Signal and Image Processing Seminar

Searching the Web for General and Scientific Information

C. Lee Giles, Steve Lawrence, Kurt Bollacker
NEC Research Institute
Princeton, NJ

kurt@pegasus.ece,utexas.edu

Monday, August 31, 2:00 PM, ENS 602


Although the World Wide Web was originally created as a collaboration tool for scientists, it has grown to be viewed as an extremely large and diverse but poorly organized database. The Web contains a practically endless supply of relevant scientific information for researchers and other users, but finding the answer to even a simple question is often difficult if not impossible. First generation search engines make great strides by providing keyword search on Web documents, but the services they provide tend to have several severe shortcomings. The precision and ranking of recalled Web pages is often poor, including both dead links and pages with "spamming" keywords. It has been shown that any single search engine only covers a small part of the Web, and different search engines have differing interfaces and capabilities, making using multiple engines tedious. Documents that are not stored as HTML (e.g. images, Postscript files) are completely invisible to Web search engines, and the explosive growth of the Web only exacerbates all of these problems.

I will discuss some of the general difficulties in using the Web as a scientific tool and some of the challenges that much be met in order to overcome them. There are two projects at NEC which attempt to meet some of these challenges.Inquirus is a "meta search engine" which provides sophisticated tools for vastly enhancing the process of Web searches. It compiles the results of multiple search engine queries for faster and more complete recall, and uses query term and page analysis as well as clustering to provide improved navigation of recalled documents.CiteSeer is a Web based assistant agent that uses citation information to enable semantic search through Web based scientific literature. Beyond keyword search, navigation through both ``citing'' and ``cited'' publications is possible and citation contexts are automatically found and summmarized. Using citation and word frequency information, semantic distance measures between publications can be calculated and used to perform ``semantic search'' for relevant publications.


A list of digital signal processing seminars is available at from the ECE department Web pages under "Seminars". The Web address for the digital signal processing seminars is http://anchovy.ece.utexas.edu/seminars