CS572 Search Engines
Early Search Engines
- Archie (1990): Created by Alan Emtage, P. Deutsch, et al., Archie is recognized as the first tool designed to search FTP servers, marking the beginning of search engine technology.
- Veronica and Jughead (1993): These tools were developed to search text files on Gopher servers, expanding the capabilities of search beyond FTP servers.
- Excite (1993): Founded by Stanford undergraduates, Excite improved search engine technology by incorporating statistical analysis of word relationships, enhancing search results.
- World Wide Web Wanderer (1993): Developed by Matthew Gray, this was one of the first efforts to measure the growth of the web.
- ALIWEB (1993): Created by Martijn Koster, ALIWEB is considered the first web search engine, allowing users to find web pages.
- Lycos (1994): Quickly cataloging millions of documents, Lycos became a significant player in the search engine market.
- Infoseek (1994): Achieving prominence by becoming Netscape's default search engine, Infoseek offered efficient search capabilities.
- Yahoo (1994): Starting as a directory of websites organized by topic, Yahoo eventually incorporated search engine features.
- AltaVista (1995): Known for its advanced search features, AltaVista was a popular search engine in the mid-1990s.
- Google (1998): Founded by Larry Page and Sergey Brin, Google revolutionized search engine technology and eventually became the leading search engine worldwide.
Search Engine Components
A search engine comprises several key components:
- User: The individual initiating the search query.
- Web: The vast collection of interconnected documents and resources.
- Crawler (Spider): A program that traverses the web to collect pages.
- Indexer: The component that processes collected pages to create searchable indexes.
- Ads: Targeted advertisements relevant to the user's query.
- Query Processor: The engine that processes user queries, retrieves relevant results, and ranks them based on relevance and other factors.
Web Crawler Policies
Web crawlers follow specific policies to dictate page selection, revisit timing, website politeness, and coordination among distributed crawlers. Strategies to maintain freshness include checking LastModified
indicators and prioritizing dynamic or popular pages. Policies also address the frequency of revisits, with some advocating a uniform approach while others suggest a proportional strategy based on change frequency.
Search Engine Evaluation
The effectiveness of a search engine is often measured by its precision and recall:
- Precision: The ratio of relevant items retrieved to all retrieved items.
- Recall: The ratio of relevant items retrieved to all relevant items in the collection.
The F1 score, a harmonic mean of precision and recall, is used to provide a single metric that balances the two. The formula for the F1 score is:
Additionally, the score allows for adjusting the relative importance of recall and precision, accommodating different search scenarios and user needs.
Pythagorean Means
Pythagorean means offer three ways to calculate the average of a set of numbers, each providing different insights:
- Arithmetic Mean: The sum of numbers divided by the count. For example, .
- Geometric Mean: The nth root of the product of the numbers. For sets 3, 6, 9, and 12, the geometric mean is the 4th root of 1944, approximately .
- Harmonic Mean: The reciprocal of the arithmetic mean of the reciprocals. For 3, 6, 9, 12, the harmonic mean is .
Evaluation Metrics
Evaluation metrics are crucial for assessing the effectiveness of search engines.
-
Mean Average Precision (MAP): Assumes that a relevant document not retrieved has a precision of zero, emphasizing the importance of retrieving all relevant documents across queries.
-
Discounted Cumulative Gain (DCG): Evaluates the quality of the search results based on their order, penalizing relevant documents that appear lower in the search result list. DCG is calculated using the formula:
or a variant that emphasizes retrieving relevant documents:
Document Similarity and Deduplication
Efficient handling of document similarity and deduplication is key to maintaining the quality and relevance of search results.
-
Shingling: Identifies similar documents by comparing sets of contiguous subsequences of words. For instance, the 4-shingling of "a rose is a rose is a rose" produces a set of unique shingles.
-
Jaccard Similarity: Measures the similarity between two sets, defined as the size of their intersection divided by the size of their union.
-
SimHash: A technique for approximating the Jaccard similarity by generating a fingerprint for each document. Documents are considered near-duplicates if the Hamming distance between their SimHash values is below a certain threshold.
Distance Measures
Distance measures are used to quantify how dissimilar two items are, with applications ranging from clustering to anomaly detection.
-
Euclidean Distance: Measures the straight-line distance between two points in Euclidean space.
-
Jaccard Distance: Complements the Jaccard similarity, measuring the dissimilarity between two sets.
-
Cosine Distance: Measures the cosine of the angle between two vectors, often used to determine the orientation similarity between documents.
-
Edit Distance: Quantifies the minimum number of operations required to transform one string into another, useful for spelling correction and similar applications.
-
Hamming Distance: Counts the positions at which two strings of equal length differ, applicable in coding theory and data transmission.
Deduplication Techniques
To manage duplicate content, search engines employ strategies like fingerprinting with cryptographic hashing (e.g., MD5, SHA-1, SHA-2) and selecting random positions in documents for comparison.
Term Frequency-Inverse Document Frequency (TF-IDF)
Frequency
Assumption: The more frequent terms in a document are more critical, i.e., more indicative of the topic.
- = frequency of term i in document j.
- We normalize term frequency () across the entire corpus as:
Terms that appear in many different documents are less indicative of the overall topic.
-
= document frequency of term i = number of documents containing term i.
-
. .
An inverse document frequency of term i = . A log is used to indicate a term's discrimination power.
Combined Term Importance Indicator: TF-IDF Weighting
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Query Scoring
Given a query , we score the query against a document using the formula:
where is in .
Example
Given a document containing 3 terms with given frequencies . Assume the collection contains 10,000 documents, and document frequencies of these 3 terms are:
- .
Then:
- : ; ;
- : ; ;
- : ; ;
Search Engine Similarity Measures
Frequency
Assumption — The more frequent terms in a document are more critical, i.e., more indicative of the topic.
We normalize term frequency () across the entire corpus as follows: