CS572 Search Engines
Early Search Engines
- Archie (1990): Created by Alan Emtage, P. Deutsch, et al., Archie is recognized as the first tool designed to search FTP servers, marking the beginning of search engine technology.
- Veronica and Jughead (1993): These tools were developed to search text files on Gopher servers, expanding the capabilities of search beyond FTP servers.
- Excite (1993): Founded by Stanford undergraduates, Excite improved search engine technology by incorporating statistical analysis of word relationships, enhancing search results.
- World Wide Web Wanderer (1993): Developed by Matthew Gray, this was one of the first efforts to measure the growth of the web.
- ALIWEB (1993): Created by Martijn Koster, ALIWEB is considered the first web search engine, allowing users to find web pages.
- Lycos (1994): Quickly cataloging millions of documents, Lycos became a significant player in the search engine market.
- Infoseek (1994): Achieving prominence by becoming Netscape's default search engine, Infoseek offered efficient search capabilities.
- Yahoo (1994): Starting as a directory of websites organized by topic, Yahoo eventually incorporated search engine features.
- AltaVista (1995): Known for its advanced search features, AltaVista was a popular search engine in the mid-1990s.
- Google (1998): Founded by Larry Page and Sergey Brin, Google revolutionized search engine technology and eventually became the leading search engine worldwide.
Search Engine Components
A search engine comprises several key components:
- User: The individual initiating the search query.
- Web: The vast collection of interconnected documents and resources.
- Crawler (Spider): A program that traverses the web to collect pages.
- Indexer: The component that processes collected pages to create searchable indexes.
- Ads: Targeted advertisements relevant to the user's query.
- Query Processor: The engine that processes user queries, retrieves relevant results, and ranks them based on relevance and other factors.
Web Crawler Policies
Web crawlers follow specific policies to dictate page selection, revisit timing, website politeness, and coordination among distributed crawlers. Strategies to maintain freshness include checking LastModified
indicators and prioritizing dynamic or popular pages. Policies also address the frequency of revisits, with some advocating a uniform approach while others suggest a proportional strategy based on change frequency.
Search Engine Evaluation
The effectiveness of a search engine is often measured by its precision and recall:
- Precision: The ratio of relevant items retrieved to all retrieved items.
- Recall: The ratio of relevant items retrieved to all relevant items in the collection.
The F1 score, a harmonic mean of precision and recall, is used to provide a single metric that balances the two. The formula for the F1 score is:
Additionally, the score allows for adjusting the relative importance of recall and precision, accommodating different search scenarios and user needs.
Pythagorean Means
Pythagorean means offer three ways to calculate the average of a set of numbers, each providing different insights:
- Arithmetic Mean: The sum of numbers divided by the count. For example, .
- Geometric Mean: The nth root of the product of the numbers. For sets 3, 6, 9, and 12, the geometric mean is the 4th root of 1944, approximately .
- Harmonic Mean: The reciprocal of the arithmetic mean of the reciprocals. For 3, 6, 9, 12, the harmonic mean is .
Evaluation Metrics
Evaluation metrics are crucial for assessing the effectiveness of search engines.
-
Mean Average Precision (MAP): Assumes that a relevant document not retrieved has a precision of zero, emphasizing the importance of retrieving all relevant documents across queries.
-
Discounted Cumulative Gain (DCG): Evaluates the quality of the search results based on their order, penalizing relevant documents that appear lower in the search result list. DCG is calculated using the formula:
or a variant that emphasizes retrieving relevant documents:
Document Similarity and Deduplication
Efficient handling of document similarity and deduplication is key to maintaining the quality and relevance of search results.
-
Shingling: Identifies similar documents by comparing sets of contiguous subsequences of words. For instance, the 4-shingling of "a rose is a rose is a rose" produces a set of unique shingles.
-
Jaccard Similarity: Measures the similarity between two sets, defined as the size of their intersection divided by the size of their union.
-
SimHash: A technique for approximating the Jaccard similarity by generating a fingerprint for each document. Documents are considered near-duplicates if the Hamming distance between their SimHash values is below a certain threshold.
Distance Measures
Distance measures are used to quantify how dissimilar two items are, with applications ranging from clustering to anomaly detection.
-
Euclidean Distance: Measures the straight-line distance between two points in Euclidean space.
-
Jaccard Distance: Complements the Jaccard similarity, measuring the dissimilarity between two sets.
-
Cosine Distance: Measures the cosine of the angle between two vectors, often used to determine the orientation similarity between documents.
-
Edit Distance: Quantifies the minimum number of operations required to transform one string into another, useful for spelling correction and similar applications.
-
Hamming Distance: Counts the positions at which two strings of equal length differ, applicable in coding theory and data transmission.
Deduplication Techniques
To manage duplicate content, search engines employ strategies like fingerprinting with cryptographic hashing (e.g., MD5, SHA-1, SHA-2) and selecting random positions in documents for comparison.
Term Frequency-Inverse Document Frequency (TF-IDF)
Frequency
Assumption: The more frequent terms in a document are more critical, i.e., more indicative of the topic.
- = frequency of term i in document j.
- We normalize term frequency () across the entire corpus as:
Terms that appear in many different documents are less indicative of the overall topic.
-
= document frequency of term i = number of documents containing term i.
-
. .
An inverse document frequency of term i = . A log is used to indicate a term's discrimination power.
Combined Term Importance Indicator: TF-IDF Weighting
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Query Scoring
Given a query , we score the query against a document using the formula:
where is in .
Example
Given a document containing 3 terms with given frequencies . Assume the collection contains 10,000 documents, and document frequencies of these 3 terms are:
- .
Then:
- : ; ;
- : ; ;
- : ; ;
Search Engine Similarity Measures
Frequency
Assumption — The more frequent terms in a document are more critical, i.e., more indicative of the topic.
We normalize term frequency () across the entire corpus as follows:
Terms appearing in many documents are less indicative of the overall topic.
An inverse document frequency of term indicates a term's discrimination power. A log is used to dampen the effect relative to .
Combined term importance indicator, tf-idf weighting,
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Given a query , we score the query against a document using the formula,
Similarity Measure
The similarity measure is a function that computes the degree of similarity between two vectors.
Similarity between vectors for the document and query can be computed as the vector inner product:
For binary vectors, the inner product is the number of matched query terms in the document (hamming distance). For weighted term vectors, it is the sum of the products of the matched term weights.
Binary example
- D = [1, 1, 0, 1, 1, 0, 0]
- Q = [1, 0, 1, 0, 0, 1, 1]
- (the inner product)
Weighted example
Cosine similarity measures the cosine of the angle between two vectors.
Cosine Similarity
Document Conversion
Convert all documents in collection to weighted vectors, , for keywords in vocabulary . Convert each query to a weighted vector . For each in do, compute score . Sort documents by decreasing the score. Time complexity is .
Query Representation
Represent the query as a weighted vector. Compute the cosine similarity score for the query vector and each document vector that contains the query term. Rank documents to the query by score. Return the top (e.g., ) to the user.
Preprocess
Pre-compute, for each term, its k nearest docs. Treat each term as a 1-term query. A lot of preprocessing is involved. The result is a "preferred list" for each term.
Search
For a t-term query, take the union of their preferred lists - call this set S. Compute cosines from the query to only the docs in S, and choose top k. This approach misses semantic and syntactic information and assumes term independence. It lacks the control of Boolean models.
Standing queries
Examples include Google Alerts.
Classification Approaches
- Manual Classification
- Hand-coded Rule-based Classifiers
- Supervised Learning
The most straightforward feature selection is to use the most common terms. This works well in scenarios such as Naive Bayes for Spam Filtering. The advantages include speedy learning and testing, low storage requirements, and suitability with many equally important features. This method is robust to many irrelevant features.
Vector Space Classification
The training set corresponds to a labeled set of points (vectors) in vector space classification.
Premise 1 — Documents in the same class form a contiguous region of space.
Premise 2 — Documents from different classes are separate.
To learn a classifier, build surfaces to delineate classes in the space.
Centroid
The centroid for a class is calculated as:
where is the set of all documents that belong to class and is the vector space representation of . The centroid is generally not a unit vector.
Rocchio Classification
Rocchio forms a simple representative for each class: the centroid/prototype. Classification is based on the nearest prototype/centroid.
kNN (k Nearest Neighbor)
To classify a document :
Define k-neighborhood as the k nearest neighbors of . Pick the majority class label in the k-neighborhood. For larger k, we can roughly estimate as . This is based on the contiguity hypothesis.
Nearest-Neighbor Learning
Learning involves just storing the labeled training examples, .
For a testing instance (under 1NN), compute the similarity between and all examples in . Assign the category of the most similar example in . This method does not compute anything beyond storing the examples and is also called case-based learning and memory-based learning, labeled as lazy learning. It can be erroneous due to a single atypical example or noise. To improve, find k (3, 5, odd) examples and return the majority to improve.
Inverted Index
An inverted index is typically composed of a vector containing all distinct words of the text collection in lexicographical order (the vocabulary), and for each word in the language, a list of all documents (and text positions) in which that word occurs.
Case Folding
Convert uppercase letters to lowercase.