CS572 Search Engines
Early Search Engines
- Archie (1990): Created by Alan Emtage, P. Deutsch, et al., Archie is recognized as the first tool designed to search FTP servers, marking the beginning of search engine technology.
- Veronica and Jughead (1993): These tools were developed to search text files on Gopher servers, expanding the capabilities of search beyond FTP servers.
- Excite (1993): Founded by Stanford undergraduates, Excite improved search engine technology by incorporating statistical analysis of word relationships, enhancing search results.
- World Wide Web Wanderer (1993): Developed by Matthew Gray, this was one of the first efforts to measure the growth of the web.
- ALIWEB (1993): Created by Martijn Koster, ALIWEB is considered the first web search engine, allowing users to find web pages.
- Lycos (1994): Quickly cataloging millions of documents, Lycos became a significant player in the search engine market.
- Infoseek (1994): Achieving prominence by becoming Netscape's default search engine, Infoseek offered efficient search capabilities.
- Yahoo (1994): Starting as a directory of websites organized by topic, Yahoo eventually incorporated search engine features.
- AltaVista (1995): Known for its advanced search features, AltaVista was a popular search engine in the mid-1990s.
- Google (1998): Founded by Larry Page and Sergey Brin, Google revolutionized search engine technology and eventually became the leading search engine worldwide.
Search Engine Components
A search engine comprises several key components:
- User: The individual initiating the search query.
- Web: The vast collection of interconnected documents and resources.
- Crawler (Spider): A program that traverses the web to collect pages.
- Indexer: The component that processes collected pages to create searchable indexes.
- Ads: Targeted advertisements relevant to the user's query.
- Query Processor: The engine that processes user queries, retrieves relevant results, and ranks them based on relevance and other factors.
Web Crawler Policies
Web crawlers follow specific policies to dictate page selection, revisit timing, website politeness, and coordination among distributed crawlers. Strategies to maintain freshness include checking LastModified
indicators and prioritizing dynamic or popular pages. Policies also address the frequency of revisits, with some advocating a uniform approach while others suggest a proportional strategy based on change frequency.
Search Engine Evaluation
The effectiveness of a search engine is often measured by its precision and recall:
- Precision: The ratio of relevant items retrieved to all retrieved items.
- Recall: The ratio of relevant items retrieved to all relevant items in the collection.
The F1 score, a harmonic mean of precision and recall, is used to provide a single metric that balances the two. The formula for the F1 score is:
Additionally, the score allows for adjusting the relative importance of recall and precision, accommodating different search scenarios and user needs.
Pythagorean Means
Pythagorean means offer three ways to calculate the average of a set of numbers, each providing different insights:
- Arithmetic Mean: The sum of numbers divided by the count. For example, .
- Geometric Mean: The nth root of the product of the numbers. For sets 3, 6, 9, and 12, the geometric mean is the 4th root of 1944, approximately .
- Harmonic Mean: The reciprocal of the arithmetic mean of the reciprocals. For 3, 6, 9, 12, the harmonic mean is .
Evaluation Metrics
Evaluation metrics are crucial for assessing the effectiveness of search engines.
-
Mean Average Precision (MAP): Assumes that a relevant document not retrieved has a precision of zero, emphasizing the importance of retrieving all relevant documents across queries.
-
Discounted Cumulative Gain (DCG): Evaluates the quality of the search results based on their order, penalizing relevant documents that appear lower in the search result list. DCG is calculated using the formula:
or a variant that emphasizes retrieving relevant documents:
Document Similarity and Deduplication
Efficient handling of document similarity and deduplication is key to maintaining the quality and relevance of search results.
-
Shingling: Identifies similar documents by comparing sets of contiguous subsequences of words. For instance, the 4-shingling of "a rose is a rose is a rose" produces a set of unique shingles.
-
Jaccard Similarity: Measures the similarity between two sets, defined as the size of their intersection divided by the size of their union.
-
SimHash: A technique for approximating the Jaccard similarity by generating a fingerprint for each document. Documents are considered near-duplicates if the Hamming distance between their SimHash values is below a certain threshold.
Distance Measures
Distance measures are used to quantify how dissimilar two items are, with applications ranging from clustering to anomaly detection.
-
Euclidean Distance: Measures the straight-line distance between two points in Euclidean space.
-
Jaccard Distance: Complements the Jaccard similarity, measuring the dissimilarity between two sets.
-
Cosine Distance: Measures the cosine of the angle between two vectors, often used to determine the orientation similarity between documents.
-
Edit Distance: Quantifies the minimum number of operations required to transform one string into another, useful for spelling correction and similar applications.
-
Hamming Distance: Counts the positions at which two strings of equal length differ, applicable in coding theory and data transmission.
Deduplication Techniques
To manage duplicate content, search engines employ strategies like fingerprinting with cryptographic hashing (e.g., MD5, SHA-1, SHA-2) and selecting random positions in documents for comparison.
Term Frequency-Inverse Document Frequency (TF-IDF)
Frequency
Assumption: The more frequent terms in a document are more critical, i.e., more indicative of the topic.
- = frequency of term i in document j.
- We normalize term frequency () across the entire corpus as:
Terms that appear in many different documents are less indicative of the overall topic.
-
= document frequency of term i = number of documents containing term i.
-
. .
An inverse document frequency of term i = . A log is used to indicate a term's discrimination power.
Combined Term Importance Indicator: TF-IDF Weighting
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Query Scoring
Given a query , we score the query against a document using the formula:
where is in .
Example
Given a document containing 3 terms with given frequencies . Assume the collection contains 10,000 documents, and document frequencies of these 3 terms are:
- .
Then:
- : ; ;
- : ; ;
- : ; ;
Search Engine Similarity Measures
Frequency
Assumption — The more frequent terms in a document are more critical, i.e., more indicative of the topic.
We normalize term frequency () across the entire corpus as follows:
Terms appearing in many documents are less indicative of the overall topic.
An inverse document frequency of term indicates a term's discrimination power. A log is used to dampen the effect relative to .
Combined term importance indicator, tf-idf weighting,
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Given a query , we score the query against a document using the formula,
Similarity Measure
The similarity measure is a function that computes the degree of similarity between two vectors.
Similarity between vectors for the document and query can be computed as the vector inner product:
For binary vectors, the inner product is the number of matched query terms in the document (hamming distance). For weighted term vectors, it is the sum of the products of the matched term weights.
Binary example
- D = [1, 1, 0, 1, 1, 0, 0]
- Q = [1, 0, 1, 0, 0, 1, 1]
- (the inner product)
Weighted example
Cosine similarity measures the cosine of the angle between two vectors.
Cosine Similarity
Document Conversion
Convert all documents in collection to weighted vectors, , for keywords in vocabulary . Convert each query to a weighted vector . For each in do, compute score . Sort documents by decreasing the score. Time complexity is .
Query Representation
Represent the query as a weighted vector. Compute the cosine similarity score for the query vector and each document vector that contains the query term. Rank documents to the query by score. Return the top (e.g., ) to the user.
Preprocess
Pre-compute, for each term, its k nearest docs. Treat each term as a 1-term query. A lot of preprocessing is involved. The result is a "preferred list" for each term.
Search
For a t-term query, take the union of their preferred lists - call this set S. Compute cosines from the query to only the docs in S, and choose top k. This approach misses semantic and syntactic information and assumes term independence. It lacks the control of Boolean models.
Standing queries
Examples include Google Alerts.
Classification Approaches
- Manual Classification
- Hand-coded Rule-based Classifiers
- Supervised Learning
The most straightforward feature selection is to use the most common terms. This works well in scenarios such as Naive Bayes for Spam Filtering. The advantages include speedy learning and testing, low storage requirements, and suitability with many equally important features. This method is robust to many irrelevant features.
Vector Space Classification
The training set corresponds to a labeled set of points (vectors) in vector space classification.
Premise 1 — Documents in the same class form a contiguous region of space.
Premise 2 — Documents from different classes are separate.
To learn a classifier, build surfaces to delineate classes in the space.
Centroid
The centroid for a class is calculated as:
where is the set of all documents that belong to class and is the vector space representation of . The centroid is generally not a unit vector.
Rocchio Classification
Rocchio forms a simple representative for each class: the centroid/prototype. Classification is based on the nearest prototype/centroid.
kNN (k Nearest Neighbor)
To classify a document :
Define k-neighborhood as the k nearest neighbors of . Pick the majority class label in the k-neighborhood. For larger k, we can roughly estimate as . This is based on the contiguity hypothesis.
Nearest-Neighbor Learning
Learning involves just storing the labeled training examples, .
For a testing instance (under 1NN), compute the similarity between and all examples in . Assign the category of the most similar example in . This method does not compute anything beyond storing the examples and is also called case-based learning and memory-based learning, labeled as lazy learning. It can be erroneous due to a single atypical example or noise. To improve, find k (3, 5, odd) examples and return the majority to improve.
Inverted Index
An inverted index is typically composed of a vector containing all distinct words of the text collection in lexicographical order (the vocabulary), and for each word in the language, a list of all documents (and text positions) in which that word occurs.
Case Folding
Convert uppercase letters to lowercase.
Stemming
Reduce words to their morphological roots.
Stop Words
Remove prevalent words. Inverted indexes are very sparse, so sorted linked lists are used. Skip pointers are used to make the merge faster.
Biword Inefficiency
Using biwords for phrase queries can be inefficient. Instead, one can extract inverted entries for each distinct term and merge the document-position lists to identify all positions where phrases occur.
Part-of-Speech Tagging
By tokenizing text and performing part-of-speech tagging, one can classify terms such as:
- Nouns (N)
- Function words (X), including articles and prepositions
Strings of terms following the pattern NX*N (where "*" represents zero or more occurrences) are considered extended by biwords and added to the vocabulary. For example, the phrase "renegotiation of the constitution" becomes "renegotiation constitution" after removing non-nouns.
N-Grams
An n-gram is a sequence of n consecutive words. They form a Zipf distribution where common n-grams often contain stop words. The inverted index for n-grams includes pointers to all dictionary terms containing them. Note that larger values of n require more storage space for n-grams.
Dynamic Indexing
The simplest method for dynamic indexing involves the following:
- Maintaining a primary index for existing documents.
- New documents go into a smaller auxiliary index.
- Searches are conducted across both, with results merged periodically.
- Deletions are managed using an invalidation bit-vector to filter out deleted documents.
- Re-indexing is done periodically to merge into one main index.
Query Box Operations
- The default operation is AND.
- OR and wildcard searches are supported.
- Use "+" to include stop words in a search.
- OR operations are evaluated first.
- Proximity matters: Google prioritizes pages with search terms near each other and in the same order as the query.
- Case insensitivity: Google search is not sensitive to case.
- Certain punctuation and special characters are ignored.
Special Search Operators
filetype
: Specify a particular file type.inanchor
: Search for text in a page's anchor.intext
: Search for text in the body of a document.intitle
: Search for text in the title of a document.inurl
: Search within the URL of a page.site
: Restrict search to a specific site.cache:url
: View the cached version of a site.info
: Show information about a page.stocks
: Search for stock information.- Note: The
phonebook
feature has been discontinued.
Mean Reciprocal Rank (MRR)
The mean reciprocal rank (MRR) is a statistical measure used for evaluating systems that return a list of responses to a set of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. MRR is calculated as the average of the reciprocal ranks of results for a sample of queries.
The formula for MRR is given by:
where is the number of queries, and is the rank of the correct answer for the query.
Example
Consider a system that translates English words to their plurals, making three guesses for each query. The system ranks these guesses, with the first being the most likely correct. Here's how the MRR is calculated for three sample queries:
Query | Results | Correct Response | Rank | Reciprocal Rank |
---|---|---|---|---|
Cat | Catten, cati, cats | Cats | 3 | 1/3 |
tori | Torii, tori, toruses | tori | 2 | 1/2 |
virus | Viruses, virii, viri | viruses | 1 | 1 |
The MRR, in this case, is calculated as follows:
This example demonstrates the calculation of MRR using three queries and their respective reciprocal ranks.
MapReduce Overview
MapReduce is a programming model designed to efficiently process large datasets across distributed clusters. It simplifies parallel computation, fault tolerance, I/O scheduling, and monitoring.
Features
- Automatic Parallelization: Code is executed across multiple threads and processors without manual intervention.
- Fault Tolerance: Provides robustness against the failure of one or more nodes within the system.
- I/O Scheduling: Efficiently manages input and output operations across the cluster.
- Monitoring & Status Updates: Keeps track of the progress and health of the distributed computing tasks.
How MapReduce Works
- Map Function: Processes chunks of data, converting them into key-value pairs.
- Shuffle Step: A master controller organizes the key-value pairs, distributing them across reduced tasks based on keys.
- Reduce Function: Each reduce task processes a set of key-value pairs, aggregating the values based on the keys.
Example
- The Map function parses documents, extracting words as keys and assigning an integer
1
for each occurrence. - The Reduce function then counts the occurrences of each word across all documents.
Fault Tolerance Mechanisms
- If a task fails, it is retried on another node; persistent failures result in job failure.
- If a node fails, its current and past map tasks are redistributed to other nodes.
- Slow tasks may be executed in parallel on other nodes for efficiency.
Cluster Architecture
- Racks contain 16-64 nodes, with intra-rack connections via Ethernet.
- Racks are interconnected through switches.
Search Result Ranking
Poisoning Search Results
- Adjust the ranking function to reward or penalize search results based on relevance and position.
- Acceptable ranking functions may include direct proportionality to rank values, such as
rank
,rank^2
, orn^rank
.
Semantic Web and RDF
RDF (Resource Description Framework) triples in the form Subject - Predicate -> Object
, enabling the linking of disparate data sources and enhancing search result breadth and accuracy.
Benefits of RDF
- Data Integration: Allows data merging even when structures vary.
- Faceted Search: Provides facets for filtering and refining search results.
- Inference: Enables search engines to infer relationships, enriching search results.
Performance Optimizations
Data Structures
Utilize efficient data structures like inverted indexes and hash tables for rapid access and retrieval.
Computational Machinery
Employ distributed computing and specialized hardware to accelerate data processing.
Disk Space
Apply data compression and deduplication to optimize storage requirements.
Bandwidth
Leverage caching and CDNs to minimize bandwidth usage and enhance retrieval speeds.
Classification Techniques
Aggregation Methods
- Rocchio: Centroid-based aggregation.
- k-Nearest Neighbors (kNN): Aggregation based on the closest k documents.
Geometry
Both Rocchio and kNN utilize a geometry of points within the feature space.
Robustness
- Rocchio: Sensitive to noise and outliers.
- kNN: Offers better robustness to noise compared to Rocchio and Nearest Neighbor.
- Nearest Neighbor (k=1): Similar to Rocchio in terms of sensitivity to noise.