2024-01-16 Search Engine
- Hyperlinks
- Internet Archive: Grateful Dead at Internet Archive
Surface Web
- Surface Web: Discoverable by Search Engines
- Deep Web: Not discoverable by Search Engines (Surface-level crawlers). Has 90% of the internet's data
- Dark Web: Accessible through specific browsers like Tor
Google Pizza Box
See content.
Web Crawlers with Recursions
Web crawlers recurse with Hyperlinks:
call crawler(link) -> Get all hyperlinks in that page -> for link in all links, call_crawler(link)
Web Crawling Issues
How to Crawl?
- Quality — find "best" pages
- Efficiency — avoid duplications
- Etiquette — don't disturb the website's performance (robots.txt includes information on this): New York Times sues Microsoft and OpenAI for 'billions'
How much to crawl?
- Coverage — How many % of the web?
- Relative Coverage — How many do competitors have?
How often do you crawl?
- Freshness — How much has changed?
- How much has changed?
Cohere Rerank
See content.
Simplest Crawler Lifecycle
- Begin with known seed pages
- Fetch and parse a page, loop.
- Whenever you discover a new link, enqueue.
- Dequeue one by one.
- Not feasible in 1 machine; need to distribute
- Challenges
- Non-malicious sites also have problems
- Latency and bandwidth of remote servers
- Robots.txt stipulation prevents web pages from being visited.
- Mirrored or duplicate content
- Politeness; only visit it occasionally.
Crawling Strategies
- BFS crawling brings in high-quality pages early in the crawl
- Adding links to the end of the queue (FIFO) results in a BFS
- Adding them to the front of the queue (LIFO) yields a DFS
- Prioritizing links in a certain way can create a "focused crawler" that targets "interesting" pages, such as
- documents that change frequently
- those whose content is relevant to a specific topic.
- One method of re-ordering the URLs on the queue is reordering by
- High In-degree
- High PageRank.
Avoid Page Duplications
- Determine if the page is already seen
- Must store URLs in a "standard" format
- Must develop a fast way to check if a URL has already been seen: Use trie with
/
as the separator
- Determine if a new page has already been seen,
- Identify whether a particular page was indexed
- Identify whether a near-identical page was already indexed
Anomalies
- Some anchors don't have links
- Some anchors produce dynamic pages with JS, which can lead to looping
Representing URLs
URLs can be lengthy, with an average length of 80 bytes. Storing 1 trillion URLs would require 80 terabytes of storage space. Using the above calculation, Google found 30 trillion unique URLs, requiring 2400 terabytes (or 2.4 petabytes) of storage space.
One proposed method to determine if a new URL has already been seen involves hashing the host or domain name first, followed by using a trie data structure to check if the path or resource matches one in the URL database.
Another proposed method is to arrange it alphabetically and then save it in a text file using delta encoding. Each URL is stored as the difference (delta) between it and the previous URL. This method significantly reduces storage requirements. However, restoring the URL is slower, as all the deltas must be applied to the initial URL. To improve the speed of this process, periodic checkpointing is done by storing the full URL.
Normalize URLs
- Convert the scheme and host to lowercase as it is case-insensitive. Example: HTTP://www.Example.com/ → http://www.example.com/
- Capitalize letters in escape sequences. All letters within a percent-encoding triplet should be capitalized. Example: http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
- Decode percent-encoded octets of unreserved characters. Example: http://www.example.com/%7Eusername/ → http://www.example.com/~username/
- Remove the default port if it appears. Example: http://www.example.com:80/bar.html → http://www.example.com/bar.html
Avoiding Spider Traps
A spider trap occurs when a crawler repeatedly visits the same web page. One of the most common types of spider traps is created using session IDs, which J2EE, ASP, .NET, and PHP manage. Session IDs are often used to keep track of visitors, and some websites put a unique ID in the URL. Each user gets a unique ID, and it's often requested from each page. The problem arises when Googlebot visits the page, spiders it, and then leaves. It then goes to another page and finds a link to the previous page, but since it has been given a different session ID, the link shows up as another URL. To avoid such traps, one way is for the crawler to be cautious when the query string "ID=" is present in the URL. Another technique is to monitor the URL length and stop if the length becomes too long.
Handling Spam Web Pages
In the early days of spam web pages, the first generation of such pages used repeated words to rank high on search engines. They used a technique where the words were made invisible by rendering them the same color as the background. However, these pages still counted as the search engines indexed them.
The second generation of spam web pages used a technique called cloaking. When a web server detected a request from a crawler, it returned a different page than the one returned from a user request. This caused the page to be mistakenly indexed.
A third generation of spam web pages was called a doorway page. These pages contained text and metadata chosen to rank highly on specific search keywords. However, when a browser requests the doorway page, it will get a more "commercially oriented" page with more ads.
Measuring and Tuning a Crawler
Improving the performance of a crawler essentially involves three main aspects: enhancing the speed of URL parsing, improving network bandwidth speed, and enhancing fault tolerance. There are several other issues that may arise, such as determining how often to restart the process (refresh strategies), detecting duplicate pages, identifying mirror sites, accelerating DNS lookup, normalizing URLs, and handling malformed HTML.