2024-01-16 Search Engine
- Hyperlinks
- Internet Archive: Grateful Dead at Internet Archive
Surface Web
- Surface Web: Discoverable by Search Engines
- Deep Web: Not discoverable by Search Engines (Surface-level crawlers). Has 90% of the internet's data
- Dark Web: Accessible through specific browsers like Tor
Google Pizza Box
See content.
Web Crawlers with Recursions
Web crawlers recurse with Hyperlinks:
call crawler(link) -> Get all hyperlinks in that page -> for link in all links, call_crawler(link)
Web Crawling Issues
How to Crawl?
- Quality — find "best" pages
- Efficiency — avoid duplications
- Etiquette — don't disturb the website's performance (robots.txt includes information on this): New York Times sues Microsoft and OpenAI for 'billions'
How much to crawl?
- Coverage — How many % of the web?
- Relative Coverage — How many do competitors have?
How often do you crawl?
- Freshness — How much has changed?
- How much has changed?
Cohere Rerank
See content.