Skip to main content

Data Preservation

The LOCKSS ("Lots of Copies Keep Stuff Safe") project, under the auspices of Stanford University, is a peer-to-peer network that develops and supports an open source system allowing libraries to collect, preserve and provide their readers with access to material published on the web. Its main goal is digital preservation. LOCKSS

์กฐ์„  ๊ฑด๊ตญ ์ดํ›„ ์‹ค๋ก์€ ์ถ˜์ถ”๊ด€์— ํ•œ ๋ถ€๋ฅผ ๋ณด๊ด€ํ•˜๊ณ , ์ถฉ์ฃผยท์„ฑ์ฃผยท์ „์ฃผ ๋“ฑ ๊ตํ†ต์˜ ์š”์ง€์˜€๋˜ ์ฃผ์š” ๋„์‹œ์˜ ์‚ฌ๊ณ ์— ํ•œ ๋ถ€์”ฉ ๋ชจ๋‘ 4๋ถ€๋ฅผ ๋ณด๊ด€ํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ž„์ง„์™œ๋ž€ ๋•Œ ์‹ค๋ก์€ ์ „์ฃผ ์‚ฌ๊ณ ์— ๋ณด๊ด€ํ–ˆ๋˜ ๊ฒƒ๋งŒ ๋‚จ๊ณ  ๋ชจ๋‘ ๋ถˆํƒ€ ๋ฒ„๋ ธ๋‹ค. ์ „์ฃผ ์‚ฌ๊ณ ์˜ ์‹ค๋ก์€ ๋‚œ๋ฆฌ๊ฐ€ ๋‚˜์ž ๋ฐฑ์„ฑ๋“ค์ด ๋ฏธ๋ฆฌ ๊นŠ์€ ์‚ฐ์†์œผ๋กœ ์˜ฎ๊ฒจ์„œ ๋ฌด์‚ฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ž„์ง„์™œ๋ž€์ด ๋๋‚œ ํ›„ ์กฐ์ •์—์„œ๋Š” ์ „์ฃผ ์‚ฌ๊ณ ์˜ ์‹ค๋ก์„ ์›๋ณธ์œผ๋กœ ์‚ผ์•„ ์‹ค๋ก์„ ๋‹ค์‹œ ์ธ์‡„ํ–ˆ๋‹ค. ์šฐ๋ฆฌ์—ญ์‚ฌ๋„ท

How to become a pirate archivist - Anna's Blogโ€‹

The world produces more knowledge and culture than ever before, but also more of it is being lost than ever before. Humanity entrusts corporations like academic publishers, streaming services, and social media companies with this heritage, and they have often not proven to be great stewards.


Alexandra Elbakyan. Founder of Sci-Hub, who is very open about her activities. But she is at high risk of being arrested if she visits a Western country at this point, and she could face decades of prison time. Secrecy comes with a psychological cost. Most people love being recognized for their work, yet you cannot take any credit for this in real life. Even simple things can be challenging, like friends asking you what you have been up to (at some point, "messing with my NAS / homelab" gets old).

Domain Selectionโ€‹

Think about your philosophy.

Target Selectionโ€‹

Better if large, unique, accessible, and insightful.

Metadata Scrapingโ€‹

We use Python scripts, sometimes curl, and MySQL database to store the results. Go through a few dozen pages to understand how that works. To get around restrictions, there are a few things you can try.

  • Find another IP without the restrictions you are facing.
  • Find other API endpoints without the restrictions you are facing.
  • Find the download rate that makes your IP blocked. How long does it get blocked? Or do you get throttled down?
  • Try creating a new account.
  • Try using HTTP/2 to keep connections open. Does that change the request-response rate?
  • Is there a comprehensive "all-in-one" page? Is the information listed there sufficient

You should collect title, filename, location, id, ISBN, doi, size, hash (md5, sha1), date added/modified, description, category, tags, authors, language, etc.

Downloading the Pageโ€‹

Save the raw HTML and process it later. This way, you don't need to re-download the HTML if you figure you missed something Or use metadata to prioritize a reasonable subset of data to download Start by downloading files. Expand slowly.


Riskiest. Even with selecting a good VPN, not filling in your details in any forms, and perhaps using a particular browser sessions (or even a different computer) a highly motivated nation-state actor can look at incoming and outgoing data flows for VPN servers and deduce who you are.