Skip to main content

Preservation of Data

The LOCKSS ("Lots of Copies Keep Stuff Safe") project, under the auspices of Stanford University, is a peer-to-peer network that develops and supports an open source system allowing libraries to collect, preserve and provide their readers with access to material published on the Web. Its main goal is digital preservation. LOCKSS

์กฐ์„  ๊ฑด๊ตญ ์ดํ›„ ์‹ค๋ก์€ ์ถ˜์ถ”๊ด€์— ํ•œ ๋ถ€๋ฅผ ๋ณด๊ด€ํ•˜๊ณ , ์ถฉ์ฃผยท์„ฑ์ฃผยท์ „์ฃผ ๋“ฑ ๊ตํ†ต์˜ ์š”์ง€์˜€๋˜ ์ฃผ์š” ๋„์‹œ์˜ ์‚ฌ๊ณ ์— ํ•œ ๋ถ€์”ฉ ๋ชจ๋‘ 4๋ถ€๋ฅผ ๋ณด๊ด€ํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ž„์ง„์™œ๋ž€ ๋•Œ ์‹ค๋ก์€ ์ „์ฃผ ์‚ฌ๊ณ ์— ๋ณด๊ด€ํ–ˆ๋˜ ๊ฒƒ๋งŒ ๋‚จ๊ณ  ๋ชจ๋‘ ๋ถˆํƒ€ ๋ฒ„๋ ธ๋‹ค. ์ „์ฃผ ์‚ฌ๊ณ ์˜ ์‹ค๋ก์€ ๋‚œ๋ฆฌ๊ฐ€ ๋‚˜์ž ๋ฐฑ์„ฑ๋“ค์ด ๋ฏธ๋ฆฌ ๊นŠ์€ ์‚ฐ์†์œผ๋กœ ์˜ฎ๊ฒจ์„œ ๋ฌด์‚ฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ž„์ง„์™œ๋ž€์ด ๋๋‚œ ํ›„ ์กฐ์ •์—์„œ๋Š” ์ „์ฃผ ์‚ฌ๊ณ ์˜ ์‹ค๋ก์„ ์›๋ณธ์œผ๋กœ ์‚ผ์•„ ์‹ค๋ก์„ ๋‹ค์‹œ ์ธ์‡„ํ–ˆ๋‹ค. ์šฐ๋ฆฌ์—ญ์‚ฌ๋„ท

How to become a pirate archivist - Anna's Blogโ€‹

The world is producing more knowledge and culture than ever before, but also more of it is being lost than ever before. Humanity largely entrusts corporations like academic publishers, streaming services, and social media companies with this heritage, and they have often not proven to be great stewards.


Alexandra Elbakyan. Founder of Sci-Hub, who is very open about her activities. But she is at high risk of being arrested if she would visit a western country at this point, and could face decades of prison time. Secrecy comes with a psychological cost. Most people love being recognized for the work that they do, and yet you cannot take any credit for this in real life. Even simple things can be challenging, like friends asking you what you have been up to (at some point "messing with my NAS / homelab" gets old).

Domain Selectionโ€‹

Think about your philosophy.

Target Selectionโ€‹

Better if large, unique, accessible, and insightful.

Metadata Scrapingโ€‹

We use Python scripts, sometimes curl, and a MySQL database to store the results in. Go through a few dozen pages yourself, to get a sense for how that works. To get around restrictions, there are a few things you can try.

  • Find another IP without restrictions you are facing.
  • Find another API endpoints without restrictions you are facing.
  • Find the download rate that makes your IP blocked. How long does it get blocked? Or do you get throttled down?
  • Try creating a new account.
  • Try using HTTP/2 to keep connections open. Does that change the request-response rate?
  • Is there a comprehensive "all-in-one" page? Is the information listed there sufficient

You might want to collect title, filename, location, id, isbn, doi, size, hash (md5, sha1), date added/modified, description, category, tags, authors, language, etc.

Downloading the Pageโ€‹

Save the raw HTML and process it later. This way you don't need to re-download the HTML if you figure you missed something Or use metadata to prioritize a reasonable subset of data to download Start by downloading files. Expand slowly.


Riskiest. Even with selecting a good VPN, not filling in your personal details in any forms, and perhaps using a special browser session (or even a different computer) a highly motivated nation-state actor can probably look at incoming and outgoing data flows for VPN servers, and deduce who you are.