How to become a pirate archivist

annas-blog.org, 2022-10-17 (translations: 中文 [zh])

Before we dive in, two updates on the Pirate Library Mirror (EDIT: moved to Anna’s Archive):
1. We got some extremely generous donations. The first was $10k from the anonymous individual who also has been supporting "bookwarrior", the original founder of Library Genesis. Special thanks to bookwarrior for facilitating this donation. The second was another $10k from an anonymous donor, who got in touch after our last release, and was inspired to help. We also had a number of smaller donations. Thanks so much for all your generous support. We have some exciting new projects in the pipeline which this will support, so stay tuned.
2. We had some technical difficulties with the size of our second release, but our torrents are up and seeding now. We also got a generous offer from an anonymous individual to seed our collection on their very-high-speed servers, so we're doing a special upload to their machines, after which everyone else who is downloading the collection should see a large improvement in speed.

Entire books can be written about the why of digital preservation in general, and pirate archivism in particular, but let us give a quick primer for those who are not too familiar. The world is producing more knowledge and culture than ever before, but also more of it is being lost than ever before. Humanity largely entrusts corporations like academic publishers, streaming services, and social media companies with this heritage, and they have often not proven to be great stewards. Check out the documentary Digital Amnesia, or really any talk by Jason Scott.

There are some institutions that do a good job archiving as much as they can, but they are bound by the law. As pirates, we are in a unique position to archive collections that they cannot touch, because of copyright enforcement or other restrictions. We can also mirror collections many times over, across the world, thereby increasing the chances of proper preservation.

For now, we won't get into discussions about the pros and cons of intellectual property, the morality of breaking the law, musings on censorship, or the issue of access to knowledge and culture. With all that out of the way, let's dive into the how. We'll share how our team became pirate archivists, and the lessons that we learned along the way. There are many challenges when you embark on this journey, and hopefully we can help you through some of them.

Community

The first challenge might be a surprising one. It is not a technical problem, or a legal problem. It is a psychological problem: doing this work in the shadows can be incredibly lonely. Depending on what you're planning to do, and your threat model, you might have to be very careful. On the one end of the spectrum we have people like Alexandra Elbakyan*, the founder of Sci-Hub, who is very open about her activities. But she is at high risk of being arrested if she would visit a western country at this point, and could face decades of prison time. Is that a risk you would be willing to take? We are at the other end of the spectrum; being very careful not to leave any trace, and having strong operational security.

* As mentioned on HN by "ynno", Alexandra initially didn't want to be known: "Her servers were set up to emit detailed error messages from PHP, including full path of faulting source file, which was under directory /home/ringo-ring, which could be traced to a username she had online on an unrelated site, attached to her real name. Before this revelation, she was anonymous." So, use random usernames on the computers you use for this stuff, in case you misconfigure something.

That secrecy, however, comes with a psychological cost. Most people love being recognized for the work that they do, and yet you cannot take any credit for this in real life. Even simple things can be challenging, like friends asking you what you have been up to (at some point "messing with my NAS / homelab" gets old).

This is why it is so important to find some community. You can give up some operational security by confiding in some very close friends, who you know you can trust deeply. Even then be careful not to put anything in writing, in case they have to turn over their emails to the authorities, or if their devices are compromised in some other manner.

Better still is to find some fellow pirates. If your close friends are interested in joining you, great! Otherwise, you might be able to find others online. Sadly this is still a niche community. So far we have found only a handful of others who are active in this space. Good starting places seem to be the Library Genesis forums, and r/DataHoarder. The Archive Team also has likeminded individuals, though they operate within the law (even if in some grey areas of the law). The traditional "warez" and pirating scenes also have folks who think in similar ways.

We are open to ideas on how to foster community and explore ideas. Feel free to message us on Twitter or Reddit. Perhaps we could host some sort of forum or chat group. One challenge is that this can easily get censored when using common platforms, so we would have to host it ourselves. There is also a tradeoff between having these discussions fully public (more potential engagement) versus making it private (not letting potential "targets" know that we're about to scrape them). We'll have to think about that. Let us know if you are interested in this!

Projects

When we do a project, it has a couple of phases:

Domain selection / philosophy: Where do you roughly want to focus on, and why? What are your unique passions, skills, and circumstances that you can use to your benefit?
Target selection: Which specific collection will you mirror?
Metadata scraping: Cataloging information about the files, without actually downloading the (often much larger) files themselves.
Data selection: Based on the metadata, narrowing down which data is most relevant to archive right now. Could be everything, but often there is a reasonable way to save space and bandwidth.
Data scraping: Actually getting the data.
Distribution: Packaging it up in torrents, announcing it somewhere, getting people to spread it.

These are not completely independent phases, and often insights from a later phase send you back to an earlier phase. For example, during metadata scraping you might realize that the target that you selected has defensive mechanisms beyond your skill level (like IP blocks), so you go back and find a different target.

1. Domain selection / philosophy

There is no shortage of knowledge and cultural heritage to be saved, which can be overwhelming. That's why it's often useful to take a moment and think about what your contribution can be.

Everyone has a different way of thinking about this, but here are some questions that you could ask yourself:

Why are you interested in this? What are you passionate about? If we can get a bunch of people who all archive the kinds of things that they specifically care about, that would cover a lot! You will know a lot more than the average person about your passion, like what is important data to save, what are the best collections and online communities, and so on.
What skills do you have that you can use to your benefit? For example, if you are an online security expert, you can find ways of defeating IP blocks for secure targets. If you are great at organizing communities, then perhaps you can rally some people together around a goal. It is useful to know some programming though, if only for keeping good operational security throughout this process.
How much time do you have for this? Our advice would be to start small and doing bigger projects as you get the hang of it, but it can get all-consuming.
What would be a high-leverage area to focus on? If you're going to spend X hours on pirate archiving, then how can you get the biggest "bang for your buck"?
What are unique ways that you are thinking about this? You might have some interesting ideas or approaches that others might have missed.

In our case, we cared in particular about the long term preservation of science. We knew about Library Genesis, and how it was fully mirrored many times over using torrents. We loved that idea. Then one day, one of us tried to find some scientific textbooks on Library Genesis, but couldn't find them, bringing into doubt how complete it really was. We then searched those textbooks online, and found them in other places, which planted the seed for our project. Even before we knew about the Z-Library, we had the idea of not trying to collect all those books manually, but to focus on mirroring existing collections, and contributing them back to Library Genesis.

2. Target selection

So, we have our area that we are looking at, now which specific collection do we mirror? There are a couple of things that make for a good target:

Large
Unique: not already well-covered by other projects.
Accessible: does not use tons of layers of protection to prevent you from scraping their metadata and data.
Special insight: you have some special information about this target, like you somehow have special access to this collection, or you figured out how to defeat their defenses. This is not required (our upcoming project does not do anything special), but it certainly helps!

When we found our science textbooks on websites other than Library Genesis, we tried to figure out how they made their way onto the internet. We then found the Z-Library, and realized that while most books don't first make their appearance there, they do eventually end up there. We learned about its relationship to Library Genesis, and the (financial) incentive structure and superior user interface, both of which made it a much more complete collection. We then did some preliminary metadata and data scraping, and realized that we could get around their IP download limits, leveraging one of our members' special access to lots of proxy servers.

As you're exploring different targets, it is already important to hide your tracks by using VPNs and throwaway email addresses, which we'll talk about more later.

3. Metadata scraping

Let's get a bit more technical here. For actually scraping the metadata from websites, we have kept things pretty simple. We use Python scripts, sometimes curl, and a MySQL database to store the results in. We haven't used any fancy scraping software which can map complex websites, since so far we only needed to scrape one or two kinds of pages by just enumerating through ids and parsing the HTML. If there aren't easily enumerated pages, then you might need a proper crawler that tries to find all pages.

Before you start scraping a whole website, try doing it manually for a bit. Go through a few dozen pages yourself, to get a sense for how that works. Sometimes you will already run into IP blocks or other interesting behavior this way. The same goes for data scraping: before getting too deep into this target, make sure you can actually download its data effectively.

To get around restrictions, there are a few things you can try. Are there any other IP addresses or servers that host the same data but do not have the same restrictions? Are there any API endpoints that do not have restrictions, while others do? At what rate of downloading does your IP get blocked, and for how long? Or are you not blocked but throttled down? What if you create a user account, how do things change then? Can you use HTTP/2 to keep connections open, and does that increase the rate at which you can request pages? Are there pages that list multiple files at once, and is the information listed there sufficient?

Things you probably want to save include:

Title
Filename / location
ID: can be some internal ID, but IDs like ISBN or DOI are useful too.
Size: to calculate how much disk space you need.
Hash (md5, sha1): to confirm that you downloaded the file properly.
Date added/modified: so you can come back later and download files that you didn't download before (though you can often also use the ID or hash for this).
Description, category, tags, authors, language, etc.

We typically do this in two stages. First we download the raw HTML files, usually directly into MySQL (to avoid lots of small files, which we talk more about below). Then, in a separate step, we go through those HTML files and parse them into actual MySQL tables. This way you don't have to re-download everything from scratch if you discover a mistake in your parsing code, since you can just reprocess the HTML files with the new code. It's also often easier to parallelize the processing step, thus saving some time (and you can write the processing code while the scraping is running, instead of having to write both steps at once).

Finally, note that for some targets metadata scraping is all there is. There are some huge metadata collections out there that aren't properly preserved.

4. Data selection

Often you can use the metadata to figure out a reasonable subset of data to download. Even if you eventually want to download all the data, it can be useful to prioritize the most important items first, in case you get detected and defences are improved, or because you would need to buy more disks, or simply because something else comes up in your life before you can download everything.

For example, a collection might have multiple editions of the same underlying resource (like a book or a film), where one is marked as being the best quality. Saving those editions first would make a lot of sense. You might eventually want to save all editions, since in some cases the metadata might be tagged incorrectly, or there might be unknown tradeoffs between editions (for example, the "best edition" might be best in most ways but worse in other ways, like a film having a higher resolution but missing subtitles).

You can also search your metadata database to find interesting things. What is the biggest file that is hosted, and why is it so big? What is the smallest file? Are there interesting or unexpected patterns when it comes to certain categories, languages, and so on? Are there duplicate or very similar titles? Are there patterns to when data was added, like one day in which many files were added at once? You can often learn a lot by looking at the dataset in different ways.

In our case, we deduplicated Z-Library books against the md5 hashes in Library Genesis, thereby saving a lot of download time and disk space. This is a pretty unique situation though. In most cases there are no comprehensive databases of which files are already properly preserved by fellow pirates. This in itself is a huge opportunity for someone out there. It would be great to have a regularly updated overview of things like music and films that are already widely seeded on torrent websites, and are therefore lower priority to include in pirate mirrors.

5. Data scraping

Now you're ready to actually download the data in bulk. As mentioned before, at this point you should already manually have downloaded a bunch of files, to better understand the behavior and restrictions of the target. However, there will still be surprises in store for you once you actually get to downloading lots of files at once.

Our advice here is mainly to keep it simple. Start by just downloading a bunch of files. You can use Python, and then expand to multiple threads. But sometimes even simpler is to generate Bash files directly from the database, and then running multiple of them in multiple terminal windows to scale up. A quick technical trick worth mentioning here is using OUTFILE in MySQL, which you can write anywhere if you disable "secure_file_priv" in mysqld.cnf (and be sure to also disable/override AppArmor if you're on Linux).

We store the data on simple hard disks. Start out with whatever you have, and expand slowly. It can be overwhelming to think about storing hundreds of TBs of data. If that is the situation that you're facing, just put out a good subset first, and in your announcement ask for help in storing the rest. If you do want to get more hard drives yourself, then r/DataHoarder has some good resources on getting good deals.

Try not to worry too much about fancy filesystems. It is easy to fall into the rabbit hole of setting up things like ZFS. One technical detail to be aware of though, is that many filesystems don't deal well with lots of files. We've found that a simple workaround is to create multiple directories, e.g. for different ID ranges or hash prefixes.

After downloading the data, be sure to check the integrity of the files using hashes in the metadata, if available.

6. Distribution

You have the data, thereby giving you possession of the world's first pirate mirror of your target (most likely). In many ways the hardest part is over, but the riskiest part is still ahead of you. After all, so far you've been stealth; flying under the radar. All you had to do was using a good VPN throughout, not filling in your personal details in any forms (duh), and perhaps using a special browser session (or even a different computer).

Now you have to distribute the data. In our case we first wanted to contribute the books back to Library Genesis, but then quickly discovered the difficulties in that (fiction vs non-fiction sorting). So we decided on distribution using Library Genesis-style torrents. If you have the opportunity to contribute to an existing project, then that could save you a lot of time. However, there are not many well-organized pirate mirrors out there currently.

So let's say you decide on distributing torrents yourself. Try to keep those files small, so they are easy to mirror on other websites. You will then have to seed the torrents yourself, while still staying anonymous. You can use a VPN (with or without port forwarding), or pay with tumbled Bitcoins for a Seedbox. If you don't know what some of those terms mean, you'll have a bunch of reading to do, since it's important that you understand the risk tradeoffs here.

You can host the torrent files themselves on existing torrent websites. In our case, we chose to actually host a website, since we also wanted to spread our philosophy in a clear way. You can do this yourself in a similar manner (we use Njalla for our domains and hosting, paid for with tumbled Bitcoins), but also feel free to contact us to have us host your torrents. We are looking to build a comprehensive index of pirate mirrors over time, if this idea catches on.

As for VPN selection, much has been written about this already, so we'll just repeat the general advice of choosing by reputation. Actual court-tested no-log policies with long track records of protecting privacy is the lowest risk option, in our opinion. Note that even when you do everything right, you can never get to zero risk. For example, when seeding your torrents, a highly motivated nation-state actor can probably look at incoming and outgoing data flows for VPN servers, and deduce who you are. Or you can just simply mess up somehow. We probably already have, and will again. Luckily, nation states don't care that much about piracy.

One decision to make for each project, is whether to publish it using the same identity as before, or not. If you keep using the same name, then mistakes in operational security from earlier projects could come back to bite you. But publishing under different names means that you don't build a longer lasting reputation. We chose to have strong operational security from the start so we can keep using the same identity, but we won't hesitate to publish under a different name if we mess up or if the circumstances call for it.

Getting the word out can be tricky. As we said, this is still a niche community. We originally posted on Reddit, but really got traction on Hacker News. For now our recommendation is to post it in a few places and see what happens. And again, contact us. We would love to spread the word of more pirate archivism efforts.

Conclusion

Hopefully this is helpful for newly starting pirate archivists. We're excited to welcome you to this world, so don't hesitate to reach out. Let's preserve as much of the world's knowledge and culture as we can, and mirror it far and wide.

- Anna and the team (Reddit)