How to run a shadow library: operations at Anna’s Archive

annas-blog.org, 2023-03-19

I run Anna’s Archive, the world’s largest open-source non-profit search engine for shadow libraries, like Sci-Hub, Library Genesis, and Z-Library. Our goal is to make knowledge and culture readily accessible, and ultimately to build a community of people who together archive and preserve all the books in the world.

In this article I’ll show how we run this website, and the unique challenges that come with operating a website with questionable legal status, since there is no “AWS for shadow charities”.

Also check out the sister article How to become a pirate archivist.

Innovation tokens

Let’s start with our tech stack. It is deliberately boring. We use Flask, MariaDB, and ElasticSearch. That is literally it. Search is largely a solved problem, and we don’t intend to reinvent it. Besides, we have to spend our innovation tokens on something else: not being taken down by the authorities.

So how legal or illegal is Anna’s Archive exactly? This mostly depends on the legal jurisdiction. Most countries believe in some form of copyright, which means that people or companies are assigned an exclusive monopoly on certain types of works for a certain period of time. As an aside, at Anna’s Archive we believe while there are some benefits, overall copyright is a net-negative for society — but that is a story for another time.

This exclusive monopoly on certain works means that it is illegal for anyone outside of this monopoly to directly distribute those works — including us. But Anna’s Archive is a search engine that doesn’t directly distribute those works (at least not on our clearnet website), so we should be okay, right? Not exactly. In many jurisdictions it is not only illegal to distribute copyrighted works, but also to link to places that do. A classic example of this is the United States’ DMCA law.

That is the strictest end of the spectrum. On the other end of the spectrum there could theoretically be countries with no copyright laws whatsoever, but these don’t really exist. Pretty much every country has some form of copyright law on the books. Enforcement is a different story. There are plenty of countries with governments that do not care to enforce copyright law. There are also countries in between the two extremes, which prohibit distributing copyrighted works, but do not prohibit linking to such works.

Another consideration is at the company-level. If a company operates in a jurisdiction that doesn’t care about copyright, but the company itself is not willing to take any risk, then they might shut down your website as soon as anyone complains about it.

Finally, a big consideration is payments. Since we need to stay anonymous, we cannot use traditional payment methods. This leaves us with cryptocurrencies, and only a small subset of companies support those (there are virtual debit cards paid by crypto, but they are often not accepted).

System architecture

So let’s say that you found some companies that are willing to host your website without shutting you down — let’s call these “freedom-loving providers” 😄. You’ll quickly find that hosting everything with them is rather expensive, so you might want to find some “cheap providers” and do the actual hosting there, proxying through the freedom-loving providers. If you do it right, the cheap providers will never know what you are hosting, and never receive any complaints.

With all of these providers there is a risk of them shutting you down anyway, so you also need redundancy. We need this on all levels of our stack.

One somewhat freedom-loving company that has put itself in an interesting position is Cloudflare. They have argued that they are not a hosting provider, but a utility, like an ISP. They are therefore not subject to DMCA or other takedown requests, and forward any requests to your actual hosting provider. They have gone as far as going to court to protect this structure. We can therefore use them as another layer of caching and protection.

Cloudflare does not accept anonymous payments, so we can only use their free plan. This means that we can’t use their load balancing or failover features. We therefore implemented this ourselves at the domain level. On page load, the browser will check if the current domain is still available, and if not, it rewrites all URLs to a different domain. Since Cloudflare caches many pages, this means that a user can land on our main domain, even if the proxy server is down, and then on the next click be moved over to another domain.

We still also have normal operational concerns to deal with, such as monitoring server health, logging backend and frontend errors, and so on. Our failover architecture allows for more robustness on this front as well, for example by running a completely different set of servers on one of the domains. We can even run older versions of the code and datasets on this separate domain, in case a critical bug in the main version goes unnoticed.

We can also hedge against Cloudflare turning against us, by removing it from one of the domains, such as this separate domain. Different permutations of these ideas are possible.

Tools

Let’s look at what tools we use to accomplish all of this. This is very much evolving as we run into new problems and find new solutions.

Application server: Flask, MariaDB, ElasticSearch, Docker.
Proxy server: Varnish.
Server management: Ansible, Checkmk, UFW.
Development: Gitlab, Weblate, Zulip.
Onion static hosting: Tor, Nginx.

There are some decisions that we have gone back and forth on. One is the communication between servers: we used to use Wireguard for this, but found that it occasionally stops transmitting any data, or only transmits data in one direction. This happened with several different Wireguard setups that we tried, such as wesher and wg-meshconf. We also tried tunneling ports over SSH, using autossh and sshuttle, but ran into problems there (though it is still not clear to me if autossh suffers from TCP-over-TCP issues or not — it just feels like a janky solution to me but maybe it is actually fine?).

Instead, we reverted back to direct connections between servers, hiding that a server is running on the cheap providers using IP-filtering with UFW. This has the downside that Docker doesn't work well with UFW, unless you use network_mode: "host". All of this is a bit more error-prone, because you will expose your server to the internet with just a tiny misconfiguration. Perhaps we should move back to autossh — feedback would be very welcome here.

We’ve also gone back and forth on Varnish vs. Nginx. We currently like Varnish, but it does have its quirks and rough edges. The same applies to Checkmk: we don’t love it, but it works for now. Weblate has been okay but not incredible — I sometimes fear it will lose my data whenever I try to sync it with our git repo. Flask has been good overall, but it has some weird quirks that have cost a lot of time to debug, such as configuring custom domains, or issues with its SqlAlchemy integration.

So far the other tools have been great: we have no serious complaints about MariaDB, ElasticSearch, Gitlab, Zulip, Docker, and Tor. All of these have had some issues, but nothing overly serious or time-consuming.

Conclusion

It has been an interesting experience to learn how to set up a robust and resilient shadow library search engine. There are tons more details to share in later posts, so let me know what you would like to learn more about!

As always, we’re looking for donations to support this work, so be sure to check out the Donate page on Anna’s Archive. We’re also looking for other types of support, such as grants, long-term sponsors, high-risk payment providers, perhaps even (tasteful!) ads. And if you want to contribute your time and skills, we’re always looking for developers, translators, and so on. Thanks for your interest and support.

- Anna and the team (Reddit, Telegram)