Anna's Blog

ISBNdb dump, or How Many Books Are Preserved Forever?

annas-blog.org, 2022-10-31

With the Pirate Library Mirror, our aim is to take all the books in the world, and preserve them forever.1 Between our Z-Library torrents, and the original Library Genesis torrents, we have 11,783,153 files. But how many is that, really? If we properly deduplicated those files, what percentage of all the books in the world have we preserved? We’d really like to have something like this:

10% of humanity’s written heritage preserved forever

For a percentage, we need a denominator: the total number of books ever published.2 Before the demise of Google Books, an engineer on the project, Leonid Taycher, tried to estimate this number. He came up — tongue-in-cheek — with 129,864,880 (“at least until Sunday”). He estimated this number by building a unified database of all the books in the world. For this, he pulled together different datasets and then merged them in various ways.

As a quick aside, there is another person who attempted to catalog all the books in the world: Aaron Swartz, the late digital activist and Reddit co-founder.3 He started Open Library with the goal of “one web page for every book ever published”, combining data from lots of different sources. He ended up paying the ultimate price for his digital preservation work when he got prosecuted for bulk-downloading academic papers, leading to his suicide. Needless to say, this is one of the reasons our group is pseudonymous, and why we’re being very careful. Open Library is still heroically being run by folks at the Internet Archive, continuing Aaron’s legacy. We’ll get back to this later in this post.

In the Google blog post, Taycher describes some of the challenges with estimating this number. First, what constitutes a book? There are a few possible definitions:

“Editions” seem the most practical definition of what “books” are. Conveniently, this definition is also used for assigning unique ISBN numbers. An ISBN, or International Standard Book Number, is commonly used for international commerce, since it is integrated with the international barcode system (”International Article Number”). If you want to sell a book in stores, it needs a barcode, so you get an ISBN.

Taycher’s blog post mentions that while ISBNs are useful, they are not universal, since they were only really adopted in the mid-seventies, and not everywhere around the world. Still, ISBN is probably the most widely used identifier of book editions, so it’s our best starting point. If we can find all the ISBNs in the world, we get a useful list of which books still need to be preserved.

So, where do we get the data? There are a number of existing efforts that are trying to compile a list of all the books in the world:

In this post, we are happy to announce a small release (compared to our previous Z-Library releases). We scraped most of ISBNdb, and made the data available for torrenting on the website of the Pirate Library Mirror (we won’t link it here directly, just search for it). These are about 30.9 million records (20GB as JSON Lines; 4.4GB gzipped). On their website they claim that they actually have 32.6 million records, so we might somehow have missed some, or they could be doing something wrong. In any case, for now we will not share exactly how we did it — we will leave that as an exercise for the reader. ;-)

What we will share is some preliminary analysis, to try to get closer to estimating the number of books in the world. We looked at three datasets: this new ISBNdb dataset, our original release of metadata that we scraped from the Z-Library shadow library (which includes Library Genesis), and the Open Library data dump.

Let’s start with some rough numbers:

Editions ISBNs
ISBNdb - 30,851,787
Z-Library 11,783,153 3,581,309
Open Library 36,657,084 17,371,977

In both Z-Library/Libgen and Open Library there are many more books than unique ISBNs. Does that mean that lots of those books don’t have ISBNs, or is the ISBN metadata simply missing? We can probably answer this question with a combination of automated matching based on other attributes (title, author, publisher, etc), pulling in more data sources, and extracting ISBNs from the actual book scans themselves (in the case of Z-Library/Libgen).

How many of those ISBNs are unique? This is best illustrated with a Venn diagram:

To be more precise:

ISBNdb ∩ OpenLib 10,177,281
ISBNdb ∩ Zlib 2,308,259
Zlib ∩ OpenLib 1,837,598
ISBNdb ∩ Zlib ∩ OpenLib 1,534,342

We were surprised by how little overlap there is! ISBNdb has a huge amount of ISBNs that do not show up in either Z-Library or Open Library, and the same holds (to a smaller but still substantial degree) for the other two. This raises a lot of new questions. How much would automated matching help in tagging the books that were not tagged with ISBNs? Would there be a lot of matches and therefore increased overlap? Also, what would happen if we bring in a 4th or 5th dataset? How much overlap would we see then?

This does give us a starting point. We can now look at all the ISBNs that were not in the Z-Library dataset, and that do not match title/author fields either. That can give us a handle on preserving all the books in the world: first by scraping the internet for scans, then by going out in real life to scan books. The latter could even be crowd-funded, or driven by “bounties” from people who would like to see particular books digitized. All that is a story for a different time.

If you want to help out with any of this — further analysis; scraping more metadata; finding more books; OCR’ing of books; doing this for other domains (eg papers, audiobooks, movies, tv shows, magazines) or even making some of this data available for things like ML / large language model training — please contact me (Twitter, Reddit). I’m nowadays also hanging out on the Discord of The Eye (IYKYK).

If you’re specifically interested in the data analysis, we are working on making our datasets and scripts available in a more easy to use format. It would be great if you could just fork a notebook and start playing with this.

Finally, if you want to support this work, please consider making a donation. This is an entirely volunteer-run operation, and your contribution makes a huge difference. Every bit helps. For now we take donations in crypto; see pilimi.org.

- Anna and the Pirate Library Mirror team (Twitter, Reddit)

1. For some reasonable definition of "forever". ;)
2. Of course, humanity’s written heritage is much more than books, especially nowadays. For the sake of this post and our recent releases we’re focusing on books, but our interests stretch further.
3. There is a lot more that can be said about Aaron Swartz, but we just wanted to mention him briefly, since he plays a pivotal part in this story. As time passes, more people might come across his name for the first time, and can subsequently dive into the rabbit hole themselves.