Anna's Blog

Putting 5,998,794 books on IPFS

annas-blog.org, 2022-11-19

Z-Library has been taken down, and its founders arrested. For the uninitiated, a quick recap: Z-Library was a massive “shadow library” of books, similar to Sci-Hub or Library Genesis. They had taken the concept of a shadow library to the next level, with a great user interface, bulk uploading and deduplication systems, and all sorts of other features. They were thriving on donations, and were therefore able to hire a professional team to keep improving the site.

Until it all came crashing down two weeks ago. Their domains were seized by the FBI, and the (alleged) founders were arrested in Argentina. The site continues to run on Tor (presumably maintained by their employees), but no one knows how sustainable that is. It was sad day for the free flow of information, knowledge, and culture. Антон Напольский and Валерия Ермакова — we stand with you. Much love to you and your families, and thank you for what you have done for the world.

Just a few months ago, we released our second backup of Z-Library — for about 31TB in total. This turned out to be timely. We also already had started working on a search aggregator for shadow libraries: “Anna’s Archive” (not linking here, but you can Google it). With Z-Library down, we scrambled to get this running as soon as possible, and we did a soft-launch shortly thereafter. Now we’re trying to figure out what is next. This seems the right time to step up and help shape the next chapter of shadow libraries.

One such thing is to put the books up on IPFS. Some of the Library Genesis mirrors have already done this a few years ago for their books, and it makes access to their collection more resiliant. After all, they don’t have to host any files themselves over HTTP anymore, but can instead link to one of the many IPFS Gateways, which will happily proxy the books from one of the many volunteer-run machines (this is the big advantage IPFS has over BitTorrent). These machines can be hidden behind VPNs, or run on seedboxes paid for using crypto, similar to torrents. You can even get other people’s machines to host the data, by paying for that service using Filecoin.

However, putting dozens of terabytes of data on IPFS is no joke. We haven’t fully succeeded in this project yet, so today we’ll share where we’ve gotten so far. If you have experience pushing the limits of IPFS (or other systems, for that matter), and want to help our cause, please reach out on Reddit or Twitter.

File organization

When we released our first backup, we used torrents that contained tons of individual files. This turns out not to be great for two reasons: 1. torrent clients struggle with this many files (especially when trying to display them in a UI) 2. magnetic hard drives and filesystems struggle as well. You can get a lot of fragmentation and seeking back and forth.

For our second release, we learned from this, and packaged the files in large “.tar” files. This solves these problems, but creates a new one: how do we now serve individual files on IPFS? We could simply extract the tar files, but then if you want to both seed the torrents, and seed the IPFS files, you need twice as much space: 62TB instead of 31TB (which was already pushing it).

Luckily, there is a good solution for this: mounting the tar files using ratarmount. This creates a virtual filesystem using FUSE. Typically we run it like this:

sudo ratarmount --fuse "allow_other" zlib2-data/*.tar zlib2/

In order to figure out which file is located where, ratarmount creates index files which it places next to the tar files. It takes some time to do this when you run it for the first time, so at some point we will share these index files on our torrent page, for your convenience.

Root CIDs

The second problem we ran into, was performance issues with IPFS. The most noticeable of these is the “advertising” or “providing” phase, where your IPFS node tells the rest of the IPFS network what data you have. A single file typically gets split up in 256KiB chunks, each of which gets an identifier, called a “Content Identifier”, or “CID”. The file itself also gets a CID, which refers to a list of the child CIDs. All in all, a single file can easily have several, if not hundreds of these CIDs — and we have millions of files. All of these CIDs have to be advertised on the network!

We first thought that we could solve this by using a particular feature of the “providing” algorithm: only advertising the root CIDs of directories. The idea was that we could take the different directories that our files were already organized in, and advertise just the CID of that directory, and then address them using:

/ipfs/<directory CID>/<filename>

Initially this seemed to work, but we ran into issues requesting more than one or a few files at once. It took us several days to debug this, but eventually it seems like we found the root cause, and filed a bug report. Sadly, this looks like a deep, fundamental issue, which we cannot easily work around. So we’ll have to deal with lots of CIDs, at least for now.

Sharding

One mitigation is to use a larger chunk size. Instead of 256KiB, we can use 1MiB (the current maximum), by using --chunker=size-1048576 on add. Another thing that helps, is using the AcceleratedDHTClient, which batches multiple advertising calls to the same node. Still, various operations can take a long time, from “providing”, to just getting some stats on the repo.

This is why we’ve been playing with sharding the data across multiple IPFS nodes, even on the same machine. We started with 32 nodes, but there the per-node overhead seemed to get quite big, especially in terms of memory usage. But providing became quite fast: about 5 minutes per node, where each node had about 1 million CIDs to advertise. We are now playing with different numbers, to see what is optimal. Unfortunately IPFS doesn’t let you easily merge or split nodes, so this is quite time-consuming.

This is what our docker-compose.yml looks like, for example, with a single node (other nodes omitted for brevity):

x-ipfs: &default-ipfs
  image: ipfs/kubo:v0.16.0
  restart: unless-stopped
  environment:
    - IPFS_PATH=/data/ipfs
    - IPFS_PROFILE=server
  command: daemon --migrate=true --agent-version-suffix=docker --routing=dhtclient

services:
  ipfs-zlib2-0:
    <<: *default-ipfs
    ports:
      - "4011:4011/tcp"
      - "4011:4011/udp"
    volumes:
      - "./container-init.d/:/container-init.d"
      - "./ipfs-dirs/ipfs-zlib2-0:/data/ipfs"
      - "./zlib2/pilimi-zlib2-0-14679999-extra/:/data/files/pilimi-zlib2-0-14679999-extra/"
      - "./zlib2/pilimi-zlib2-14680000-14999999/:/data/files/pilimi-zlib2-14680000-14999999/"
      - "./zlib2/pilimi-zlib2-15000000-15679999/:/data/files/pilimi-zlib2-15000000-15679999/"
      - "./zlib2/pilimi-zlib2-15680000-16179999/:/data/files/pilimi-zlib2-15680000-16179999/"
      # etc.

In the container-init.d/ folder that is referred there, we have a single shell script, with the following content:

#!/bin/sh
ipfs config --json Experimental.FilestoreEnabled true
ipfs config --json Experimental.AcceleratedDHTClient true

We also manually changed the config for each node to use a unique IP address.

Processing CIDs

Once you have a bunch of nodes running, you can add data to it. In the example configuration above, we would run:

docker-compose exec ipfs-zlib2-0 ipfs add --progress=false --nocopy --recursive --hash=blake2b-256 --chunker=size-1048576 /data/files > ipfs-zlib2-0.log

This logs the filenames and CIDs to ipfs-zlib2-0.log. Now we can scoop up all the different log files into a CSV, using a little Python script:

import glob

def process_line(line, csv):
  components = line.split()
  if len(components) == 3 and components[0] == "added":
    file_components = components[2].split("/")
    if len(file_components) == 3 and file_components[0] == "files":
      csv.write(file_components[2] + "," + components[1] + "\n")

with open("ipfs.csv", "w") as csv:
  for file in glob.glob("*.log"):
    print("Processing", file)
    with open(file) as f:
      for line in f:
        process_line(line, csv)

Because the filenames are simply the Z-Library IDs, the CSV looks something like this:

1,bafk2bzacedrabzierer44yu5bm7faovf5s4z2vpa3ry2cx6bjrhbjenpxifio
2,bafk2bzaceckyxepao7qbhlohijcqgzt4d2lfcgecetfjd6fhzvuprqgwgnygs
3,bafk2bzacec3yohzdu5rfebtrhyyvqifib5rxadtu35vvcca5a3j6yaeds3yfy
4,bafk2bzaceacs3a4t6kfbjjpkgx562qeqzhkbslpdk7hmv5qozarqn2jid5sfg
5,bafk2bzaceac2kybzpe6esch3auugpi2zoo2yodm5bx7ddwfluomt2qd3n6kbg
6,bafk2bzacealxowh6nddsktetuixn2swkydjuehsw6chk2qyke4x2pxltp7slw

Most systems support reading CSV. For example, in Mysql you could write:

CREATE TABLE zlib_ipfs (
  zlibrary_id INT NOT NULL,
  ipfs_cid CHAR(62) NOT NULL,
  PRIMARY KEY(zlibrary_id)
);
LOAD DATA INFILE '/var/lib/mysql/ipfs.csv'
  INTO TABLE zlib_ipfs
  FIELDS TERMINATED BY ',';

This data should be exactly the same for everyone, as long as you run ipfs add with the same parameters as we did. For your convenience, we will also release our CSV at some point, so you can link to our files on IPFS without doing all the hashing yourself.

Remote file storage

One thing you learn quickly when hosting ~controversial~ content, is that it’s quite useful to have long-term “backend” servers, which you don’t expose on the public internet, and publicly facing “frontend” servers, which are more at risk of being taken down. For serving websites, the “frontend” server can be a simple proxy (HTTP proxy like Varnish, VPN node like Wireguard, etc). But with IPFS, the better solution might be to actually run IPFS on the frontend server directly. This has several advantages:

  1. Traffic speed and latency are better without a proxy.
  2. You can get a storage backend server with lots of hard drives and weak cpu/memory, and the inverse for the frontend server.
  3. You can shard across multiple physical IPFS servers, without having to move tons of data around all the time.

For this, we use remote mounted filesystems. The easiest way to set that up seemed to be rclone:

# File server:
rclone -vP serve sftp --addr :1234 --user hello --pass hello ./zlib1
# IPFS machine:
sudo rclone mount -v --sftp-host *redacted* --sftp-port 1234 --sftp-user hello --sftp-pass `rclone obscure hello` --sftp-set-modtime=false --read-only --vfs-cache-mode full --attr-timeout 100000h --dir-cache-time 100000h --vfs-cache-max-age 100000h --vfs-cache-max-size 300G --no-modtime --transfers 6 --cache-dir ./zlib1cache --allow-other :sftp:/zlib1 ./zlib1

We’re not sure if this is the best way to do this, so if you have tips for how to most efficiently set up a remote immutable file system with good local caching, let us know.

Final thoughts

We’re still figuring all of this out, and don’t have it all running quite yet, so if you have experience with this, please contact us. We’re also interested in learning from people who have set up IPFS Collaborative Clusters, so more people can easily participate in hosting these books. We’re also always looking for volunteers to run IPFS and torrent nodes, help build new projects, and so on (we noticed that lots of technical talent just left a certain social media company — and who particularly care about the free flow of information.. hi!).

If you believe in preserving humanity’s knowledge and culture, please consider supporting us. I have personally been working on this full time, mostly self-funded, plus a couple of large generous donations. But to make this work sustainable, we would probably need to set up a sort of “shadow Patreon”. In the meantime, please consider donating through one of these crypto addresses:

Thanks so much!

- Anna and the Pirate Library Mirror team (Twitter, Reddit)