Update readme

2024-04-13 20:38:58 +01:00 · 2024-04-13 20:38:58 +01:00 · fa88777ecd
commit fa88777ecd
parent 60c89c8796
1 changed files with 103 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,103 @@
-# See Makefile and `explore/README.md`
+# Bluesky indexer
 This is a bunch of code that can download all of Bluesky into a giant table in
 PostgreSQL.
 The structure of that table is roughly `(repo, collection, rkey) -> JSON`, and
 it is a good idea to partition it by collection.
 ## System requirements
 NOTE: all of this is valid as of April 2024, when Bluesky has ~5.5M accounts,
 ~1.2B records total, and average daily peak of ~100 commits/s.
 * One decent SATA SSD is plenty fast to keep up. Preferably a dedicated one
  (definitely not the same that your system is installed on). There will be a
  lot of writes happening, so the total durability of the disk will be used up
  at non-negligible rate.
 * 16GB of RAM, but the more the better, obviously.
 * ZFS with compression enabled is highly recommended, but not strictly
  necessary.
    * Compression will cut down on IO bandwidth quite a bit, as well as on used
      disk space. On a compressed FS the whole database takes up about 270GB,
      without compression - almost 3 times as much.
 ## Overview of components
 ### Lister
 Once a day get a list of all repos from all known PDSs and adds any that are
 missing to the database.
 ### Consumer
 Connects to firehose of each PDS and stores all received records in the
 database.
 ### Record indexer
 Goes over all repos that might have missing data, gets a full checkout from the
 PDS and adds all missing records to the database.
 ### PLC mirror
 Syncs PLC operations log into a local table, and allows other components to
 resolve `did:plc:` DIDs without putting strain on https://plc.directory and
 hitting rate limits.
 ## Setup
 * Decide where do you want to store the data
 * Copy `example.env` to `.env` and edit it to your liking.
    * `POSTGRES_PASSWORD` can be anything, it will be used on the first start of
      `postgres` container to initialize the database.
 * Optional: copy `docker-compose.override.yml.example` to
  `docker-compose.override.yml` to change some parts of `docker-compose.yml`
  without actually editing it (and introducing possibility of merge conflicts
  later on).
 * `make start-plc`
    * This will start PostgreSQL and PLC mirror
 * `make wait-for-plc`
    * This will wait until PLC mirror has fully replicated the operations log
 * `make init-db`
    * This will add the initial set of PDS hosts into the database.
    * You can skip this if you're specifying `CONSUMER_RELAYS` in
      `docker-compose.override.yml`
 * `make up`
 ## Additional commands
 * `make status` - will show container status and resource usage
 * `make psql` - starts up SQL shell inside the `postgres` container
 * `make logs` - streams container logs into your terminal
 * `make sqltop` - will show you currently running queries
 * `make sqldu` - will show disk space usage for each table and index
 ## Tweaking the number of indexer threads at runtime
 Record indexer exposes a simple HTTP handler that allows to do this:
 `curl -s 'http://localhost:11003/pool/resize?size=10'`
 ## Advanced topics
 ### Table partitioning
 With partitioning by collection you can have separate indexes for each record
 type. Also, doing any kind of heavy processing on a particular record type will
 be also faster, because all of these records will be in a separate table and
 PostgreSQL will just read them sequentially, instead of checking `collection`
 column for each row.
 You can do the partitioning at any point, but the more data you already have in
 the database, the longer will it take.
 Before doing this you need to run `lister` at least once in order to create the
 tables (`make init-db` does this for you as well).
 * Stop all containers except for `postgres`.
 * Run the [SQL script](db-migration/migrations/20240217_partition.sql) in
  `psql`.
 * Check [`migrations`](db-migration/migrations/) dir for any additional
  migrations you might be interested in.
 * Once all is done, start the other containers again.