Self-hosted PLC mirror
 
 
Go to file
Max Ignatenko de51bd9015 Read handles from the new plc mirror 2024-06-24 16:31:50 +01:00
benchmark Fix path to sql file 2024-02-23 11:28:24 +00:00
cmd Add exponential backoff to firehose connection attempts and export connection status directly 2024-05-21 11:43:35 +01:00
dashboards Fix regexp to include bsky.social in 1st-party PDS list 2024-05-21 17:05:14 +01:00
db-migration Filter out dead dids on postgres level. 2024-06-03 14:53:43 -07:00
explore Add listblock query. 2024-05-27 10:38:39 -07:00
metrics Update dash. 2024-05-30 07:55:35 -07:00
models Import 2024-02-15 16:10:39 +00:00
pds Add all PDSs to whitelist 2024-03-20 13:31:14 +00:00
repo Fix for atproto.brid.gy being treated as not whitelisted due to not matching "https://*" glob pattern 2024-05-21 08:58:37 +01:00
util Add code missing in the previous commit 2024-04-13 18:37:26 +01:00
.dockerignore Import 2024-02-15 16:10:39 +00:00
.gitignore Add ignore to override. 2024-02-23 09:26:11 -08:00
LICENSE Add license 2024-04-13 20:41:56 +01:00
Makefile Update dash. 2024-05-30 07:55:35 -07:00
README.md Add missing word 2024-04-13 22:48:34 +01:00
consistency_model.md Write more words 2024-04-07 14:53:22 +01:00
csv_export.sh Read handles from the new plc mirror 2024-06-24 16:31:50 +01:00
csv_iexport.sh Ladder fixed. 2024-05-16 22:17:17 -07:00
csv_iexport_month.sh remove the latest migration until further notice. 2024-05-16 17:29:07 -07:00
docker-compose.override.yml.example Make the address for exporting metrics configurable 2024-04-13 19:01:51 +01:00
docker-compose.yml Make the address for exporting metrics configurable 2024-04-13 19:01:51 +01:00
du.sql Add sqldu 2024-02-23 11:26:06 +00:00
example.env Add wiring for maintaining a dashboard 2024-04-15 21:58:33 +01:00
go.mod Add exponential backoff to firehose connection attempts and export connection status directly 2024-05-21 11:43:35 +01:00
go.sum Add exponential backoff to firehose connection attempts and export connection status directly 2024-05-21 11:43:35 +01:00
top.sql Add sqltop 2024-02-23 11:19:13 +00:00

README.md

Bluesky indexer

This is a bunch of code that can download all of Bluesky into a giant table in PostgreSQL.

The structure of that table is roughly (repo, collection, rkey) -> JSON, and it is a good idea to partition it by collection.

System requirements

NOTE: all of this is valid as of April 2024, when Bluesky has ~5.5M accounts, ~1.2B records total, and average daily peak of ~100 commits/s.

  • One decent SATA SSD is plenty fast to keep up. Preferably a dedicated one (definitely not the same that your system is installed on). There will be a lot of writes happening, so the total durability of the disk will be used up at non-negligible rate.
  • 16GB of RAM, but the more the better, obviously.
  • ZFS with compression enabled is highly recommended, but not strictly necessary.
    • Compression will cut down on IO bandwidth quite a bit, as well as on used disk space. On a compressed FS the whole database takes up about 270GB, without compression - almost 3 times as much.

Overview of components

Lister

Once a day get a list of all repos from all known PDSs and adds any that are missing to the database.

Consumer

Connects to firehose of each PDS and stores all received records in the database.

If CONSUMER_RELAYS is specified, it will also add any new PDSs to the database that have records sent through a relay.

Record indexer

Goes over all repos that might have missing data, gets a full checkout from the PDS and adds all missing records to the database.

PLC mirror

Syncs PLC operations log into a local table, and allows other components to resolve did:plc: DIDs without putting strain on https://plc.directory and hitting rate limits.

Setup

  • Decide where do you want to store the data
  • Copy example.env to .env and edit it to your liking.
    • POSTGRES_PASSWORD can be anything, it will be used on the first start of postgres container to initialize the database.
  • Optional: copy docker-compose.override.yml.example to docker-compose.override.yml to change some parts of docker-compose.yml without actually editing it (and introducing possibility of merge conflicts later on).
  • make start-plc
    • This will start PostgreSQL and PLC mirror
  • make wait-for-plc
    • This will wait until PLC mirror has fully replicated the operations log. That's gonna take a few hours.
    • Technically you can start everything before it is caught up: it will return errors and other components will fallback to querying https://plc.directory. But you will be rate-limited quite hard.
  • make init-db
    • This will add the initial set of PDS hosts into the database.
    • You can skip this if you're specifying CONSUMER_RELAYS in docker-compose.override.yml
  • make up

Additional commands

  • make status - will show container status and resource usage
  • make psql - starts up SQL shell inside the postgres container
  • make logs - streams container logs into your terminal
  • make sqltop - will show you currently running queries
  • make sqldu - will show disk space usage for each table and index

Tweaking the number of indexer threads at runtime

Record indexer exposes a simple HTTP handler that allows to do this:

curl -s 'http://localhost:11003/pool/resize?size=10'

Advanced topics

Table partitioning

With partitioning by collection you can have separate indexes for each record type. Also, doing any kind of heavy processing on a particular record type will be also faster, because all of these records will be in a separate table and PostgreSQL will just read them sequentially, instead of checking collection column for each row.

You can do the partitioning at any point, but the more data you already have in the database, the longer will it take.

Before doing this you need to run lister at least once in order to create the tables (make init-db does this for you as well).

  • Stop all containers except for postgres.
  • Run the SQL script in psql.
  • Check migrations dir for any additional migrations you might be interested in.
  • Once all is done, start the other containers again.