Update readme
parent
60c89c8796
commit
fa88777ecd
104
README.md
104
README.md
|
@ -1 +1,103 @@
|
|||
# See Makefile and `explore/README.md`
|
||||
# Bluesky indexer
|
||||
|
||||
This is a bunch of code that can download all of Bluesky into a giant table in
|
||||
PostgreSQL.
|
||||
|
||||
The structure of that table is roughly `(repo, collection, rkey) -> JSON`, and
|
||||
it is a good idea to partition it by collection.
|
||||
|
||||
## System requirements
|
||||
|
||||
NOTE: all of this is valid as of April 2024, when Bluesky has ~5.5M accounts,
|
||||
~1.2B records total, and average daily peak of ~100 commits/s.
|
||||
|
||||
* One decent SATA SSD is plenty fast to keep up. Preferably a dedicated one
|
||||
(definitely not the same that your system is installed on). There will be a
|
||||
lot of writes happening, so the total durability of the disk will be used up
|
||||
at non-negligible rate.
|
||||
* 16GB of RAM, but the more the better, obviously.
|
||||
* ZFS with compression enabled is highly recommended, but not strictly
|
||||
necessary.
|
||||
* Compression will cut down on IO bandwidth quite a bit, as well as on used
|
||||
disk space. On a compressed FS the whole database takes up about 270GB,
|
||||
without compression - almost 3 times as much.
|
||||
|
||||
## Overview of components
|
||||
|
||||
### Lister
|
||||
|
||||
Once a day get a list of all repos from all known PDSs and adds any that are
|
||||
missing to the database.
|
||||
|
||||
### Consumer
|
||||
|
||||
Connects to firehose of each PDS and stores all received records in the
|
||||
database.
|
||||
|
||||
### Record indexer
|
||||
|
||||
Goes over all repos that might have missing data, gets a full checkout from the
|
||||
PDS and adds all missing records to the database.
|
||||
|
||||
### PLC mirror
|
||||
|
||||
Syncs PLC operations log into a local table, and allows other components to
|
||||
resolve `did:plc:` DIDs without putting strain on https://plc.directory and
|
||||
hitting rate limits.
|
||||
|
||||
## Setup
|
||||
|
||||
* Decide where do you want to store the data
|
||||
* Copy `example.env` to `.env` and edit it to your liking.
|
||||
* `POSTGRES_PASSWORD` can be anything, it will be used on the first start of
|
||||
`postgres` container to initialize the database.
|
||||
* Optional: copy `docker-compose.override.yml.example` to
|
||||
`docker-compose.override.yml` to change some parts of `docker-compose.yml`
|
||||
without actually editing it (and introducing possibility of merge conflicts
|
||||
later on).
|
||||
* `make start-plc`
|
||||
* This will start PostgreSQL and PLC mirror
|
||||
* `make wait-for-plc`
|
||||
* This will wait until PLC mirror has fully replicated the operations log
|
||||
* `make init-db`
|
||||
* This will add the initial set of PDS hosts into the database.
|
||||
* You can skip this if you're specifying `CONSUMER_RELAYS` in
|
||||
`docker-compose.override.yml`
|
||||
* `make up`
|
||||
|
||||
## Additional commands
|
||||
|
||||
* `make status` - will show container status and resource usage
|
||||
* `make psql` - starts up SQL shell inside the `postgres` container
|
||||
* `make logs` - streams container logs into your terminal
|
||||
* `make sqltop` - will show you currently running queries
|
||||
* `make sqldu` - will show disk space usage for each table and index
|
||||
|
||||
## Tweaking the number of indexer threads at runtime
|
||||
|
||||
Record indexer exposes a simple HTTP handler that allows to do this:
|
||||
|
||||
`curl -s 'http://localhost:11003/pool/resize?size=10'`
|
||||
|
||||
## Advanced topics
|
||||
|
||||
### Table partitioning
|
||||
|
||||
With partitioning by collection you can have separate indexes for each record
|
||||
type. Also, doing any kind of heavy processing on a particular record type will
|
||||
be also faster, because all of these records will be in a separate table and
|
||||
PostgreSQL will just read them sequentially, instead of checking `collection`
|
||||
column for each row.
|
||||
|
||||
You can do the partitioning at any point, but the more data you already have in
|
||||
the database, the longer will it take.
|
||||
|
||||
Before doing this you need to run `lister` at least once in order to create the
|
||||
tables (`make init-db` does this for you as well).
|
||||
|
||||
* Stop all containers except for `postgres`.
|
||||
* Run the [SQL script](db-migration/migrations/20240217_partition.sql) in
|
||||
`psql`.
|
||||
* Check [`migrations`](db-migration/migrations/) dir for any additional
|
||||
migrations you might be interested in.
|
||||
* Once all is done, start the other containers again.
|
||||
|
|
Loading…
Reference in New Issue