Parallel Indexing Instructions

For larger datasets, indexing becomes problematic and requires parallelism to scale. For example, in one test it took 10 hours to index 100 million small records totalling 60GB. Fortunately, Kyrix can index data in parallel and scales up both with more cores per computer and multiple computers - we've tested Kyrix up to 640 cores (20 nodes, 32 cores each) and achieved near-linear scaling, with CREATE INDEX achieving super-linear scaling due to increased RAM across the cluster. On that system, it took around 12 minutes to index 1 billion records (~500x speedup). Theoretically, Kyrix scales infinitely with this architecture - in practice, many of the cluster administration tools/scripts are executed sequentialy, and as you scale past 1,000 cores you get "tail latency" due to random stalls and errors. We currently recommend Citus up to 50-100 machines "comfortably" and perhaps 250-1000 with difficulty.

At this time, only the "dots" synthetic dataset supports parallel indexing "out of the box," (dots-uniform-pushdown) though you can adapt it to other data sets. There are two issues: (1) the "getBBoxCoordinates" is hardcoded - see bboxfunc in PsqlNativeBoxIndexer.java. (2) the reload-dots-pushdown-uniform.sh script needs to be adapted for your dataset. (3) though it's not directly related to parallelism, the transform.js function must be carefully coded to avoid bottlenecking on JavaScript - see transforms.js for an example.

Parallel Kyrix is implemented using Kubernetes on Google Cloud for orchestration and the Citus Postgres extension to provide parallel query/update/DDL. It would be straightforward to port this to other Kubernetes providers. To execute the JavaScript transform function inside Postgres (and avoid bottlenecking on the Kyrix middleware), we use plv8, though you could (in theory) run multiple Kyrix middleware servers.

To use parallel Kyrix, you will need a Kubernetes cluster provider - these instructions are for Google Cloud and assume your client is running on Ubuntu 18.04.

install kubectl (e.g. sudo snap install kubectl --classic - kubectl docs)
setup kubectl to the given cluster (e.g. gcloud container clusters get-credentials <cluster name> - gcloud install instructions)
run cd cluster-scripts; ./redeploy-citus; ./redeploy-kyrix-server then wait for "Backend server started..." (see above)
look for "Kyrix running; run 'source cluster-scripts/setup-kyrix-vars.env' for convenience scripts/functions or visit http://:8000/
point a browser at this URL - for most kubernetes providers, no firewall changes should be required.
to load the larger 'dots' dataset, run a command like this: cd cluster-scripts; SCALE=1 KYRIX_DB_RELOAD_FORCE=1 DATA=dots-pushdown-uniform ./restart-kyrix-server.sh (SCALE multiplies the dataset by this amount - start with SCALE=1, then try SCALE=10 etc)
reload your browser, you should see dots. If you don't, it could be that the density is too low - either increase SCALE or modify dots.js to reduce the canvas size.

note: the Kyrix indexing pipeline automatically detects the number of machine instances and cores per machine, and sets the number of Citus "shards" (Postgres tables) to one per core, i.e. if you use 96-core machines, then you will find 96 PostgreSQL tables on each machine in the one database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Indexing Instructions

Clone this wiki locally