- 2022-9-12: Started
@walldiss
DAS is the process of verifying the availability of block data by sampling chunks or shares of those blocks. The das
package implements an engine to ensure the availability of the chain's block data via the Availability
interface.
Verifying the availability of block data is a priority functionality for celestia-node. Its performance could benefit significantly from parallelization optimisation to make it able to fully utilise network bandwidth.
Share sampling, by its nature, is a network-bound operation that implies multiple network round-trips. The previous implementation of DAS'er used a single-thread synchronous approach, meaning there was only one process sequentially performing sampling operations over past headers that were blocked by awaiting a response.
Using multiple coordinated workers running in parallel drastically improves the DASer's performance through better network bandwidth utilization. On the downside, the proposed solution brings concurrency complexity.
To achieve parallelization, the DASer was split into the following core components:
- The
Coordinator
holds the current state of sampled headers and defines what headers should be sampled next. Workers
perform sampling over a range of headers and communicate the results back to the coordinator. Workers are created on demand, whenJobs
are available. The amount of concurrently running workers is limited by the constconcurrencyLimit
. Length of the sampling range is defined by DASer configuration paramsamplingRange
.- The
Subscriber
subscribes to network head updates. When new headers are found, it will notify theCoordinator
. Recent network head blocks will be prioritized for sampling to increase the availability of the most demanded blocks. - The
CheckpointStore
stores/loads theCoordinator
state as a checkpoint to allow for seamless resuming upon restart. TheCoordinator
stores the state as a checkpoint on exit and resumes sampling from the latest state. It also periodically stores checkpoints to storage to avoid the situation when no checkpoint is stored upon a hard shutdown of the node.
Sampling flow:
Coordinator
checks if worker concurrency limit is reached.- If the limit is not reached,
Coordinator
forms a new samplingJob
- Looks if there is a
Job
in the top of the priority stack. - If nothing is in the priority stack, picks the next not sampled range for
Job
.
- Looks if there is a
- Launches new
Worker
with formedJob
. Worker
gets headers for given ranges and sample shares.- After
Worker
is done, it communicates results back toCoordinator
Coordinator
updates sampling state according to worker results.
The maximum amount of concurrently running workers is defined by the const concurrencyLimit
= 16. This value is an approximation that came from the first basic performance tests.
During the test, samples/sec rate was observed with moving average over 30 sec window for a period of 5min. The metric was triggered only by a sampled header with width > 2.
amount of workers: 8, speed: 8.66
amount of workers: 16, speed: 11.13
amount of workers: 32, speed: 11.33
amount of workers: 64, speed: 11.83
Based on basic experiment results, values higher than 16 don’t bring much benefit. At the same time, increased parallelization comes with a cost of higher memory consumption.
Future improvements will be discussed later and are out of the scope of this ADR.
Implemented
Several params values that come hardcoded in DASer (samplingRange
, concurrencyLimit
, priorityQueueSize
, genesisHeight
, backgroundStoreInterval
) should become configurable, so the node runner can define them based on the specific node setup. Default values should be optimized by performance testing for most common setups, and could potentially vary for different node types.