Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine file encoding for initial state #1396

Closed
segfault-magnet opened this issue Oct 4, 2023 · 3 comments
Closed

Determine file encoding for initial state #1396

segfault-magnet opened this issue Oct 4, 2023 · 3 comments
Assignees
Labels
regenesis SDK team The issue is ready to be addressed by SDK team

Comments

@segfault-magnet
Copy link
Contributor

Current development on the regenesis feature has split off the initial state from the chain config. This buys us some flexibility to select an appropriate encoding for the initial state file.

Some points to consider:

  • Storage -- how much overhead does the encoding have?
  • Cursor -- how quickly can we continue decoding the file when resuming the regenesis?
  • Performance -- both for decoding and encoding
  • Compression -- encoding overhead can be reduced by compression. On the other hand, having a cursor to the last loaded element might get complicated. If compression is used ideally it shouldn't have system dependencies and be a well-known algorithm.
  • Human readability -- keeping it human readable helps with manipulating the file, and developing tools to analyze it, testing.
  • Library maturity -- general support for encoding/decoding and also rust-specific support.
  • Streaming - can you decode/encode on the fly? What is the API like? Do you handle reading/saving the data?
@segfault-magnet
Copy link
Contributor Author

segfault-magnet commented Oct 10, 2023

An update:

Wrote a small utility to benchmark our use case.

Our expected data volume is enormous -- a 1TB rocksdb database at the moment of regenesis. Benchmarking for this much data became problematic so I resorted to a mixed approach of benchmarks and linear regression.

Testing was done in memory to remove the storage speed variable.

Some expectations were given for the contract state in that it might reach numbers of 100M entries, which meant that one ContractConfig might take up 6.4GB of memory (or more depending on the decoding overhead) when loaded (64B per entry).

This made us uneasy with regards to keeping the state and balances collections inside ContractConfig.

We agreed that ContractConfig would need some flattening. We considered encoding the state and balances into separate files. We ultimately decided against it because of the overhead when matching the contract state and balances to the contract itself. This is manifested in the need for a 'foreign key' to be saved with each storage or balance entry. If we don't enforce an ordering while encoding then a columnar file layout would be needed so that we may locate all of the contract state and balances when parsing a single contract. We briefly considered using the rocksdb snapshot (or sled or mysql) as the target format itself but didn't pursue the idea further.

An alternative would be introducing a somewhat custom layout to fit our needs exactly. We're currently placing the following requirement on whatever data format is chosen:

  • First all CoinConfigs are encoded,
  • Then all MessageConfigs are encoded,
  • Finally all ContractConfigs are encoded. After each ContractConfig all of its state and balances should follow in any order before another ContractConfig is decoded.

The ContractConfig was broken down into 3 structures:

  1. A ContractState representing one storage slot (a (Bytes32, Bytes32))
  2. A ContractBalance representing the balance of the contract for a particular AssetId (a (Bytes32, u64))
  3. ContractConfig - the original fields minus the above.

An example:

COIN
... (coins)
COIN
MESSAGE
...(messages)
MESSAGE
CONTRACT#1
STATE
STATE
BALANCE
STATE
BALANCE
CONTRACT#2
...
...

Test data

Randomly generated Coins, Messages, and Contracts in the above-described ordering.
e.g. 1k test entries = 333 coins, 333 messages, and 333 contracts (each with 10k state entries and 100 balance entries)

Other formats

Didn't consider any column-oriented formats (such as Parquet) since we don't benefit from it.
Didn't consider any formats that need schema code generation (such as ProtoBuf, CaptnProto, etc)
Didn't consider formats whose rust impl is poorly maintained

Compression

Each test is run with and without compression.

Used native Zstd with minimal compression

Json (serde_json)

Pros:

  • Mature, popular
  • Serde compatible
  • Human readable
  • Can be streamed (json lines)
    Cons:
  • We don't need the schema, adds encoding overhead

Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries:

Storage

storage_requirements

400GB without compression, 255GB with compression.

Encoding performance

encoding_time

Around 15m for uncompressed json, 1h8m for compressed.

Decoding performance

decoding_time

30 minutes for uncompressed, 43 minutes for compressed json,

Bincode

Pros:

  • Mature
  • Serde compatible
  • No schema
  • Can be streamed (just encode elements one after another)
    Cons:
  • Not human-readable
  • Manual streaming -- Collections require size to be encoded before hand

Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries:

Storage

storage_requirements
Around 290GB uncompressed, 240GB compressed

Encoding performance

encoding_time
Around 13 minutes uncompressed, and around 46 minutes compressed.

Decoding performance

decoding_time
27 minutes uncompressed, around 37 compressed

Bson

Pros:

  • Mature
  • Serde compatible
  • Can be streamed
    Cons:
  • Has overhead for schema
  • Interface doesn't allow for encoding into a writer -- extra allocations
  • No unsigned types native support, have to be careful around u64 and bigger types since they can't fit into i64 supported by BSON.
  • Manual streaming -- collections can only be encoded in-memory.

Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries:

Storage

storage_requirements
Around 400GB uncompressed, around 250GB compressed

Encoding performance

encoding_time
Around 15m uncompressed, around 45m compressed

Decoding performance

decoding_time

33 minutes uncompressed, 46 minutes compressed.

Bench summary graph

storage_requirements
encoding_time
decoding_time

Compression impact on cursor

Seeking a location on uncompressed data is fast. This is not the case for compressed data since you have to decompress even the data you don't care about. There are workarounds should we need them.

The naive approach takes around 13 minutes to decompress and seek to the end of 400GB of compressed data.

Current summary:
We're going with bincode + optional compression. Starting work on implementing the reading/writing logic.

@MujkicA MujkicA mentioned this issue Oct 12, 2023
5 tasks
@segfault-magnet
Copy link
Contributor Author

segfault-magnet commented Nov 7, 2023

Tried one more format, Parquet. Even though we don't benefit from the columnar layout parquet has the advantage of encoding data in chunks (solving the cursor + encoding problem).

It is also not Serde compatible and has poor support for deriving the encoding/decoding code.

Also the column for a file should represent one entity (and not multiple like a rust enum might do). So that means 5 files: coins, messages, contracts, contract_state and contract_balance.

Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries:

Storage

storage_requirements
Around 140 GB without compression, more (+5GB) with compression (at the lowest level Gz, just like all the other compressions used)

Encoding performance

encoding_time
Around 300s (uncompressed) and around 1250s compressed.

Decoding performance

decoding_time
480s (uncompressed) and around 720s (compressed).

All compared

Storage

storage_requirements

Encoding performance

encoding_time

Decoding performance

decoding_time

It seems Parquet is a clear winner. Will proceed to use it instead of bincode for the regenesis.

@xgreenx
Copy link
Collaborator

xgreenx commented Jan 31, 2024

Done during #1474

@xgreenx xgreenx closed this as completed Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regenesis SDK team The issue is ready to be addressed by SDK team
Projects
None yet
Development

No branches or pull requests

2 participants