-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine file encoding for initial state #1396
Comments
An update: Wrote a small utility to benchmark our use case. Our expected data volume is enormous -- a 1TB rocksdb database at the moment of regenesis. Benchmarking for this much data became problematic so I resorted to a mixed approach of benchmarks and linear regression. Testing was done in memory to remove the storage speed variable. Some expectations were given for the contract state in that it might reach numbers of 100M entries, which meant that one This made us uneasy with regards to keeping the We agreed that An alternative would be introducing a somewhat custom layout to fit our needs exactly. We're currently placing the following requirement on whatever data format is chosen:
The
An example:
Test dataRandomly generated Coins, Messages, and Contracts in the above-described ordering. Other formatsDidn't consider any column-oriented formats (such as Parquet) since we don't benefit from it. CompressionEach test is run with and without compression. Used native Zstd with minimal compression Json (serde_json)Pros:
Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries: Storage400GB without compression, 255GB with compression. Encoding performanceAround 15m for uncompressed json, 1h8m for compressed. Decoding performance30 minutes for uncompressed, 43 minutes for compressed json, BincodePros:
Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries: Storage
Encoding performance
Decoding performance
BsonPros:
Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries: Storage
Encoding performance
Decoding performance33 minutes uncompressed, 46 minutes compressed. Bench summary graphCompression impact on cursorSeeking a location on uncompressed data is fast. This is not the case for compressed data since you have to decompress even the data you don't care about. There are workarounds should we need them. The naive approach takes around 13 minutes to decompress and seek to the end of 400GB of compressed data. Current summary: |
Tried one more format, Parquet. Even though we don't benefit from the columnar layout parquet has the advantage of encoding data in chunks (solving the cursor + encoding problem). It is also not Serde compatible and has poor support for deriving the encoding/decoding code. Also the column for a file should represent one entity (and not multiple like a rust enum might do). So that means 5 files: coins, messages, contracts, contract_state and contract_balance. Measurements taken for 10k, 20k, ..., 100k entries. Using linear regression to predict usage for up to 1B entries: Storage
Encoding performance
Decoding performance
All comparedStorageEncoding performanceDecoding performanceIt seems Parquet is a clear winner. Will proceed to use it instead of bincode for the regenesis. |
Done during #1474 |
Current development on the regenesis feature has split off the initial state from the chain config. This buys us some flexibility to select an appropriate encoding for the initial state file.
Some points to consider:
The text was updated successfully, but these errors were encountered: