change chunk default size to 10MB #925

bendichter · 2023-07-30T23:37:20Z

Motivation

Larger chunks are better for streaming. See https://youtu.be/rcS5vt-mKok?t=621

bendichter · 2023-07-30T23:37:34Z

CodyCBakerPhD · 2023-07-31T19:55:11Z

@bendichter First off, how on earth did you find that video?? lol

It's the kind of thing I'd expect to see a white paper or at least a blog post about... certainly a nice case study on pagination for a more diffuse file structure than we deal with

First time hearing about the page buffer cache too...

Since we've been hearing from more and more people (Jeremy, Alessio, IBL, now HDF group) that we should simply raise our default chunk sizes, I agree we should just go ahead and bump it up so we can shift the conversation from 'these chunks are too small' to 'these chunks are about the right size but the shape could still be better'.

I do still wish someone (or us) could provide explicit, exhaustive, and thorough performance testing vs. chunk shape/size on top of pagination (and importantly, an easy-to-use framework someone could run to evaluate their own ideas on a given architecture) to prove once and for all what choices for more optimal sizes/shapes are for a given dataset or average of several datasets would be

Until that day comes, I'll just leave the rest below as notes to follow up on (you can skip reading everything below here unless very interested)

Point 2 on: https://youtu.be/rcS5vt-mKok?t=773

I really wish they had cited the exact AWS Best Practice reference used for this: the closest thing is https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html, which is phrased as 'the typical' not the 'recommended', and again no performance tests or logs are cited as the basis for that claim. I even dug around across their white papers such as https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf, https://d1.awsstatic.com/whitepapers/AWS_Cloud_Best_Practices.pdf, https://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf, https://d0.awsstatic.com/whitepapers/Cost_Optimization_with_AWS.pdf and couldn't find a source for the exact number ranges listed.

The closest to an actual number estimate I could find is the estimate of 5,500 GET requests per second on an S3 bucket from https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html, which one would then have to do the math on to equate a single data access slice pattern (from HDF5) to the number of GET requests that would result from that (depends on the chunking in the source), and assume they are the only current person making requests to that bucket (during that one second anyway).

However, as a blatant heuristic I can see their argument: if viewers/users are commonly wanting to request data slices on average >5 MB independent of the exact access slice pattern then you will inevitably end up with fewer GET requests if the chunking is closer in size to those requests (both a lower chance that the request pattern crosses multiple chunks inefficiently and a lower chance that the requests spans a large number of total chunks). So on average, as they say, returning more data with less communication is better than returning less data with more communication. That said, for any specific access pattern I can provide counter-examples of chunk shapes, even a large ones, that could trigger slowdowns.

codecov · 2023-07-31T19:56:16Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (64a444f) 88.33% compared to head (9e84f53) 88.33%.

Additional details and impacted files

@@           Coverage Diff           @@
##              dev     #925   +/-   ##
=======================================
  Coverage   88.33%   88.33%           
=======================================
  Files          45       45           
  Lines        9283     9284    +1     
  Branches     2651     2651           
=======================================
+ Hits         8200     8201    +1     
  Misses        765      765           
  Partials      318      318

Files Changed	Coverage Δ
src/hdmf/backends/hdf5/h5tools.py	`82.56% <100.00%> (+0.01%)`	⬆️
src/hdmf/data_utils.py	`90.24% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bendichter · 2023-07-31T20:00:33Z

First off, how on earth did you find that video?? lol

Youtube recommended it haha. The Algorithm saves the day.

rly · 2023-08-07T08:41:02Z

This sounds good to me. However, we should increase the default chunk cache size to 10 MB then. See https://www.star.nesdis.noaa.gov/jpss/documents/HDF5_Tutorial_201509/2-2-Mastering%20Powerful%20Features.pptx.pdf for a pitfall of chunks being larger than the cache size.

It should be fine to increase the chunk cache size. Each dataset has its own cache. NWB files typically do not have many datasets open. netcdf-4 uses a chunk cache size of 32 MB. We could do the same.

See https://stackoverflow.com/a/57278604/20177 for example code.

rly · 2023-08-07T08:46:39Z

I made the change. I would be curious to see whether increasing the raw data chunk cache size impacts cloud performance, especially when compression is involved.

bendichter · 2023-08-07T23:19:24Z

Looks good to me!

change chunk default size to 10MB

56945c5

bendichter requested a review from rly July 30, 2023 23:37

bendichter added 2 commits July 30, 2023 19:38

update arg doc

29dcb5e

fix tests

0c1b5c3

CodyCBakerPhD mentioned this pull request Jul 31, 2023

ElectricalSeries: important chunking considerations flatironinstitute/neurosift#52

Closed

bendichter requested a review from oruebel August 1, 2023 00:34

Update h5tools.py

9e84f53

rly approved these changes Aug 8, 2023

View reviewed changes

rly merged commit 9e194a4 into dev Aug 8, 2023
26 checks passed

rly deleted the change_chunk_default_size branch August 8, 2023 20:06

rly added a commit that referenced this pull request Aug 8, 2023

Update CHANGELOG.md for #925

9475bb0

rly mentioned this pull request Aug 8, 2023

Update CHANGELOG.md for #925 #936

Merged

rly added a commit that referenced this pull request Aug 8, 2023

Update CHANGELOG.md for #925 (#936)

b9047de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change chunk default size to 10MB #925

change chunk default size to 10MB #925

bendichter commented Jul 30, 2023

bendichter commented Jul 30, 2023

CodyCBakerPhD commented Jul 31, 2023

codecov bot commented Jul 31, 2023 •

edited

Loading

bendichter commented Jul 31, 2023

rly commented Aug 7, 2023

rly commented Aug 7, 2023

bendichter commented Aug 7, 2023

change chunk default size to 10MB #925

change chunk default size to 10MB #925

Conversation

bendichter commented Jul 30, 2023

Motivation

bendichter commented Jul 30, 2023

CodyCBakerPhD commented Jul 31, 2023

codecov bot commented Jul 31, 2023 • edited Loading

Codecov Report

bendichter commented Jul 31, 2023

rly commented Aug 7, 2023

rly commented Aug 7, 2023

bendichter commented Aug 7, 2023

codecov bot commented Jul 31, 2023 •

edited

Loading