Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change chunk default size to 10MB #925

Merged
merged 4 commits into from
Aug 8, 2023
Merged

change chunk default size to 10MB #925

merged 4 commits into from
Aug 8, 2023

Conversation

bendichter
Copy link
Contributor

Motivation

Larger chunks are better for streaming. See https://youtu.be/rcS5vt-mKok?t=621

@bendichter
Copy link
Contributor Author

cc @CodyCBakerPhD

@bendichter bendichter requested a review from rly July 30, 2023 23:37
@CodyCBakerPhD
Copy link
Contributor

@bendichter First off, how on earth did you find that video?? lol

It's the kind of thing I'd expect to see a white paper or at least a blog post about... certainly a nice case study on pagination for a more diffuse file structure than we deal with

First time hearing about the page buffer cache too...

Since we've been hearing from more and more people (Jeremy, Alessio, IBL, now HDF group) that we should simply raise our default chunk sizes, I agree we should just go ahead and bump it up so we can shift the conversation from 'these chunks are too small' to 'these chunks are about the right size but the shape could still be better'.

I do still wish someone (or us) could provide explicit, exhaustive, and thorough performance testing vs. chunk shape/size on top of pagination (and importantly, an easy-to-use framework someone could run to evaluate their own ideas on a given architecture) to prove once and for all what choices for more optimal sizes/shapes are for a given dataset or average of several datasets would be

Until that day comes, I'll just leave the rest below as notes to follow up on (you can skip reading everything below here unless very interested)

Point 2 on: https://youtu.be/rcS5vt-mKok?t=773

I really wish they had cited the exact AWS Best Practice reference used for this: the closest thing is https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html, which is phrased as 'the typical' not the 'recommended', and again no performance tests or logs are cited as the basis for that claim. I even dug around across their white papers such as https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf, https://d1.awsstatic.com/whitepapers/AWS_Cloud_Best_Practices.pdf, https://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf, https://d0.awsstatic.com/whitepapers/Cost_Optimization_with_AWS.pdf and couldn't find a source for the exact number ranges listed.

The closest to an actual number estimate I could find is the estimate of 5,500 GET requests per second on an S3 bucket from https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html, which one would then have to do the math on to equate a single data access slice pattern (from HDF5) to the number of GET requests that would result from that (depends on the chunking in the source), and assume they are the only current person making requests to that bucket (during that one second anyway).

However, as a blatant heuristic I can see their argument: if viewers/users are commonly wanting to request data slices on average >5 MB independent of the exact access slice pattern then you will inevitably end up with fewer GET requests if the chunking is closer in size to those requests (both a lower chance that the request pattern crosses multiple chunks inefficiently and a lower chance that the requests spans a large number of total chunks). So on average, as they say, returning more data with less communication is better than returning less data with more communication. That said, for any specific access pattern I can provide counter-examples of chunk shapes, even a large ones, that could trigger slowdowns.

@codecov
Copy link

codecov bot commented Jul 31, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (64a444f) 88.33% compared to head (9e84f53) 88.33%.

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #925   +/-   ##
=======================================
  Coverage   88.33%   88.33%           
=======================================
  Files          45       45           
  Lines        9283     9284    +1     
  Branches     2651     2651           
=======================================
+ Hits         8200     8201    +1     
  Misses        765      765           
  Partials      318      318           
Files Changed Coverage Δ
src/hdmf/backends/hdf5/h5tools.py 82.56% <100.00%> (+0.01%) ⬆️
src/hdmf/data_utils.py 90.24% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bendichter
Copy link
Contributor Author

First off, how on earth did you find that video?? lol

Youtube recommended it haha. The Algorithm saves the day.

@rly
Copy link
Contributor

rly commented Aug 7, 2023

This sounds good to me. However, we should increase the default chunk cache size to 10 MB then. See https://www.star.nesdis.noaa.gov/jpss/documents/HDF5_Tutorial_201509/2-2-Mastering%20Powerful%20Features.pptx.pdf for a pitfall of chunks being larger than the cache size.

It should be fine to increase the chunk cache size. Each dataset has its own cache. NWB files typically do not have many datasets open. netcdf-4 uses a chunk cache size of 32 MB. We could do the same.

See https://stackoverflow.com/a/57278604/20177 for example code.

@rly
Copy link
Contributor

rly commented Aug 7, 2023

I made the change. I would be curious to see whether increasing the raw data chunk cache size impacts cloud performance, especially when compression is involved.

@bendichter
Copy link
Contributor Author

Looks good to me!

@rly rly merged commit 9e194a4 into dev Aug 8, 2023
26 checks passed
@rly rly deleted the change_chunk_default_size branch August 8, 2023 20:06
rly added a commit that referenced this pull request Aug 8, 2023
@rly rly mentioned this pull request Aug 8, 2023
rly added a commit that referenced this pull request Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants