-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change chunk default size to 10MB #925
Conversation
@bendichter First off, how on earth did you find that video?? lol It's the kind of thing I'd expect to see a white paper or at least a blog post about... certainly a nice case study on pagination for a more diffuse file structure than we deal with First time hearing about the page buffer cache too... Since we've been hearing from more and more people (Jeremy, Alessio, IBL, now HDF group) that we should simply raise our default chunk sizes, I agree we should just go ahead and bump it up so we can shift the conversation from 'these chunks are too small' to 'these chunks are about the right size but the shape could still be better'. I do still wish someone (or us) could provide explicit, exhaustive, and thorough performance testing vs. chunk shape/size on top of pagination (and importantly, an easy-to-use framework someone could run to evaluate their own ideas on a given architecture) to prove once and for all what choices for more optimal sizes/shapes are for a given dataset or average of several datasets would be Until that day comes, I'll just leave the rest below as notes to follow up on (you can skip reading everything below here unless very interested)Point 2 on: https://youtu.be/rcS5vt-mKok?t=773 I really wish they had cited the exact AWS Best Practice reference used for this: the closest thing is https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html, which is phrased as 'the typical' not the 'recommended', and again no performance tests or logs are cited as the basis for that claim. I even dug around across their white papers such as https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf, https://d1.awsstatic.com/whitepapers/AWS_Cloud_Best_Practices.pdf, https://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf, https://d0.awsstatic.com/whitepapers/Cost_Optimization_with_AWS.pdf and couldn't find a source for the exact number ranges listed. The closest to an actual number estimate I could find is the estimate of 5,500 GET requests per second on an S3 bucket from https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html, which one would then have to do the math on to equate a single data access slice pattern (from HDF5) to the number of GET requests that would result from that (depends on the chunking in the source), and assume they are the only current person making requests to that bucket (during that one second anyway). However, as a blatant heuristic I can see their argument: if viewers/users are commonly wanting to request data slices on average >5 MB independent of the exact access slice pattern then you will inevitably end up with fewer GET requests if the chunking is closer in size to those requests (both a lower chance that the request pattern crosses multiple chunks inefficiently and a lower chance that the requests spans a large number of total chunks). So on average, as they say, returning more data with less communication is better than returning less data with more communication. That said, for any specific access pattern I can provide counter-examples of chunk shapes, even a large ones, that could trigger slowdowns. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #925 +/- ##
=======================================
Coverage 88.33% 88.33%
=======================================
Files 45 45
Lines 9283 9284 +1
Branches 2651 2651
=======================================
+ Hits 8200 8201 +1
Misses 765 765
Partials 318 318
☔ View full report in Codecov by Sentry. |
Youtube recommended it haha. The Algorithm saves the day. |
This sounds good to me. However, we should increase the default chunk cache size to 10 MB then. See https://www.star.nesdis.noaa.gov/jpss/documents/HDF5_Tutorial_201509/2-2-Mastering%20Powerful%20Features.pptx.pdf for a pitfall of chunks being larger than the cache size. It should be fine to increase the chunk cache size. Each dataset has its own cache. NWB files typically do not have many datasets open. netcdf-4 uses a chunk cache size of 32 MB. We could do the same. See https://stackoverflow.com/a/57278604/20177 for example code. |
I made the change. I would be curious to see whether increasing the raw data chunk cache size impacts cloud performance, especially when compression is involved. |
Looks good to me! |
Motivation
Larger chunks are better for streaming. See https://youtu.be/rcS5vt-mKok?t=621