-
-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature, Perf] Audit and optimize the embedded files accessing code #714
Comments
Yes, I've been thinking about redesigning the format as well. I thought we could write out each file individually like we write out individual strings, and no need for nested compression. Basically have a new kind of record "File" that's literally two strings, one for file path and the other is for the content (uncompressed). Maybe also the timestamp and attributes (or other metadata?). The outer stream would compress anyway. Need to measure how the performance of this approach would compare to the current one, and what the binlog file size would be. The upside of this approach is that if the build crashes or is interrupted in the middle, we still get all the records, strings and files embedded so far. Another idea would be to unzip the file archive when opening the binlog, as usual, and then immediately re-zip each file individually, and store the compressed byte[] for each file. Unpack each file text on demand. This would spike memory and CPU when reading the binlog initially, but would settle to lower memory consumption of the viewer after it's open. Also seems like the cheapest approach. Or we could store records, strings, name-value lists and files in four separate streams and concatenate them at the end. The downside is if the build fails in the middle, we lose everything. |
I don't know the binlog file format, but since it's a gzipped stream I may suggest an idea: maybe you could use the fact that two concatenated gzip streams are a valid gzip stream (see RFC 1952, section 2.2).
The main problem with this approach is that I don't think Here are my two cents, hoping they can be useful 🙂 |
I made a refactoring that at least moves the logic that reads embedded files to a background thread and reports the time it took in the bottom info bar after the binlog is loaded. |
Context
Inspired byt the discussion here: #711 (comment)
Embedded files are currently greedely fully read into memory during opening of binlog - while they might never be accessed.
Gotchas
Embedded files are a ziparchive which is within the zipstream of a binlog - acessing those later on would require one of those optins:
Each of those options have significant downsides. The optimal way would need to be tested.
Alternative
Redesigning the binlog format.
E.g.: compressed events stream and files zip archive would be two independent streams within single file. The file would have few empty bytes prealocated on the begining and those would then be overwritten as the binlog would be writen:
Other possible alternations to the format can be done at the same time to optimize the compression ratio, quicker redaction workflows etc. - e.g.:
FYI @rokonec - he was incepting some of those ideas
The text was updated successfully, but these errors were encountered: