How to disable block window usage? How about line-boundary instead of fixed bytes? #30
Replies: 2 comments 4 replies
-
I should describe more clearly how this stuff works... The That all being said, there's typically a sweet spot in terms of how big (or small) you want these matches to be. Smaller matches increase the metadata size and at some point start to make block compression less effective. However, if larger blocks are found, then you can very effectively reduce the number of blocks. I've found that typically matches of less than ~4 KiB bloat the metadata and don't give additional benefits in compression ratio. I'll definitely document this at some point, but I think that in principle the code as it exists at the moment should already be able to find pretty much all matches you want it to find if you configure things appropriately. I'd rather not add extra complexity in this particular area unless there's a compelling reason to do so. |
Beta Was this translation helpful? Give feedback.
-
Almost all. There's a <1% collision rate of the rolling hash, and in case of a collision only one offset is stored. I've actually tried storing all offsets in case of collisions, but it didn't make any practical difference. The main reasons for only storing hash values at distinct intervals are memory usage and performance. The matching cannot easily be parallelized, so on multicore CPUs this piece of the compression pipeline can easily become the bottleneck. In the default configuration, a hash value is stored only every 2048 bytes (window-size=12, i.e. 4096 bytes, shifted by 1). Decreasing this interval means increasing the size of the hash table, increasing the number of writes to the hash table and decreasing the lookup speed. |
Beta Was this translation helpful? Give feedback.
-
Is there a way to completely disable the "window block" usage?
I ask because for certain workloads although there are plenty of duplicates they aren't properly aligned to a block size, thus
--blockhash-window-sizes
and--window-increment-shift
are useless (unless one sets the window increment to something extremely small.)My particular use-case is a large corpus of emails (~3.8 M, of ~54 GiB) of which:
Alternatively a better solution would be to compute these blocks at line boundaries. I.e. try to fill a block with successive lines, but stop before the last line that would overflow the block buffer; the same with the window increment, if enough bytes have been collected to fill a window increment, and the next would overflow, stop there.
The line terminator should be configurable:
\n
,\n\r
and\r\n
as meaning the same thing);\0
; (some files are zero-terminated, and perhaps one has a dataset full of zero-terminated files;)gzip
orzstd
does (see--rsyncable
option);These options should work perfectly for text-based datasets, but (especially the
rolling-checksum
) for other file formats that for some reason prepend information at the beginning, thus shifting the file contents.Beta Was this translation helpful? Give feedback.
All reactions