How to disable block window usage? How about line-boundary instead of fixed bytes? #30

cipriancraciun · 2020-12-19T18:19:29Z

cipriancraciun
Dec 19, 2020

Is there a way to completely disable the "window block" usage?

I ask because for certain workloads although there are plenty of duplicates they aren't properly aligned to a block size, thus --blockhash-window-sizes and --window-increment-shift are useless (unless one sets the window increment to something extremely small.)

My particular use-case is a large corpus of emails (~3.8 M, of ~54 GiB) of which:

almost ~1.5 of them are almost duplicates but differ only in a header value;
many of them contain duplicate attachments (think about replies and forwards);

Alternatively a better solution would be to compute these blocks at line boundaries. I.e. try to fill a block with successive lines, but stop before the last line that would overflow the block buffer; the same with the window increment, if enough bytes have been collected to fill a window increment, and the next would overflow, stop there.

The line terminator should be configurable:

"new-line" (which would in fact match \n, \n\r and \r\n as meaning the same thing);
"zero", i.e. \0; (some files are zero-terminated, and perhaps one has a dataset full of zero-terminated files;)
use a "rolling-checksum" to trigger a break as gzip or zstd does (see --rsyncable option);

These options should work perfectly for text-based datasets, but (especially the rolling-checksum) for other file formats that for some reason prepend information at the beginning, thus shifting the file contents.

mhx · 2020-12-19T22:49:39Z

mhx
Dec 19, 2020
Maintainer

I should describe more clearly how this stuff works...

The window-increment-shift option only affects which hash values are kept in memory for matching. When trying to find a match, the rolling hash windows are actually moved one byte at a time. This obviously still means that not all possible matches are found, but it's still guaranteed that all matches of a certain size are found. With the default configuration, all matches that are at least 1.5 times the smallest window size (so 6 KiB) should be detected. Once match is found, the match window is extended as far as possible at both ends, so the actual chunk size being referenced by an inode is not directly dependent on the block hash window size.

That all being said, there's typically a sweet spot in terms of how big (or small) you want these matches to be. Smaller matches increase the metadata size and at some point start to make block compression less effective. However, if larger blocks are found, then you can very effectively reduce the number of blocks. I've found that typically matches of less than ~4 KiB bloat the metadata and don't give additional benefits in compression ratio.

I'll definitely document this at some point, but I think that in principle the code as it exists at the moment should already be able to find pretty much all matches you want it to find if you configure things appropriately. I'd rather not add extra complexity in this particular area unless there's a compelling reason to do so.

4 replies

cipriancraciun Dec 20, 2020
Author

With the default configuration, all matches that are at least 1.5 times the smallest window size (so 6 KiB) should be detected.

I don't quite understand this 1.5 times the smallest window size.

Say one chooses a --blockhash-window-sizes 12 (i.e. 4 KiB) and --window-increment-shift of 2 (i.e. 1 KiB), and say the file is 1 MiB:

there are 256 non overlapping windows;
however there are 1021 4 KiB overlapping windows taken at 1 KiB offsets;
thus the compressor stores the hash of each of these 1021 window hashes;

Where does the "1.5 times the smallest window size" fit into all of the above?

My assumption is that due to the rolling window hash, and the fact that duplicate blocks might be misaligned, in order to match them 100% of the times should be at least "window size" + "window offset" in order to be sure, that no matter the misalignment, it will at least match a part (the window size), and afterwards as you said the duplicate range is extended.

Thus the "1.5 times" comes from the fact that the default window shift increment is 1, thus the offset is half the window size. Else in general it would be "window size" + "window offset".

In other words, is this algorithm your original creation? Have you studied other options?

I'm asking because once I've thought about how such a similar algorithm could be used to deduplicate a compressed gzip stream by detecting duplicate blocks in the original stream and using pointers to previous stored compressed blocks. Then obtaining a compressed stream would just imply concatenating the compressed blocks. This would be useful for an HTTP server that can serve large collections of files that have duplicate contents; think of manual pages that have lot's of duplicated headers and footers.

mhx Dec 20, 2020
Maintainer

Thus the "1.5 times" comes from the fact that the default window shift increment is 1, thus the offset is half the window size. Else in general it would be "window size" + "window offset".

Exactly, that's why I wrote "with the default configuration".

In other words, is this algorithm your original creation?

Yes, although I reckon I might no be the first one to come up with this idea given its simplicity.

Have you studied other options?

I think I remember looking at generalized diff algorithms and such things, but I definitely didn't do an exhaustive study. The algorithm I came up with seemed like a good start.

Most compression algorithms will spend the bulk of their time doing something like this in a much more sophisticated way. But as I'm really only interested in reasonably large overlaps, I'm pretty happy with what this simple algorithm can achieve.

cipriancraciun Dec 20, 2020
Author

One final question: given that once you identify a block match you try to extend the block in both directions, what is the purpose of using multiple window sizes?

Because if one only uses the smallest window size (of the multiple ones he would have used), it would automatically cover the larger ones also.

mhx Dec 20, 2020
Maintainer

Ah, yes, good question. I'm looking for larger matches first (which is why the window sizes are in decreasing order), as this decreases metadata size and improves compression further down the chain. In case of multiple matches, I only keep track of the last match for a particular window size, not all of them, and this last one might not be the largest.

I'm pretty sure there are much better ways to this than using multiple match window sizes, and I'm actually planning on looking into this again. I had a brief look at this code when replacing the cyclic hash function, but apart from that it probably hasn't changed for the last 7 years. It should be fairly easy to actually keep track of all matches (I'm not entirely sure why I didn't do that initially), and this should also speed up this stage quite a bit (and make things easier to configure).

mhx · 2020-12-19T23:07:50Z

mhx
Dec 19, 2020
Maintainer

With the default configuration, all matches that are at least 1.5 times the smallest window size (so 6 KiB) should be detected.

Almost all. There's a <1% collision rate of the rolling hash, and in case of a collision only one offset is stored. I've actually tried storing all offsets in case of collisions, but it didn't make any practical difference.

The main reasons for only storing hash values at distinct intervals are memory usage and performance. The matching cannot easily be parallelized, so on multicore CPUs this piece of the compression pipeline can easily become the bottleneck. In the default configuration, a hash value is stored only every 2048 bytes (window-size=12, i.e. 4096 bytes, shifted by 1). Decreasing this interval means increasing the size of the hash table, increasing the number of writes to the hash table and decreasing the lookup speed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to disable block window usage? How about line-boundary instead of fixed bytes? #30

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to disable block window usage? How about line-boundary instead of fixed bytes? #30

cipriancraciun Dec 19, 2020

Replies: 2 comments · 4 replies

mhx Dec 19, 2020 Maintainer

cipriancraciun Dec 20, 2020 Author

mhx Dec 20, 2020 Maintainer

cipriancraciun Dec 20, 2020 Author

mhx Dec 20, 2020 Maintainer

mhx Dec 19, 2020 Maintainer

cipriancraciun
Dec 19, 2020

Replies: 2 comments 4 replies

mhx
Dec 19, 2020
Maintainer

cipriancraciun Dec 20, 2020
Author

mhx Dec 20, 2020
Maintainer

cipriancraciun Dec 20, 2020
Author

mhx Dec 20, 2020
Maintainer

mhx
Dec 19, 2020
Maintainer