Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

Closed
klauspost opened this issue Aug 16, 2023 · 16 comments · Fixed by #3731
Closed

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) #3724

klauspost opened this issue Aug 16, 2023 · 16 comments · Fixed by #3731
Assignees
Labels

Comments

@klauspost
Copy link

Describe the bug

I am building a custom dictionary builder.

When testing it on my current output:
dictionary.bin.gz

To Reproduce

Gunzip above. Place file at github_users_sample_set\user.0E6FBohGzg.json

Execute:

λ zstd -D dictionary.bin github_users_sample_set\user.0E6FBohGzg.json
zstd: error 11 : Allocation error : not enough memory

I have an older debug compiled zstd:

compress\zstd_compress.c: ZSTD_compress_insertDictionary (dictSize=4272)
compress\zstd_compress.c:4337: ERROR!: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted):
compress\zstd_compress.c:4435: ERROR!: forwarding error in eSize: Dictionary is corrupted: ZSTD_loadCEntropy failed
compress\zstd_compress.c:4835: ERROR!: forwarding error in dictID: Dictionary is corrupted: ZSTD_compress_insertDictionary failed
compress\zstd_cwksp.h: cwksp: freeing workspace
compress\zstd_compress.c:1123: ERROR!: check !dl->cdict failed, returning ERROR(memory_allocation): ZSTD_createCDict_advanced failed

So it seems like there is a restriction on the huffman tables - where maxSymbolValue must be 255 the provided table.

What is the reason for that. I would kind of expect it to be more efficient, if all symbols can't be represented? Could you explain a bit to me why this (undocumented?) limitation exists?

Expected behavior

I would kinda expect it to work?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Aug 16, 2023

This is a very old implementation limitation.

When maxSymbolValue < 255, that means that not every symbol (byte value) is present in the statistics.
Consequently, if a missing symbol ends up being present in the input, and then in the literals section, it will not be possible to represent it. This can happen if data to compress is different from training set.
If encoding was proceeding blindly, this would lead to corruption errors.

Initially, this choice was selected for speed consideration : we want to proceed blindly, i.e. trust the table provided by the dictionary, so we must be sure it's going to be able to represent any data thrown at it. This results in major speed benefits.

Nowadays, we could use some stronger analysis, deciding to employ dictionary statistics selectively depending on their fit with local statistics. Such analysis would be able to detect that a dictionary statistic is incomplete and unable to represent a symbol to compress, and replace them with locally generated statistics. This is something we already do for inter-block correlation when streaming at higher compression levels.

However, for fastest levels, there is still a problem regarding time spent during analysis.
Hence, to remain compatible with speed objectives, we would be more likely to just always trust, or always distrust.

Generating new statistics embedded in the frame is of course detrimental to compression ratio and speed, on both the compression and decompression sides. And since we mostly use dictionary compression for high speed applications, this is a situation we want to avoid.
Hence, the reference dictionary builder always generate "complete" statistics.
(note: the same problem happens with statistics for other symbols, such as match length, literal length and offset).

Consequently, our reference decoder has only met this scenario, so we have not so far needed to deal with Zstandard dictionaries containing incomplete statistics (and the need to mitigate the side effects).

You are right that incomplete statistics would likely compress better, since we would not waste some probability space for events that will never happen.
But the loss is also generally not that big, so this is just a (small) compression ratio optimization.
On the other hand, the processing requirement for the encoder to deal safely with incomplete statistics is unavoidable, and actually substantial for high speed applications.
The only use case I could imagine it to make sense is in combination with high compression ratio modes, which are already very slow, and therefore for which such additional processing stage would probably be insensible.

@Cyan4973 Cyan4973 self-assigned this Aug 16, 2023
@klauspost
Copy link
Author

Ah, so you are forcing it to use the table on lower levels.

But it only checks if 255 can be represented. Not if all others can. Is this a "missing" check, or is it safe to use?

I generate the Huffman table from the statistics of the actual literal bytes, and discard low (and obviously zero) probability outputs - with the assumption that it can switch to another if the table cannot represent it.

My experience from toying with deflate told me that adding zero-probability codes "just incase they were needed" was less efficient overall than just having these cases generate a fresh table. So I applied that learning here.

Do the FSE tables have the same limitations? I generate these in a similar fashion, which will also lead to "short" tables and some symbols having 0 probability.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Aug 16, 2023

I don't remember this part of the code base,
so it could very well be that something is incomplete or not done well enough.

But for information, this entry point is fuzzed 24/24. So I don't expect an adversarial dictionary to be able to crash the encoder or make it produce corrupted frames. There are additional checks in the pipeline, and cumulatively, they seem to cover the situation well enough to survive heavy fuzzer tests, so no catastrophic scenario is expected. Worst case is, the dictionary will be refused.

Nonetheless, some improvements might be possible, if only for clarity. So that can still be worth another look.

As mentioned in the previous message, sure, zeroing some non-probable symbols is likely going to improve compression ratio, if only by a little bit. The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off.
In our use cases, we lean towards better speed. And full statistics is a way towards better compression speed.

Now, it doesn't mean that we can't support partial statistics. It's just that there was no such need so far, so the implementation need not to support it.

Now let's assume the implementation is updated to support partial statistics in dictionary. A secondary issue is that checking input compatibility with partial statistics is going to cost additional cpu time, but since correctness is not affected, this is a form of silent degradation: nothing crashes, it just silently reduces speed. Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed.

@klauspost
Copy link
Author

The problem is not there: the problem is the necessity to validate that the input is compatible with the partial statistics. This verification necessarily costs cpu time. Depending on your use case, you might consider this cost as irrelevant or well worth it, or you might consider it too high for the benefit. This is a trade-off.

Sure. I see how you can avoid histogramming the input in these cases.

Now, it doesn't mean that we can't support partial statistics.

It may be something internally, since I don't understand how it is fine to have a Huffman table where symbol 254 cannot be represented (weight=0), but symbol 255 must - or at least the table must produce 256 weight entries when decoded.

Hence it can remain unnoticed for a long time. This seems a concern that should also be addressed.

I guess it is just a matter of implementation expectations. But I do see how being able to always use the literal dictionary compressor can be an advantage.

In my implementation, it uses the dictionary the same as any dictionary from a previous block. 1) Generate histogram of literals 2) Check if previous can be used 3) Check if a new table would be better or even if just uncompressed. I don't actually think this is too far from the logic in zstd.

Since I always (by default) check if a new table will be better, I am already generating the histogram, so the only extra cost is checking the histogram against the dictionary table.

Sidenote - maybe returning a more "correct" error than zstd: error 11 : Allocation error : not enough memory would be nice.

I am not asking you to change anything (other than maybe the error) - but I am trying to understand what is going on.

@Cyan4973
Copy link
Contributor

Agreed, the error message is not reflective of what's going on.
This must be improved.

@terrelln
Copy link
Contributor

TLDR: This check is overly restrictive, and we can fix it easily.

I think this line is a holdover from when we were relying on dictionaries being valid. At this point, we ensure that compression never corrupts data, even when a dictionary is corrupted, invalid, or missing symbols in statistics (but the same between compression & decompression). And we have fuzzers to verify this.

During dictionary loading, we validate the Huffman / FSE tables, and mark them as either valid meaning they have non-zero probabilities for every possible symbol and can be reused for any source, or check meaning that they have zero probability for some symbols, and can't be re-used without histograming the source.

For Huffman, you can see that logic here:

/* We only set the loaded table as valid if it contains all non-zero
* weights. Otherwise, we set it to check */
if (!hasZeroWeights)
bs->entropy.huf.repeatMode = HUF_repeat_valid;
RETURN_ERROR_IF(HUF_isError(hufHeaderSize), dictionary_corrupted, "");
RETURN_ERROR_IF(maxSymbolValue < 255, dictionary_corrupted, "");

We should be to change the logic to be if (!hasZeroWeights && maxSymbolValue == 255) bs->entropy.huf.repeatMode = HUF_repeat_valid;.

The 3 calls to FSE_dictNCountRepeat() in that function is where we validate the FSE tables:

bs->entropy.fse.matchlength_repeatMode = ZSTD_dictNCountRepeat(matchlengthNCount, matchlengthMaxValue, MaxML);

bs->entropy.fse.litlength_repeatMode = ZSTD_dictNCountRepeat(litlengthNCount, litlengthMaxValue, MaxLL);

{ size_t const dictContentSize = (size_t)(dictEnd - dictPtr);
U32 offcodeMax = MaxOff;
if (dictContentSize <= ((U32)-1) - 128 KB) {
U32 const maxOffset = (U32)dictContentSize + 128 KB; /* The maximum offset that must be supported */
offcodeMax = ZSTD_highbit32(maxOffset); /* Calculate minimum offset code required to represent maxOffset */
}
/* All offset values <= dictContentSize + 128 KB must be representable for a valid table */
bs->entropy.fse.offcode_repeatMode = ZSTD_dictNCountRepeat(offcodeNCount, offcodeMaxValue, MIN(offcodeMax, MaxOff));

@terrelln
Copy link
Contributor

terrelln commented Aug 21, 2023

For both FSE & Huffman tables:

  • For lower compression levels, or for very small data, if the table is marked as valid we will always use the dictionaries (or previous blocks) tables.
  • For lower compression levels, if the dictionary is marked as check we won't re-use the tables ever.
  • For higher compression levels, if the dictionary is marked as check or valid, we will compute whether it is better to reuse the dictionaries tables, write new tables, or use the default tables (for FSE).

If you want your dictionary to play nice with lower compression levels and the reference encoder, you'd have to either:

  • Ensure you have non-zero probabilities for all symbols
  • Add a parameter to the compressor to allow overriding the default behavior, so lower levels can check whether it is better to re-use a dictionary or not. I can give you code pointers.

@terrelln
Copy link
Contributor

Also, @klauspost, please let us know about the results of your custom dictionary builder!

The baseline dictionary builder we generally use is ZDICT_optimizeTrainFromBuffer_cover() with these parameters:

k=0
d=0
steps=256
compressionLevel=/* compression level you expect to use the dictionary with */

If your dictionary builder is competitive, or able to beat that, we'd love to try it out!

@klauspost
Copy link
Author

@terrelln Thanks - I will read through that. Much appreciated.

My current work is a different approach than yours. the backreference dictionary is done with a rather naïve chain of hashes that counts occurrences of 4-8 byte sequences and then from the top list tries to string them together to a set of longer strings until a hash exceeds a threshold - obviously emitting the most popular at the end.

The dictionaries beats zstd on "diverse input". For example a collection of "Go source code" and a collection of "HTML" files is able to generate a better dictionary than zstd (even when compressing with zstd). For sparse, very similar content, like "github users" it creates a worse dictionary, since it splits it up too much. It currently does not have "repeat-friendly" dicts, where it has several patterns, with a few "literal" chars.

The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there.

FSE Tables are generated doing actual compression and normalizing the output. Huffman tables are generated using the literal output of the compression. Ran into a couple of cases where this didn't generate compressible tables, so I need a better secondary strategy.

It is still fairly experimental and can fail on certain input - for example if an FSE table is a single value (RLE).

@klauspost
Copy link
Author

@terrelln Just used regular --fast-cover to compare. But when compared against zstd --train-fastcover="k=0,d=0,steps=256" it is at least competitive:

HTML:

Default:

zstd: 1698 files compressed : 19.97% (  95.4 MiB =>   19.0 MiB)
kp:   1698 files compressed : 19.82% (  95.4 MiB =>   18.9 MiB)

Level 19:

zstd: 1698 files compressed : 16.73% (  95.4 MiB =>   16.0 MiB)
kp: 1698 files compressed : 16.80% (  95.4 MiB =>   16.0 MiB)

GO SOURCE:

Default:

zstd: 8912 files compressed : 21.86% (  67.0 MiB =>   14.6 MiB)
kp: 8912 files compressed : 22.17% (  67.0 MiB =>   14.8 MiB)


Level 19:

zstd: 8912 files compressed : 17.95% (  67.0 MiB =>   12.0 MiB)
kp: 8912 files compressed : 18.06% (  67.0 MiB =>   12.1 MiB)

Nothing groundbreaking, but I'll take my one win and keep tweaking :D

@terrelln
Copy link
Contributor

The 3 repeat codes are pretty bad ATM. Though it seems like zstd doesn't generate unique repeat codes and it will maybe save 3-4 bytes the most. I will probably add some better scanning there.

Yeah, we don't do anything with this currently. Maybe there is something to get here, but we haven't really found a use case for them.

However, we have found a really interesting pattern in our dictionaries w.r.t. repcodes. If I have the string:

<common substr 1><random 4 bytes><common substr 2>

And two dictionary snippets:

Dictionary A: <common substr 1><common substr 2>
Dictionary B: <common substr 1>0000<common substr 2>

Dictionary B will outperform Dictionary A, sometimes by a large margin, because the match finder will be able to use a repeat offset for <common substr 2>.

This is one of the advantages of the cover algorithm: It keeps segments together, even if there is an infrequent sub-segment within it.

@terrelln
Copy link
Contributor

Also CC @ot (the original author of the cover algorithm), you may find this discussion interesting

@klauspost
Copy link
Author

klauspost commented Aug 23, 2023

@terrelln Yes. That is pretty much what I refer to when "not repeat friendly". Especially considering how relatively often lower levels checks for repeats that should also be a factor.

Also zstd's ability to "splat" versions of similar data is missing. I will never emit a hash that has already been emitted as a substring of a higher priority previous item. This is fine for "diverse" input, but not optimal much for "sparse" input.

So "github-users" only generates a small dictionary, whereas zstd can emit several "versions" to chose from.

@ot
Copy link
Contributor

ot commented Aug 23, 2023

@terrelln Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant, the only optimization was to put the more frequent substrings at the end so they would be available for longer to the sliding window.

@terrelln
Copy link
Contributor

Yes, the algorithm was originally implemented for LZ4 use cases, so the relative position of substrings in the dictionary was irrelevant

Yeah, I think we mostly got lucky, since it wasn't a design consideration going on. When the cover dictionary builder selected parameters, it often ended up selecting larger segment sizes than we expected, like a few hundred bytes.

We only fully realized that the larger segment size was important for utilizing repeat offsets in the dictionary content later, when we were removing that property and saw a regression.

I found the ZSTD_generateSequences() API very useful when debugging dictionary performance. It allowed me to build a heatmap of the dictionary by # of uses, and also inspect things like which positions are referenced using repeat offsets.

@terrelln
Copy link
Contributor

If you wanted to be more repcode aware, you could potentially concatenate all the sources, or random pairs of sources or something, then use ZSTD_generateSequences() to check for matches of the pattern <match><literals><repeat 0>. You could zero the literals, and treat that whole segment as one unit somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants