-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BuildZStdDict Testing #950
Comments
TBH I don't see it happening. I don't have the bandwidth for it at the moment. However if you can make a simple reproducer for any crash I will fix it. |
Also sorry for being a bit distracted in my last replies... Let me provide some context:
What it mainly does is to create a common dictionary for the input samples you provide. Deduplicating and truncating to opts.MaxDictSize. Once that is done it calls It will attempt to compress all of There are for sure untested corner cases in this process and it is mostly a "released PoC" - as in - it works - some/most of the time and it is "ok" in terms of (compression) efficiency. |
That's good context. I'll try and bypass the higher level API and use the zstd.BuildDict next. In the meantime, here is a test which is failing. There is a divide by 0 error, which I got around with a hack (see commit). I doubt that is right. But then once the dictionary is created, the I'm out of my element when is comes to compression, but will probably be pecking at this for the next few days. We'd really like to avoid using CGO in our product. |
Fix in #951 Made a few tweaks to your test. A) You are writing to the input buffer when decompressing. package dict
import (
"bytes"
"io"
"math/rand"
"os"
"testing"
"github.com/klauspost/compress/zstd"
)
func TestZStdDict(t *testing.T) {
for _, level := range []zstd.EncoderLevel{zstd.SpeedFastest, zstd.SpeedDefault, zstd.SpeedBetterCompression, zstd.SpeedBestCompression} {
testZStdDict(t, level)
}
}
func testZStdDict(t *testing.T, level zstd.EncoderLevel) {
out := io.Discard
if testing.Verbose() {
out = os.Stdout
}
opts := Options{
MaxDictSize: 2048,
HashBytes: 4,
Output: out,
ZstdDictID: 0,
ZstdDictCompat: false,
ZstdLevel: level,
}
inBuf := make([]byte, 0, 4096)
outBuf := make([]byte, 0, 4096)
// This is 32K worth of data, but it's all very similar. Only fits in 4K if compressed with a dictionary.
samples := generateSimilarByteSlices(42, 32)
dict, err := BuildZstdDict(samples, opts)
if err != nil {
t.Fatal(err.Error())
}
totalSize := 0
for _, blob := range samples {
compressed, err := zCompressDict(inBuf, dict, blob)
if err != nil {
t.Fatal(err.Error())
}
totalSize += len(compressed)
// Check round trip.
decompressed, err := zDecompressDict(outBuf, dict, compressed)
if err != nil {
t.Fatal(err.Error())
}
if !bytes.Equal(decompressed, blob) {
t.Fatal("Round trip failed")
}
}
if totalSize > 4096 {
t.Fatal("Total compressed size exceeds 4096 bytes")
}
t.Log("Total compressed size:", totalSize)
}
func zCompressDict(dst, dict, data []byte) ([]byte, error) {
encoder, err := zstd.NewWriter(nil, zstd.WithEncoderDict(dict))
if err != nil {
return nil, err
}
defer encoder.Close()
result := encoder.EncodeAll(data, dst[:0])
return result, nil
}
func zDecompressDict(dst, dict, data []byte) ([]byte, error) {
decoder, err := zstd.NewReader(nil, zstd.WithDecoderDicts(dict))
if err != nil {
return nil, err
}
defer decoder.Close()
result, err := decoder.DecodeAll(data, dst[:0])
if err != nil {
return nil, err
}
return result, nil
}
// Creates a slice of byte slices, each of which is has the same random seed, so they are very similar. The length
// of each slice is 1024 + the index of the slice.
func generateSimilarByteSlices(seed int64, count int) [][]byte {
chks := make([][]byte, count)
for i := 0; i < count; i++ {
chks[i] = generateRandomByteSlice(seed, 1024+i)
if false {
// Generate a small diff.
chks[i][i] = byte(seed)
}
}
return chks
}
func generateRandomByteSlice(seed int64, len int) []byte {
r := rand.NewSource(seed)
data := make([]byte, len)
for i := range data {
data[i] = byte(r.Int63())
}
return data
} With your permission I'd like to include that as a test. |
I pulled your changes, and they do result in our unit tests passing. Great! When exercising against real data, I ran into another edge case. I pushed a new test to my fork branch: https://github.com/macneale4/compress/tree/zstd-dict-test There tip commit has a speculative change (allowing v == threshold) will allows my testing to continue. I have no idea if that's correct though. You are welcome to take any of this and stick into your tests. The data is coming from one of our public data repos. |
The threshold is mostly to filter out candidates in large sets. Though everything being discarded could indicate that there is little of actual value. I'll merge your change back into the PR. For production use you'd probably want to set up a fuzz test to check that border cases are handled. |
Hello!
I'm hoping to use the compress package for packing bytes in our archive format for Dolt (https://github.com/dolthub/dolt). Our data is very amenable to custom dictionaries for small sets (5-50) of binary blobs which average about 4K in size.
We built our POC using the gozstd package which has Dictionary support: https://pkg.go.dev/github.com/valyala/gozstd
Now that I'm are convinced this is a good approach for us, I'm curious how we can move the
dict.BuildZstdDict
function out of the experimental phase, as it's stated in thedict/README.md
. In particular, the README says it doesn't work well with small sets, and indeed I got a panic when using a set of 32 samples.How far from complete do you think it is? How can I help get this tested and get the bugs ironed out? I have one local test case I can share with you if that's helpful.
FWIW, using the dictionary produced by valyala/gozstd, I can compress/decompress successfully with the compress/zstd package. So I think it's just a matter of generating a correct dict and we'll be able to move forward.
Thanks!
The text was updated successfully, but these errors were encountered: