Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: S2++ #846

Closed
wants to merge 13 commits into from
Closed

WIP: S2++ #846

wants to merge 13 commits into from

Conversation

klauspost
Copy link
Owner

@klauspost klauspost commented Aug 7, 2023

Aim

Improve encoding method of S2, which is read backwards compatible with the following:

  • Output from previous versions can be decompressed.
  • Output from new versions can requires a new version.
  • Blocks from new version will always produce an error when decoded with incompatible version.
Version Snappy Decoder S2 Decoder S2++ Decoder
Snappy Encoder ✔️ ✔️ ✔️
S2 Encoder ✔️ ✔️
S2++ Encoder ✔️

Only changes that provide significant improvements with no decompression speed penalty will be considered.
No reduction in seek functionality is accepted.

Method

Fixes the biggest mistake in Snappy (though extremely rarely used in Snappy) - and also implements more efficient repeat codes.

If the first bytes of a block is 0x80, 0x00, 0x00 (copy, 2 byte offset = 0),
this indicates that all Copy with 2-byte offset (10)
and Copy with 4-byte offset (11) tags change.

There can be no literals before this tag and no repeats before a match as specified above.
This will only trigger on this exact tag.

Discussion

Blocks below 64K do not need to add this, and it will just be 3 wasted bytes.
65536 could be added to the base value, but having a max 16MB backreference max seems neater.

Using a 3 byte indicator, since block can start with an initial repeat. Having this as the first block will always be invalid in current decoders.

Seems like the encoder can unconditionally enable this when block is >64K. Sizes below are with it enabled for all blocks, 4MB blocks. Pretty much always better unless just storing.

Consider if old repeat codes should be disabled if this mode (probably). Yes

Sizes

Percentages are calculated as reduction in output size and reduction as percentage of input size.

FILE Level Input Output Size Org Size
BEFORE:          
gob-stream 1 1911399616 347633082    
gob-stream 2 1911399616 303776251    
gob-stream 3 1911399616 258013815    
after:          
gob-stream 1 1911399616 297164561 -14.52% -2.64%
gob-stream 2 1911399616 269233350 -11.37% -1.81%
gob-stream 3 1911399616 224782856 -12.88% -1.74%
           
BEFORE:          
silesia.tar 1 211947520 96899588    
silesia.tar 2 211947520 87166102    
silesia.tar 3 211947520 79612333    
after:          
silesia.tar 1 211947520 91660668 -5.41% -2.47%
silesia.tar 2 211947520 82145738 -5.76% -2.37%
silesia.tar 3 211947520 74518937 -6.40% -2.40%
           
enwik9 1 1000000000 487526653    
enwik9 2 1000000000 416581621    
enwik9 3 1000000000 370860824    
after:          
enwik9 1 1000000000 460036037 -5.64% -2.75%
enwik9 2 1000000000 392514719 -5.78% -2.41%
enwik9 3 1000000000 341953796 -7.79% -2.89%
           
github-june-2days-2019.json 1 6273951764 1041705230    
github-june-2days-2019.json 2 6273951764 944873043    
github-june-2days-2019.json 3 6273951764 826384742    
after:          
github-june-2days-2019.json 1 6273951764 940405663 -9.72% -1.61%
github-june-2days-2019.json 2 6273951764 881830595 -6.67% -1.00%
github-june-2days-2019.json 3 6273951764 764962673 -7.43% -0.98%
           
github-ranks-backup.bin 1 1862623243 623833007    
github-ranks-backup.bin 2 1862623243 568441528    
github-ranks-backup.bin 3 1862623243 553965705    
after:          
github-ranks-backup.bin 1 1862623243 598949133 -3.99% -1.34%
github-ranks-backup.bin 2 1862623243 536791344 -5.57% -1.70%
github-ranks-backup.bin 3 1862623243 508220735 -8.26% -2.46%
           
nyc-taxi-data-10M.csv 1 3325605752 1093518508    
nyc-taxi-data-10M.csv 2 3325605752 884711223    
nyc-taxi-data-10M.csv 3 3325605752 773678211    
after:          
nyc-taxi-data-10M.csv 1 3325605752 937134605 -14.30% -4.70%
nyc-taxi-data-10M.csv 2 3325605752 776582738 -12.22% -3.25%
nyc-taxi-data-10M.csv 3 3325605752 663806572 -14.20% -3.30%
           
apache.log 1 2622574440 230523580    
apache.log 2 2622574440 217884490    
apache.log 3 2622574440 185357903    
after:          
apache.log 1 2622574440 188006334 -18.44% -1.62%
apache.log 2 2622574440 173645540 -20.30% -1.69%
apache.log 3 2622574440 146077255 -21.19% -1.50%
           
consensus.db.10gb 1 10737418240 4549768015    
consensus.db.10gb 2 10737418240 4416692817    
consensus.db.10gb 3 10737418240 4210593068    
after:          
consensus.db.10gb 1 10737418240 4332822720 -4.77% -2.02%
consensus.db.10gb 2 10737418240 4299355082 -2.66% -1.09%
consensus.db.10gb 3 10737418240 4095105829 -2.74% -1.08%
           
rawstudio-mint14.tar 1 8558382592 4413947468    
rawstudio-mint14.tar 2 8558382592 4101956347    
rawstudio-mint14.tar 3 8558382592 3905189070    
after:          
rawstudio-mint14.tar 1 8558382592 4241234066 -3.91% -2.02%
rawstudio-mint14.tar 2 8558382592 3962581837 -3.40% -1.63%
rawstudio-mint14.tar 3 8558382592 3781979945 -3.16% -1.44%
           
10gb.tar 1 10065157632 5915543454    
10gb.tar 2 10065157632 5486469704    
10gb.tar 3 10065157632 5192490218    
after:          
10gb.tar 1 10065157632 5733844651 -3.07% -1.81%
10gb.tar 2 10065157632 5271029444 -3.93% -2.14%
10gb.tar 3 10065157632 4979564326 -4.10% -2.12%
           
sofia-air-quality-dataset.tar 1 15464463872 4991766468    
sofia-air-quality-dataset.tar 2 15464463872 4432998200    
sofia-air-quality-dataset.tar 3 15464463872 4017422246    
after:          
sofia-air-quality-dataset.tar 1 15464463872 4665176521 -6.54% -2.11%
sofia-air-quality-dataset.tar 2 15464463872 4032382208 -9.04% -2.59%
sofia-air-quality-dataset.tar 3 15464463872 3657874273 -8.95% -2.32%

If the first bytes of a block is `0x40, 0x00` (repeat, length 4), this indicates that all [Copy with 4-byte offset (11)](https://github.com/google/snappy/blob/main/format_description.txt#L106) are all 3 bytes instead for the remainder of the block.

There can be no literals before this tag and no repeats before a match as specified above.
This will only trigger on this exact tag.

> These are like the copies with 2-byte offsets (see previous subsection),
> except that the offset is stored as a 24-bit integer instead of a
> 16-bit integer (and thus will occupy three bytes).

When in this mode the maximum backreference offset is 16777215.

This *cannot* be combined with dictionaries.
@klauspost
Copy link
Owner Author

Attempted offset delta encoding -16 to 16, length 1-16. Extremely small hit rate. Not worth the complexity.

@klauspost
Copy link
Owner Author

klauspost commented Sep 18, 2023

Experiment with using 1 bit from copy long offset to indicate repeats.

Limits long offsets to length 32, down from 64, forcing a repeat.

Repeat length are encoded as:

// 0-28: Length 1 -> 29
// 29: Length (Read 1) + 1
// 30: Length (Read 2) + 1
// 31: Length (Read 3) + 1

Copy lengths are encoded as

// 0-28: Length 4 -> 32
// 29: Length (Read 1) + 4
// 30: Length (Read 2) + 4
// 31: Length (Read 3) + 4
Input Level Improvement
gob-stream 1 1.94%
gob-stream 2 -0.28%
gob-stream 3 0.87%
     
silesia.tar 1 1.10%
silesia.tar 2 0.54%
silesia.tar 3 0.89%
     
enwik9 1 0.24%
enwik9 2 0.02%
enwik9 3 0.06%
     
github-june-2days-2019.json 1 0.63%
github-june-2days-2019.json 2 0.19%
github-june-2days-2019.json 3 0.33%
     
github-ranks-backup.bin 1 0.35%
github-ranks-backup.bin 2 -0.45%
github-ranks-backup.bin 3 0.02%
     
nyc-taxi-data-10M.csv 1 1.27%
nyc-taxi-data-10M.csv 2 0.53%
nyc-taxi-data-10M.csv 3 0.72%
     
apache.log 1 1.11%
apache.log 2 1.21%
apache.log 3 1.39%
     
consensus.db.10gb 1 0.85%
consensus.db.10gb 2 -0.10%
consensus.db.10gb 3 0.01%
     
rawstudio-mint14.tar 1 1.23%
rawstudio-mint14.tar 2 0.57%
rawstudio-mint14.tar 3 0.99%
     
10gb.tar 1 0.35%
10gb.tar 2 0.20%
10gb.tar 3 0.32%
     
sofia-air-quality-dataset.tar 1 -0.31%
sofia-air-quality-dataset.tar 2 -1.79%
sofia-air-quality-dataset.tar 3 -3.04%

So gains mainly depend on how many repeats compared to long offsets (with long length) there is. Only sofia has a reasonable regression. Only many long offsets with length 32->64 should expect a regression.

This can remove the change from literals=63, which makes the change cleaner.

OP updated.

@klauspost
Copy link
Owner Author

klauspost commented Nov 17, 2023

Added variable length encoding to TagCopy2 as well. Good improvement and simplifies encoding decisions.

@klauspost
Copy link
Owner Author

klauspost commented Nov 24, 2023

Using 1 more bit for length in Tagcopy4 gives a very reasonable improvement.

Hard to make simple, though.

@klauspost
Copy link
Owner Author

klauspost commented Dec 6, 2023

Experimenting with using Copy1 length 11 as indicator for extra (length-64):

Negative means percentage smaller output:

(combined table below)

Undecided...

@klauspost
Copy link
Owner Author

klauspost commented Dec 6, 2023

Using 10 bits (max 1024) for offset:

Negative means percentage smaller output:

(table below)

Again, inconclusive....

@klauspost
Copy link
Owner Author

klauspost commented Dec 6, 2023

Using 10 bit lengths + 4 bits length and last length, read 1 additional byte (16 base)

file level Extra-len64 1024 offset Both
gob-stream 1 -0.06% 0.04% -0.02%
gob-stream 2 0.01% 0.02% -0.09%
gob-stream 3 -0.04% 0.09% 0.05%
         
silesia.tar 1 0.15% 0.03% 0.01%
silesia.tar 2 0.19% 0.09% 0.05%
silesia.tar 3 0.13% 0.11% 0.03%
         
enwik9 1 0.25% 0.20% 0.18%
enwik9 2 0.30% 0.39% 0.37%
enwik9 3 0.29% 0.23% 0.16%
         
github-june-2days-2019.json 1 0.09% -1.03% -1.19%
github-june-2days-2019.json 2 0.07% -1.42% -1.67%
github-june-2days-2019.json 3 -0.01% -1.05% -1.28%
         
github-ranks-backup.bin 1 1.27% -1.36% -1.37%
github-ranks-backup.bin 2 0.88% -1.34% -1.36%
github-ranks-backup.bin 3 0.61% -1.30% -1.33%
         
nyc-taxi-data-10M.csv 1 -0.26% 0.45% 0.14%
nyc-taxi-data-10M.csv 2 -0.27% 0.45% -0.02%
nyc-taxi-data-10M.csv 3 -0.56% 0.04% -0.51%
         
apache.log 1 -4.21% -0.38% -3.90%
apache.log 2 -5.29% -0.79% -5.61%
apache.log 3 -5.36% -0.32% -4.59%
         
consensus.db.10gb 1 -0.38% -0.02% -0.44%
consensus.db.10gb 2 -0.40% -0.06% -0.51%
consensus.db.10gb 3 -0.36% -0.21% -0.58%
         
rawstudio-mint14.tar 1 0.08% -0.12% -0.16%
rawstudio-mint14.tar 2 0.13% -0.13% -0.18%
rawstudio-mint14.tar 3 0.07% -0.03% -0.10%
         
10gb.tar 1 0.06% 0.50% 0.48%
10gb.tar 2 0.07% 0.33% 0.30%
10gb.tar 3 0.06% 0.22% 0.18%
         
sofia-air-quality-dataset.tar 1 0.05% 0.39% 0.39%
sofia-air-quality-dataset.tar 2 0.08% 0.44% 0.44%
sofia-air-quality-dataset.tar 3 0.13% 0.67% 0.64%

@klauspost
Copy link
Owner Author

klauspost commented Dec 6, 2023

Offset before length extension? Probably faster for decoding - but maybe more tedious for encoding?

@klauspost
Copy link
Owner Author

Minimum offset 1 will eliminate a lot of 0 checks.

@klauspost
Copy link
Owner Author

Using one more bit for extra Copy4 length +28 is just too good to leave.

@klauspost
Copy link
Owner Author

Length after offset.

@klauspost
Copy link
Owner Author

Allow 0-3 literals in copy depending on uncompressed position. Table updated.

@klauspost
Copy link
Owner Author

(project will be published under minio)

@klauspost klauspost closed this Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant