Handle interrupted pack #12

tasleson · 2023-10-12T21:43:15Z

Initial take at providing away to put an archive back into good state if a pack operation gets interrupted.

From git commit 280131e

The most important objective is to prevent the data slab and hashes slab from
getting corrupted and losing archived data.  Incomplete writes during a pack to
the slabs should be the only way for the slabs to get in an inconsistent state.
To allow us to detect and correct this we introduce a check point file at the
root of the archive which is written and sync'd to stable storage before we
start the pack operation.  This way if the pack operation is interrupted, we
can put the slab files back to where they were before we started with a repair
option.  Moving forward, the idea is we add the ability to periodically update
the checkpoint for long running operations by quiescing IO to the data slab,
hashes slab, offsets files, and the stream output and recording the offset into
the input data.  Then we can resume the operation by checking the files,
truncating where needed, and then resuming the de-dupe operation.

Note: If the slab file and the hashes file have no corruption and the
number of slabs match between the data and hash slab, the slab files are not
touched!  Thus the archive size could be much larger than would be
indicated by the listing of the archive as the data for the interrupted
pack operation is retained, but the stream is not.

I guess I could add a statement that the archive could get corrupted from bitrot, but that will be addressed in a future change where we introduce erasure coding support or similar.

Removes some duplicated code. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Signed-off-by: Tony Asleson <tasleson@redhat.com>

Currently just checks the data slab & offsets file and the hashes slab & offsets file. Signed-off-by: Tony Asleson <tasleson@redhat.com>

At the moment we only fix the offsets file for the slab. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Signed-off-by: Tony Asleson <tasleson@redhat.com>

When you read off of the end of the stream we get a cryptic error "failed to fill whole buffer". Before issuing a read, make sure we have enough data to fill the request. Signed-off-by: Tony Asleson <tasleson@redhat.com>

The most important objective is to prevent the data slab and hashes slab from getting corrupted and losing archived data. Incomplete writes during a pack to the slabs should be the only way for the slabs to get in an inconsistent state. To allow us to detect and correct this we introduce a check point file at the root of the archive which is written and sync'd to stable storage before we start the pack operation. This way if the pack operation is interrupted, we can put the slab files back to where they were before we started with a repair option. Moving forward, the idea is we add the ability to periodically update the checkpoint for long running operations by quiescing IO to the data slab, hashes slab, offsets files, and the stream output and recording the offset into the input data. Then we can resume the operation by checking the files, truncating where needed, and then resuming the de-dupe operation. Note: If the slab file and the hashes file have no corruption and the number of slabs match between the data and hash slab, the slab files are not touched! Thus the archive size could be much larger than would be indicated by the listing of the archive as the data for the interrupted pack operation is retained, but the stream is not. Signed-off-by: Tony Asleson <tasleson@redhat.com>

If a checkpoint exists we will raise an error and require the user to correct before they proceed. Signed-off-by: Tony Asleson <tasleson@redhat.com>

The toml file format uses signed 64 bit integers. Thus we cannot use it to represent unsigned 64 bit integers which are needed. Comvert file to binary and have the entire file protected by a 8 byte checksum. Note: We should investigate using bincode for this to get automatic ser./des. support. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Trading our implementation for library code. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Signed-off-by: Tony Asleson <tasleson@redhat.com>

Instead of having a 'verify' and 'verify-all', we'll remove the 'verify-all' and add a validate which includes sub commands for all and stream. This preserves backwards compatibility in the command line. Signed-off-by: Tony Asleson <tasleson@redhat.com>

tasleson added 13 commits September 22, 2023 12:06

Create function to create 256 & 64 bit hash tuple

39e4afa

Removes some duplicated code. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Cuckoo_filter: Create ChaCha20Rng in one place

7f2e62c

Signed-off-by: Tony Asleson <tasleson@redhat.com>

utils: Add macro for debug

efebb52

Signed-off-by: Tony Asleson <tasleson@redhat.com>

Add a "verify-all" command

c82b783

Currently just checks the data slab & offsets file and the hashes slab & offsets file. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Add repair option to verify-all

207a90b

At the moment we only fix the offsets file for the slab. Signed-off-by: Tony Asleson <tasleson@redhat.com>

slab: Flush when exiting writer thread

4b1873b

Signed-off-by: Tony Asleson <tasleson@redhat.com>

slab verify: Handle truncated slab file

9a49b02

When you read off of the end of the stream we get a cryptic error "failed to fill whole buffer". Before issuing a read, make sure we have enough data to fill the request. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Raise error if check point exists

e17ffdf

If a checkpoint exists we will raise an error and require the user to correct before they proceed. Signed-off-by: Tony Asleson <tasleson@redhat.com>

Use bincode for checkpoint data

130f30a

Trading our implementation for library code. Signed-off-by: Tony Asleson <tasleson@redhat.com>

check.rs: Remove needless clones

a5bbce6

Signed-off-by: Tony Asleson <tasleson@redhat.com>

tasleson marked this pull request as draft November 22, 2023 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle interrupted pack #12

Handle interrupted pack #12

tasleson commented Oct 12, 2023

Handle interrupted pack #12

Are you sure you want to change the base?

Handle interrupted pack #12

Conversation

tasleson commented Oct 12, 2023