Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle interrupted pack #12

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

tasleson
Copy link
Contributor

Initial take at providing away to put an archive back into good state if a pack operation gets interrupted.

From git commit 280131e

The most important objective is to prevent the data slab and hashes slab from
getting corrupted and losing archived data.  Incomplete writes during a pack to
the slabs should be the only way for the slabs to get in an inconsistent state.
To allow us to detect and correct this we introduce a check point file at the
root of the archive which is written and sync'd to stable storage before we
start the pack operation.  This way if the pack operation is interrupted, we
can put the slab files back to where they were before we started with a repair
option.  Moving forward, the idea is we add the ability to periodically update
the checkpoint for long running operations by quiescing IO to the data slab,
hashes slab, offsets files, and the stream output and recording the offset into
the input data.  Then we can resume the operation by checking the files,
truncating where needed, and then resuming the de-dupe operation.

Note: If the slab file and the hashes file have no corruption and the
number of slabs match between the data and hash slab, the slab files are not
touched!  Thus the archive size could be much larger than would be
indicated by the listing of the archive as the data for the interrupted
pack operation is retained, but the stream is not.

I guess I could add a statement that the archive could get corrupted from bitrot, but that will be addressed in a future change where we introduce erasure coding support or similar.

Removes some duplicated code.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Currently just checks the data slab & offsets file and the hashes slab &
offsets file.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
At the moment we only fix the offsets file for the slab.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
When you read off of the end of the stream we get a cryptic error
"failed to fill whole buffer".  Before issuing a read, make sure
we have enough data to fill the request.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
The most important objective is to prevent the data slab and hashes slab from
getting corrupted and losing archived data.  Incomplete writes during a pack to
the slabs should be the only way for the slabs to get in an inconsistent state.
To allow us to detect and correct this we introduce a check point file at the
root of the archive which is written and sync'd to stable storage before we
start the pack operation.  This way if the pack operation is interrupted, we
can put the slab files back to where they were before we started with a repair
option.  Moving forward, the idea is we add the ability to periodically update
the checkpoint for long running operations by quiescing IO to the data slab,
hashes slab, offsets files, and the stream output and recording the offset into
the input data.  Then we can resume the operation by checking the files,
truncating where needed, and then resuming the de-dupe operation.

Note: If the slab file and the hashes file have no corruption and the
number of slabs match between the data and hash slab, the slab files are not
touched!  Thus the archive size could be much larger than would be
indicated by the listing of the archive as the data for the interrupted
pack operation is retained, but the stream is not.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
If a checkpoint exists we will raise an error and require the user
to correct before they proceed.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
The toml file format uses signed 64 bit integers.  Thus we cannot
use it to represent unsigned 64 bit integers which are needed.

Comvert file to binary and have the entire file protected by a 8
byte checksum.

Note: We should investigate using bincode for this to get
automatic ser./des. support.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Trading our implementation for library code.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Instead of having a 'verify' and 'verify-all', we'll remove the
'verify-all' and add a validate which includes sub commands for
all and stream.  This preserves backwards compatibility in the
command line.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
@tasleson tasleson marked this pull request as draft November 22, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant