-
Notifications
You must be signed in to change notification settings - Fork 21
Journal
I decided I'd write a backup program, because I was frustrated with slow or failed restores from Duplicity, and with other open source solutions. I love Rust and thought it'd be good to write a real application in it. Also, I learned things both good and bad about storage formats and performance from bzr, and wanted to try to apply them.
Conserve has been a long time coming, with my earliest notes on an idea for it dating back to 2011 or earlier. It'll be a while longer before it's ready, as it only gets a couple of hours a week, not even every week.
My day job has very large systems with complicated requirements and environments, so it's very refreshing to work on something I entirely understand and where I can decide the requirements myself. I do miss simply writing programs.
I decided to write a journal about development, as a record and to enhance the reflective process.
Happy birthday, America! 🇺🇸🎆
Lots of recent progress: all small, but moving.
I changed Band
from holding a reference to the containing Archive
to just
knowing its own path. It avoids explicit lifetimes and perhaps is a cleaner
layering anyhow.
Now have index hunks. Getting close to being able to actually make a backup. Now to create a band, add some files both to the data blocks and to the index.
I moved the blackbox tests away from relying on Python Cram
to just running conserve
as a subprocess from Rust.
Seems fine.
Also now have Conserve storing the data for files though not yet adding them to the index or anything else that would let them actually be retrieved.
The backup.rs
code is [now storing]
(https://github.com/sourcefrog/conserve/blob/054e999b25763029a3c856dfe3eee47a126cbcdf/src/backup.rs)
very approximately what I need, in roughly the right format. Needs some
clean up to cope with storing multiple files the right way.
Maybe the command should be restricted to conserve backup ARCHIVE SOURCE
, and
if you want to select things inside that source that can be done with
exclusions. That will make it simpler to get the right ordering.
I tried having a common Report
trait abstracting whether the report is self-synchronizing or not.
However it doesn't work for an interesting reason: the whole thing about the SyncReport
is that
it can be updated even through a non-mutable reference, whereas the simple Report needs a mutable
reference. The trait can't abstract across them. Java and maybe C++ would gloss over that
by not having such strict concepts of mutability.
Passing a &mut report
in to the methods does work and seems cleaner, though.
Another fun weekend of development: I added comparison and sorting of archive internal paths. That feels very elegant in Rust.
There are also now counters of IO, errors, compression, etc. This brings explicit lifetimes back in to the picture, because the BlockDir
holds a reference to a SyncReport
constructed by the caller. I really dislike how this necessitates every function declared in the impl
to have &'a
noise, even if it does nothing with 'a
. But really my unhappiness comes from not knowing whether this is idiomatic Rust or whether there's some simpler way to write it.
Idea: Alternatively, just pass a &mut report
in to every function that will uses it. Avoids marking lifetime.
I made a SyncReport
that implicitly locks around each update. It's nice that's possible and it seems like good layering but at the moment with no threads in the picture it seems expensive. Another option here would be a trait that hides from the receiver whether the Report
synchronizes or not.
I realized also that the approach of combining small files into blocks is cause a slight complication to writing the index: we don't know the hash in which the file is stored, immediately when it is stored. We need a pipeline of some number of files that all go into blocks, and then later make their addresses available.
Next:
- refactor counters
- start adding an index
- maybe a
conserve cat-block
command?
The contents of backed-up files are stored in compressed, hash-addressed block files. (See: doc/format.md.)
This tries to balance a few considerations:
For cloud data stores we want files that are not too small (too many round trips) and not too big (too slow to write.)
I want large files that are changed in place, such as VM images or databases, to be incrementally updated and not entirely rewritten when they change. However the rsync rolling-sum approach used in Duplicity is not necessary, and has a security risk: insertions into files, moving the rest of the data along is rare for large files. So Conserve uses a degenerate case of rdiff: it matches only data with the same strong hash, aligned on a block boundary.
The start of this is implemented in
a simple BlockWriter
class
that compresses and accumulates the hash of what was written to it.
The question arises here of whether it's OK to buffer compressed
data in memory, and I decided it is: writing blocks to temporary
files will be tedious, and the size we want for good IO should
fit in memory on typical machines. (Let's say, 1GB or less.)
In writing this I realized the description of versioning semantics was not so clear and consistent, so I updated that in d09d08c.
Next up is to actually write blocks into a block storage directory, and then we can start writing these into bands, the next larger storage structure.
One dilemma here is whether to put blocks into subdirectories, or to have one directory with a potentially very large number of block files. Choosing at runtime makes it harder to know where to look for a particular file, and makes things more complicated. On most modern filesystems there is no direct limit, but I would like the option to backup to flash sticks which have a limit in FAT. I'll use the first three hex chars, for 4096 direct child directories, each of which can have many files.
Am trying the Rust Clippy linter, but it won't work at the moment because this package provides both a library and binary.
Lifetime management in Rust continues to throw up thought-provoking errors, but
in many cases they are pointing to a real imprecision in the code.
In the case of the BlockWriter
the object should be consumed as it finishes.
In another language, the object would remain in a state where it can be called,
but should not be.
Storing blocks into a data directory is now done in
3bf3190.
I was going to tell the BlockWriter
how to store itself, but it turns out
cleaner to have a separate
BlockDir
which knows about the directory structure and consumes the BlockWriter
.
Rust lifetimes, again, are awkward at first but create positive design pressure.
if let
is nice here for error handling: if create_dir
returns an error and the error is not
AlreadyExists
, return it.
I started using a type alias, type BlockHash = String
to be more descriptive in the API. It's interchangeable with a regular String
.