-
Notifications
You must be signed in to change notification settings - Fork 0
About the BaRJ Cargo backup archive
The data archive of the BaRJ Cargo backup is using the BaRJ Cargo format. Please read the linked page to understand how it stores data. The configuration and the content is detailed by the following sections.
The backup can use the following components.
File component | File name pattern | Occurrence | Relevance |
---|---|---|---|
RSA Key pair | *.p12 |
1 per backup job (optional) | Used for decryption of the encrypted backups |
Backup configuration | *.json |
1 per backup job | Configures the backup |
Backup manifest | *.manifest.cargo |
1 per backup increment | Contains backup metadata |
Backup index | *.index.cargo |
1 per backup increment | BaRJ Cargo archive index |
Backup chunk | *.<sequence>.cargo |
N per backup increment | BaRJ Cargo archive content chunk. Sequence starting from 00001. |
The full set of metadata captured during the backup process is stored in the Backup manifest. The manifest is stored in 2 copies under the backup destination:
- One under the
.history
folder that contains a GZip archived variant of the JSON manifest without using encryption. This is required in order to allow the scheduled backup to run without the need for the private key of the RSA key pair to be available for the job. - The second is stored as a
*.manifest.cargo
file. This is encrypted as well in case the backup configuration defines a KEK. This can be safely stored together with the backup files, for example in case they are transferred to an offline location for increased durability of the backup. These are used during the restore process.
Each manifest contains a long list of metadata, such as:
- The set of backup increment versions represented by the manifest
- The version of the File BaRJ job generating the backup increment
- The timestamp when the backup was started
- The file name prefix of the backup archive files belonging to the current backup increment
- The actual backup type used when the backup was executed (Can differ from the configuration, for example in case of the first full backup of an incremental backup job)
- The backup configuration JSON used for the creation of the backup, including the KEK we need to use for DEK encryption
- The set of encrypted DEKs (AES keys) which are used for data encryption in case the backup is encrypted
- The file metadata for each file in scope. These contain details such as:
- Id (unique Id of each file)
- The absolute path
- File system key (a unique Id defined by the file system, e.g. inode)
- Original size in bytes
- Access control permissions
- The POSIX permissions of the file at the time of the backup (Can be partial in case the file system or the OS does not support POSIX)
- The owner of the file at the time of the backup (optional)
- The group of the file at the time of the backup (optional)
- Time stamps of the file as observed at the time of backup
- Created
- Last modified
- Last accessed
- Hash value, representing the original hash of the file contents using the hash algorithm selected in the configuration. (optional)
- Change status, indicating what the status of the file is (NEW, DELETED, etc.)
- The Id of the related archive entry storing the file contents. (optional)
- The archive entry metadata for each archive entry stored for the backup. This contains details like:
- The unique Id of the archive entry
- Archive locator. Uses the following components to help identify the archived content
- Backup increment, the number of the backup increment storing the file contents
- Entry name, UUID, the name of the archived entry storing the file contents
- Archived hash value, the digest of the archived content using the hash algorithm defined by the backup configuration. (optional)
- Hash value, representing the original hash of the file contents using the hash algorithm selected in the configuration. (optional)
- Files, containing the set of Ids which are referencing the related file metadata entries
The aforementioned metadata was designed to be helpful with the following use-cases:
As the file metadata is not 1:1 related to the archive entry metadata, we can easily indicate, that an entry in the archive belongs to multiple files. Each file might have different name, permission, etc. but the content can be exactly the same, potentially reducing the archive size.
In addition to this, each archive metadata can specify a version number (identifying the backup increment) and an entry name (identifying the entry from the entries of the given increment) to locate the archived entry. As a result, we can easily reference any previous entry from existing backup increments when a new backup increment is being created. This way, we can even point to already stored copies if a new copy of the same file appears in an incremental backup.
Since the BaRJ Cargo archive is storing individually compressed and encrypted entries, we can easily merge multiple backup increments into a single archive without the need to re-encrypt or re-compress them. This is further supported by the fact, that each increment has a full metadata snapshot of the files available at the time of the increment is created. Using this information, it is very easy to find out which entry is needed after the merge and which can be thrown out because it is no longer referenced.
The archived and original hash values in the archive metadata are very helpful for the verification steps. This way, we can be sure, that the operation was completed and produced the exact same result we expect. Also, these checksum values are used by the BaRJ Cargo archive as well, it can preform additional verification steps and raise errors in case a file became corrupted.
Each manifest has a set of DEKs (in case a KEK was provided). These are automatically assigned to each random generated archive entry UUID. This way, we can expect that each DEK will only be reused for some of the archive entries (~1/16th of all entries in the increment). Also, thanks to the built-in encryption of the BaRJ Cargo format, each entry will use a random IV.
The file content storage is very simple compared to the metadata as it is simply using the features provided by the BaRJ Cargo archive format. The highest complexity is around locating the file entries based on the metadata (the Archive locator). This is because as a result of a conscious decision, the original file names/paths are not used in the archives. We are using a simplified path which consists from the backup increment number as root folder and the entry name, as file name.
This decision has the following benefits:
- It is harder to identify which entry is which if the metadata part is not available. This can make it harder for an adversary to know which file is which. Still, when the manifest is available and successfully decrypted, we can easily connect the entries with their files and the encryption keys used.
- The chance of file name collisions is significantly reduced as the file names are generated by the tool and not user provided. This remains true even in case of merged archives.
- Path traversal is mitigated as the paths are always absolute in the metadata and the archived entries are using a logical, almost flat representation.
An unzipped example manifest can be found below:
{
"backup_versions" : [ 0 ],
"encryption_keys" : null,
"app_version" : "0.1.0",
"start_time_utc_epoch_seconds" : 1703797376,
"file_name_prefix" : "test-backup-gzip-1703797376",
"backup_type" : "FULL",
"job_configuration" : {
"backup_type" : "FULL",
"hash_algorithm" : "SHA256",
"compression_algorithm" : "GZIP",
"encryption_key" : null,
"duplicate_strategy" : "KEEP_EACH",
"chunk_size_mebibyte" : 500,
"file_name_prefix" : "test-backup-gzip",
"destination_directory" : "file:///tmp/backup-test/",
"sources" : [ {
"path" : "file:///home/user/dir/"
} ]
},
"files" : {
"92cc58f0-049e-44cb-bc33-a54e8b71fd72" : {
"id" : "92cc58f0-049e-44cb-bc33-a54e8b71fd72",
"file_system_key" : "(dev=fd01,ino=2098465)",
"path" : "file:///home/user/dir/",
"original_size" : 4096,
"last_modified_utc_epoch_seconds" : 1632340914,
"last_accessed_utc_epoch_seconds" : 1703797180,
"created_utc_epoch_seconds" : 1632340914,
"permissions" : "rwxrwxr-x",
"owner" : "user",
"group" : "user",
"file_type" : "DIRECTORY",
"hidden" : false,
"status" : "NEW"
},
"8a8e4296-1ef0-4d46-8119-6c826a255a9a" : {
"id" : "8a8e4296-1ef0-4d46-8119-6c826a255a9a",
"file_system_key" : "(dev=fd01,ino=2098468)",
"path" : "file:///home/user/dir/file",
"original_hash" : "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"original_size" : 0,
"last_modified_utc_epoch_seconds" : 1632340914,
"last_accessed_utc_epoch_seconds" : 1703797180,
"created_utc_epoch_seconds" : 1632340914,
"permissions" : "rw-rw-r--",
"owner" : "user",
"group" : "user",
"file_type" : "REGULAR_FILE",
"hidden" : false,
"status" : "NEW",
"archive_metadata_id" : "4179ac43-4971-4257-8d09-a12108ceef5a"
}
},
"archive_entries" : {
"4179ac43-4971-4257-8d09-a12108ceef5a" : {
"id" : "4179ac43-4971-4257-8d09-a12108ceef5a",
"archive_location" : {
"backup_increment" : 0,
"entry_name" : "4179ac43-4971-4257-8d09-a12108ceef5a"
},
"archived_hash" : "ac73670af3abed54ac6fb4695131f4099be9fbe39d6076c5d0264a6bbdae9d83",
"original_hash" : "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"files" : [ "8a8e4296-1ef0-4d46-8119-6c826a255a9a" ]
}
},
"index_file_name" : "test-backup-gzip-1703797376.index.cargo",
"data_file_names" : [ "test-backup-gzip-1703797376.00001.cargo" ]
}