-
Notifications
You must be signed in to change notification settings - Fork 0
Backup job configuration tips
The backup configuration has the following options.
Option | Type | Meaning |
---|---|---|
backup_type | Enum | Defines whether the job should always produce full backups or should perform incremental backups after the initial full backup. Possible values: FULL, INCREMENTAL |
hash_algorithm | Enum | Defines which hash algorithm should be used for ensuring file integrity and finding duplicated files. Possible values: NONE, MD5, SHA1, SHA256, SHA512 |
compression_algorithm | Enum | Selects the compression algorithm which should be used for the archive. Possible values: NONE, BZIP2, GZIP |
encryption_key | Base64 String (nullable) | The KEK (RSA public key) that should be used for key encryption. |
duplicate_strategy | Enum | Instructs the backup algorithm to keep each duplicated version as is, store only one of the per the current backup increment, or eliminate duplications even across multiple backup increments. Using KEEP_EACH is expected to produce the biggest archives, but the benefit of the other two options can depend on the number of duplications in the file set. Possible values: KEEP_EACH, KEEP_ONE_PER_BACKUP |
chunk_size_mebibyte | int (larger than 0) | Sets an upper threshold for the size of each archive chunk. Can be useful in case the archive will be transferred to a file system or online storage service where the file size is limited. Also, partial restores can benefit from only reading the relevant chunks if there are multiple of them. |
file_name_prefix | String | The prefix of each backup archive file. Must use characters which are allowed by the file system and OS. Can be a good idea to use only alphanumeric characters, dashes and underscores for simplicity. |
destination_directory | String (file:// URI) | The absolute path of a directory where we want to store the backup archives. |
sources | Set (Backup source) | Defines the source folders/files and the relevant match criteria. Each Backup source has to be mutually exclusive (to guarantee, that each file can only be included due to matching one source). A backup source has a path component, defining the root of the source, an include_patterns list, defining the glob patterns identifying the files matching under the root path, and an optional exclude_patterns list that can be used for excluding some of the matching files using the same glob matching pattern as before. |
A simple example configuration can be seen below:
{
"backup_type" : "FULL",
"hash_algorithm" : "SHA256",
"compression_algorithm" : "GZIP",
"encryption_key" : null,
"duplicate_strategy" : "KEEP_EACH",
"chunk_size_mebibyte" : 500,
"file_name_prefix" : "home-backup-gzip-unencrypted",
"destination_directory" : "file:///tmp/backup-destination/",
"sources" : [
{
"path" : "file:///home/user/",
"include_patterns": [ "**" ],
"exclude_patterns": [ ".m2", ".m2/**" ]
}
]
}
If security is important for you, make sure to always provide a KEK (4096 bit RSA public key) in the encryption_key property. This will automatically turn on automatic AES DEK generation as well as encryption of each archived entry, and each piece of metadata.
As backups can take a lot of space, it can be useful to reduce the size of the backup archives. Multiple features can be used for this, such as:
- Using incremental backups can make sure that only the changed files are stored
- Selecting the KEEP_ONE_PER_BACKUP duplication strategy can make sure each file is only stored once across all versions. It is recommended to enable hash calculation as well by selecting a hash algorithm other than NONE.
- Using GZIP or BZIP2 compression can reduce the size of each backup entry
- Regularly merging the increments when we are sure that we no longer need to restore to a particular point in time can eliminate unimportant states of files which change frequently
This setting ignores duplications, meaning that every copy of the file will be stored without considering the fact that we have already stored the same content previously as illustrated on the picture below.
As you can see, the A
content is saved in multiple copies.
With this strategy, we can try to eliminate duplications across each backup increment globally, making sure, that we are not adding the same file twice even in case of later increments.
This illustration shows even more links than the previous.
Using encryption and compression can slow down backup creation. If performance is more important than security or the size of the archive, you can opt to disable these features as a trade-off. At the same time, the implementation supports multi-threaded backup and restore functionality, that can help you mitigate the overhead caused by these expensive options.
Tip
Since the multi-threaded backup is using temp files in the backup directory, it can make sense to allow using only 1 thread for the backup process when the backup destination is a slow disk or accessible over the network. This is because the single threaded implementation in writing only once, allowing better efficiency in these I/O bound scenarios.