-
Notifications
You must be signed in to change notification settings - Fork 22
Delta compression
Delta compression (eg by librsync) would be a way to avoid transmitting redundant data:
- similar or identical contents between different files in one version
- similar content between different versions of the same file
- renamed files
Why?
-
Storing data has a dollar cost.
-
All else being equal, if you write less data when making a backup, it will complete more quickly.
-
Storing data may have an opportunity cost: more compact storage may allow you to store more versions in a particular amount of space.
-
Binary deltas may be a cheap way to detect file renames or copies even when overlapping with modifications.
However there are counterarguments. Many of these show up in Duplicity in an extreme form since it computes one binary delta across all files.
-
Computing the delta requires reading some information about what's already stored, which tends to make the backup slower. This may be compensated for by spending less time writing the resulting delta than storing the whole new file.
-
Similarly, retrieving any single file requires following the whole chain which is slower.
-
For many files, a binary delta will not be especially helpful: the file is too small to matter (text files) or it is completely rewritten when it changes (jpgs, compiled binaries). However there are important exceptions: files that grow at the end, databases that have only some files rewritten, etc.
-
Storing the data for one file version across multiple blocks in multiple chains tends to make the storage more fragile: more of it has to be present and correct to retrieve the file.
-
librsync uses probabilistic compression: there is a very small but non-zero chance of a collision in which case data will be irretrievably lost. Guarding against this at backup time would require reconstructing the previous version of the file (eg by applying all the deltas to date), and if you did this there would be no point in using librsync.
-
If we rely on keeping a local cache of what's on the server to get adequate speed for computing deltas, that is a point of fragility. The cache can be stale or wrong; clients can have bugs that depend on the cache state; performance can fall off if the cache misses.
These are apparently in tension with general goals for Conserve:
-
Keep things simple.
-
Optimize for the best chance of getting data back when it is needed, at any reasonable cost.
-
Maximize the chance that at least some data can be retrieved even if the backup is damaged due to Conserve bugs, problems in underlying systems, or external failures.
-
Make progress backing some files up in only a short run that doesn't have time to do anything proportional to the size of the whole filesystem.
-
Make storage and retrieval fast, including retrieval of just a subset of files.
For the first version and as a default it seems best just to do file-at-a-time lossless compression: simple, predictable performance, fast random access.
We can add some options:
-
Track renamed files by looking for identical hashes, and pointing back to a different name in the previous version.
-
File-at-a-time lossless binary delta: retrieve the old version, run xdelta, store the delta. Do this only for selected files, perhaps giving a filename glob and a minimum size. Retrieval is supposed to be fast: if there is a delta chain we need to read
n_levels * (band_footer + block_index + block)
. Overall this will be somewhat worse than reading the whole file from the server, but reading may be faster than writing, or may be parallelizable on a full duplex connection. Only actually store the delta if it's significantly smaller than the plain compressed version. -
File-at-a-time block matching compression: store signatures next to the block index. Probably also only do this for files whose size and name indicates it may be worthwhile. Small non-zero chance of a collision and loss. Faster than reading back the entire file and we only need to read one.
-
Look at multiple previous-version files (or current-version?), and run xdelta or librsync across them to find common data.