Last active
August 12, 2020 22:46
-
-
Save ThomDietrich/ca4c7b943f294a4274fbc4e1d68bfb7f to your computer and use it in GitHub Desktop.
Benchmark to understand the effect of compressed archives on a restic repository
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
BASE="$(pwd)/temp_test_deduplication" | |
SOURCE="$BASE/input" | |
REPO_BASE="$BASE/repo" | |
NUM_FILES=16 | |
FILE_SIZE="8M" | |
export RESTIC_PASSWORD="password123" | |
############################################################ | |
echo "Starting with clean folder..." | |
TEMP="$BASE/temp" | |
rm -rf "$BASE" | |
rm -rf "$SOURCE" && mkdir -p "$SOURCE" | |
rm -rf "$TEMP" && mkdir -p "$TEMP" | |
rm -rf "$REPO_BASE"* | |
echo -e "\nInitializing restic repos..." | |
restic init --repo=$REPO_BASE-input | |
restic init --repo=$REPO_BASE-gzip | |
restic init --repo=$REPO_BASE-bzip2 | |
restic init --repo=$REPO_BASE-xz | |
restic init --repo=$REPO_BASE-rsyncable-gzip | |
restic init --repo=$REPO_BASE-rsyncable-pigz | |
restic init --repo=$REPO_BASE-rsyncable-zstd | |
for i in $(seq -f "%03g" 1 $NUM_FILES) | |
do | |
INDEX=$(cat /dev/urandom | tr -dc 'a-z0-9' | head -c 8) | |
echo "============================================================" | |
echo "Adding file $i under $SOURCE/$INDEX.txt" | |
cat /dev/urandom | tr -dc '[:alnum:] \n' | head -c $FILE_SIZE > "$SOURCE/$INDEX.txt" | |
ls -lh "$SOURCE" | |
REPO="$REPO_BASE-input" | |
echo -e "\n$REPO" | |
restic --repo=$REPO backup $SOURCE | |
for ALGO in gzip bzip2 xz; do | |
echo "------------------------------------------------------------" | |
REPO="$REPO_BASE-$ALGO" | |
echo -e "\n$REPO" | |
/usr/bin/time -f "Compression took %e seconds" \ | |
tar -cv --$ALGO -f $TEMP/archive.tar.z $SOURCE | |
echo | |
restic --repo=$REPO backup $TEMP | |
rm -rf $TEMP && mkdir $TEMP | |
done | |
echo "------------------------------------------------------------" | |
REPO="$REPO_BASE-rsyncable-gzip" | |
echo -e "\n$REPO" | |
#tar -cv $SOURCE | gzip --rsyncable > $TEMP/archive.tar.z | |
#GZIP='--rsyncable' tar -cvzf $TEMP/archive.tar.gz $SOURCE | |
/usr/bin/time -f "Compression took %e seconds" \ | |
tar -cv --use-compress-program="gzip --rsyncable" -f $TEMP/archive.tar.z $SOURCE | |
echo | |
restic --repo=$REPO backup $TEMP | |
rm -rf $TEMP && mkdir $TEMP | |
echo "------------------------------------------------------------" | |
REPO="$REPO_BASE-rsyncable-pigz" | |
echo -e "\n$REPO" | |
#tar -cv $SOURCE | pigz --rsyncable > $TEMP/archive.tar.z | |
/usr/bin/time -f "Compression took %e seconds" \ | |
tar -cv --use-compress-program="pigz --rsyncable" -f $TEMP/archive.tar.z $SOURCE | |
echo | |
restic --repo=$REPO backup $TEMP | |
rm -rf $TEMP && mkdir $TEMP | |
echo "------------------------------------------------------------" | |
# Attention: rsyncable introduced in https://github.com/facebook/zstd/releases/tag/v1.3.8 | |
REPO="$REPO_BASE-rsyncable-zstd" | |
echo -e "\n$REPO" | |
#tar -cv $SOURCE | pigz --rsyncable > $TEMP/archive.tar.z | |
/usr/bin/time -f "Compression took %e seconds" \ | |
tar -cv --use-compress-program="zstd --rsyncable" -f $TEMP/archive.tar.z $SOURCE | |
echo | |
restic --repo=$REPO backup $TEMP | |
rm -rf $TEMP && mkdir $TEMP | |
done | |
rm -rf $TEMP | |
echo -e "\nFinal repo sizes, compared to file input of $NUM_FILES of $FILE_SIZE each:" | |
du -hs $BASE/* |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The benchmark resembles a typical scenario. An application changes parts of its data over time and the provided backup command or script creates a compressed archive. A compressed archive is generally preferred as it is easier to handle and uses up less storage. Problems arise as soon as backup tools like restic are used. These tools identify changes between backup runs and only store differences. Two compressed archives with little content difference might be identified as completely different and the backup repository explodes in disk size.
The above script tests the effect of different compression algorithms on the repository size. The benchmark script increases archive size over multiple backup runs. In one test run with 16 loop runs and 8 MB file size the resulting repositories had the following sizes:
It is therefore highly recommended to use compression algorithms with the "rsyncable" option or send uncompressed backup files to restic.
Side aspect: How fast were the individual compression algorithms? On the test machine and with the generated test data the compression of all 16 files took: