Organizing backups: Combining duplicity and rsync

Posted in archiving backup rsync duplicity bash
Part 1 from the series "Organizing backups"

There are two awesome, weathered tools out there that are all you really need for your personal backups. [1] One is the rsync cli, the other is duplicity.

The former should need no introduction.

The latter operates more like tar. But it still works over ssh like rsync. In fact, it's based on librsync which implements the rsync protocol. The special sauce however is, of course, encryption.

Backup categories

Let's for the sake of argument say that our personal backups can be divided in three categories:

Stuff that can be public

Code snippets, git repositories, public data store states (e.g. blockchain ledgers), copies of OS packages and any other assets assets without redistribution issues.

For this we will use rsync.

Sensitive stuff

Passwords, keys, contacts, calendars, contracts, invoices, task lists, databases, system configurations, application data.

For this we will use duplicity.

Secret stuff

Long-lived keys, password- and volume decryption keys, cryptocurrency keys and meta-information about the backups themselves.

This will not be addressed now.

Why not just one or the other?

Duplicity stores everything in an archive file format. That means that you must first authenticate, decrypt and unpack the archive in order to even browse the files inside.

If there is no reason to keep the files from prying eyes, then it's much more practical to be able to browse the files where they lie, with the regular filesystem tools. In such a case, rsync will scratch your itch.

For the sensitive and secret stuff, there would be no real need to use duplicity if you were only operating on your local host. You'd just use an encrypted volume [2] and rsync everything in there.

But half the point here is to keep remote copies aswell as your local ones. You know, in case of fire, hardware-eating locust swarms or some totalitarian minions nabbing all your electronics. Unless "remote" here means some box hidden in some moated leisure castle of yours, you'll want to encrypt everything before you ship it off. And that's where duplicity comes in.

Vive la difference

Of course, it would be too much to hope for that duplicity and rsync cli have aligned the ways they parse their invocation parameters.

Here are some examples [3] of how they do not match:

local to local

$ rsync -a src/ /path/to/dst/

$ duplicity src/ file:///path/to/dst/

local to remote, relative path

$ rsync -a src/ user@remotehost:path/to/dst/

$ duplicity src/ scp://user@remotehost/path/to/dst

toggle dotfiles from current path

# include only .foo/foo.txt given the current structure:
$ tree src/ -a
src/
├── .bar
├── baz
└── .foo
    └── foo.txt

$ rsync --exclude=".b*" --include=".*/***" --exclude="*" ./ ../dst/

$ duplicity --exclude="./.b*" --include="./.*/***" --exclude="*" ./ file:///home/lash/tmp/dst/

logging

# spill the beans
$ rsync -vv ...

$ duplicity -v debug

Batchin'

Since you will want to select up front which tool to use for which sensititivy category, you'll be writing the includes and excludes specifically for the tool anyway.

So the only real issue with the above is the way remote host is specified.

Let's say we choose to stick to the rsync cli host format. That means we need to make the following translations:

rsync duplicity
foo/bar file://foo/bar
/foo/bar file:///foo/bar
user@host:foo/bar scp://user@host/foo/bar
user@host:/foo/bar scp://user@host//foo/bar

Expressed in bash that could look like this:

to_duplicity_remote() {

        # remote host is defined in rsync format
        # ... and we will only support scp
        # $remote_base is the path we want to parse
        remote_duplicity_base=
        remote_base=$1

        # substring up until the first slash
        s_firstslash=${remote_base%%/*}

        # substring up until the first colon
        s_firstcolon=${remote_base%%:*}

        # string index of the first slash
        i_firstslash=$((${#s_firstslash}))

        # string index of the first colon
        i_firstcolon=$((${#s_firstcolon}))

        # if colon is before first slash that most likely means we have a remote host
        # (Exception is if first directory of path happens to have ":" in it. Seriously, don't use ":" in filenames)
        if [ "$i_firstcolon" -gt "0" ]; then

                if [ "$i_firstcolon" -lt "$i_firstslash" ]; then

                        # pexpect addition due to lack of implicit private key fetch, without pexpect works only with key wo pwd
                        # https://serverfault.com/questions/982591/duplicity-backup-fails-private-key-file-is-encrypted
                        remote_duplicity_base="pexpect+scp://"

                        # trim url so that colon after hostname is removed
                        # (no support here for setting an alternate port number)
                        remote_duplicity_base=${remote_duplicity_base}${remote_base:0:$i_firstcolon}
                        remote_duplicity_base=${remote_duplicity_base}/${remote_base:(($i_firstcolon+1))}

                        # indicate that we have a remote host
                        remote=1
                fi
        fi

        # If it's not a remote host, treat it as a file
        if [ -z $remote_duplicity_base ]; then
                remote_duplicity_base="file://${remote_base}"
        fi
}

if [ ! -z "$BAK_TEST" ]; then
        src=(/foo/bar foo/bar localhost:foo/bar localhost:/foo/bar)
        res=(file:///foo/bar file://foo/bar pexpect+scp://localhost/foo/bar pexpect+scp://localhost//foo/bar)

        i=0
        for case_src in ${src[@]}; do
                case_res=${res[$i]}

                to_duplicity_remote $case_src
                if [ "$remote_duplicity_base" != "$case_res" ]; then
                        >&2 echo "expected $case_res got $remote_duplicity_base from $case_src"
                        exit 1
                elif [ "$remote_base" != "$case_src" ]; then
                        >&2 echo "$case_src got mangled into $remote_base"
                        exit 1
                fi
                i=$((i+1))
        done
fi

Let's behave and test our code:

if [ ! -z "$BAK_TEST" ]; then
        src=(/foo/bar foo/bar localhost:foo/bar localhost:/foo/bar)
        res=(file:///foo/bar file://foo/bar pexpect+scp://localhost/foo/bar pexpect+scp://localhost//foo/bar)

        i=0
        for case_src in ${src[@]}; do
                case_res=${res[$i]}

                to_duplicity_remote $case_src
                if [ "$remote_duplicity_base" != "$case_res" ]; then
                        >&2 echo "expected $case_res got $remote_duplicity_base from $case_src"
                        exit 1
                elif [ "$remote_base" != "$case_src" ]; then
                        >&2 echo "$case_src got mangled into $remote_base"
                        exit 1
                fi
                i=$((i+1))
        done
fi
# 0 == good!
$ BAK_TEST=1 bash remote.sh && echo $?
0

Now we can use the rsync cli path input, and use that same input to a batch of single backup steps, each which may use rsync cli or duplicity

to_duplicity_remote localhost:/foo/bar

rsync -avzP pub/ $remote_base:src/

duplicity -v info secret/ $remote_duplicity_base:secret/

See also

[1]Ok, I know, I assuming that you are using git in daily life, too.
[2]Provided, of course, that it's an encrypted volume that you don't keep unlocked all the time.
[3]Duplicity needs at a minimum a password for symmetric encryption, and will prompt for it unless it's set in the environment. Simply export PASSPHRASE=test for these examples to relieve you of the annoyance.