Backups

Backups are created using Borg Backup and stored externally on Backblaze B2, all orchestrated via Ansible. Backups cover data from servers and from deployed workloads (see notes on backup scope below for more details).

Tools and Services

Borg Backup

Borg is used to take the selection of files that we want to safeguard and actually turn them into “a backup”. It has two key concepts:

A backup - this is a point-in-time snapshot of a collection of files. A backup is an object within Borg that can be named, deleted, subjected to retention rules, and used to restore the input files at a later date. Borg manuals call this an “archive”.
A repo - this is a collection of related backups (or “archives”). We have one repo called repo0 that all backups are stored in. It is the entire repo that is stored off-site (within B2).

Borg is a mature application that provides a lot of features; the ones most relevant and relied on by us are:

Retention - Borg makes it very easy to apply rules in the form “keep the last X hourly backups, and then the last Y daily backups, etc.”.
Deduplication - if very little has changed between backup N and backup N+1 in the same repo, Borg ensures that very little new data is stored. It does this by splitting every input file into chunks and storing only the unique ones. This means that frequent backups are possible without storing unsustainable amounts of data.
Encryption - repos are bound to an encryption key on creation, which is needed for all interaction with that repo. Without the key (and the key’s passphrase) the repo is just a large volume of random bytes, making it easier and safer to store off-site. The backup key is stored in the password manager.
Compression - Borg will compress each file chunk before storing it, so we don’t need to worry about compressing everything we’re putting into the backup (in fact we should explicitly avoid it - see below).

Note

Borg splits files into chunks to deduplicate them, then compresses new chunks before storing them to disk. To give Borg the best shot at finding duplicate blocks input files should be presented in their simplest, uncompressed form. For example, two large database dumps where only a few rows have changed will be almost identical files in plain text, but entirely different if they are compressed outside of Borg.

Backblaze B2

Provided by Backblaze, B2 is an S3-compatible object storage service. Is is used in place of AWS S3 because it is significantly cheaper - $0.006/GB/month vs. $0.023 as of August 2023. AWS does have cheaper storage tiers for infrequent access, but that is incompatible with performing regular backups and restoring from a backup as quickly as possible.

Backblaze login credentials are shared in the password manager.

Note

Backups are cheap: 100GB of backed up data costs less than 50p/month. If in doubt, back it up.

Backup Scope

During the backup process target files are split conceptually into two types: static and generated.

Static files are anything that normally exists on the host and just needs to be copied into a backup.
- For example a path like /home/markormesher/Pictures or system config files.
- These files are listed for each host in the Ansible inventory.
Generated files are created on the fly for a specific backup, copied into it, and removed afterwards.
- For example a database dump or a snapshot of a Kubernetes volume contents.
- In most cases these are generated by scripts that live in the Ansible repo or by the K8s backup generator.

Running Backups

Note

The Ansible role that runs this process has fairly good in-line documentation.

Note

The storage locations below are owned by root, and the backup workflow mostly runs as root. This makes it possible to gather files from a variety of locations that are owned by various users, and prevents a regular user from accidentally interfering with backup files.

Storage

Backups live on nfs01 - the repo is stored there and all input files need to be copied there before a backup can be created. Most of the workflow below is designed to achieve that copying as efficiently and reliably as possible, followed by a few tasks to create and upload the backup.

As part of this process a few file paths are key:

/mnt/backups/borg on nfs01
- This is where the Borg repo is stored.
/mnt/backps/staging on nfs01
- This is where files are copied to before creating a backup.
/mnt/backup-staging on every host that contributes to the backup, excluding nfs01
- This is an NFS mount to a subfolder under /mnt/backups/staging on nfs01.
  - For example, srv01:/mnt/backup-staging points at nfs01:/mnt/backups/staging/srv01.
  - These mounts are set up in the Ansible script.
- This means we never have to explicitly copy from a contributing host to nfs01, it’s implicitly for us via NFS. It also means that any backup scripts can write their output directly to /mnt/backup-staging and know that it will be included in the backup.
- One exception to this is nfs01 - backups running on that host write directly to /mnt/backups/staging/nfs01 instead of via an NFS mount back to the same host.

Workflow

A cron job on srv01 triggers an Ansible playbook every 4 hours that carries out the following workflow. The cron job itself is also managed via Ansible.

Any required packages are installed or upated.
A backup lock is acquired, or the playbook aborts if the lock is already taken. This achieves two things:
- It stops multiple playbook runs from being executed in parallel.
- If the previous backup failed the lock will not be released. This requires manual intervention and is by design: if things are in a broken state, trying to run more backups on top might make things worse.
On each host that contributes to the backup:
- The file paths and NFS mounts described above are set up.
- The folders for generated and static files are created if they don’t exist.
- The folder for generated files is emptied.
- Generated files
  - A backup timestamp is written to the generated file folder (this can be useful to know during restore operations).
  - Any per-host scripts are run to generate one-off backups, which are outputted into the host’s /mnt/backup-staging/generated folder.
    - As of writing, current scripts include triggering the K8s backup generator and extracting the home router config.
- Static files (excluding nfs01)
  - A symlink to each static input directory for the host is created in /mnt/backup-staging/static-input. Paths are renamed to remove slashes, so /home/markormesher/Pictures becomes __home__markormesher__Pictures.
  - rsync is used to copy everything from /mnt/backup-staging/static-input to /mnt/backup-staging/static, removing anything in static that is no longer in static-input.
  - The symlinks are removed.
  - This steps keeps the static files folder up to date without having to copy every file during every backup - most files should already be there from the last backup run. See the note at the bottom for more details.
  - This process is not followed for nfs01 because the target files are already on the host, so this would be a waste of time and storage. Instead, the static files on that host are referenced directly when the backup snapshot is created.
The playbook checks that no non-optional hosts failed to populate their backup staging directory (and aborts if this check fails).
A snapshot of /mnt/backups/staging (which now contains all of the per-host files) is created as a new backup.
Generated backup files are deleted (note that static files are not).
A retention policy is applied to all backups in the repo.
The repo is synced to B2 using rclone.
Successful completion is signalled to the uptime monitoring tool.
- If the monitoring tool doesn’t hear about a successful backup for 6 hours it triggers an alert.
The backup lock is released.

Note

The process for copying static files to the staging area (create symlinks, rsync, remove the symlinks) is complex for a good reason: to minimise how much data is copied between hosts on each backup.

By leaving the static folder intact and populated after each backup we only have to copy changes and additions in the next run. In contrast, generated files are wiped and copied from scratch in every backup, because they are comparatively tiny. However, we want to make sure files that are no longer on the host are removed from the backup, so we need to gather everything into one place first (the static-input folder) and then update the static folder in a single “snapshot” update via rsync --delete

Why not just put the symlinks in the static folder? Borg doesn’t follow symlinks.

Restoring from Backups

Warning

If you restore individual files Borg will automatically create the parent directory tree, which will have the permissions of the user doing the extract, not the permissions they were backed up with.

For example, if you tell Borg to extract foo/bar/fizz/buzz then buzz and all of its children will have the correct permissions, but foo/bar/fizz will have the permissions of the user doing the extract (probably root).

You can get around this by extracting the entire repo, by fixing permissions after a partial extract, or by discarding the automatically created directories.

See borgbackup/borg#1751.

If the Local Backup is Accessible

Note

These steps assume that the main local backup on nfs01 is intact and accessbile. This might be the case if you are just recovering an old version of a file or one that has been accidentally deleted.

Get the passphrase to the Borg repo encryption key - it’s in the password manager. You will need to enter it for every borg command below.
Connect to the nfs01 host, or mount the entire backup folder on a different host. The rest of these steps assume the backup is at /mnt/backups/borg/repo0, which is the path on nfs01.
List the backups present in the repo.
```
sudo borg list /mnt/backups/borg/repo0
```

Pick the correct backup and list the files within it.

sudo borg list /mnt/backups/borg/repo0::<the backup name>

Recover individual files or directories as required, or omit the paths to extract the entire archive. Note that files and directories will be extracted into the current directory.
```
sudo borg extract --progress /mnt/backups/borg/repo0::<the backup name> <file or directory> <file or directory>
```
Fix file permissions if required.

If Everything is Gone

Note

These steps start from zero, i.e. a freshly installed Linux machine with no access to any local backups. They are tested on Debian but any mainstream OS should work. They can be followed to recover from a total loss or to test the recovery process.

Note

It’s optional, but you might want to start a tmux session before starting the download so you can leave it running in the background. Tmux is included in the install command below, and the following commands/shortcuts will let you use it:

Start a session: tmux new -s main

Detach from the session: Ctrl+b then d

Re-attach to the session: tmux attach

Install the packages that will be needed in the following steps.
```
apt update
apt install rclone borgbackup vim tmux
```
(Optional) Start a tmux session - see above.
Get the key ID and value for the B2 application key. You will need it to download the repo from B2. If you’ve lost these credentials or they don’t work, new ones can be regenerated by logging in to Backblaze.

Configure Rclone to talk to B2 by creating the config file and setting the contents below.

mkdir -p ~/.config/rclone
vim ~/.config/rclone/rclone.conf

[backblaze-b2]
type = b2
account = <application key ID>
key = <application key value>

Download the entire backup repo. Make sure you have enough disk space. Note that this step may fail if you hit the download caps set in Backblaze - they can be updated via their website.
```
mkdir ~/backup-recovery
rclone sync -P backblaze-b2:mormesher-borg-repo0 ~/backup-recovery/.
```
Get the Borg repo encryption key (it’s a long string in the form BORG_KEY <lots of base64 data>) and save it in a file on the host, such as ~/borg-key.
Import the encryption key and link it to the repo that you downloaded:
```
borg key import ~/backup-recovery ~/borg-key
```
The Borg repo is now set up and available to use - follow the steps in the section above, replacing /mnt/backups/borg/repo0 with ~/backup-recovery, or wherever you downloaded the repo to.

Additional Steps when Testing Restore

Below are additional checks to make after extracting a complete backup whilst testing the backup recovery process.

Basic files and directories
- Compare the contents of various directories in the backup and at the origin.
  - e.g. are there the same number of photos on the desktop and in the backup?
- Compare the hashes of specific files in the backup with their origins.

Postgres databases

Create a new Postgres container and restore at least one of the database dumps.

docker run -it -d -n postgres -e POSTGRES_HOST_AUTH_METHOD=trust postgres:16
cat ~/backup-restore/.../some-db.sql | docker exec -it postgres psql -U postgres
docker exec -it postgres psql -U postgres
# explore the DB that was restored

K8s volumes
- Explore the contents of at least one volume dump and confirm that it matches actual volume contents.