# Backups

Backups are created using [Borg Backup](https://www.borgbackup.org) and stored externally on [Backblaze B2](https://www.backblaze.com/cloud-storage), all orchestrated via [Ansible][ansible]. Backups cover data from [servers](./hosts.md) and from [deployed workloads](./lab-architecture.md) (see notes on backup scope below for more details).

## Tools and Services

### Borg Backup

Borg is used to take the selection of files that we want to safeguard and actually turn them into "a backup". It has two key concepts:

- A **backup** - this is a point-in-time snapshot of a collection of files. A backup is an object within Borg that can be named, deleted, subjected to retention rules, and used to restore the input files at a later date. Borg manuals call this an "archive".
- A **repo** - this is a collection of related backups (or "archives"). We have one repo called `repo0` that all backups are stored in. It is the entire repo that is stored off-site (within B2).

Borg is a mature application that provides a lot of features; the ones most relevant and relied on by us are:

- **Retention** - Borg makes it very easy to apply rules in the form "keep the last X hourly backups, and then the last Y daily backups, etc.".
- **Deduplication** - if very little has changed between backup N and backup N+1 in the same repo, Borg ensures that very little new data is stored. It does this by splitting every input file into chunks and storing only the unique ones. This means that frequent backups are possible without storing unsustainable amounts of data.
- **Encryption** - repos are bound to an encryption key on creation, which is needed for all interaction with that repo. Without the key (and the key's passphrase) the repo is just a large volume of random bytes, making it easier and safer to store off-site. The backup key is stored in the password manager.
- **Compression** - Borg will compress each file chunk before storing it, so we don't need to worry about compressing everything we're putting into the backup (in fact we should explicitly avoid it - see below).

:::{note}
Borg splits files into chunks to deduplicate them, _then_ compresses new chunks before storing them to disk. To give Borg the best shot at finding duplicate blocks input files should be presented in their simplest, uncompressed form. For example, two large database dumps where only a few rows have changed will be almost identical files in plain text, but entirely different if they are compressed outside of Borg.
:::

### Backblaze B2

Provided by Backblaze, B2 is an S3-compatible object storage service. Is is used in place of AWS S3 because it is significantly cheaper - $0.006/GB/month vs. $0.023 as of August 2023. AWS does have cheaper storage tiers for infrequent access, but that is incompatible with performing regular backups and restoring from a backup as quickly as possible.

Backblaze login credentials are shared in the password manager.

:::{note}
**Backups are cheap:** 100GB of backed up data costs less than 50p/month. If in doubt, back it up.
:::

## Backup Scope

During the backup process target files are split conceptually into two types: static and generated.

- Static files are anything that normally exists on the host and just needs to be copied into a backup.
  - For example a path like `/home/markormesher/Pictures` or system config files.
  - These files are listed for each host in the Ansible inventory.
- Generated files are created on the fly for a specific backup, copied into it, and removed afterwards.
  - For example a database dump or a snapshot of a Kubernetes volume contents.
  - In most cases these are generated by scripts that live in the Ansible repo or by the [K8s backup generator](https://gitea.tatsu.casa/tatsu-dev/k8s-backup-generator).

## Running Backups

:::{note}
The Ansible role that runs this process has fairly good in-line documentation.
:::

:::{note}
The storage locations below are owned by root, and the backup workflow mostly runs as root. This makes it possible to gather files from a variety of locations that are owned by various users, and prevents a regular user from accidentally interfering with backup files.
:::

### Storage

Backups live on `nfs01` - the repo is stored there and all input files need to be copied there before a backup can be created. Most of the workflow below is designed to achieve that copying as efficiently and reliably as possible, followed by a few tasks to create and upload the backup.

As part of this process a few file paths are key:

- `/mnt/backups/borg` on `nfs01`
  - This is where the Borg repo is stored.
- `/mnt/backps/staging` on `nfs01`
  - This is where files are copied to before creating a backup.
- `/mnt/backup-staging` on every host that contributes to the backup, excluding `nfs01`
  - This is an NFS mount to a subfolder under `/mnt/backups/staging` on `nfs01`.
    - For example, `srv01:/mnt/backup-staging` points at `nfs01:/mnt/backups/staging/srv01`.
    - These mounts are set up in the Ansible script.
  - This means we never have to explicitly copy from a contributing host to `nfs01`, it's implicitly for us via NFS. It also means that any backup scripts can write their output directly to `/mnt/backup-staging` and know that it will be included in the backup.
  - One exception to this is `nfs01` - backups running on that host write directly to `/mnt/backups/staging/nfs01` instead of via an NFS mount back to the same host.

### Workflow

A cron job on `srv01` triggers an Ansible playbook every 4 hours that carries out the following workflow. The cron job itself is also managed via Ansible.

- Any required packages are installed or upated.
- A backup lock is acquired, or the playbook aborts if the lock is already taken. This achieves two things:
  - It stops multiple playbook runs from being executed in parallel.
  - If the previous backup failed the lock will not be released. This requires manual intervention and is by design: if things are in a broken state, trying to run more backups on top might make things worse.
- On each host that contributes to the backup:
  - The file paths and NFS mounts described above are set up.
  - The folders for generated and static files are created if they don't exist.
  - The folder for generated files is emptied.
  - **Generated files**
    - A backup timestamp is written to the generated file folder (this can be useful to know during restore operations).
    - Any per-host scripts are run to generate one-off backups, which are outputted into the host's `/mnt/backup-staging/generated` folder.
      - As of writing, current scripts include triggering the K8s backup generator and extracting the [home router](./network.md#core-router) config.
  - **Static files** (excluding `nfs01`)
    - A symlink to each static input directory for the host is created in `/mnt/backup-staging/static-input`. Paths are renamed to remove slashes, so `/home/markormesher/Pictures` becomes `__home__markormesher__Pictures`.
    - `rsync` is used to copy everything from `/mnt/backup-staging/static-input` to `/mnt/backup-staging/static`, removing anything in `static` that is no longer in `static-input`.
    - The symlinks are removed.
    - This steps keeps the static files folder up to date without having to copy every file during every backup - most files should already be there from the last backup run. See the note at the bottom for more details.
    - This process is not followed for `nfs01` because the target files are already on the host, so this would be a waste of time and storage. Instead, the static files on that host are referenced directly when the backup snapshot is created.
- The playbook checks that no non-optional hosts failed to populate their backup staging directory (and aborts if this check fails).
- A snapshot of `/mnt/backups/staging` (which now contains all of the per-host files) is created as a new backup.
- Generated backup files are deleted (note that static files are not).
- A retention policy is applied to all backups in the repo.
- The repo is synced to B2 using `rclone`.
- Successful completion is signalled to the [uptime monitoring tool](./observability.md).
  - If the monitoring tool doesn't hear about a successful backup for 6 hours it triggers an alert.
- The backup lock is released.

:::{note}
The process for copying static files to the staging area (create symlinks, `rsync`, remove the symlinks) is complex for a good reason: to minimise how much data is copied between hosts on each backup.

By leaving the `static` folder intact and populated after each backup we only have to copy changes and additions in the next run. In contrast, generated files are wiped and copied from scratch in every backup, because they are comparatively tiny. However, we want to make sure files that are no longer on the host are removed from the backup, so we need to gather everything into one place first (the `static-input` folder) and then update the `static` folder in a single "snapshot" update via `rsync --delete`

Why not just put the symlinks in the `static` folder? Borg doesn't follow symlinks.
:::

## Restoring from Backups

:::{warning}
If you restore individual files Borg will automatically create the parent directory tree, which will have the permissions of the user doing the extract, not the permissions they were backed up with.

For example, if you tell Borg to extract `foo/bar/fizz/buzz` then `buzz` and all of its children will have the correct permissions, but `foo/bar/fizz` will have the permissions of the user doing the extract (probably root).

You can get around this by extracting the entire repo, by fixing permissions after a partial extract, or by discarding the automatically created directories.

See [borgbackup/borg#1751](https://github.com/borgbackup/borg/issues/1751).
:::

### If the Local Backup is Accessible

:::{note}
These steps assume that the main local backup on `nfs01` is intact and accessbile. This might be the case if you are just recovering an old version of a file or one that has been accidentally deleted.
:::

1. Get the passphrase to the Borg repo encryption key - it's in the password manager. You will need to enter it for every `borg` command below.

1. Connect to the `nfs01` host, or mount the entire backup folder on a different host. The rest of these steps assume the backup is at `/mnt/backups/borg/repo0`, which is the path on `nfs01`.

1. List the backups present in the repo.

    ```shell
    sudo borg list /mnt/backups/borg/repo0
    ```

1. Pick the correct backup and list the files within it.

    ```shell
    sudo borg list /mnt/backups/borg/repo0::<the backup name>
    ```

1. Recover individual files or directories as required, or omit the paths to extract the entire archive. Note that files and directories will be extracted into the current directory.

    ```shell
    sudo borg extract --progress /mnt/backups/borg/repo0::<the backup name> <file or directory> <file or directory>
    ```

1. Fix file permissions if required.

### If Everything is Gone

:::{note}
These steps start from zero, i.e. a freshly installed Linux machine with no access to any local backups. They are tested on Debian but any mainstream OS should work. They can be followed to recover from a total loss or to test the recovery process.
:::

:::{note}
It's optional, but you might want to start a tmux session before starting the download so you can leave it running in the background. Tmux is included in the install command below, and the following commands/shortcuts will let you use it:

Start a session: `tmux new -s main`

Detach from the session: `Ctrl+b` then `d`

Re-attach to the session: `tmux attach`
:::

1. Install the packages that will be needed in the following steps.

    ```shell
    apt update
    apt install rclone borgbackup vim tmux
    ```

1. (Optional) Start a tmux session - see above.

1. Get the key ID and value for the B2 application key. You will need it to download the repo from B2. If you've lost these credentials or they don't work, new ones can be regenerated by logging in to Backblaze.

1. Configure Rclone to talk to B2 by creating the config file and setting the contents below.

    ```shell
    mkdir -p ~/.config/rclone
    vim ~/.config/rclone/rclone.conf
    ```

    ```
    [backblaze-b2]
    type = b2
    account = <application key ID>
    key = <application key value>
    ```

1. Download the entire backup repo. Make sure you have enough disk space. Note that this step may fail if you hit the download caps set in Backblaze - they can be updated via their website.

    ```shell
    mkdir ~/backup-recovery
    rclone sync -P backblaze-b2:mormesher-borg-repo0 ~/backup-recovery/.
    ```

1. Get the Borg repo encryption key (it's a long string in the form `BORG_KEY <lots of base64 data>`) and save it in a file on the host, such as `~/borg-key`.

1. Import the encryption key and link it to the repo that you downloaded:

    ```shell
    borg key import ~/backup-recovery ~/borg-key
    ```

1. The Borg repo is now set up and available to use - follow the steps in the section above, replacing `/mnt/backups/borg/repo0` with `~/backup-recovery`, or wherever you downloaded the repo to.

### Additional Steps when Testing Restore

Below are additional checks to make after extracting a complete backup whilst testing the backup recovery process.

- Basic files and directories
  - Compare the contents of various directories in the backup and at the origin.
    - e.g. are there the same number of photos on the desktop and in the backup?
  - Compare the hashes of specific files in the backup with their origins.
- Postgres databases
  - Create a new Postgres container and restore at least one of the database dumps.
      ```shell
      docker run -it -d -n postgres -e POSTGRES_HOST_AUTH_METHOD=trust postgres:16
      cat ~/backup-restore/.../some-db.sql | docker exec -it postgres psql -U postgres
      docker exec -it postgres psql -U postgres
      # explore the DB that was restored
      ```
- K8s volumes
  - Explore the contents of at least one volume dump and confirm that it matches actual volume contents.

[ansible]: https://gitea.tatsu.casa/tatsu-deploy/ansible