What does it do?

btrfs2s3 maintains a tree of differential backups in cloud object storage (anything with an S3-compatible API).

Each backup object is just a native btrfs archive produced by btrfs send [-p].

The root of the tree is a full backup. The other nodes of the tree are differential backups.

The structure of the tree corresponds to a schedule.

It looks like this:

Yearly backup (full)
- Monthly backup A (changes from yearly)
- Monthly backup B (changes from yearly)
  - Daily backup 1 (changes from monthly B)
  - Daily backup 2 (changes from monthly B)
- Monthly backup C (changes from yearly)
  - Daily backup 3 (changes from monthly C)
  - Daily backup 4 (changes from monthly C)

The schedule and granularity can be customized. Up-to-the-minute backups can be made, with minimal increase in cloud storage or I/O.

The design and implementation are tailored to minimize cloud costs.

btrfs2s3 will keep one snapshot on disk for each backup in the cloud. This one-to-one correspondence is required for differential backups.

What does it do?
What problem does this solve?
The case for cloud backups of self-hosted data
The case for snapshotting filesystems
Advantages
Disadvantages
Comparison with other tools
Installation
Versioning
Config
Preservation Policy
Usage
Design
Differential Tree
Object storage scheme
Cloud storage costs
Cloud API usage costs
Cloud-to-host costs
Host-to-cloud costs
Threat Model
Permissions
Immutable backups
Encryption
Quirks when uploading to S3
Timezones

What problem does this solve?

btrfs2s3 is intended for users who want to self-host irreplacable data, but are concerned about the risk of self-hosting backups. btrfs2s3's main function is to hand off backups to a third party, and minimize the cost of doing so.

My hope is that more users (including myself) can self-host more data with confidence.

Non-goals:

Self-hosted backups
Backups of replacable data, e.g. an operating system

The case for cloud backups of self-hosted data

Cloud-hosted backups can be a cost-effective alternative to a self-hosted backup system. They might also be the only way to eliminate yourself as a single point of failure.

Self-hosting precious data generally means redundant storage, good security, reliable monitoring and regular maintenance. Self-hosting backups means doing all that twice, ideally on a geographically-distant system.

These aren't hard problems on their own, but each is a new opportunity for human error, which has no upper bound of severity. Personally, I've lost years of data by formatting the wrong volume.

Further, self-hosting primary and backup systems means means you have admin powers over both. If one is compromised, the other may get compromised through your access. If bad config affects one, it may affect the other through your administration. How can you protect yourself from yourself?

If you are dedicated to self-hosting backups, btrfs2s3 may not be the best tool. A self-hosted backup system can use the same filesystem as the primary, and take better advantage of native deduplication and direct file access. A tool like btrbk is good for this.

The case for snapshotting filesystems

btrfs2s3 stores native data streams from snapshotting filesystems (currently only btrfs, but more support is planned). It may seem like a backup tool should support all filesystems, and not specialize.

When we specialize in snapshotting filesystems, we can take advantage of native change detection, deduplication and data storage formats. This has several advantages:

Backups can be done automatically in the background with little or no interruption, maximizing the chances that backups stay up-to-date
Backups can be very frequent, minimizing the chance of data loss
Our tool's code is greatly simplified, reducing maintenance costs and bug surface area
We're guaranteed to backup all filesystem-specific metadata, whereas a generic backup storage format may need to discard it

It may seem that if your data is on an ext4 volume or a Windows machine, it's a disadvantage if a backup tool doesn't support that.

But if your data is worth backing up, it should be on a filesystem with checksums. This is the same as the argument for ECC memory. And apparently, most or all checksumming filesystems also support snapshots (true of btrfs, zfs, xfs, ceph; I welcome counterexamples). Thus if you need a backup tool, you likely already have native snapshotting features available, and it would be wasteful for a backup tool to to ignore these and re-implement all their advantages.

Many believe that btrfs is unstable. While this is a tedious debate, it's always reasonable to believe software has bugs. But backups are the best defense against bugs. To the degree that snapshotting filesystems make backups easier, non-snapshotting filesystems like ext4 incur risk by making backups harder.

One extra risk of relying on native snapshots is that its specialized code paths are less extremely-well-tested than traditional ones (btrfs send versus read()). There is some increased risk of silent data corruption in backups.

Advantages

Atomic snapshot backups.
Up-to-the-minute backups are reasonable (even full-filesystem snapshots!)
Simple design with no separate state files.
Excellent fit with cheap storage classes (e.g. AWS Glacier Deep Archive).
Excellent fit with object locking for security.
Designed to minimize API usage and other cloud storage costs.
Connects directly to S3, no FUSE filesystem required.

Disadvantages

Requires btrfs.
Individual files can't be accessed directly. A whole sequence of snapshots (from root to leaf) must be restored on a local btrfs filesystem.

Comparison with other tools

TODO

Installation

btrfs2s3 is distributed on PyPI. You can install the latest version:

pip install btrfs2s3

Versioning

btrfs2s3 adheres to Semantic Versioning v2.0.0. Any breaking changes will result in a major version bump.

As of writing, the documented user-facing API surface consists of:

CLI arguments (not the CLI output)
The backup object storage and metadata format

There is no publicly-exposed programmatic interface / API as of writing. The programmatic interface should be considered unstable and subject to breaking change without a major version bump.

The v0.x versions are experimental and should not be used.

Config

Minimal example:

timezone: America/Los_Angeles
sources:
  - path: /path/to/your/subvolume
    snapshots: /path/to/your/snapshots
    upload_to_remotes:
      - id: aws
        preserve: 1y 3m 30d 24h
remotes:
  - id: aws
    s3:
      bucket: my-s3-bucket-name

Full reference:

# Your time zone. Changing this affects your preservation policy. Always
# required.
timezone: America/Los_Angeles
# A source is a subvolume which you want to back up. btrfs2s3 will manage
# snapshots and backups of the source. At least one is required.
sources:
    # The path must be a subvolume to which you have write access.
  - path: /path/to/your/subvolume
    # The path where you want btrfs2s3 to store snapshots. btrfs2s3 will
    # automatically manage (create, rename and delete) any snapshots of the
    # source which exist under this path. Any snapshots outside of this path
    # will be ignored by btrfs2s3.
    snapshots: /path/to/your/snapshots
    # upload_to_remotes specifies where btrfs2s3 should store backups of this
    # source, and how they should be managed. At least one is required.
    upload_to_remotes:
        # The id refers to the "id" field of the top-level "remotes" list.
      - id: aws
        # The preservation policy for backing up this source to this remote.
        # This applies to both snapshots and backups.
        preserve: 1y 3m 30d 24h
        # A sequence of commands to pipe the backup stream through. This is
        # useful for compressing or encrypting your backup on the host before
        # storing it in the cloud. The resulting backup will be the result of
        # a command pipeline like "btrfs send | cmd1 | cmd2 | ..."
        pipe_through:
          - [gzip]
          - [gpg, --encrypt, -r, me@example.com]
# A list of places to store backups remotely. At least one is required.
remotes:
    # A unique id for this remote. Required.
  - id: aws
    # S3 configuration. Required.
    s3:
      # The S3 bucket name. Required.
      bucket: my-s3-bucket-name
      # Optional configuration for the S3 service endpoint.
      endpoint:
        # The AWS config profile in ~/.aws/config and ~/.aws/credentials. If
        # not specified, the default config sections are used.
        profile_name: my-profile-name
        # The AWS region name. Required if not configured in ~/.aws
        region_name: us-west-2
        # Access key id and secret access key for accessing the S3 endpoint.
        # Required if not specified in ~/.aws
        aws_access_key_id: ABCXYZ...
        aws_secret_access_key: ABCXYZ...
        # The S3 endpoint URL. Required if not specified in ~/.

Btrfs2s3

Install / Use

README