backup and restore
this cluster should make the homelab easier to abandon, not harder to leave.
assume the node can be lost, the edge vps can be replaced, and the cluster can be rebuilt from git. only deliberate state gets protected.
guiding model
split every workload into one of four classes:
- stateless / rebuildable: manifests and config live in git. no data backup required.
- stateful small: small databases, config, notes, credentials, and agent state. back up with restic and test restores.
- stateful large: photos, media, object storage, packages, artifacts. protect only what is irreplaceable or explicitly worth the cost.
- disposable: caches, workdirs, generated files, temporary environments, and job scratch space. do not back up.
default backup pattern
for single-node k3s, prefer host-level restic first.
back up the host paths that contain cluster state and persistent volumes instead of deploying a backup product on day one:
/etc/rancher/k3s//var/lib/rancher/k3s/server/db/for sqlite cluster state/var/lib/rancher/k3s/server/tls/if full server recovery is desired/srv/k3s/volumes/for app pvc data- sops age private key(s)
- flux bootstrap metadata and repo deploy key material
- any host-level blocky/caddy/cloudflared/tailscale config that is not in git
kubernetes cronjobs running restic are fine later for app-specific schedules, but host-level restic is simpler and better aligned with this one-node design. velero with restic/kopia is useful if the cluster grows, but is heavier than needed for v1.
restic sketch
choose a real target before production workloads: backblaze b2, s3-compatible storage, a nas, or another restic-compatible destination.
export RESTIC_REPOSITORY="s3:s3.example.com/k3s-one"
export RESTIC_PASSWORD_FILE="/root/.config/restic/k3s-one-password"
restic backup \
/etc/rancher/k3s \
/var/lib/rancher/k3s/server/db \
/var/lib/rancher/k3s/server/tls \
/srv/k3s/volumes
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
restic check
sqlite note
single-node k3s uses sqlite by default. prefer a stop-aware backup or filesystem snapshot. at minimum, coordinate backups with low write activity and test restore.
state contracts
before adding an app, add a short state contract:
- stateless: yes/no
- pvc paths and expected size
- backup class: none, daily, hourly, manual export, or special
- restore priority: p0, p1, p2, p3
- evacuation value: must-have, nice-to-have, or abandon
- exclusions: caches, generated files, logs, artifacts, transcodes
example:
workload: vaultwarden
stateless: false
pvcs:
- /srv/k3s/volumes/vaultwarden/data
backup_class: daily
restore_priority: p0
evacuation_value: must-have
exclusions: []
workload classification
| workload | state class | backup guidance | priority |
|---|---|---|---|
| vaultwarden | stateful small | back up database, attachments, config, and admin token material; test restore before relying on it | p0 |
| hermes | stateful small | if docker-mounted as /opt/data, protect config, .env/auth tokens, skills, memory, sessions/state db, cron jobs, plugin config | p0 |
| gitea repos | stateful variable | protect git repositories and gitea database; size can grow with repo count | p0/p1 |
| gitea lfs/packages/artifacts/actions logs | stateful large/variable | classify separately from source repos; use retention and exclude what is rebuildable | p1/p2 |
| immich originals | stateful large | protect originals and database; thumbnails/transcodes are rebuildable | p1 |
| hedgedoc | stateful small | back up database/uploads if used | p1 |
| wallabag | stateful small | back up database and config | p1 |
| blocky | mostly stateless | keep config in git; back up only if local runtime state matters | p1 |
| hugo sites | stateless output, stateful source | source repos are protected by gitea/offsite git; generated public/ output is disposable | p1/p3 |
| chatmail server in go | disposable state | device mailboxes pull messages as they arrive; losing server state is acceptable | p3 |
| jellyfin media | stateful large | separate irreplaceable media from replaceable movies/shows/transcodes; metadata is convenience | p2 |
| owncast | mixed | app config is small; recordings/media only if explicitly valuable | p2 |
| monitoring stack | mixed | dashboards/config in git; long-term metrics optional with retention | p2/p3 |
| caddy | stateless | config in git; certificates are restorable but may be backed up as convenience | p3 |
| gitea runner | disposable | do not back up workspaces, caches, or logs beyond short retention | p3 |
| agents running with hermes | mixed | prompts/config in git; agent memory/state only if deliberately valuable | p2/p3 |
| s3 storage deployment | stateful large | treat as a storage system, not just an app; define bucket-level policies and offsite copy | special |
| ephemeral dev containers | disposable | no backup; push useful work to git | p3 |
| random jobs | disposable by default | promote to stateful only when outputs are intentionally retained | p3 |
evacuation priorities
p0: rebuild-critical
- vaultwarden
- hermes state needed to resume operations
- gitea source repositories and database
- sops age keys
- flux deploy keys and bootstrap notes
- cloudflare, dns, registry, and backup repository credentials
p1: important
- immich originals and database
- hedgedoc
- wallabag
- blocky/home dns config
- hugo source repositories
p2: convenience or bulky value
- jellyfin metadata and selected media
- owncast config and selected media
- monitoring history
- gitea packages/artifacts if actually needed
- optional agent state
p3: abandon
- chatmail server state
- gitea runner workdirs
- ci caches
- generated hugo output
- ephemeral dev environments
- temp job outputs
- monitoring metrics beyond retention
restore drill
backup is not done until restore has been tested.
quarterly drill:
- provision a clean debian node or vm.
- install k3s.
- restore
/etc/rancher/k3s, sqlite state, and selected pvc data from restic. - restore sops age key and flux deploy key material.
- let flux reconcile.
- verify p0 apps first: vaultwarden, hermes, gitea.
- verify p1 apps next: immich, hedgedoc, wallabag, blocky.
- document any manual steps that were not already in git.
grab-and-go implication
in an evacuation, do not depend on the cluster being reachable. keep an encrypted offline or cloud-reachable copy of the p0 material outside the house:
- password manager recovery/export path
- sops age key recovery path
- flux/gitea/cloudflare access path
- restic repository credentials and password
- one-page rebuild checklist
k3s-one is successful when the physical node is replaceable and the human can leave with identity, keys, and backups.