backup and restore

this cluster should make the homelab easier to abandon, not harder to leave.

assume the node can be lost, the edge vps can be replaced, and the cluster can be rebuilt from git. only deliberate state gets protected.

guiding model

split every workload into one of four classes:

  1. stateless / rebuildable: manifests and config live in git. no data backup required.
  2. stateful small: small databases, config, notes, credentials, and agent state. back up with restic and test restores.
  3. stateful large: photos, media, object storage, packages, artifacts. protect only what is irreplaceable or explicitly worth the cost.
  4. disposable: caches, workdirs, generated files, temporary environments, and job scratch space. do not back up.

default backup pattern

for single-node k3s, prefer host-level restic first.

back up the host paths that contain cluster state and persistent volumes instead of deploying a backup product on day one:

  • /etc/rancher/k3s/
  • /var/lib/rancher/k3s/server/db/ for sqlite cluster state
  • /var/lib/rancher/k3s/server/tls/ if full server recovery is desired
  • /srv/k3s/volumes/ for app pvc data
  • sops age private key(s)
  • flux bootstrap metadata and repo deploy key material
  • any host-level blocky/caddy/cloudflared/tailscale config that is not in git

kubernetes cronjobs running restic are fine later for app-specific schedules, but host-level restic is simpler and better aligned with this one-node design. velero with restic/kopia is useful if the cluster grows, but is heavier than needed for v1.

restic sketch

choose a real target before production workloads: backblaze b2, s3-compatible storage, a nas, or another restic-compatible destination.

export RESTIC_REPOSITORY="s3:s3.example.com/k3s-one"
export RESTIC_PASSWORD_FILE="/root/.config/restic/k3s-one-password"

restic backup \
  /etc/rancher/k3s \
  /var/lib/rancher/k3s/server/db \
  /var/lib/rancher/k3s/server/tls \
  /srv/k3s/volumes

restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
restic check

sqlite note

single-node k3s uses sqlite by default. prefer a stop-aware backup or filesystem snapshot. at minimum, coordinate backups with low write activity and test restore.

state contracts

before adding an app, add a short state contract:

  • stateless: yes/no
  • pvc paths and expected size
  • backup class: none, daily, hourly, manual export, or special
  • restore priority: p0, p1, p2, p3
  • evacuation value: must-have, nice-to-have, or abandon
  • exclusions: caches, generated files, logs, artifacts, transcodes

example:

workload: vaultwarden
stateless: false
pvcs:
  - /srv/k3s/volumes/vaultwarden/data
backup_class: daily
restore_priority: p0
evacuation_value: must-have
exclusions: []

workload classification

workloadstate classbackup guidancepriority
vaultwardenstateful smallback up database, attachments, config, and admin token material; test restore before relying on itp0
hermesstateful smallif docker-mounted as /opt/data, protect config, .env/auth tokens, skills, memory, sessions/state db, cron jobs, plugin configp0
gitea reposstateful variableprotect git repositories and gitea database; size can grow with repo countp0/p1
gitea lfs/packages/artifacts/actions logsstateful large/variableclassify separately from source repos; use retention and exclude what is rebuildablep1/p2
immich originalsstateful largeprotect originals and database; thumbnails/transcodes are rebuildablep1
hedgedocstateful smallback up database/uploads if usedp1
wallabagstateful smallback up database and configp1
blockymostly statelesskeep config in git; back up only if local runtime state mattersp1
hugo sitesstateless output, stateful sourcesource repos are protected by gitea/offsite git; generated public/ output is disposablep1/p3
chatmail server in godisposable statedevice mailboxes pull messages as they arrive; losing server state is acceptablep3
jellyfin mediastateful largeseparate irreplaceable media from replaceable movies/shows/transcodes; metadata is conveniencep2
owncastmixedapp config is small; recordings/media only if explicitly valuablep2
monitoring stackmixeddashboards/config in git; long-term metrics optional with retentionp2/p3
caddystatelessconfig in git; certificates are restorable but may be backed up as conveniencep3
gitea runnerdisposabledo not back up workspaces, caches, or logs beyond short retentionp3
agents running with hermesmixedprompts/config in git; agent memory/state only if deliberately valuablep2/p3
s3 storage deploymentstateful largetreat as a storage system, not just an app; define bucket-level policies and offsite copyspecial
ephemeral dev containersdisposableno backup; push useful work to gitp3
random jobsdisposable by defaultpromote to stateful only when outputs are intentionally retainedp3

evacuation priorities

p0: rebuild-critical

  • vaultwarden
  • hermes state needed to resume operations
  • gitea source repositories and database
  • sops age keys
  • flux deploy keys and bootstrap notes
  • cloudflare, dns, registry, and backup repository credentials

p1: important

  • immich originals and database
  • hedgedoc
  • wallabag
  • blocky/home dns config
  • hugo source repositories

p2: convenience or bulky value

  • jellyfin metadata and selected media
  • owncast config and selected media
  • monitoring history
  • gitea packages/artifacts if actually needed
  • optional agent state

p3: abandon

  • chatmail server state
  • gitea runner workdirs
  • ci caches
  • generated hugo output
  • ephemeral dev environments
  • temp job outputs
  • monitoring metrics beyond retention

restore drill

backup is not done until restore has been tested.

quarterly drill:

  1. provision a clean debian node or vm.
  2. install k3s.
  3. restore /etc/rancher/k3s, sqlite state, and selected pvc data from restic.
  4. restore sops age key and flux deploy key material.
  5. let flux reconcile.
  6. verify p0 apps first: vaultwarden, hermes, gitea.
  7. verify p1 apps next: immich, hedgedoc, wallabag, blocky.
  8. document any manual steps that were not already in git.

grab-and-go implication

in an evacuation, do not depend on the cluster being reachable. keep an encrypted offline or cloud-reachable copy of the p0 material outside the house:

  • password manager recovery/export path
  • sops age key recovery path
  • flux/gitea/cloudflare access path
  • restic repository credentials and password
  • one-page rebuild checklist

k3s-one is successful when the physical node is replaceable and the human can leave with identity, keys, and backups.