backup and restore

this cluster should make the homelab easier to abandon, not harder to leave.

assume the node can be lost, the edge vps can be replaced, and the cluster can be rebuilt from git. only deliberate state gets protected.

guiding model

split every workload into one of four classes:

stateless / rebuildable: manifests and config live in git. no data backup required.
stateful small: small databases, config, notes, credentials, and agent state. back up with restic and test restores.
stateful large: photos, media, object storage, packages, artifacts. protect only what is irreplaceable or explicitly worth the cost.
disposable: caches, workdirs, generated files, temporary environments, and job scratch space. do not back up.

default backup pattern

for single-node k3s, prefer host-level restic first.

back up the host paths that contain cluster state and persistent volumes instead of deploying a backup product on day one:

/etc/rancher/k3s/
/var/lib/rancher/k3s/server/db/ for sqlite cluster state
/var/lib/rancher/k3s/server/tls/ if full server recovery is desired
/srv/k3s/volumes/ for app pvc data
sops age private key(s)
flux bootstrap metadata and repo deploy key material
any host-level blocky/caddy/cloudflared/tailscale config that is not in git

kubernetes cronjobs running restic are fine later for app-specific schedules, but host-level restic is simpler and better aligned with this one-node design. velero with restic/kopia is useful if the cluster grows, but is heavier than needed for v1.

restic sketch

choose a real target before production workloads: backblaze b2, s3-compatible storage, a nas, or another restic-compatible destination.

export RESTIC_REPOSITORY="s3:s3.example.com/k3s-one"
export RESTIC_PASSWORD_FILE="/root/.config/restic/k3s-one-password"

restic backup \
  /etc/rancher/k3s \
  /var/lib/rancher/k3s/server/db \
  /var/lib/rancher/k3s/server/tls \
  /srv/k3s/volumes

restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
restic check

sqlite note

single-node k3s uses sqlite by default. prefer a stop-aware backup or filesystem snapshot. at minimum, coordinate backups with low write activity and test restore.

state contracts

before adding an app, add a short state contract:

stateless: yes/no
pvc paths and expected size
backup class: none, daily, hourly, manual export, or special
restore priority: p0, p1, p2, p3
evacuation value: must-have, nice-to-have, or abandon
exclusions: caches, generated files, logs, artifacts, transcodes

example:

workload: vaultwarden
stateless: false
pvcs:
  - /srv/k3s/volumes/vaultwarden/data
backup_class: daily
restore_priority: p0
evacuation_value: must-have
exclusions: []

workload classification

workload	state class	backup guidance	priority
vaultwarden	stateful small	back up database, attachments, config, and admin token material; test restore before relying on it	p0
hermes	stateful small	if docker-mounted as `/opt/data`, protect config, `.env`/auth tokens, skills, memory, sessions/state db, cron jobs, plugin config	p0
gitea repos	stateful variable	protect git repositories and gitea database; size can grow with repo count	p0/p1
gitea lfs/packages/artifacts/actions logs	stateful large/variable	classify separately from source repos; use retention and exclude what is rebuildable	p1/p2
immich originals	stateful large	protect originals and database; thumbnails/transcodes are rebuildable	p1
hedgedoc	stateful small	back up database/uploads if used	p1
wallabag	stateful small	back up database and config	p1
blocky	mostly stateless	keep config in git; back up only if local runtime state matters	p1
hugo sites	stateless output, stateful source	source repos are protected by gitea/offsite git; generated `public/` output is disposable	p1/p3
chatmail server in go	disposable state	device mailboxes pull messages as they arrive; losing server state is acceptable	p3
jellyfin media	stateful large	separate irreplaceable media from replaceable movies/shows/transcodes; metadata is convenience	p2
owncast	mixed	app config is small; recordings/media only if explicitly valuable	p2
monitoring stack	mixed	dashboards/config in git; long-term metrics optional with retention	p2/p3
caddy	stateless	config in git; certificates are restorable but may be backed up as convenience	p3
gitea runner	disposable	do not back up workspaces, caches, or logs beyond short retention	p3
agents running with hermes	mixed	prompts/config in git; agent memory/state only if deliberately valuable	p2/p3
s3 storage deployment	stateful large	treat as a storage system, not just an app; define bucket-level policies and offsite copy	special
ephemeral dev containers	disposable	no backup; push useful work to git	p3
random jobs	disposable by default	promote to stateful only when outputs are intentionally retained	p3

evacuation priorities

p0: rebuild-critical

vaultwarden
hermes state needed to resume operations
gitea source repositories and database
sops age keys
flux deploy keys and bootstrap notes
cloudflare, dns, registry, and backup repository credentials

p1: important

immich originals and database
hedgedoc
wallabag
blocky/home dns config
hugo source repositories

p2: convenience or bulky value

jellyfin metadata and selected media
owncast config and selected media
monitoring history
gitea packages/artifacts if actually needed
optional agent state

p3: abandon

chatmail server state
gitea runner workdirs
ci caches
generated hugo output
ephemeral dev environments
temp job outputs
monitoring metrics beyond retention

restore drill

backup is not done until restore has been tested.

quarterly drill:

provision a clean debian node or vm.
install k3s.
restore /etc/rancher/k3s, sqlite state, and selected pvc data from restic.
restore sops age key and flux deploy key material.
let flux reconcile.
verify p0 apps first: vaultwarden, hermes, gitea.
verify p1 apps next: immich, hedgedoc, wallabag, blocky.
document any manual steps that were not already in git.

grab-and-go implication

in an evacuation, do not depend on the cluster being reachable. keep an encrypted offline or cloud-reachable copy of the p0 material outside the house:

password manager recovery/export path
sops age key recovery path
flux/gitea/cloudflare access path
restic repository credentials and password
one-page rebuild checklist

k3s-one is successful when the physical node is replaceable and the human can leave with identity, keys, and backups.