Storage & Data Management¶
The cluster uses multiple storage backends depending on the use case: local ephemeral storage, distributed block storage, shared filesystems, and network file shares.
Storage Classes¶
Different apps need different types of storage. The cluster provides three storage classes:
| Storage Class | Backend | Use Case | Performance | Redundancy |
|---|---|---|---|---|
ceph-ssd (default) | Rook-Ceph | Persistent block storage | Medium | 3x replication |
openebs-hostpath | OpenEBS | Local storage (CNPG, ephemeral) | Fast | None (local disk) |
nfs-media | NFS Server | Media libraries | Varies | External |
Storage Decision Matrix¶
graph LR
APP[Application Pod] --> PVC[PersistentVolumeClaim]
PVC --> SSD[Ceph SSD<br/>ceph-ssd<br/>ceph-blockpool]
PVC --> NFS[Synology NFS<br/>nfs-media<br/>Media Libraries]
PVC --> LOCAL[OpenEBS<br/>openebs-hostpath<br/>Local Storage]
SSD --> BACKUP1[VolSync → Backblaze B2]
NFS --> BACKUP2[Synology Snapshots] | App Type | Storage Class | Reason |
|---|---|---|
| Databases (PostgreSQL via CNPG) | openebs-hostpath | CNPG manages replication, prefers local fast storage |
| Config & app state | ceph-ssd | Critical persistent data, survives node failure |
| Media libraries (Plex, Jellyfin) | nfs-media | Large capacity from Synology NAS |
| Cache (Victoria Logs, temp data) | openebs-hostpath | Fast local storage, ephemeral |
When to Use Each¶
Use for: Cache, temporary data, non-critical state
Data lives on a single node's local disk. If the node dies, data is lost. Fast, but not persistent.
Use for: Databases, application state, anything critical
Block storage replicated across 3 nodes via Rook-Ceph. Data survives node failures. This is the default for most apps.
Use for: Large media libraries (movies, photos, etc.)
Network file share from an external NAS. Large capacity but slower than local storage.
How to Choose
- Can you afford to lose this data? →
openebs-hostpath - Is it critical persistent data? →
ceph-ssd - Database (CNPG manages replication)? →
openebs-hostpath - Is it massive media files? →
nfs-media
Rook-Ceph: Distributed Storage¶
Rook-Ceph provides distributed block storage across cluster nodes. Each node contributes Samsung SSDs for Ceph OSDs over a dedicated 2.5GbE network. See Infrastructure Architecture and Ceph Network for configuration details.
Ceph replicates data 3x across nodes (host failure domain). Single node failure doesn't impact data availability.
How Replication Works
When you write to a ceph-ssd PVC:
- Data is written to the primary OSD (Object Storage Daemon) on one node
- Ceph replicates it to two other nodes automatically
- Write is acknowledged only after replication completes
If a node fails, Ceph automatically rebalances data to maintain 3 replicas.
Monitoring Ceph¶
# Check Ceph cluster status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Check OSD status (storage daemons)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd status
# Check storage usage
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
# Access Rook-Ceph dashboard
# URL: https://rook.${SECRET_DOMAIN}
VolSync Backups¶
VolSync backs up PersistentVolumeClaims to cloud storage (Backblaze B2) using Restic. Apps with critical data include the VolSync component.
How VolSync Works¶
The VolSync component (kubernetes/components/volsync/) creates:
- ReplicationSource: Snapshots the PVC on a schedule
- ReplicationDestination: Restores from the latest snapshot
- ExternalSecret: Restic repository password and B2 credentials
- PVC: Used for restores
Apps include it in their ks.yaml:
spec:
components:
- ../../../../components/volsync
postBuild:
substitute:
APP: immich
VOLSYNC_CAPACITY: 100Gi # Size of the PVC
Backup Schedule¶
From kubernetes/components/volsync/replicationsource.yaml:
Backups run twice daily. Customize per-app by overriding in the app's resources.
Manual Snapshots¶
Force a backup immediately:
# Trigger snapshot for a specific app
just kube snapshot <namespace> <app-name>
# Trigger all snapshots
just kube snapshot-all
Restoring from Backup¶
To restore an app's data:
-
List available snapshots:
-
Restore from a specific snapshot (e.g., 2nd most recent):
-
The app will restart with data from that snapshot
How Restore Works
The restore process:
- Scales down the app (to avoid file conflicts)
- Creates a new PVC with
dataSourceRefpointing to ReplicationDestination - Kubernetes Volume Populator fills the PVC from the backup
- App is scaled back up with restored data
The old PVC is renamed, not deleted, so you can roll back if needed.
Unlocking Repositories¶
If Restic repositories get locked (due to interrupted backups):
This unlocks all repositories. Safe to run anytime.
PostgreSQL (CNPG)¶
CloudNativePG provides HA PostgreSQL clusters. Configured in kubernetes/apps/database/cnpg/.
Current Clusters¶
- pgsql-cluster: Main PostgreSQL 17 cluster for most apps (Authentik, Gatus, etc.)
- immich17: PostgreSQL 17 cluster for Immich with vectorchord extension
- Located in
databasenamespace - Storage:
openebs-hostpath(20Gi per instance, 3 instances) - Backups: Barman-cloud to Backblaze B2
Automatic User Provisioning¶
Apps using PostgreSQL include the CNPG component (kubernetes/components/cnpg/):
spec:
components:
- ../../../../components/cnpg
postBuild:
substitute:
APP: authentik
CNPG_NAME: pgsql-cluster
This automatically:
- Creates a database user named
authentik - Creates a database named
authentik - Generates a Kubernetes Secret with:
usernamepassworduri(connection string)
The app references this secret in its Helm values:
CNPG Backups¶
CNPG includes built-in backup via pg_basebackup to S3-compatible storage. Defined in the Cluster CRD.
To manually trigger a backup:
To restore:
PostgreSQL 17 Upgrade¶
The cluster was upgraded from PostgreSQL 16 to PostgreSQL 17 in December 2024. Key changes:
- Cluster name:
pgsql-cluster(running PostgreSQL 17) - Image:
ghcr.io/cloudnative-pg/postgresql:17 - Migration approach: Blue-green migration pattern with restoration from backups
- Located in:
kubernetes/apps/database/cnpg/pgsql-cluster/
The immich17 cluster is a separate PostgreSQL 17 instance for Immich-specific database requirements.
For detailed migration procedures, see Issue #1211.
Dragonfly: Redis Cache¶
Dragonfly is a Redis-compatible in-memory cache. Deployed in kubernetes/apps/database/dragonfly/.
Apps connect to it via:
Database Indices¶
Multiple apps share the same Dragonfly instance but use different database indices:
- DB 0: Default
- DB 1: Authentik (unused as of Authentik 2025.10 which removed Redis support)
- DB 2: Immich
- DB 3: Searxng
This prevents key collisions between apps.
Storage Operations¶
Browsing PVC Contents¶
To inspect what's inside a PVC:
This mounts the PVC to a debug pod and drops you into a shell. Useful for diagnosing storage issues.
Checking PVC Usage¶
# List all PVCs
kubectl get pvc -A
# Show PVC details (including size and usage)
kubectl describe pvc <claim-name> -n <namespace>
# Check actual disk usage from inside a pod
kubectl exec -it <pod-name> -n <namespace> -- df -h
Expanding a PVC¶
To increase a PVC's size:
- Edit the Helm values or PVC manifest to increase
storage - Apply the changes via Git
- Kubernetes automatically expands the volume (if the StorageClass supports it)
Most storage classes (ceph-ssd, cephfs) support expansion. openebs-hostpath does not.
Deleting a PVC¶
Deleting a PVC deletes the underlying data! Always make sure you have backups.
# Scale down the app first
kubectl scale deployment <app> --replicas=0 -n <namespace>
# Delete the PVC
kubectl delete pvc <claim-name> -n <namespace>
# Scale the app back up
kubectl scale deployment <app> --replicas=1 -n <namespace>
The app will recreate the PVC on startup if it's defined in the Helm chart.
Disaster Recovery¶
Full Backup Strategy¶
- Application data: VolSync backs up PVCs to Backblaze B2
- Database dumps: CNPG backs up PostgreSQL to S3
- Configuration: Git (this repository) is the source of truth
- Secrets: Stored in aKeyless (cloud secrets manager)
To restore the entire cluster:
- Rebuild the cluster:
just bootstrap default - Restore application data:
just kube volsync-restore <namespace> <app> 1 - Restore databases:
kubectl cnpg restore ...
Testing Restores¶
Regularly test restores to ensure backups work:
- Spin up a test namespace
- Restore VolSync backups into test PVCs
- Verify data integrity
Next Steps¶
- Operations Guide: Day-to-day maintenance
- Troubleshooting: Common storage issues