Storage & Data Management¶

The cluster uses multiple storage backends depending on the use case: local ephemeral storage, distributed block storage, shared filesystems, and network file shares.

Storage Classes¶

Different apps need different types of storage. The cluster provides three storage classes:

Storage Class	Backend	Use Case	Performance	Redundancy
`ceph-ssd` (default)	Rook-Ceph	Persistent block storage	Medium	3x replication
`openebs-hostpath`	OpenEBS	Local storage (CNPG, ephemeral)	Fast	None (local disk)
`nfs-media`	NFS Server	Media libraries	Varies	External

Storage Decision Matrix¶

graph LR
    APP[Application Pod] --> PVC[PersistentVolumeClaim]

    PVC --> SSD[Ceph SSD<br/>ceph-ssd<br/>ceph-blockpool]
    PVC --> NFS[Synology NFS<br/>nfs-media<br/>Media Libraries]
    PVC --> LOCAL[OpenEBS<br/>openebs-hostpath<br/>Local Storage]

    SSD --> BACKUP1[VolSync → Backblaze B2]
    NFS --> BACKUP2[Synology Snapshots]

App Type	Storage Class	Reason
Databases (PostgreSQL via CNPG)	`openebs-hostpath`	CNPG manages replication, prefers local fast storage
Config & app state	`ceph-ssd`	Critical persistent data, survives node failure
Media libraries (Plex, Jellyfin)	`nfs-media`	Large capacity from Synology NAS
Cache (Victoria Logs, temp data)	`openebs-hostpath`	Fast local storage, ephemeral

When to Use Each¶

openebs-hostpathceph-ssdnfs-media

Use for: Cache, temporary data, non-critical state

Data lives on a single node's local disk. If the node dies, data is lost. Fast, but not persistent.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cache
spec:
  storageClassName: openebs-hostpath
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi

Use for: Databases, application state, anything critical

Block storage replicated across 3 nodes via Rook-Ceph. Data survives node failures. This is the default for most apps.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  storageClassName: ceph-ssd
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi

Use for: Large media libraries (movies, photos, etc.)

Network file share from an external NAS. Large capacity but slower than local storage.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: media-library
spec:
  storageClassName: nfs-media
  accessModes: [ReadWriteMany]
  resources:
    requests:
      storage: 1Ti

How to Choose

Can you afford to lose this data? → openebs-hostpath
Is it critical persistent data? → ceph-ssd
Database (CNPG manages replication)? → openebs-hostpath
Is it massive media files? → nfs-media

Rook-Ceph: Distributed Storage¶

Rook-Ceph provides distributed block storage across cluster nodes. Each node contributes Samsung SSDs for Ceph OSDs over a dedicated 2.5GbE network. See Infrastructure Architecture and Ceph Network for configuration details.

Ceph replicates data 3x across nodes (host failure domain). Single node failure doesn't impact data availability.

How Replication Works

When you write to a ceph-ssd PVC:

Data is written to the primary OSD (Object Storage Daemon) on one node
Ceph replicates it to two other nodes automatically
Write is acknowledged only after replication completes

If a node fails, Ceph automatically rebalances data to maintain 3 replicas.

Monitoring Ceph¶

# Check Ceph cluster status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

# Check OSD status (storage daemons)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd status

# Check storage usage
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df

# Access Rook-Ceph dashboard
# URL: https://rook.${SECRET_DOMAIN}

VolSync Backups¶

VolSync backs up PersistentVolumeClaims to cloud storage (Backblaze B2) using Restic. Apps with critical data include the VolSync component.

How VolSync Works¶

The VolSync component (kubernetes/components/volsync/) creates:

ReplicationSource: Snapshots the PVC on a schedule
ReplicationDestination: Restores from the latest snapshot
ExternalSecret: Restic repository password and B2 credentials
PVC: Used for restores

Apps include it in their ks.yaml:

spec:
  components:
    - ../../../../components/volsync
  postBuild:
    substitute:
      APP: immich
      VOLSYNC_CAPACITY: 100Gi  # Size of the PVC

Backup Schedule¶

From kubernetes/components/volsync/replicationsource.yaml:

spec:
  trigger:
    schedule: "0 */12 * * *"  # Every 12 hours

Backups run twice daily. Customize per-app by overriding in the app's resources.

Manual Snapshots¶

Force a backup immediately:

# Trigger snapshot for a specific app
just kube snapshot <namespace> <app-name>

# Trigger all snapshots
just kube snapshot-all

Restoring from Backup¶

To restore an app's data:

List available snapshots:

just kube volsync-list <namespace> <app-name>

Restore from a specific snapshot (e.g., 2^nd most recent):

just kube volsync-restore <namespace> <app-name> 2

The app will restart with data from that snapshot

How Restore Works

The restore process:

Scales down the app (to avoid file conflicts)
Creates a new PVC with dataSourceRef pointing to ReplicationDestination
Kubernetes Volume Populator fills the PVC from the backup
App is scaled back up with restored data

The old PVC is renamed, not deleted, so you can roll back if needed.

Unlocking Repositories¶

If Restic repositories get locked (due to interrupted backups):

just kube volsync-unlock

This unlocks all repositories. Safe to run anytime.

PostgreSQL (CNPG)¶

CloudNativePG provides HA PostgreSQL clusters. Configured in kubernetes/apps/database/cnpg/.

Current Clusters¶

pgsql-cluster: Main PostgreSQL 17 cluster for most apps (Authentik, Gatus, etc.)
immich17: PostgreSQL 17 cluster for Immich with vectorchord extension
Located in database namespace
Storage: openebs-hostpath (20Gi per instance, 3 instances)
Backups: Barman-cloud to Backblaze B2

Automatic User Provisioning¶

Apps using PostgreSQL include the CNPG component (kubernetes/components/cnpg/):

spec:
  components:
    - ../../../../components/cnpg
  postBuild:
    substitute:
      APP: authentik
      CNPG_NAME: pgsql-cluster

This automatically:

Creates a database user named authentik
Creates a database named authentik
Generates a Kubernetes Secret with:
- username
- password
- uri (connection string)

The app references this secret in its Helm values:

envFrom:
  - secretRef:
      name: authentik-pguser-secret

CNPG Backups¶

CNPG includes built-in backup via pg_basebackup to S3-compatible storage. Defined in the Cluster CRD.

To manually trigger a backup:

kubectl cnpg backup <cluster-name> -n database

To restore:

kubectl cnpg restore <cluster-name> --backup <backup-name> -n database

PostgreSQL 17 Upgrade¶

The cluster was upgraded from PostgreSQL 16 to PostgreSQL 17 in December 2024. Key changes:

Cluster name: pgsql-cluster (running PostgreSQL 17)
Image: ghcr.io/cloudnative-pg/postgresql:17
Migration approach: Blue-green migration pattern with restoration from backups
Located in: kubernetes/apps/database/cnpg/pgsql-cluster/

The immich17 cluster is a separate PostgreSQL 17 instance for Immich-specific database requirements.

For detailed migration procedures, see Issue #1211.

Dragonfly: Redis Cache¶

Dragonfly is a Redis-compatible in-memory cache. Deployed in kubernetes/apps/database/dragonfly/.

Apps connect to it via:

dragonfly-cluster.database.svc.cluster.local:6379

Database Indices¶

Multiple apps share the same Dragonfly instance but use different database indices:

DB 0: Default
DB 1: Authentik (unused as of Authentik 2025.10 which removed Redis support)
DB 2: Immich
DB 3: Searxng

This prevents key collisions between apps.

Storage Operations¶

Browsing PVC Contents¶

To inspect what's inside a PVC:

just kube browse-pvc <namespace> <claim-name>

This mounts the PVC to a debug pod and drops you into a shell. Useful for diagnosing storage issues.

Checking PVC Usage¶

# List all PVCs
kubectl get pvc -A

# Show PVC details (including size and usage)
kubectl describe pvc <claim-name> -n <namespace>

# Check actual disk usage from inside a pod
kubectl exec -it <pod-name> -n <namespace> -- df -h

Expanding a PVC¶

To increase a PVC's size:

Edit the Helm values or PVC manifest to increase storage
Apply the changes via Git
Kubernetes automatically expands the volume (if the StorageClass supports it)

Most storage classes (ceph-ssd, cephfs) support expansion. openebs-hostpath does not.

Deleting a PVC¶

Deleting a PVC deletes the underlying data! Always make sure you have backups.

# Scale down the app first
kubectl scale deployment <app> --replicas=0 -n <namespace>

# Delete the PVC
kubectl delete pvc <claim-name> -n <namespace>

# Scale the app back up
kubectl scale deployment <app> --replicas=1 -n <namespace>

The app will recreate the PVC on startup if it's defined in the Helm chart.

Disaster Recovery¶

Full Backup Strategy¶

Application data: VolSync backs up PVCs to Backblaze B2
Database dumps: CNPG backs up PostgreSQL to S3
Configuration: Git (this repository) is the source of truth
Secrets: Stored in aKeyless (cloud secrets manager)

To restore the entire cluster:

Rebuild the cluster: just bootstrap default
Restore application data: just kube volsync-restore <namespace> <app> 1
Restore databases: kubectl cnpg restore ...

Testing Restores¶

Regularly test restores to ensure backups work:

Spin up a test namespace
Restore VolSync backups into test PVCs
Verify data integrity

Next Steps¶

Operations Guide: Day-to-day maintenance
Troubleshooting: Common storage issues