Operations Overview¶

Day-to-day operations are GitOps-driven: Renovate opens PRs for updates, Flux applies changes, and alerts notify when intervention is needed. This guide lists the common workflows and commands you’ll use.

Normal workflow¶

graph LR
    A[Edit YAML] --> B[Commit & Push]
    B --> C[GitHub]
    C --> D[Flux Detects Change]
    D --> E[Flux Applies]
    E --> F[Cluster Updated]

Monitoring¶

Passive: push notifications on alerts
Status page: https://status.t0m.co
AlertManager: https://am.t0m.co (only available w/in home LAN)

Use alerts and the status page as the primary indicators; dig into Grafana/AlertManager when needed.

Common Tasks¶

Updating an Existing App¶

Most app updates come automatically via Renovate, but sometimes you need to change configuration:

Edit the app's Helm values in kubernetes/apps/<namespace>/<app>/app/helmrelease.yaml
Commit and push
Optionally force sync: just kube ks-reconcile <namespace> <app>
Monitor: kubectl rollout status deployment/<app> -n <namespace> -w

Real Example: Increasing Plex Memory

# 1. Edit Helm values
nano kubernetes/apps/media/plex/app/helmrelease.yaml
# Change resources.limits.memory: 8Gi

# 2. Commit and push
git add kubernetes/apps/media/plex/
git commit -m "feat(plex): increase memory limit to 8Gi"
git push

# 3. Force reconcile (optional, otherwise waits up to 1 hour)
just kube ks-reconcile media plex

# 4. Watch rollout
kubectl rollout status deployment/plex -n media -w

Adding a New App¶

Copy an existing app directory as a template:

cp -r kubernetes/apps/default/vaultwarden kubernetes/apps/default/myapp

Update the files:
- ks.yaml: Change app name, dependencies, substitutions
- app/helmrelease.yaml: Configure chart and values
- app/ocirepository.yaml: Point to correct chart repository
- app/externalsecret.yaml: Add secrets from aKeyless (if needed)
- namespace.yaml: Update namespace (if using a new one)
Test locally (optional but recommended):
```
just kube apply-ks <namespace> <app>
```
Commit and push
Force reconcile: just kube ks-reconcile <namespace> <app>

Using Components

Most apps need common patterns. Include components in ks.yaml:

../../../../components/cnpg - If app needs PostgreSQL database
../../../../components/ext-auth-external - If app needs SSO protection
../../../../components/volsync - If app has data to backup

Forcing Updates¶

Sometimes Flux needs a nudge:

# Force reconcile a specific app
just kube ks-reconcile <namespace> <app>

# Force sync all Kustomizations
just kube sync-all-ks

# Force sync all HelmReleases
just kube sync-all-hr

# Restart failed Kustomizations
just kube ks-restart

# Restart failed HelmReleases
just kube hr-restart

Viewing Secrets¶

Secrets are base64-encoded in Kubernetes. To decode them:

just kube view-secret <namespace> <secret-name>

This decodes all keys and shows them in plain text.

Scaling Apps¶

# Scale up/down manually
kubectl scale deployment/<app> --replicas=3 -n <namespace>

# Or edit the Helm values and let Flux apply it

For apps with KEDA auto-scaling, you can suspend/resume:

# Suspend auto-scaling for one app
just kube keda suspend <namespace> <app>

# Resume auto-scaling
just kube keda resume <namespace> <app>

# Suspend/resume all
just kube keda-all suspend
just kube keda-all resume

Backup & Restore¶

VolSync handles PVC snapshots (default: every 6h). Useful commands:

just kube snapshot <ns> <app>      # snapshot one app
just kube snapshot-all             # snapshot all
just kube volsync-list <ns> <app>  # list snapshots
just kube volsync-restore <ns> <app> <n> # restore (1 = latest)

Postgres (CNPG) backups to S3(minio) are automatic. Useful commands:

kubectl cnpg backup pgsql-cluster -n database
kubectl cnpg restore pgsql-cluster --backup <name> -n database

System Upgrades (Talos & Kubernetes)¶

Upgrades are automated by tuppr via Renovate PRs (Talos → Kubernetes). Typical flow:

Renovate opens PRs for tool and platform versions
Merge tool updates (.mise.toml) first
Merge tuppr/Talos PRs; tuppr performs rolling upgrades and health checks

Manual fallback:

# Upgrade Kubernetes manually
just talos upgrade-k8s 1.35.0

# Upgrade Talos on a single node
just talos upgrade-node k8s-1

tuppr can also downgrade if needed—just update the version in the TalosUpgrade/KubernetesUpgrade CRD to an earlier version and commit it.

Node Maintenance¶

Rebooting Nodes¶

For manual reboots (not upgrades), reboot one at a time:

just talos reboot-node k8s-1
kubectl wait --for=condition=ready node/k8s-1 --timeout=10m

just talos reboot-node k8s-2
kubectl wait --for=condition=ready node/k8s-2 --timeout=10m

just talos reboot-node k8s-3
kubectl wait --for=condition=ready node/k8s-3 --timeout=10m

Kubernetes reschedules pods to healthy nodes during reboots automatically.

Observability¶

Dashboards¶

Grafana: https://grafana.t0m.co - Metrics and dashboards
Prometheus: https://prometheus.t0m.co - Metrics database (queried via Grafana)
VictoriaLogs: https://logs.t0m.co - Log search GUI (internal network only)
AlertManager: https://am.t0m.co - Active alerts
Karma: https://karma.t0m.co - AlertManager UI with silencing
Gatus: https://status.t0m.co - Service status page

Checking Logs¶

GUI:

VictoriaLogs at https://logs.t0m.co provides a searchable interface for all cluster logs (accessible from internal network only)

CLI:

# View logs from a deployment
kubectl logs -n <namespace> deployment/<app> --tail=100 -f

# View logs from a specific pod
kubectl logs -n <namespace> <pod-name> -f

# View previous crash logs
kubectl logs -n <namespace> <pod-name> --previous

# View logs from all containers in a pod
kubectl logs -n <namespace> <pod-name> --all-containers

Checking Events¶

Events show what Kubernetes is doing:

# Recent cluster-wide events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Events for a specific namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Events for a specific resource
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

GitOps Workflows¶

Renovate Auto-Updates¶

Renovate runs hourly via GitHub Actions and creates PRs for:

Container image updates
Helm chart updates
Talos/Kubernetes versions (via tuppr)
CLI tool versions (in .mise.toml)
GitHub Actions versions

Auto-merge behavior (configured in .renovate/autoMerge.json5):

Auto-merged:
- Container digest updates for trusted packages
- Minor/patch updates for specific OCI charts (kube-prometheus-stack, grafana)
- GitHub Actions updates for trusted actions (actions/, renovatebot/)
- Minor/patch updates for specific GitHub releases (external-dns, gateway-api, prometheus-operator)
Manual review required:
- Major version updates
- Most application updates
- Talos/Kubernetes upgrades

Check open PRs: https://github.com/tscibilia/home-ops/pulls

Typical PR merge order:

Merge .mise.toml tool updates first (so you have compatible kubectl/talosctl)
Merge Talos upgrade PR (tuppr handles the rollout)
Merge Kubernetes upgrade PR (tuppr handles the rollout)
Merge app updates

Manual Dependency Updates¶

To update a chart version manually:

kubernetes/apps/default/myapp/ks.yaml

postBuild:
  substitute:
    MYAPP_VERSION: "2.1.0"  # Change version here

Renovate watches these substitution values and updates them automatically via annotations:

# renovate: datasource=docker depName=ghcr.io/user/image
MYAPP_VERSION: "2.1.0"

Emergency Procedures¶

Suspend Flux (Stop All Automation)¶

If something's going wrong and you need to stop Flux from making changes:

flux suspend kustomization --all

Resume when ready:

flux resume kustomization --all

Rollback a Broken Change¶

# Revert the Git commit
git revert HEAD
git push

# Force Flux to sync
just kube ks-reconcile <namespace> <app>

Flux applies the revert, rolling back to the previous working state.

Cluster-Wide Restart¶

If the cluster is in a weird state:

# Restart all nodes (one at a time)
just talos reboot-node k8s-1
kubectl wait --for=condition=ready node/k8s-1 --timeout=10m

just talos reboot-node k8s-2
kubectl wait --for=condition=ready node/k8s-2 --timeout=10m

just talos reboot-node k8s-3
kubectl wait --for=condition=ready node/k8s-3 --timeout=10m

Best Practices¶

Always commit changes to Git - Don't kubectl edit manually unless debugging
Test locally with just kube apply-ks before pushing to Git
Watch rollouts after changes (kubectl rollout status)
Trust the self-healing - Don't intervene unless alerts fire
Merge mise PRs first - Keep your local tools compatible with the cluster
tuppr handles upgrades - Don't manually upgrade Talos/Kubernetes unless needed
One node at a time - Never reboot multiple control plane nodes simultaneously
Check dependencies - If an app won't deploy, check dependsOn resources are healthy

Next Steps¶

Troubleshooting Guide: When things go wrong
Task Runner Reference: Full command reference