Skip to content

Troubleshooting

Organized by area. Each entry: what you see, what to check, how to fix it.

Flux / GitOps

Symptom Check Fix
App not updating after push kubectl get ks -A \| grep -v True just kube ks-reconcile <ns> <app>
HelmRelease stuck kubectl get hr -n <ns> <name> Delete the HR: kubectl delete hr <name> -n <ns>, Flux recreates it
kubectl edit changes reverted Expected — Flux owns all state Edit in Git, push, reconcile
Kustomization dependency failed Check the dependency: kubectl get ks -n <dep-ns> <dep-name> Fix the upstream dependency first

Networking

Symptom Check Fix
LoadBalancer IP unreachable kubectl get svc -A -o wide \| grep LoadBalancer Check CiliumL2AnnouncementPolicy and CiliumLoadBalancerIPPool
External DNS not resolving kubectl logs -n network deploy/external-dns Verify Cloudflare API token in secret, check HTTPRoute exists
LAN DNS not resolving kubectl logs -n network deploy/unifi-dns Check UniFi controller connectivity
External access broken kubectl get pods -n network -l app=cloudflared Verify cloudflared tunnel status, check Cloudflare dashboard
Cert not issuing kubectl get cert -A, kubectl get challenges -A Check cert-manager logs, verify Cloudflare DNS-01 permissions

Remember: There is no kube-proxy. Cilium is the eBPF replacement. Use cilium CLI or Hubble for network debugging, not iptables.

Storage

Symptom Check Fix
Ceph cluster degraded kubectl get cephcluster -n rook-ceph Check OSD pods, node status, disk health via Scrutiny
PVC stuck Pending kubectl describe pvc -n <ns> <name> Verify storage class exists, Ceph has capacity
NFS mount errors Check pod events: kubectl describe pod -n <ns> <pod> Verify TrueNAS is reachable, NFS share exists
VolSync backup failing kubectl get replicationsource -n <ns> Check restic repo locks: just kube volsync-unlock
VolSync restore needed just kube volsync-list <ns> <name> just kube volsync-restore <ns> <name> <previous>

CNPG

Symptom Check Fix
Cluster unhealthy kubectl get cluster -n database Check .status.conditions, inspect pod logs
Wrong CNPG image used Verify cluster name pgsql-cluster = standard PG17, immich17 = vectorchord
Need full recovery Backups are in B2 just bootstrap cnpg — don't manually recreate clusters
Connection refused Check endpoint: kubectl get svc -n database \| grep rw Always use the -rw service for app connections

Talos

Symptom Check Fix
Can't SSH to node Expected — Talos has no SSH Use talosctl -n <node-ip> for all node access
Config change needed Edit templates in kubernetes/talos/, then just talos apply-node <node>
Upgrade needed Check current version: talosctl version Upgrade order: Talos first (just talos upgrade-node), then K8s (just talos upgrade-k8s)
Node stuck talosctl dmesg -n <node-ip> Try just talos reboot-node <node>