Troubleshooting Guide¶
When Do You Actually Need This?¶
Kubernetes is self-healing. Most of the time, things just work. If a pod crashes, it restarts. If a node goes down, pods move to healthy nodes. If you're updating an app and it fails, Flux rolls back to the previous working version automatically.
You only need to troubleshoot when:
-
Pushover alert on your phone hasn't cleared The cluster sends alerts for issues that persist beyond self-healing timeouts.
-
AlertManager shows active firing alerts Check https://am.t0m.co (when connected to internal network) for currently firing alerts and their details.
-
Status page shows degraded services Check https://status.t0m.co for a quick overview of what's working and what's not.
If none of these are showing problems, the cluster is probably fine. Temporary errors during deployments are normal and resolve automatically.
Quick Health Check¶
Before digging into specific issues:
# Check overall cluster health
kubectl get nodes # All should be Ready
# Check for pods in bad states
kubectl get pods -A | grep -v "Running\|Completed"
# Check Flux status
flux get all -A | grep -v "True"
If everything looks good, you're probably overthinking it. Kubernetes handles transient issues on its own.
Common Scenarios¶
Updating an App and It Fails¶
What happens: You push a change, the app fails to deploy, Flux rolls back automatically.
Symptoms: HelmRelease shows Failed status, but pods are still running the old version.
Why it's usually fine: Flux's fail-safe behavior means failed deployments don't take down working apps.
When to intervene:
- AlertManager fires an alert
- The app shows
Failedfor more than 10 minutes - You need to force the update despite errors
Fix:
# Check what went wrong
kubectl describe helmrelease <name> -n <namespace>
# If it's a configuration error, fix it in Git and push
# If you need to force a clean slate
kubectl delete helmrelease <name> -n <namespace>
just kube ks-reconcile <namespace> <app-name>
Adding a New App and It Won't Deploy¶
Symptoms: New app's Kustomization stays in Progressing or Failed state.
Common causes:
- Missing dependency (database, secret, CRD)
- Typo in configuration
- Wrong image reference
Fix:
# Check dependency chain
kubectl describe kustomization <app-name> -n flux-system
# Look for dependency failures
kubectl get kustomization -A | grep False
# Fix dependencies first, then retry
just kube ks-reconcile <namespace> <app-name>
Network Changes Breaking Things¶
Symptoms: Everything was working, you changed networking config, now apps are unreachable.
What to check:
# Did you break Cilium?
kubectl get pods -n kube-system | grep cilium
# Are LoadBalancer IPs allocated?
kubectl get svc -A | grep LoadBalancer
# Is Cloudflared tunnel up?
kubectl get pods -n network | grep cloudflared
# Are HTTPRoutes configured?
kubectl get httproute -A
Nuclear option (restart networking stack in order):
This restarts CoreDNS → Cilium → Cloudflared → external-dns → Envoy in dependency order. Only use this if you're confident networking is the problem.
Alert-Driven Troubleshooting¶
Using AlertManager¶
AlertManager at https://am.t0m.co shows active alerts with:
- Alert name: What's failing
- Labels: Which namespace/app
- Annotations: Description and suggested actions
- Firing duration: How long it's been broken
Alerts are configured to fire only after issues persist beyond self-healing thresholds. If an alert is firing, intervention is likely needed.
Common Alerts and What They Mean
- PodCrashLooping: Pod restarting repeatedly (check logs:
kubectl logs <pod> -n <namespace> --previous) - PVCFull: PersistentVolumeClaim out of space (expand PVC or clean up data)
- NodeNotReady: Node unhealthy (check node logs:
talosctl -n <node-ip> logs kubelet) - HelmReleaseFailed: Deployment failed (check HelmRelease:
kubectl describe helmrelease <name> -n <namespace>) - KustomizationFailed: Flux can't apply manifests (check Kustomization:
kubectl describe kustomization <name> -n flux-system)
Using the Status Page¶
The status page at https://status.t0m.co provides a high-level view of service health. Powered by Gatus, it monitors:
- HTTP endpoint availability
- Response time
- Certificate validity
- DNS resolution
If a service shows as down on the status page, start troubleshooting there.
Specific Problem Scenarios¶
Stuck HelmRelease¶
When this happens: After an app update, HelmRelease shows Progressing forever or Failed with retries exhausted.
Check:
Fix:
# Delete the HelmRelease (Flux recreates it)
kubectl delete helmrelease <name> -n <namespace>
# Force reconcile
just kube ks-reconcile <namespace> <app-name>
If That Doesn't Work
Sometimes Helm gets truly stuck:
ExternalSecret Won't Sync¶
Symptoms: SecretSyncedError in ExternalSecret status, app can't start because secret is missing.
Check:
Fix:
# Force resync
just kube sync-es <namespace> <secret-name>
# If that doesn't work, restart the operator
kubectl rollout restart deployment/external-secrets -n external-secrets
Database Connection Failures¶
Symptoms: App logs show "can't connect to database" errors.
Check:
# Is the database running?
kubectl get pods -n database
# Is the secret available?
kubectl get secret -n <app-namespace> | grep pguser
# Can DNS resolve it?
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never \
-- nslookup postgresql.database.svc.cluster.local
Fix:
# Check CNPG cluster health
kubectl get cluster -n database
kubectl describe cluster pgsql-cluster -n database
# Restart the app (forces secret refresh)
kubectl rollout restart deployment/<app> -n <namespace>
PVC Out of Space¶
Symptoms: App logs show "no space left on device", alert fires for PVCFull.
Check:
Fix:
# Option 1: Expand PVC (edit in Git, push, Flux expands it)
# Edit Helm values or PVC manifest to increase storage size
# Option 2: Clean up old data
just kube browse-pvc <namespace> <claim>
# Then delete old files inside the debug pod
Pod CrashLoopBackOff¶
Symptoms: Pod repeatedly crashes, Pushover alert fires.
Check:
Common causes:
- Application error (fix code/config)
- Missing environment variable (check ExternalSecret)
- OOMKilled (increase memory limits)
- Liveness probe too aggressive (increase
initialDelaySeconds)
Fix based on cause: Edit configuration in Git, push, Flux applies it.
Node Issues¶
Node NotReady¶
Symptoms: kubectl get nodes shows NotReady, pods being evicted, AlertManager fires NodeNotReady.
Check:
Fix:
# Often fixed by rebooting the node
just talos reboot-node <node-name>
# Wait for it to come back
kubectl wait --for=condition=ready node/<node-name> --timeout=10m
Network Troubleshooting¶
Can't Access App from Internet¶
Check:
# Is HTTPRoute created?
kubectl get httproute -n <namespace>
# Is Cloudflared running?
kubectl get pods -n network | grep cloudflared
# Is DNS resolving?
dig <app>.t0m.co
Fix:
# Restart cloudflared
kubectl rollout restart deployment/cloudflared -n network
# Force DNS sync
just kube sync-all-hr
Can't Access App from LAN¶
Check:
# Does service have LoadBalancer IP?
kubectl get svc <service> -n <namespace>
# Is Cilium healthy?
kubectl get pods -n kube-system | grep cilium
Fix:
VPN Network Issues (Multus)¶
Symptoms: qBittorrent or Prowlarr can't access internet or VPN isn't working
Check:
# Verify NetworkAttachmentDefinition exists
kubectl get network-attachment-definitions -n network
# Check pod has both interfaces
kubectl exec -it <pod-name> -n media -- ip addr show
# Verify routing
kubectl exec -it <pod-name> -n media -- ip route show
# Test VPN network connectivity
kubectl exec -it <pod-name> -n media -- ping 192.168.99.1
Fix:
# Restart pod to re-attach network
kubectl delete pod <pod-name> -n media
# If NetworkAttachmentDefinition is missing, check app's kustomization
kubectl get kustomization -n media <app-name> -o yaml
See the VPN Networking Guide for detailed troubleshooting.
Emergency: Everything is Broken¶
If the entire cluster is down:
- Check node connectivity: Can you ping 192.168.5.211/212/213?
- Check control plane:
kubectl get pods -n kube-system -
Reboot nodes one at a time:
-
Last resort: Rebuild from Git (
just bootstrap default) and restore data from VolSync backups.
Remember¶
- Trust the self-healing: If no alerts are firing, it's probably fine.
- Check alerts first: Pushover → AlertManager → Status page.
- Flux rolls back failures: Broken deployments don't break working apps.
- Logs are your friend:
kubectl logs,kubectl describe, andtalosctl logstell the full story.
Getting More Help¶
Still stuck?
- Check Flux events:
flux events --for Kustomization/<name> - Check all pod logs:
kubectl logs -n <namespace> <pod> --all-containers --previous - Consult DeepWiki for AI-generated insights
- Review the Operations Overview for more context