Observability Guide¶
The cluster includes a comprehensive observability stack for monitoring, alerting, and log aggregation. Everything runs in the observability namespace.
Observability Stack Overview¶
The stack is built around Prometheus (metrics collection and storage via kube-prometheus-stack), AlertManager (alert routing via Karma UI), Gatus (status monitoring), and VictoriaLogs (log aggregation).
graph TB
subgraph "Metrics Collection"
Prom[Prometheus]
Exporters[Prometheus Exporters]
Grafana[Grafana Dashboards]
Kromgo[Kromgo Badge Generator]
end
subgraph "Alerting"
PromAlert[Prometheus Alerting]
AM[AlertManager]
Karma[Karma Dashboard]
Pushover[Pushover Mobile Notifications]
end
subgraph "Status Monitoring"
Gatus[Gatus Status Page]
end
subgraph "Logs"
FluentBit[Fluent-bit]
VLogs[VictoriaLogs]
end
Exporters --> Prom
Prom --> Grafana
Prom --> Kromgo
Prom --> PromAlert
PromAlert --> AM
AM --> Karma
AM --> Pushover
FluentBit --> VLogs Web-Accessible Services¶
| Service | URL | Access | Authentication |
|---|---|---|---|
| Gatus (Status page) | status.t0m.co | External (public) | None |
| Karma (AlertManager UI) | am.t0m.co | Internal only | Authentik SSO |
| Prometheus | prometheus.t0m.co | Internal only | Authentik SSO |
| VictoriaLogs | logs.t0m.co | Internal only | Authentik SSO |
| Grafana | grafana.t0m.co | Internal only | Grafana auth + Authentik SSO |
| Kromgo | kromgo.t0m.co | External (public) | None |
What is 'Internal only'?
Services marked "Internal only" use the envoy-internal gateway and are only accessible from: - Devices on the home network (192.168.5.0/24) - Devices connected via Tailscale VPN
They are not exposed through the Cloudflared tunnel and cannot be accessed from the public internet.
Components¶
Prometheus (Metrics)¶
Purpose: Metrics collection, storage, and querying
Components (via kube-prometheus-stack): - Prometheus Server: Metrics storage and querying - Prometheus Operator: Manages ServiceMonitors and PrometheusRules - kube-state-metrics: Exports Kubernetes object state as metrics - prometheus-node-exporter: Exports node/system metrics - Grafana: Dashboarding (included in stack) - AlertManager: Alert routing and notifications
Access: prometheus.t0m.co (internal only, Authentik SSO required)
Storage: Persistent storage on ceph-ssd storage class
Configuration: kube-prometheus-stack/app/helmrelease.yaml
Adding Metrics to an App
Apps expose metrics via ServiceMonitor resources. Prometheus Operator automatically discovers and scrapes them:
serviceMonitor:
app:
serviceName: myapp
endpoints:
- port: metrics
path: /metrics
Prometheus Operator converts ServiceMonitors into Prometheus scrape configs automatically.
AlertManager + Karma (Alerting)¶
Purpose: Routes alerts from vmalert to notification channels, web UI via Karma
Important: AlertManager itself has no web UI exposed. Karma provides the web interface to AlertManager.
Access: am.t0m.co via Karma (internal only, Authentik SSO required)
Configured receivers: - Pushover: Critical and warning alerts to mobile devices - Healthchecks.io: Watchdog heartbeat (confirms monitoring is alive) - null: InfoInhibitor alerts (silenced)
Alert routing logic: 1. Group alerts by alertname and job 2. Wait 1 minute before sending grouped alerts 3. Send to Pushover for severity=~"warning|critical" 4. Repeat alerts every 12 hours if not resolved 5. Silence lower-severity alerts when critical alerts are firing
Karma configuration: Connects to AlertManager internally at http://kube-prometheus-stack-alertmanager.observability.svc.cluster.local:9093
Pushover Notification Format
Alerts are sent to Pushover with:
- Title:
[FIRING:2] KubePodCrashLooping - Message: Alert description + labels
- Priority:
1(firing),0(resolved) - Sound:
gamelan - URL: "View in Alertmanager" (links to Karma)
Configuration: kube-prometheus-stack/app/helmrelease.yaml
Gatus (Status Monitoring)¶
Purpose: Uptime monitoring and public status page
What it monitors: - HTTP/HTTPS endpoints (apps exposed via HTTPRoute) - Kubernetes Services (via sidecar auto-discovery) - Custom endpoints (configured manually)
Access: status.t0m.co (publicly accessible)
Auto-discovery: The gatus-sidecar automatically creates Gatus endpoints for: - All HTTPRoute resources with gatus.home-operations.com/enabled: "true" annotation - All Service resources with gatus.home-operations.com/enabled: "true" annotation
Storage: PostgreSQL database (gatus-pguser secret managed via CNPG component)
Adding an App to Status Page
Apps are automatically monitored if they have an HTTPRoute. To customize the health check:
route:
app:
annotations:
gatus.home-operations.com/enabled: "true" # Enable monitoring
gatus.home-operations.com/endpoint: | # Custom health check
conditions: ["[STATUS] == 200", "[BODY].status == ok"]
interval: 60s
hostnames: ["myapp.${SECRET_DOMAIN}"]
The sidecar watches HTTPRoute resources and syncs them to Gatus configuration automatically.
VictoriaLogs (Log Aggregation)¶
Purpose: Centralized log storage and search
Retention: 21 days
Access: logs.t0m.co (internal only, Authentik SSO required)
Log sources: - All pods via fluent-bit DaemonSet - Kubernetes audit logs - System logs
Storage: 10Gi on openebs-hostpath
Searching Logs
VictoriaLogs uses LogsQL query language (similar to LogQL):
Search by app:
Search for errors:
Time range search:
Access the web UI at logs.t0m.co from the internal network.
Grafana (Dashboards)¶
Purpose: Metrics visualization and dashboarding
Access: grafana.t0m.co (internal only, Grafana auth + Authentik SSO)
Dashboards: Managed declaratively via GrafanaDashboard CRDs. Apps include dashboards in their manifests:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: cilium-operator
spec:
instanceSelector:
matchLabels:
dashboards: grafana
url: https://raw.githubusercontent.com/cilium/cilium/main/install/kubernetes/cilium/files/cilium-operator/dashboards/cilium-operator-dashboard.json
Datasource: Prometheus at http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090
Creating Custom Dashboards
- Create dashboard in Grafana web UI
- Export as JSON
- Create GrafanaDashboard CRD:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: my-dashboard
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: grafana
json: |
<paste exported JSON>
Grafana Operator syncs dashboards automatically.
Kromgo (Badge Generator)¶
Purpose: Generate status badges from Prometheus queries for README files
Access: kromgo.t0m.co (publicly accessible)
Example badges: - Cluster uptime - Pod count - Storage usage - Kubernetes version
How it works: Kromgo queries Prometheus internally and renders SVG badges based on configured queries.
Configuration: kromgo/app/configmap.yaml
Prometheus Exporters¶
Various exporters provide metrics for infrastructure components:
| Exporter | Purpose | Metrics |
|---|---|---|
| node-exporter | System/hardware metrics | CPU, memory, disk, network |
| kube-state-metrics | Kubernetes object state | Pods, deployments, nodes |
| unpoller | Unifi network stats | WiFi clients, bandwidth, devices |
All exporters are scraped by Prometheus automatically via ServiceMonitor CRDs.
Common Observability Tasks¶
Viewing Active Alerts¶
Via Karma (recommended): 1. Navigate to am.t0m.co (requires internal network access) 2. Log in with Authentik SSO 3. View grouped alerts with filters
Via Pushover: - Critical/warning alerts sent to mobile device automatically
Searching Logs¶
# 1. Navigate to VictoriaLogs web UI (internal network only)
open https://logs.t0m.co
# 2. Log in with Authentik SSO
# 3. Use LogsQL to search
# Example: Find all errors in default namespace in last 5 minutes
{namespace="default"} | "error" [5m]
# Example: Search specific app
{namespace="media", app="plex"}
# Example: Search across all namespaces
{} | "connection refused"
Creating Alert Rules¶
Alert rules are defined as PrometheusRule CRDs (managed by Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp
spec:
groups:
- name: myapp
rules:
- alert: MyAppDown
annotations:
summary: MyApp is down
description: MyApp has been unavailable for 5 minutes
expr: up{job="myapp"} == 0
for: 5m
labels:
severity: critical
Commit and push—Flux applies the rule, Prometheus evaluates it, and AlertManager routes firing alerts to Pushover.
Silencing Alerts¶
Temporary silence: 1. Navigate to am.t0m.co (Karma) 2. Click alert → "Silence" 3. Set duration and reason 4. Create silence
Permanent silence: Add silence to silence-operator configuration or adjust alert rule severity/inhibition.
Checking Status Page¶
Public status page shows app availability:
- Navigate to status.t0m.co
- View real-time status of all monitored endpoints
- Click endpoint for uptime history
Apps automatically appear if they have an HTTPRoute with Gatus annotation enabled.
Alert Rules¶
The cluster includes several categories of alert rules:
Core Kubernetes Alerts¶
KubePodCrashLooping- Pod crash loopsKubePodNotReady- Pods stuck in non-ready stateKubeDeploymentReplicasMismatch- Deployment replica count issuesKubePersistentVolumeFillingUp- PVC approaching capacityKubeNodeNotReady- Node in NotReady stateKubeAPIErrorBudgetBurn- Kubernetes API error rate too high
Application-Specific Alerts¶
Apps define custom alerts in prometheusrule.yaml files:
- CNPG: Database cluster health, replication lag
- Cert-Manager: Certificate expiration warnings
- Flux: GitRepository sync failures, HelmRelease failures
- Cilium: Network policy drops, endpoint health
- VolSync: Backup failures
Custom Alerts¶
Additional alerts defined in kube-prometheus-stack HelmRelease:
DockerhubRateLimitRisk- Too many Docker Hub pullsOomKilled- Pods killed due to OOMKubePodFailed- Pods in Failed state for >15 minutes
All alerts are viewable in Karma.
Metrics Retention and Storage¶
| Component | Retention | Storage Size | Storage Class |
|---|---|---|---|
| Prometheus | Configurable | Persistent storage | ceph-ssd |
| VictoriaLogs | 21 days | 10Gi | openebs-hostpath |
| Gatus | Unlimited (PostgreSQL) | Shared CNPG cluster | ceph-ssd |
| Grafana | N/A (uses Prometheus) | Shared CNPG cluster | ceph-ssd |
Troubleshooting Observability¶
Alerts Not Firing¶
Symptoms: Expected alert not visible in Karma or Pushover
Diagnosis:
# 1. Check if alert rule exists
kubectl get prometheusrule -A | grep <rule-name>
# 2. Check Prometheus logs
kubectl logs -n observability -l app.kubernetes.io/name=prometheus -f
# 3. Query Prometheus directly to test alert expression
# Navigate to https://prometheus.t0m.co (internal network)
# Run the alert's PromQL expression manually
Common causes: - Alert rule syntax error (check Prometheus logs) - Alert expression never evaluates to true (test in Prometheus) - Alert for duration not yet elapsed
Metrics Missing¶
Symptoms: Grafana dashboard shows "No data" or incomplete metrics
Diagnosis:
# 1. Check if ServiceMonitor exists
kubectl get servicemonitor -n <namespace>
# 2. Check Prometheus targets
# Navigate to https://prometheus.t0m.co/targets (internal network)
# Verify target shows as "UP"
# 3. Check Prometheus logs
kubectl logs -n observability -l app.kubernetes.io/name=prometheus -f
Common causes: - ServiceMonitor not created or misconfigured - App not exposing /metrics endpoint - Network policy blocking scrape
Logs Not Appearing¶
Symptoms: VictoriaLogs search returns no results
Diagnosis:
# 1. Check fluent-bit is running on all nodes
kubectl get pods -n observability -l app.kubernetes.io/name=fluent-bit
# 2. Check fluent-bit logs for errors
kubectl logs -n observability daemonset/fluent-bit -f
# 3. Check VictoriaLogs ingestion
kubectl logs -n observability deployment/victoria-logs -f | grep -i error
Common causes: - Fluent-bit DaemonSet not running on node - Pod logs in non-standard location - VictoriaLogs storage full
Gatus Status Page Not Updating¶
Symptoms: Status page shows outdated status or missing endpoints
Diagnosis:
# 1. Check gatus-sidecar logs (auto-discovery)
kubectl logs -n observability deployment/gatus -c gatus-sidecar -f
# 2. Check main Gatus container logs
kubectl logs -n observability deployment/gatus -c app -f
# 3. Verify HTTPRoute has correct annotation
kubectl get httproute -n <namespace> <app> -o yaml | grep gatus.home-operations.com
Common causes: - Missing gatus.home-operations.com/enabled: "true" annotation - HTTPRoute not created yet - Gatus database connection issue
Best Practices¶
- Always define alerts for new apps: Include a
prometheusrule.yamlfor critical failure scenarios - Use Grafana dashboards: Create dashboards for app-specific metrics
- Enable Gatus monitoring: Add
gatus.home-operations.com/enabled: "true"to HTTPRoutes - Test alert expressions: Verify PromQL queries in Prometheus web UI before committing
- Monitor alert noise: If alerts fire too frequently, adjust thresholds or add inhibition rules
- Check logs for errors: Use VictoriaLogs to investigate alert root causes
Next Steps¶
- Troubleshooting Guide: Common cluster issues
- Task Runner Reference: Managing apps and resources
- Operations Overview: Day-to-day cluster management