Skip to content

Observability Guide

The cluster includes a comprehensive observability stack for monitoring, alerting, and log aggregation. Everything runs in the observability namespace.

Observability Stack Overview

The stack is built around Prometheus (metrics collection and storage via kube-prometheus-stack), AlertManager (alert routing via Karma UI), Gatus (status monitoring), and VictoriaLogs (log aggregation).

graph TB
    subgraph "Metrics Collection"
        Prom[Prometheus]
        Exporters[Prometheus Exporters]
        Grafana[Grafana Dashboards]
        Kromgo[Kromgo Badge Generator]
    end

    subgraph "Alerting"
        PromAlert[Prometheus Alerting]
        AM[AlertManager]
        Karma[Karma Dashboard]
        Pushover[Pushover Mobile Notifications]
    end

    subgraph "Status Monitoring"
        Gatus[Gatus Status Page]
    end

    subgraph "Logs"
        FluentBit[Fluent-bit]
        VLogs[VictoriaLogs]
    end

    Exporters --> Prom
    Prom --> Grafana
    Prom --> Kromgo
    Prom --> PromAlert
    PromAlert --> AM
    AM --> Karma
    AM --> Pushover
    FluentBit --> VLogs

Web-Accessible Services

Service URL Access Authentication
Gatus (Status page) status.t0m.co External (public) None
Karma (AlertManager UI) am.t0m.co Internal only Authentik SSO
Prometheus prometheus.t0m.co Internal only Authentik SSO
VictoriaLogs logs.t0m.co Internal only Authentik SSO
Grafana grafana.t0m.co Internal only Grafana auth + Authentik SSO
Kromgo kromgo.t0m.co External (public) None
What is 'Internal only'?

Services marked "Internal only" use the envoy-internal gateway and are only accessible from: - Devices on the home network (192.168.5.0/24) - Devices connected via Tailscale VPN

They are not exposed through the Cloudflared tunnel and cannot be accessed from the public internet.


Components

Prometheus (Metrics)

Purpose: Metrics collection, storage, and querying

Components (via kube-prometheus-stack): - Prometheus Server: Metrics storage and querying - Prometheus Operator: Manages ServiceMonitors and PrometheusRules - kube-state-metrics: Exports Kubernetes object state as metrics - prometheus-node-exporter: Exports node/system metrics - Grafana: Dashboarding (included in stack) - AlertManager: Alert routing and notifications

Access: prometheus.t0m.co (internal only, Authentik SSO required)

Storage: Persistent storage on ceph-ssd storage class

Configuration: kube-prometheus-stack/app/helmrelease.yaml

Adding Metrics to an App

Apps expose metrics via ServiceMonitor resources. Prometheus Operator automatically discovers and scrapes them:

kubernetes/apps/default/myapp/app/helmrelease.yaml
serviceMonitor:
  app:
    serviceName: myapp
    endpoints:
      - port: metrics
        path: /metrics

Prometheus Operator converts ServiceMonitors into Prometheus scrape configs automatically.


AlertManager + Karma (Alerting)

Purpose: Routes alerts from vmalert to notification channels, web UI via Karma

Important: AlertManager itself has no web UI exposed. Karma provides the web interface to AlertManager.

Access: am.t0m.co via Karma (internal only, Authentik SSO required)

Configured receivers: - Pushover: Critical and warning alerts to mobile devices - Healthchecks.io: Watchdog heartbeat (confirms monitoring is alive) - null: InfoInhibitor alerts (silenced)

Alert routing logic: 1. Group alerts by alertname and job 2. Wait 1 minute before sending grouped alerts 3. Send to Pushover for severity=~"warning|critical" 4. Repeat alerts every 12 hours if not resolved 5. Silence lower-severity alerts when critical alerts are firing

Karma configuration: Connects to AlertManager internally at http://kube-prometheus-stack-alertmanager.observability.svc.cluster.local:9093

Pushover Notification Format

Alerts are sent to Pushover with:

  • Title: [FIRING:2] KubePodCrashLooping
  • Message: Alert description + labels
  • Priority: 1 (firing), 0 (resolved)
  • Sound: gamelan
  • URL: "View in Alertmanager" (links to Karma)

Configuration: kube-prometheus-stack/app/helmrelease.yaml


Gatus (Status Monitoring)

Purpose: Uptime monitoring and public status page

What it monitors: - HTTP/HTTPS endpoints (apps exposed via HTTPRoute) - Kubernetes Services (via sidecar auto-discovery) - Custom endpoints (configured manually)

Access: status.t0m.co (publicly accessible)

Auto-discovery: The gatus-sidecar automatically creates Gatus endpoints for: - All HTTPRoute resources with gatus.home-operations.com/enabled: "true" annotation - All Service resources with gatus.home-operations.com/enabled: "true" annotation

Storage: PostgreSQL database (gatus-pguser secret managed via CNPG component)

Adding an App to Status Page

Apps are automatically monitored if they have an HTTPRoute. To customize the health check:

kubernetes/apps/default/myapp/app/helmrelease.yaml
route:
  app:
    annotations:
      gatus.home-operations.com/enabled: "true"  # Enable monitoring
      gatus.home-operations.com/endpoint: |      # Custom health check
        conditions: ["[STATUS] == 200", "[BODY].status == ok"]
        interval: 60s
    hostnames: ["myapp.${SECRET_DOMAIN}"]

The sidecar watches HTTPRoute resources and syncs them to Gatus configuration automatically.


VictoriaLogs (Log Aggregation)

Purpose: Centralized log storage and search

Retention: 21 days

Access: logs.t0m.co (internal only, Authentik SSO required)

Log sources: - All pods via fluent-bit DaemonSet - Kubernetes audit logs - System logs

Storage: 10Gi on openebs-hostpath

Searching Logs

VictoriaLogs uses LogsQL query language (similar to LogQL):

Search by app:

{namespace="default", app="authentik"}

Search for errors:

{namespace="default"} | "error" or "ERROR"

Time range search:

{namespace="media", app="plex"} [5m]

Access the web UI at logs.t0m.co from the internal network.


Grafana (Dashboards)

Purpose: Metrics visualization and dashboarding

Access: grafana.t0m.co (internal only, Grafana auth + Authentik SSO)

Dashboards: Managed declaratively via GrafanaDashboard CRDs. Apps include dashboards in their manifests:

Example: kubernetes/apps/network/cilium/app/grafanadashboard.yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: cilium-operator
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana
  url: https://raw.githubusercontent.com/cilium/cilium/main/install/kubernetes/cilium/files/cilium-operator/dashboards/cilium-operator-dashboard.json

Datasource: Prometheus at http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090

Creating Custom Dashboards
  1. Create dashboard in Grafana web UI
  2. Export as JSON
  3. Create GrafanaDashboard CRD:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: my-dashboard
  namespace: observability
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana
  json: |
    <paste exported JSON>

Grafana Operator syncs dashboards automatically.


Kromgo (Badge Generator)

Purpose: Generate status badges from Prometheus queries for README files

Access: kromgo.t0m.co (publicly accessible)

Example badges: - Cluster uptime - Pod count - Storage usage - Kubernetes version

How it works: Kromgo queries Prometheus internally and renders SVG badges based on configured queries.

Configuration: kromgo/app/configmap.yaml


Prometheus Exporters

Various exporters provide metrics for infrastructure components:

Exporter Purpose Metrics
node-exporter System/hardware metrics CPU, memory, disk, network
kube-state-metrics Kubernetes object state Pods, deployments, nodes
unpoller Unifi network stats WiFi clients, bandwidth, devices

All exporters are scraped by Prometheus automatically via ServiceMonitor CRDs.


Common Observability Tasks

Viewing Active Alerts

Via Karma (recommended): 1. Navigate to am.t0m.co (requires internal network access) 2. Log in with Authentik SSO 3. View grouped alerts with filters

Via Pushover: - Critical/warning alerts sent to mobile device automatically

Searching Logs

# 1. Navigate to VictoriaLogs web UI (internal network only)
open https://logs.t0m.co

# 2. Log in with Authentik SSO

# 3. Use LogsQL to search
# Example: Find all errors in default namespace in last 5 minutes
{namespace="default"} | "error" [5m]

# Example: Search specific app
{namespace="media", app="plex"}

# Example: Search across all namespaces
{} | "connection refused"

Creating Alert Rules

Alert rules are defined as PrometheusRule CRDs (managed by Prometheus Operator):

kubernetes/apps/default/myapp/app/prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp
spec:
  groups:
    - name: myapp
      rules:
        - alert: MyAppDown
          annotations:
            summary: MyApp is down
            description: MyApp has been unavailable for 5 minutes
          expr: up{job="myapp"} == 0
          for: 5m
          labels:
            severity: critical

Commit and push—Flux applies the rule, Prometheus evaluates it, and AlertManager routes firing alerts to Pushover.

Silencing Alerts

Temporary silence: 1. Navigate to am.t0m.co (Karma) 2. Click alert → "Silence" 3. Set duration and reason 4. Create silence

Permanent silence: Add silence to silence-operator configuration or adjust alert rule severity/inhibition.

Checking Status Page

Public status page shows app availability:

  1. Navigate to status.t0m.co
  2. View real-time status of all monitored endpoints
  3. Click endpoint for uptime history

Apps automatically appear if they have an HTTPRoute with Gatus annotation enabled.


Alert Rules

The cluster includes several categories of alert rules:

Core Kubernetes Alerts

  • KubePodCrashLooping - Pod crash loops
  • KubePodNotReady - Pods stuck in non-ready state
  • KubeDeploymentReplicasMismatch - Deployment replica count issues
  • KubePersistentVolumeFillingUp - PVC approaching capacity
  • KubeNodeNotReady - Node in NotReady state
  • KubeAPIErrorBudgetBurn - Kubernetes API error rate too high

Application-Specific Alerts

Apps define custom alerts in prometheusrule.yaml files:

  • CNPG: Database cluster health, replication lag
  • Cert-Manager: Certificate expiration warnings
  • Flux: GitRepository sync failures, HelmRelease failures
  • Cilium: Network policy drops, endpoint health
  • VolSync: Backup failures

Custom Alerts

Additional alerts defined in kube-prometheus-stack HelmRelease:

  • DockerhubRateLimitRisk - Too many Docker Hub pulls
  • OomKilled - Pods killed due to OOM
  • KubePodFailed - Pods in Failed state for >15 minutes

All alerts are viewable in Karma.


Metrics Retention and Storage

Component Retention Storage Size Storage Class
Prometheus Configurable Persistent storage ceph-ssd
VictoriaLogs 21 days 10Gi openebs-hostpath
Gatus Unlimited (PostgreSQL) Shared CNPG cluster ceph-ssd
Grafana N/A (uses Prometheus) Shared CNPG cluster ceph-ssd

Troubleshooting Observability

Alerts Not Firing

Symptoms: Expected alert not visible in Karma or Pushover

Diagnosis:

# 1. Check if alert rule exists
kubectl get prometheusrule -A | grep <rule-name>

# 2. Check Prometheus logs
kubectl logs -n observability -l app.kubernetes.io/name=prometheus -f

# 3. Query Prometheus directly to test alert expression
# Navigate to https://prometheus.t0m.co (internal network)
# Run the alert's PromQL expression manually

Common causes: - Alert rule syntax error (check Prometheus logs) - Alert expression never evaluates to true (test in Prometheus) - Alert for duration not yet elapsed

Metrics Missing

Symptoms: Grafana dashboard shows "No data" or incomplete metrics

Diagnosis:

# 1. Check if ServiceMonitor exists
kubectl get servicemonitor -n <namespace>

# 2. Check Prometheus targets
# Navigate to https://prometheus.t0m.co/targets (internal network)
# Verify target shows as "UP"

# 3. Check Prometheus logs
kubectl logs -n observability -l app.kubernetes.io/name=prometheus -f

Common causes: - ServiceMonitor not created or misconfigured - App not exposing /metrics endpoint - Network policy blocking scrape

Logs Not Appearing

Symptoms: VictoriaLogs search returns no results

Diagnosis:

# 1. Check fluent-bit is running on all nodes
kubectl get pods -n observability -l app.kubernetes.io/name=fluent-bit

# 2. Check fluent-bit logs for errors
kubectl logs -n observability daemonset/fluent-bit -f

# 3. Check VictoriaLogs ingestion
kubectl logs -n observability deployment/victoria-logs -f | grep -i error

Common causes: - Fluent-bit DaemonSet not running on node - Pod logs in non-standard location - VictoriaLogs storage full

Gatus Status Page Not Updating

Symptoms: Status page shows outdated status or missing endpoints

Diagnosis:

# 1. Check gatus-sidecar logs (auto-discovery)
kubectl logs -n observability deployment/gatus -c gatus-sidecar -f

# 2. Check main Gatus container logs
kubectl logs -n observability deployment/gatus -c app -f

# 3. Verify HTTPRoute has correct annotation
kubectl get httproute -n <namespace> <app> -o yaml | grep gatus.home-operations.com

Common causes: - Missing gatus.home-operations.com/enabled: "true" annotation - HTTPRoute not created yet - Gatus database connection issue


Best Practices

  1. Always define alerts for new apps: Include a prometheusrule.yaml for critical failure scenarios
  2. Use Grafana dashboards: Create dashboards for app-specific metrics
  3. Enable Gatus monitoring: Add gatus.home-operations.com/enabled: "true" to HTTPRoutes
  4. Test alert expressions: Verify PromQL queries in Prometheus web UI before committing
  5. Monitor alert noise: If alerts fire too frequently, adjust thresholds or add inhibition rules
  6. Check logs for errors: Use VictoriaLogs to investigate alert root causes

Next Steps