Multi-Cluster Observability Platform
Description
Multi-Cluster Observability Platform
Built a production-grade multi-cluster observability platform in my Homelab
05.2026
The problem with vanilla Prometheus is that if you run Prometheus on multiple Kubernetes clusters, you quickly hit a wall:
- Each Prometheus only sees its own cluster
- No built-in long-term storage — data disappears when pods restart
- No way to query across clusters from one place
This is fine for a single cluster. But the moment you have two clusters, you need something more.
The solution: Thanos. Thanos sits on top of Prometheus and solves all three problems. Think of it as Prometheus with superpowers:
- Query across all clusters from a single endpoint/li>
- Store metrics indefinitely in cheap object storage (S3-compatible)/li>
- Deduplicate data when you run multiple Prometheus replicas/li>
Every Prometheus instance has unique labels identifying which cluster it belongs to (cluster=admin-cluster, cluster=workload-cluster). Every 2 hours, the Thanos sidecar uploads a block of metrics to the self-hosted S3 bucket in Ceph.