Multi-Cluster Observability Platform

Description

Multi-Cluster Observability Platform
Built a production-grade multi-cluster observability platform in my Homelab
05.2026

The problem with vanilla Prometheus is that if you run Prometheus on multiple Kubernetes clusters, you quickly hit a wall:

Each Prometheus only sees its own cluster
No built-in long-term storage — data disappears when pods restart
No way to query across clusters from one place

This is fine for a single cluster. But the moment you have two clusters, you need something more.

The solution: Thanos. Thanos sits on top of Prometheus and solves all three problems. Think of it as Prometheus with superpowers:

Query across all clusters from a single endpoint/li>
Store metrics indefinitely in cheap object storage (S3-compatible)/li>
Deduplicate data when you run multiple Prometheus replicas/li>

Every Prometheus instance has unique labels identifying which cluster it belongs to (cluster=admin-cluster, cluster=workload-cluster). Every 2 hours, the Thanos sidecar uploads a block of metrics to the self-hosted S3 bucket in Ceph.