Multi-Cluster Observability Platform

image

Description

The problem with vanilla Prometheus is that if you run Prometheus on multiple Kubernetes clusters, you quickly hit a wall:

  • Each Prometheus only sees its own cluster
  • No built-in long-term storage — data disappears when pods restart
  • No way to query across clusters from one place

This is fine for a single cluster. But the moment you have two clusters, you need something more.

The solution: Thanos. Thanos sits on top of Prometheus and solves all three problems. Think of it as Prometheus with superpowers:

  • Query across all clusters from a single endpoint/li>
  • Store metrics indefinitely in cheap object storage (S3-compatible)/li>
  • Deduplicate data when you run multiple Prometheus replicas/li>

Every Prometheus instance has unique labels identifying which cluster it belongs to (cluster=admin-cluster, cluster=workload-cluster). Every 2 hours, the Thanos sidecar uploads a block of metrics to the self-hosted S3 bucket in Ceph.