Thanos - a Scalable Prometheus with Unlimited Storage - InfoQ

The Improbable engineering team has open sourced Thanos, a set of components that adds high availability to Prometheus installations by cross-cluster federation, unlimited storage and global querying across clusters.

Improbable, a British technology company that focuses on large-scale simulations in the cloud, has a large Prometheus deployment for monitoring their dozens of Kubernetes clusters. An out-of-the-box Prometheus setup had difficulty meeting their requirements around querying historical data, querying across distributed Prometheus servers in a single API call, and merging replicated data from multiple Prometheus HA setups.

Prometheus has existing high availability features - highly available alerts and federated deployments. In a federation, a global Prometheus server aggregates data across other Prometheus servers, potentially in multiple datacenters. Each server sees only a portion of the metrics. To handle more load per datacenter, multiple Prometheus servers can run in a single datacenter, with horizontal sharding. In a sharding setup, slave servers fetch subsets of the data and a master aggregates them. Querying a specific machine involves querying the specific slave that scraped its data. By default, Prometheus stores time series data for 15 days. To store data for unlimited periods, Prometheus has remote endpoints to write to another datastore along with its regular one. However, de-duplication of data is a problem with this approach. Other solutions like Cortex provide scalable long term storage when used as a remote write endpoint, and a compatible querying API.

Thanos' architecture introduces a central query layer across all the servers via a sidecar component which sits alongside each Prometheus server, and a central Querier component that responds to PromQL queries. This makes up a Thanos deployment. Inter-component communication is via the memberlist gossip protocol. The Querier can scale horizontally since it is stateless and acts as an intelligent reverse proxy, passing on requests to the sidecars, aggregating their responses and evaluating the PromQL query against them.

Thanos solves the storage retention problem by using an object storage as the backend. The sidecar StoreAPI component detects whenever Prometheus writes data to disk, and uploads them to object storage. The same Store component also functions as a retrieval proxy over the gossip protocol, letting the Querier component talk to it to fetch data.

InfoQ got in touch with Bartłomiej Płotka, software engineer at Improbable, to learn more about how query execution happens in Thanos:

In Thanos, the query is always evaluated in a single place - in the root, which listens over HTTP for PromQL queries. The vanilla PromQL engine from Prometheus 2.2.1 is used to evaluate the query, which deduces what time series and for what time ranges we need to fetch the data. We use basic filtering (based on time ranges and external labels) to filter out StoreAPIs (leafs) that will not give us the desired data and then invoke the remaining ones. The results are then merged - append together time-wise - from different sources.

The Querier component can auto-switch between resolutions (e.g. 5 minutes, 1 hour, 24 hours) based on the user zooming in and out. How does Thanos identify which API servers have the data it needs in a query? Płotka explains:

StoreAPIs propagate external labels and the time range they have data for. So we can do basic filtering on this. However if you don't specify any of these in query (you just want all the "up" series) the querier concurrently asks all the StoreAPI servers. Also, there might be some duplication of results between sidecar and store data, which might not be easy to avoid.

The StoreAPI component understands the Prometheus data format, so it can optimize the query execution plan and also cache specific indices of the blocks to respond fast enough to user queries. This absolves it from the need to cache huge amounts of data. In an HA Prometheus setup with Thanos sidecars, would there be issues with multiple sidecars attempting to upload the same data blocks to object storage? Płotka responded:

This is handled by having unique external labels for all Prometheus + sidecar instances (even HA replicas). To indicate that all replicas are storing same targets they differ only in one label. So for example:

First:
"cluster": "prod1"
"replica": "0"

Second:
"cluster":"prod1"
"replica": "1"

There is no problem with storing them since the label sets are unique. The query layer, however is capable of deduplicating by "replica" label, if specified, on the fly.

Image courtesy: https://improbable.io/games/blog/thanos-prometheus-at-scale

Thanos also provides compaction and downsampled storage of time series data. Prometheus has an inbuilt compaction model where existing smaller data blocks are rewritten into larger ones, and restructured to improve query performance. Thanos utilizes the same mechanism in a Compactor component that runs as a batch job and compacts the object storage data. The Compactor downsamples the data too, and "the downsampling interval is not configurable at this point but we have chosen some sane intervals - 5m and 1h", says Płotka. Compaction is a common feature in other time series databases like InfluxDB and OpenTSDB.

Thanos is written in Go, and is available on Github. Visualization tools like Grafana can use Thanos as-is with the existing Prometheus plugin since the API is the same.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Thanos - a Scalable Prometheus with Unlimited Storage

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter