Scale Your Prometheus Metrics Indefinitely with Thanos

Prometheus metrics are an essential part of your observability stack. Observability comes hand in hand with monitoring, and is covered extensively here in this Essential Observability Techniques article. A well-monitored application with flexible logging frameworks can pay enormous dividends over a long period of sustained growth, but Prometheus has a problem when it comes to scale.

Prometheus Metrics at Scale

Prometheus is an extremely popular choice for collecting and querying real-time metrics primarily due to its simplicity, ease of integration, and powerful visualizations. It’s perfect for small and medium-sized applications, but what about when it comes to scaling? 

Unfortunately, a traditional Prometheus & Kubernetes combo can struggle at scale given its heavy reliance on writing memory to disk. Without specialized knowledge in the matter, configurations for high performance at scale is extremely tough. For example, querying through petabytes of historical data with any degree of speed will prove to be extremely challenging with a traditional Prometheus setup as it relies heavily on read/write disk operations.

So how do you scale your Prometheus Metrics?

This is where Thanos comes to the rescue. In this article, we will be expanding upon what Thanos is, and how it can give us a helping hand and allow us to scale using Prometheus without the memory headache. Lastly, we will be running through a few great usages of Thanos and how effective it is using real world examples as references.

Thanos

Thanos, simply put, is a “highly available Prometheus setup with long-term storage capabilities”. The word “Thanos” comes from the Greek “Athanasios”, meaning immortal in English. True to its name, Thanos features object storage for an unlimited time and is heavily compatible with Prometheus and other tools that support it such as grafana.

Thanos allows you to aggregate data from multiple Prometheus instances and query them, all from a single endpoint. Thanos also automatically deals with duplicate metrics that may arise from multiple Prometheus instances.

Let’s say we were running multiple prometheus instances, and wanted to use Thanos to our advantage. We will take a look at the relationship between these two components and then delve into more technical details.

how prometheus metrics are queried across multiple prometheus instances using thanos

Storage

The first step to solving memory woes using Thanos is its Sidecar component, as it allows seamless uploading of metrics as object storage on your typical providers (S3, Swift, Azure etc). It employs the use of the StoreAPI as an API gateway, and only uses a small amount of disk space to keep track of remote blocks and keep them in sync. The StoreAPI is a gRPC that uses SSL/TLS authentication, and therefore standard HTTP operations are converted into gRPC format. Information on this can be found here.

Thanos also features Time Based Partitioning which allows you to set flags that query metrics from certain timeframes within the Store Gateway. For example: --min-time=-6w would be used as a flag to filter data older than 6 weeks.

Additionally, Thanos has an Index Cache and implements the use of a Caching Bucket – certainly powerful features that allow you to safely transition from a read/write dependency to seamless cloud storage. These latency performance boosters are essential when scaling – every ms counts.

The use of the Sidecar is also invaluable in case of an outage. It is key to be able to refer to historical data through the use of backups on the cloud. Using a typical Prometheus setup, you could very well lose important data in case of an outage.

Basic Thanos Query

The basic Thanos setup seen above in the diagram also contains the Thanos Query component, which is responsible for aggregating and deduplicating metrics like briefly mentioned earlier. Similar to storage, Thanos Query also employs the use of an API – namely the Prometheus HTTP API. This allows querying data within a Thanos cluster via PromQL. It intertwines with the previously mentioned StoreAPI by querying underlying objects and returning the result. The Thanos querier is “fully stateless and horizontally scalable”, per its developers.

Scaling Thanos Query

Thanos Query works as an aggregator for multiple sidecar instances, but we could easily find ourselves using multiple Kubernetes clusters, with multiple Prometheus instances. This would mean we would have multiple Thanos Query nodes leading subsets of Sidecar & Prometheus instances. Intuitively, this is a difficult problem to solve. 

The good news is that a Thanos Query node aggregates multiple instances of Thanos Query nodes also! Sounds confusing? It’s actually very simple:

Querying prometheus metrics across clusters using Thanos

It doesn’t matter if your project spans over multiple Kubernetes clusters. Thanos deduplicates metrics automatically. The “head” Thanos Query node takes care of this for us by running high performance deduplication algorithms.

The clear advantage to this type of setup is that we end up with a single node where we can query all of our metrics. But this also has a clear disadvantage – What if our head Thanos Query node goes down? Bye bye metrics?

Luckily there are options to truly nail the Thanos setup over multiple Kubernetes clusters. This fantastic article runs through “Yggdrasil”, an AWS multi-cluster load-balancing tool. It allows you to perform queries against any of your clusters and still access all of your metrics. This is especially important in case of any downtime or service failures. With careful planning, you can cut down data loss to pretty much near 0%.

Availability

It should be clear by now that the sum of all these parts is a Prometheus setup with high availability of data. The use of Thanos Query and Thanos Sidecar significantly facilitate the availability of objects and metrics. Given the convenience of a single metric collection point, alongside unlimited retention of object storage, it’s easy to see why the first four words on the Thanos website are “Highly available Prometheus setup”.

Uses of Thanos for Prometheus Metrics

Nubank

Many large companies are using Thanos to their advantage. Nubank, a Brazilian fintech company that toppled a $10B valuation last year, has seen massive increases to operational efficiency after fitting Thanos, within other tools into their tech stack.

A case study on the company revealed that the Nubank cloud native platform “includes Prometheus, Thanos and Grafana for monitoring”. Although its ultimately impossible to attribute this solely to Thanos, the case study explains how Nubank now deploys “700 times a week” and has “gained about a 30% cost efficiency”. It seems even a very large-scale application can deeply benefit from a hybrid Prometheus+Thanos setup.

GiffGaff

Popular UK mobile network provider GiffGaff also boasts the use of Thanos. In fact, they have been fairly public as to how Thanos fits into their stack and what sort of advantages it has given them.

prometheus metrics are Giff Gaff and how they used Thanos to scale

The diagram above shows that the Thanos Store, Thanos Query and Thanos Bucket. They are critical parts of the monitoring data flow. Objects are constantly being synced and uploaded onto an S3 bucket. In comparison to disk operations seen in a normal Prometheus setup, this scales far more and benefits from the reliability of S3. 

Within their post under the “Thanos” section, GiffGaff claims “As long as at least one server is running at a given time, there shouldn’t be data loss.”  This hints at some form of multi-cluster load balancing at the very least.

“… [Thanos] allowed us [GiffGaff] to retain data for very long periods in a cost-efficient way” 

Interestingly, GiffGaff employs the use of the sidecar to upload objects every 2 hours – hedging against any potential prometheus downtime.

GiffGaff uses Thanos Store and allocates time periods for each Thanos Store cluster for storage. This effectively rotates the cluster use, keeping availability and reliability very high. The example given, by GiffGaff themselves is: 

  • now – 2h: Thanos Sidecars
  • 2h – 1 month: Thanos Store 1
  • 1 month – 2 months: Thanos Store 2
  • 2 months – 3 months: Thanos Store 3

We had previously touched upon Thanos downsampling and how it could save you time when querying historical data. In order to implement this, GiffGaff used the Thanos Compactor, “performing 5m downsampling after 40 hours and 1h downsampling after 10 days.” Impressive to say the least.

Conclusion

Now you know what Thanos is, how it interacts with Prometheus and the type of advantages it can give us. In this post, we also ran through some real life examples which should give some insight into how Thanos is used. It should also be clear how the use of Thanos Sidecar and Storage are inherently advantageous when it comes to scaling, in relation to a typical Prometheus setup. 

Apart from storage capabilities, the effectiveness of Thanos Query should also be direct – and how a single metric collection point is a massive blessing but comes with its own responsibilities should you need to balance the load on multiple clusters. 

Lastly, downsampling through the use of the Thanos Compactor seems like a performance no brainer. Large datasets can be easily handled using the downsampling method.

Hopefully you understand Thanos and what it has to offer to make your life easier. If that sounds like a lot of work, Coralogix offers a powerful suite of monitoring tools. 

Overcoming DNS barriers for Kubernetes Scaling

It was a cloudy winter morning when I had arrived at the office and found, to our horror, that a Kubernetes cluster was suffering from extremely high CPU and network usage and had become almost completely non-functional.

To make things worse, restarting the nodes (the go-to DevOp solution), seemed to have absolutely no effect on the issue. Something was poisoning the network and we had to find out what it was and fast.

The first thing we did was look at the logs and sure enough, we discovered an out of memory message:

Now, before resorting to any dark and perhaps satanic rituals that involve sacrificing children, we decided to run “top” (the Linux tool) on one of the nodes to understand it better.

The result really knocked us off our feet: a process named “protokube” which usually behaves nicely all of a sudden went berserk and started consuming between 200% to 300% of CPU usage. We also ran “netstat -plan | grep <protokube_pid>” to see which ports it was listening to or communicating on and found out it was port 3999/TCP.

That was suspicious indeed but we wanted to be certain that we found the real culprit, so we ran tcpdump to capture some traffic from one of the nodes on port 3999/TCP and then used Wireshark to understand how much of the total traffic was generated by Protokube.

From the chart above it’s evident that most of the node’s traffic actually goes through port 3999, which means that it’s being sent and received by the Protokube process. So, we knew that this process was responsible for the issue, but what does the process do in the first place?

A quick google search and we could clearly see the list of Protokube’s tasks:

It soon became clear that we could rule out the first and last tasks as they’re most likely not related to our case. This left us with “Configures DNS for simple discovery” but how does it actually work?

Our next step was to find out what useful information we could get from the traffic data to better understand how Protokube configures DNS for simple discovery.

While reviewing the raw traffic in Wireshark and looking for clues, we noticed the following communication:

As you can see above, the communication starts with the text “weave..weave”. After searching we found that this type of Protokube traffic uses the Weave library internally.

So now we understand why we couldn’t decipher the raw traffic (i.e. due to the Weave network encryption), but we’re still left with the original question of how Protokube configures the DNS for simple discovery?

A few more searches and a few cups of tea later we found that Protokube uses Gossip DNS encrypted with Weave and runs on port 3999/TCP and 3998/TCP. For those who aren’t familiar, Gossip is designed to handle name resolution in a decentralized manner to keep up with fast-changing containerized environments.

Now that we finally knew what the Protokube process actually did and how it related to the traffic we saw on port 3999/TCP, there was still the original question we needed to answer: why was the Protokube process behaving so badly and how to address it?

To find the answer, we had to look deeper and analyze the actual communication between the nodes on port 3999/TCP. Here’s an example of communication we found:

As you can see from the above screenshot, the communication starts with the node 1xx.xx.xx.130 that tries to connect to the node at 1xx.xx.xx.208 on port 3999/TCP at packet no. 6313 and as soon as it’s ready to send Gossip DNS  data, the node at 1xx.xx.xx.208 responds with a zero-length TCP Window Size.

What exactly is a TCP window size? The TCP window size is a throttling mechanism built into the TCP protocol to avoid overloading the recipient. In order to demonstrate this, let’s consider the following example. Computer A wants to send data to computer B over TCP port 3999. Here’s how the communication will typically look like:

  1. Computer A: SYN source port = X, destination port = 3999 (connection request)
  2. Computer B: SYN-ACK source port = 3999, destination port X (acknowledgment of connection request)
  3. Computer A: ACK source port = X, destination port = 3999 (acknowledgment of the request, which essentially opens the connection)
  4. Computer A: PSH destination port = 3999, destination port X (data pushed to recipient)

But what happens if, for example, computer B cannot handle the incoming traffic at that particular moment (e.g. due to temporary server overload)?

This is exactly what the TCP Window Size aims to solve. In every packet that is sent by the client or server,  it specifies a Window Size that is set by the packet sender to indicate how many bytes the sender is willing to accept in return.

When the TCP window size reaches zero (due to overload or hacking), the recipient of the packet will periodically poll the sender of the packet to see if it has become available again (i.e. if the Window Size has since grown) and if it has, the sender will finally send the requested data.

It’s absolutely normal for computers to become busy at times and use this mechanism to throttle their communications.

However, in our case, since the Protokube Gossip DNS uses a mesh topology, things can look very different and that’s exactly what happened in our case:

  1. Computer A=>B: SYN (source port = X, destination port = 3999)
  2. Computer B=>A: SYN-ACK (source port = 3999, destination port X)
  3. Computer A=>B: ACK (source port = X, destination port = 3999)
  4. Computer A=>B: PSH (source port = X, destination port = 3999) With the data that needs to be transmitted
  5. Computer B=>A: PSH-ACK (source port = 3999, destination port X)It will respond with a TCP Window Size of 0 because of some unrelated temporary heavy load. Now, Computer A cannot send the data, causing it to periodically poll Computer B.
    .
    .
    .
  6. Computer C=>B: SYN (source port = Y, destination port = 3999)
  7. Computer B=>C: SYN-ACK (source port = 3999, destination port Y)
  8. Computer C=>B: ACK (source port = Y, destination port = 3999)
  9. Computer C=>B: PSH (source port = Y, destination port = 3999) With the data that needs to be transmitted
  10. Computer B=>C: PSH-ACK (source port = 3999, destination port Y) with the TCP Window Size set to 0 because of some unrelated temporary heavy load

In our mesh topology, if computer B becomes temporarily busy for any reason, it will immediately overload all connections in its DNS Gossip network because all nodes that try to publish their DNS cache to computer B will become overloaded by other nodes trying to poll it at the same time. This resulted in a cascade of overloads that was made worse as more nodes got added to the network. More nodes meant exponentially more connections.

Due to the constant polling as a consequence of even a small initial overload, a small issue eventually spiraled to a point where the window size reached 0 and the network would come to a halt.  

The solution I discovered was to run a script that loops through all nodes in a cluster, connects to them and kills the Protokube Daemon to force Kubernetes to restart every Protokub pod. The script ran pretty fast allowing for more nodes to become available and respond to the Gossip requests. After all nodes on the cluster went through this process, the cluster was back online and the extremely heavy load diminished.

We also reduced the number of servers by using more powerful ones. This reduced the overall node count and possible connections on the Gossip mesh thereby lowering the risk of overloading it.

Fewer nodes means less gossip which means less chance of failure.

Although this indeed solved the problem, the real long-term solution for production environments is to abandon Gossip DNS completely and switch to better techniques such as the AWS Route53 service.

Just as a side note, the very same throttling mechanism can and is being used by hackers to attack servers and services by causing a DoS attack. The idea is by sending multiple request packets to a server and setting the TCP window size to zero or to a low value, it forces the server to hold the connection open for long periods of time. Such attacks are known as Socketstress. A similar attack that operates in higher layers of the communication is the Slowloris attack. This works by requesting a resource but doesn’t allow the server to return it properly – causing a large number of connections to be left open and idle.

To summarize, we initially thought we needed to increase nodes to handle our growing needs, but this caused the network to come to a halt. After investigating, we found the culprit to be the way Gossip DNS is implemented in Protokube. Counter-intuitively, by reducing the number of nodes and forcing K8s to restart all Protokube pods in a cluster, we were able to scale without sacrificing performance (or any children). However, even this was a short-term solution and we eventually abandoned Gossip DNS for a dedicated DNS server.