[Live Webinar] Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy Register today!

How to Scale Your Alerts Beyond PromQL with Coralogix Flow Alerts

  • Chris Cooney
  • September 27, 2022
Share article

When building alerts, engineers aim to create accurate, timely, and actionable alerts. In pursuit of this goal, many engineers will leverage PromQL throughout their careers. PromQL is the query language used by Prometheus and Alert Manager to query metrics and define alerting rules.

While PromQL works very well for simple use cases, as infrastructure scales, architectural patterns grow more complex, engineering practices accelerate, and alerting use cases become more multivariate. Let’s explore the limitations of PromQL and how a low-code alerting solution like Coralogix Flow Alerts will help you scale your alerting use cases to match even the most complex cases.

Why is PromQL holding you back?

PromQL is a fundamental technology in the observability industry and features in almost every reputable platform. It is so ubiquitous that many engineering teams treat it as the default approach for handling metrics and defining alerts, but this approach comes with a series of potential issues.

PromQL requires specialist knowledge

PromQL is a query language that can scale into a complex expression, encompassing statistical, programming, and logical concepts. This means that your non-technical colleagues may struggle to use it, but potentially worse, many of your software engineers will have to navigate a difficult learning curve.

PromQL queries don’t scale very well

The readability of a PromQL query degrades fast. Take, for example, the following expression:

http_requests_total{job=”apiserver”, handler=”/api/comments”}

We can guess what this does. It has a single clause and doesn’t perform any calculations. So what about this?

avg(rate(container_cpu_usage_seconds_total[5m])) / avg(rate(container_cpu_system_seconds_total[5m])) > 0.9 and avg(rate(http_requests_total{job=”apiserver”, handler=”/api/comments”}[5m])) < 150

It all becomes a little difficult when you involve multiple clauses in your queries, and the larger the query becomes, the more difficult it is to understand.

PromQL leads to many small alerts

As a direct effect of the poor scalability of PromQL, it incentivizes engineers to write many small alerts. This means that for a given outage, dozens or perhaps hundreds of alerts might fire. This isn’t useful, nor is it actionable. Worse, it can cause alert fatigue

As your engineering efforts scale, these issues translate into increased cost and complexity.

So what can we do about PromQL?

We need a tool that gives us the convenience of PromQL at a small scale but doesn’t burden us with the operational complexity of a multivariate PromQL alert. The answer is simple:  we need a layer on top of PromQL that can orchestrate our modular alerts in many different complex ways, enabling us to grow and scale our alerts in response to our system demands without worrying about code complexity.

PromQL vs. Coralogix Flow Alerts – A worked example

Let’s try something more complex. Imagine we want to fire an alert if:

  1. The average CPU utilization increases over 90%
  2. AND within 10 minutes, both the average request latency and the error rate increase

To implement this in PromQL, you would first need to capture the average CPU utilization and test to see if it increased over 90% in the last five minutes:

avg(rate(container_cpu_usage_seconds_total[5m])) / avg(rate(container_cpu_system_seconds_total[5m])) > 0.9

Next, we need to check if the average request latency has increased sharply AND the error rate has increased. 

increase(http_response_took_cx_avg[5m]) > 1000 and increase(http_error_perc_total) > 5

We now need to join these together, so that they fire in sequence. Unfortunately, this is where we need to draw the line. After all of the work of putting these queries together, we can’t orchestrate PromQL queries over time, it’s simply not something supported by the engine. 

How Flow Alerts make your complex alerting simple

In Flow alerts, we first need to define the correct alarms. We can do this by breaking up our query into three clear components:

  • avg(rate(container_cpu_usage_seconds_total[5m])) / avg(rate(container_cpu_system_seconds_total[5m])) > 0.9
  • increase(http_response_took_cx_avg[5m]) > 1000
  • increase(http_error_perc_total) > 5

Next, we declare those as alerts in their own right, on the Coralogix platform. For example, using the CPU usage alarm, that looks something like this:

Link to GIF: Screen-recording-clean (1).gif

And then we string those alerts together into a single flow alert. The flow alert is a simple, low-code alerting interface that allows you to create powerful relationships between individual alerts, to describe the full story of an incident as it travels through your system.

Link to GIF: Flow Alert Clean.gif

On top of this, as you build your alert, you also get a living process diagram. This makes maintenance and handover far more straightforward. It also greatly simplifies the process of creating these alerts. If the basic building blocks are all in place, your Flow Alert could be built by anyone who can think logically about what your alert needs to do. It removes the stress of trying to break down and understand large, complex PromQL queries.

FeaturePromQL + AlertManagerPromQL + Flow Alerts
Easily declare simple alerting cases
Sequence multiple alerts over time 
Create alerts based on logs, metrics, traces, and security data
Generate a clear diagram of your alert as you build it
Logically connect multiple PromQL statements together, using a visual UI

Flow Alerts offer the flexibility of PromQL and the simplicity of Low-code alerting

PromQL offers a remarkable level of flexibility and control over your alerting, and at Coralogix, we understand how important it is for engineers to be able to do what they do best. That’s why PromQL is fully compatible with Flow Alerts. Still, with the Flow Alerts UI, you can safeguard against the weaknesses of PromQL, enabling you to build more complex alerting flows, using features not available in AlertManager, to sequence your alerts into a single, coherent story that describes an incident from end to end.

Where Modern Observability
and Financial Savvy Meet.

Live Webinar
Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy
April 30th at 12pm ET | 6pm CET
Save my Seat