Breaking News from AWS re:Invent
Coralogix receives AWS Rising Star award!
As part of Coralogix Alerting, metric alerts serve as your frontline defense, ensuring the uninterrupted performance, reliability, and security of your systems and applications.
Metric alerts are notifications triggered by predefined thresholds being met or exceeded for specific metrics in your Coralogix dashboard.
Metric alerts are meticulously designed to keep a vigilant eye on critical performance indicators surrounding infrastructure and other metrics. When specific thresholds or conditions are breached, these alerts act as our early warning system, instantly notifying our teams of any potential issues that require immediate attention. For instance, they proactively monitor server CPU utilization, response times, error rates, and resource utilization in cloud environments.
Create PromQL alerts for standard metrics, such as Prometheus or Cloudwatch metrics, or metrics hidden within your logs using Events2Metrics.
STEP 1. In your Coralogix toolbar, go to Alerts > Alert Management. Click ADD NEW ALERT.
STEP 2. Set alert details: Name, Description, and Severity.
STEP 3. Select Alert Type.
STEP 4. Add the PromQL query that you would like to trigger the alert.
Notes:
application_error_count
and it covered code area x
, you can tack on a corresponding metric label.application_error_count{area="x"}
by
aggregation operator to choose which dimensions (metric labels) to aggregate along and how to split your [alert notification groups]. For instance, the query sum by(instance) (node_filesystem_size_bytes)
returns the total node_filesystem_size_bytes
for each instance.STEP 1. Choose if you alert will be triggered if it is more or less than a certain value, or more than usual for a minimum threshold. When the query passes the value or minimum threshold in accordance with the conditions set, an alert will be triggered.
Selecting the more than usual condition will trigger an alert when the number of matches is higher than normal and above a minimum threshold. Find out more here.
STEP 2. Enter a percentage (for over x %) and timeframe (of the last x minutes). This determines how much of the timeframe you want to be crossing the threshold in order for the alert to trigger.
Example:
Select the percentage (at least x %) of the timeframe which needs to have values for the alert to trigger.
Percentage values:
STEP 3. You have the option of replacing missing values with 0.
Why replace missing values with 0? Sometimes data may have missing values as seen in the graph below. When you don’t replace missing values with zero and leave them empty, the data you do have is considered to be 100% of the data.
Let’s say that you query for a time frame of 10 minutes. 6 data points have values and 4 data points have no values. If you haven’t replaced the missing values with 0, the 6 minutes with values will be considered 100% of the timeframe. This can lead to false triggers.
STEP 4. If you are using the Less than condition, you will have the option to manage undetected values.
Undetected values occur when a permutation of a Less than alert stops being sent, causing multiple triggers of the alert (for every timeframe in which it was not sent).
When you view an alert with undetected values, you have the option to retire these values manually, or select a time period after which undetected values will automatically be retired. You can also disable triggering on undetected values to immediately stop sending alerts when an undetected value occurs.
Expand Preview Alert to preview the queried metric and defined threshold over the past 24 hours. The preview is limited to a maximum of 100 time series.
By default, a single notification, aggregating all values matching an alert query and conditions, will be sent to your Coralogix Insights screen.
You have the option of grouping alerts by one or more labels using the Group By feature and sending multiple individual notifications.
Select one or more Keys – consisting of a subset of the fields selected in the alert conditions – in the drop down menu. A separate notification will be sent for each Key selected.
What is Group By? This feature allows you to group alerts by one or more labels that are aggregated into a histogram. An alert is triggered whenever the condition threshold is met for a specific aggregated label within the specified timeframe.
If using 2 labels for Group By, matching metrics will first be aggregated by the parent label (ie. region), then by the child label (ie. pod_name). An alert will fire when the threshold meets the unique combination of both parent and child. Only metrics that include the Group By labels will be included in the count.
Notes:
Both notification types allow you to choose the the parameters of your notification.
STEP 1. Notify Every. Sets the alert cadence. After an alert is triggered and a notification is sent, the alert will continue to work, but notifications will be suppressed for the duration of the suppression period.
STEP 2. Notify when resolved. Activate to receive an automatic update once an alert has ceased.
STEP 3. Define additional alert recipient(s) and notification channels by clicking + ADD WEBHOOK.
View your triggered alerts by clicking on the Insights tab in your navigation bar. View the name of the alert, the query used, the graph to represent the alert, and the aggregation you have chosen.
Alerts Map presents users with a visual representation of each alert status in real-time. Grouping all of your alerts in a scalable, information-dense manner, this feature ensures optimal system monitoring. To access the Alerts Map feature, navigate to Insights in your navigation pane > Alerts Map. Find out more here.
When you first create an alert, it takes 15 minutes before it starts to trigger. In most cases it will be faster.
For timeframes up to 30 minutes, we define steps every 1 minute.
For timeframes up to 12 hours, we define steps every 5 minutes.
For timeframes over 12 hours, we define steps every 10 minutes.
This could be the result of late coming data, a lag of ingestion in a system when the alert triggered.
Sometimes data may have missing values as seen in the graph below. When you don’t replace missing values with zero and leave them empty, the data you do have is considered to be 100% of the data.
Let’s say that you query for a time frame of 10 minutes. 6 data points have values and 4 data points have no values. If you haven’t replaced the missing values with 0, the 6 minutes with values will be considered 100% of the timeframe. This can lead to false triggers. To avoid this, either replace missing values with 0 or set that at least 100% of the timeframe needs to have values for the alert to trigger. In the latter case, if certain points don’t exist, an alert will not be triggered.
Sometimes data may have missing values as seen in the graph below. When you don’t replace missing values with zero and leave them empty, the data you do have is considered to be 100% of the data.
Let’s say that you query for a time frame of 10 minutes. 6 data points have values and 4 data points have no values. If you haven’t replaced the missing values with 0, the 6 minutes with values will be considered 100% of the timeframe, leading to false triggers.
To avoid this, either replace missing values with zero or set that at least 100% percent of the timeframe needs to have values for this alert to trigger.
We strongly recommend viewing your metric in Grafana or your Custom Dashboard in real-time. By viewing the metric, you can see if there has been a lag in ingestion or sending.
Our suggestion is to use PromQL function to return a value.
Need help?
Our world-class customer success team is available 24/7 to walk you through your setup and answer any questions that may come up.
Feel free to reach out to us via our in-app chat or by sending us an email at [email protected].