As part of Coralogix Alerting, metric alerts serve as your frontline defense, ensuring your systems and applications’ uninterrupted performance, reliability, and security.
Metric alerts are notifications triggered by predefined thresholds being met or exceeded for specific metrics in your Coralogix dashboard.
Metric alerts are meticulously designed to monitor critical performance indicators surrounding infrastructure and other metrics. When specific thresholds or conditions are breached, these alerts act as our early warning system, instantly notifying our teams of potential issues requiring immediate attention. For instance, they proactively monitor server CPU utilization, response times, error rates, and resource utilization in cloud environments.
Create PromQL alerts for standard metrics, such as Prometheus or Cloudwatch metrics, or metrics hidden within your logs using Events2Metrics.
STEP 1. In your Coralogix toolbar, go to Alerts > Alert Management. Click ADD NEW ALERT.
STEP 2. Set alert details: Name, Description, and Severity.
STEP 3. Select Alert Type.
STEP 4. Add the PromQL query that you would like to trigger the alert.
Notes:
application_error_count
and it covered code area x
, you can tack on a corresponding metric label.application_error_count{area="x"}
by
aggregation operator to choose which dimensions (metric labels) to aggregate along and how to split your [alert notification groups]. For instance, the query sum by(instance) (node_filesystem_size_bytes)
returns the total node_filesystem_size_bytes
for each instance.STEP 5. Define the conditions for which your alert will be triggered.
Example:
You have the option of replacing missing values with 0.
Why replace missing values with 0? Sometimes data may have missing values as seen in the graph below. When you don’t replace missing values with zero and leave them empty, the data you do have is considered to be 100% of the data.
Let’s say that you query for a time frame of 10 minutes. 6 data points have values and 4 data points have no values. If you haven’t replaced the missing values with 0, the 6 minutes with values will be considered 100% of the timeframe. This can lead to false triggers.
If you are using the Less than condition, you will have the option to manage undetected values.
Undetected values occur when a permutation of a Less than alert stops being sent, causing multiple triggers of the alert (for every timeframe in which it was not sent).
When you view an alert with undetected values, you have the option to retire these values manually, or select a time period after which undetected values will automatically be retired. You can also disable triggering on undetected values to immediately stop sending alerts when an undetected value occurs.
STEP 6. Expand Preview Alert to preview the queried metric and defined threshold over the past 24 hours. The preview is limited to a maximum of 100 time series.
STEP 7. Define Notification settings.
In the notification settings, you have different options, depending on whether or not you are using the Group By condition.
When using Group By conditions, you will see the following options:
When not using the Group By condition, a single alert will be triggered and sent to your Incidents Screen when the query meets the condition.
You can define additional alert recipient(s) and notification channels in both cases by clicking + ADD WEBHOOK. Once you add a webhook, you can choose the parameters of your notification:
Notes:
Both notification types allow you to choose the parameters of your notification.
STEP 1. Notify Every. Sets the alert cadence. After an alert is triggered and a notification is sent, the alert will continue to work, but notifications will be suppressed for the duration of the suppression period.
STEP 2. Notify when resolved. Activate to receive an automatic update once an alert has ceased.
STEP 3. Define additional alert recipient(s) and notification channels by clicking + ADD WEBHOOK.
STEP 8. View your triggered alerts.
Our Incidents Screen displays all of your triggered alert events within the Coralogix platform. View all those events that are currently triggered or those triggered within a specific time frame. With easy-to-use functionalities and the ability to drill down into events of interest, the feature ensures top-notch monitoring and analysis. Find out more here.
Alerts Map presents users with a visual representation of each alert status in real time. Grouping all of your alerts in a scalable, information-dense manner, this feature ensures optimal system monitoring. To access the Alerts Map feature, navigate to Alerts > Alert Map. Find out more here.
When creating an alert, it takes 15 minutes before it triggers. In most cases, it will be faster.
For timeframes up to 30 minutes, we define steps every 1 minute.
For timeframes up to 12 hours, we define steps every 5 minutes.
For timeframes over 12 hours, we define steps every 10 minutes.
This could result from late coming data, a lag of ingestion in a system when the alert triggered.
Sometimes data may have missing values as seen in the graph below. When you don’t replace missing values with zero and leave them empty, the data you do have is considered to be 100% of the data.
Let’s say that you query for a time frame of 10 minutes. 6 data points have values and 4 data points have no values. If you haven’t replaced the missing values with 0, the 6 minutes with values will be considered 100% of the timeframe. This can lead to false triggers. To avoid this, either replace missing values with 0 or set that at least 100% of the timeframe needs to have values for the alert to trigger. In the latter case, if certain points don’t exist, an alert will not be triggered.
Sometimes data may have missing values, as seen in the graph below. When you don’t replace missing values with zero and leave them empty, the data you do have is considered to be 100% of the data.
Let’s say that you query for a time frame of 10 minutes. 6 data points have values and 4 data points have no values. If you haven’t replaced the missing values with 0, the 6 minutes with values will be considered 100% of the timeframe, leading to false triggers.
To avoid this, either replace missing values with zero or set that at least 100% percent of the timeframe needs to have values for this alert to trigger.
We strongly recommend viewing your metric in Grafana or your Custom Dashboard in real-time. By viewing the metric, you can see if there has been a lag in ingestion or sending.
Our suggestion is to use PromQL function to return a value.
Need help?
Our world-class customer success team is available 24/7 to walk you through your setup and answer any questions that may come up.
Contact us via our in-app chat or by emailing [email protected].