Alerting has been a fundamental part of operations strategy for the past decade. An entire industry is built around delivering valuable, actionable alerts to engineers and…
Observable and secure platforms use three connected data sets: logs, metrics, and traces. Platforms can link these data to alerting systems to notify system administrators when an event requires intervention. There are nuances to setting up these alerts so the system is kept healthy and the system administrators are not chasing false positive alerts.
Modern platforms have user interactions around the clock. Observability tools will monitor the systems in real-time as they are used, keeping up with demand where people cannot.
System administrators need to be able to keep watch on their platform in real time, but attempting to do this manually without tools is not realistic. When your platform has observability tools plugged in, administrators can use dashboards and visualizations to know when there is a problem. But, there are cases where visualizations may fall short. When events occur in the middle of the night or when the visualization of the event is not clear to administrators, issues can be missed. Alerts help system administrators identify issues that need to be actively managed. When alerts are in place, administrators can disengage from watching the dashboard.
Administrators should configure alerts for events requiring human intervention. An alert would notify the administrator of what has occurred, demanding their attention. The administrator can then use their observability dashboard to identify the problem and determine the steps to solve the issue.
Alerts can also do more than just notify administrators of issues; they can also rectify issues on their own. For example, an alert for limited CPU usage could notify an administrator of the issue. This same alert could also be used to scale up CPUs in the platform to fix the issue sooner than with required human intervention.
Alerts must be customized to the platform they are handling. Every platform has a different architecture, user base, and functional requirements; alerts must be equally unique and tuned to the platform. Customized alerts can be created individually by administrators or can rely on machine learning tools to find events that require attention.
Identify architecture bottlenecks in your platform and ensure data flows through at the expected rate. Bottlenecks will appear in different places in the architecture depending on the used frameworks. A Kubernetes cluster could have a bottleneck on Ingress processing causing external requests to be dropped. A message bus could also have a processing bottleneck causing a queue to overflow. Both examples are bottlenecks, but detecting and fixing these issues are quite different.
Technology-oriented metrics are commonly integrated with alarms, so system administrators know when their offering requires maintenance or upgrading. Alerts typically include server health issues, latency, and CPU and memory use overages. Alerts can also relate to security threats like DDoS attacks.
Observability platforms can also integrate with business metrics to alert administrators and marketing teams when unexpected user interactions occur. Platforms can identify key functionality and produce alerts when operations are not expected. Key functionality may include sending user emails or push notifications, making sales, taking survey data, and more. Human intervention may be needed to determine why. A business-oriented metric alert would check event rates compared with expectations and alert when required.
Administrators can add SLOs to their observability solution by creating service level indicators (SLIs) metrics. The SLIs show how the platform behaves to ensure service level objectives (SLOs) are met. SLI metrics include latency, availability, response time, and error rate.
Alert fatigue is when workers receive so many alerts that they become desensitized to them, resulting in missed or ignored alerts or delayed responses to alerts. Alarms constantly going off in the middle of the night and filling workers’ inboxes during the day can lead to burnout, low productivity, and employee turnover. False or redundant alerts exacerbate the issue of alert fatigue, adding more alarms when they are not necessary.
A 2021 report on Web Application security reported that 45% of alerts seen were false alarms. Beyond that, teams were spending as much time tracking false alarms as actual security breaches, showing just how intrusive false alarms can be to teams. Choosing an observability solution that minimizes false alarms is critical to avoid alert fatigue and wasted time.
Some solutions have proven they can decrease false positives and time to resolution, showing the right toolset can solve these problems. Coralogix can reduce false positives by using advanced machine-learning-powered alerting techniques. Event noise reduction techniques employ machine learning to identify patterns in the metrics and suppress events that are normal to each platform, allowing only critical events to generate alarms.
Static alerting techniques use fixed, preset thresholds that set the watched metric’s limit. For example, monitoring CPU usage could be set as a static threshold where any value above some percentage is considered unacceptable. Any value above the preset range will trigger an alarm sent to the administration team or engineer.
Determining an accurate threshold and maintaining that value is not an easy task. The limit must be determined for more than just a single environment; every cluster must monitor its CPU usage, and administrators must set CPU thresholds for every cluster. Also, as product development continues, the acceptable percentage may change and require a new threshold to be set. The administrator also must keep up to date and apply changes to the observability system to ensure alarms are set off only when they want them. This job quickly becomes large when considering the potential size of a system and the hundreds or thousands of metrics it may include.
The case of CPU usage is typical and relatively simple compared to other metrics that require tracking. Many thresholds in the system need to be set compared to ‘normal’ behavior. If this threshold is not established, it can lead to long-term alert noise as the administration team identifies the best value. For example, if the latency on a connection is more than a typical time, the administration team would like to be alerted. Administrators could guess what that standard time is, but likely there will be several alarms triggered that do not require human intervention while that number is tuned.
There are cases where a static alert is optimal. When alerting on a change in metric value, or a delta, like when CPU usage jumps by several percent in a short time, a static threshold can track this.
Machine-powered dynamic alerting can use existing metric data to set thresholds for alerts. Observability platforms generally calculate thresholds considering event contexts like time of day, day of the week, and seasonal variations. For example, CPU usage may be consistently higher during scheduled maintenance which the dynamic alert can predict and ignore. Dynamic thresholds solve the difficult job of administrators to determine what threshold to set, instead showing administrators what typical values are so they can make system adjustments as needed.
Specialized alerting is also possible within a machine-learning-powered observability platform. These alerts are helpful to alert on common problems that are impossible to set up with a static alert and complex to set up with only fundamental dynamic alerts.
Time relative alerts will compare metrics across time and alert teams when their system has reached a certain threshold. For example, compare error rates across days, alerting when there is an increase in errors daily. Business teams could also set up alerts for spikes or dips in sales to alarm when purchase trends are changing.
A ratio alert will compare the value of one metric to another and trigger an event when the ratio reaches a certain threshold. For example, monitor the number of login requests with the number of denied requests to detect attacks on your platform.
When a platform has heavy use, the number of events needed to trigger an alarm will change. For example, an API with a small number of users should want to detect when a single error occurs, such as returning a 500 error code. When there are thousands or millions of users, that value needs to scale to ensure a real problem exists before sending an alert to administrators. Since the 500 error code is part of the log and not a numerical metric, some systems recommend using static alerts, which can lead to false positives.
The unique count alert alarms on the number of unique events that occur in event logs, how many times the value should be detected, and in what timeframe to detect them. For the example above, alerts can be sent when more than the typical 500 error codes are detected for the platform.
We have discussed why alerts are critical to creating an observable platform and why they can be detrimental when they send false positive notifications. Static alerts are helpful in some cases but generally cause false positive alarms and can trigger alert fatigue across administration teams. Dynamic alerting is a powerful, machine-learning-driven tool that can reduce false positives and help administrators set appropriate thresholds. Observability tools like Coralogix’s log monitoring can provide dynamic alerting for platforms proven to reduce false positive alarms.