Monitoring is a fundamental pillar of modern software development. With the advent of modern software architectures like microservices, the demand for high-performance monitoring and alerting shifted from useful to mandatory. Combine this with an average outage cost of $5,600 per minute, and you’ve got a compelling case for investing in your monitoring capability. However, many organizations are still simply reacting to incidents as they see them, and they never achieve the next stage of operational excellence: proactive monitoring. Let’s explore the difference between reactive and proactive monitoring and how you can move to the next level of modern software resilience.
Proactive vs. Reactive: What’s the difference?
Reactive monitoring is the classic model of software troubleshooting. If the system is working, leave it alone and focus on new features. This represents a long history of monitoring that simply focused on responding quickly to outages and is still the default position for most organizations that are maintaining in-house software.
Proactive monitoring builds on top of your reactive monitoring practices. It uses many of the same technical components. Still, it has one key difference: rather than waiting for your system to reach out with an alarm, it allows you to interrogate your observability data to develop new, ad hoc insights about your platform’s operational and commercial success.
Understanding Reactive Monitoring
Reactive monitoring is the easiest to explain. If your database runs out of disk space, a reactive monitoring platform would inform you that your database is no longer working. Suppose a customer calls and explains the website is no longer working. You may use your reactive monitoring toolset to check for apparent errors.
Reactive monitoring looks like alerts that monitor key metrics. If these metrics exceed a threshold, this will trigger an alarm and inform your engineers that something is broken. You may also capture logs, metrics, and traces, but never look at them until one of your alarms has told you you need to. These behaviors are the basic tenets of reactive monitoring.
So what are the limitations of reactive monitoring?
The limitations of reactive monitoring are clear, but a reactive-only strategy has some more subtle consequences. The obvious implication is that you’re reacting to incidents rather than preventing them. This leads to service disruptions and customer impact. However, it also means more time troubleshooting issues. Interruptions can constitute up to 6 hours of your time on a typical working day. These interruptions, plus potentially expensive outages, can add to a lot of lost revenue and reputational damage that may impact your ability to attract talented employees and new customers.
What is Proactive Application Monitoring?
So what is proactive monitoring? We’ll go further into this, but proactive monitoring looks like this:
- Multiple alert levels for different severity events. For example, google advises three levels – notify, ticket, and page.
- Notify is simply a piece of information, like the database uses more CPU than usual.
- Ticket is an issue that isn’t pressing but should be dealt with soon.
- A page alert is an actual alarm when something is broken.
- Predictive alarms tell you when something is going to happen rather than when it has already happened. Prometheus supports this with its predict_linear functionality, for example. This is a straightforward implementation but illustrates the idea perfectly.
- Interrogating your observability data regularly to understand how your system is behaving. For example, using Lucene to query your elasticsearch cluster or PromQL to generate insights from your Prometheus data.
Machine learning is a powerful component of proactive monitoring
AIOps has found its place in the observability tech stack, and machine learning-driven alerts are a cornerstone of proactive monitoring. You’re setting alarms around your “known knowns” and your “known unknowns” with a traditional alert. These are the issues that you’re either aware of, for example, a spike in database CPU usage following increased user traffic, or an issue you know about but haven’t diagnosed yet, like a sudden slow down in HTTP latency.
You’re more broadly looking for anomalous behavior with a machine-learning alert. You can raise an alert if your system begins to exhibit behavior that it hasn’t shown before, especially around a specific metric or type of log. This is incredibly powerful because it adds a safety net to your existing traditional alerts and can detect new issues that fall into the “unknown unknown” category. An example might be an error that only manifests when a series of other criteria are true, like time of day, number of on-site users, and load on the system.
These issues are challenging to detect and drive reactive monitoring behaviors – “we’ll just wait until it happens next time.” With a machine learning alert, you can catch these incidents in their early stages and analyze the anomalous behavior to gain new and powerful insights into the behavior of your system.
Proactive monitoring is the difference between fire fighting and forward-fixing. Proactive monitoring approaches will give your system a voice and not rely on incidents or outages before your system is surfacing data. When this approach is coupled with a machine learning strategy, you’ve got a system that informs you of undesirable (but not critical) behavior, plus alarms that will tell you about new, potentially unwanted behavior that you never considered in the first place. This allows you to leverage your observability data to help you achieve your operational and commercial goals.