Monitoring IT infrastructure and services has always been an essential IT prerequisite. However, your IT monitoring system and security measures need to upgrade with an exponential…
Mean time to repair (MTTR) is an essential metric that represents the average time it takes to repair and restore a component or system to functionality. It is a primary measurement of the maintainability of an organization’s systems, equipment, applications and infrastructure, as well as its efficiency in fixing that equipment when an IT incident occurs.
Key challenges with MTTR arise from just trying to figure out that there is actually a problem. Incorrect diagnosis or inadequate repairs can also lengthen MTTR. A low MTTR indicates that a component or service of a distributed system can be repaired quickly and, consequently, that any IT issues associated with it will probably have a less significant impact on the business.
The following section will describe some of the challenges faced with managing MTTR. In essence trying to show that a high MTTR for an application, device or system failure, can result in a significant service interruption and thus a significant business impact.
Here are 6 common issues that contribute to a high (i.e. poor) MTTR:
To start reducing MTTR, you need to better understand your incidents and failures. Modern enterprise software can help you automatically unite your siloed data to produce a reliable MTTR metric and valuable insights about contributing factors.
By measuring MTTR, you accept that sometimes things will go wrong. It is just a part of development. Once you’ve accepted that the development process is about continuously improving, analyzing and collecting feedback, you will realize that MTTR will lead to better things. Such as faster feedback mechanisms, better logging and processes for making recovery as simple as deployment.
Having a robust incident management action plan, will allow an organization and development teams to have a clear escalation policy that explains what to do if something breaks. The plan will define who to call, how to document what is happening, and how to set things in motion to solve the problem.
It will cover a chain of events that begins with the discovery of an application or infrastructure performance issue, and that ends with learning as much as possible about how to prevent issues from happening again. Thus covering every aspect of a solid strategy for reducing MTTR.
A good monitoring solution will provide you with a continuous stream of real-time data about your system’s performance. It is usually presented in a single, easy-to-digest dashboard interface. The solution will alert you to any issues as they arise and should provide credible metrics.
Having proper visibility into your applications and infrastructure can make or break any incident response process.
Consider an example of a troubleshooting process without monitoring data. A server hosting a critical application goes down, and the only ‘data’ available to diagnose the problem is the lack of a power source on the front of the server. An incident response team is forced to diagnose and solve the problem with a heavy amount of guesswork. This leads to a long and costly repair process and a high MTTR.
If you have a monitoring solution with real-time monitoring data flows from the application, server, and related infrastructure it changes the situation drastically. It gives an incident response team an accurate read on server load, memory and storage usage, response times, and other metrics. The team can formulate a theory about what is causing a problem and how to fix it using hard facts rather than guesswork.
Response teams can use this monitoring data to assess the impact of a solution as it is being applied, and to move quickly from diagnosing to resolving an incident. This is a powerful one-two combination, making monitoring perhaps the single most important way to promote an efficient and effective incident resolution process and reduce MTTR.
When it comes to maintaining a low MTTR, there’s no substitute to a thorough action plan. For most organizations, this will require a conventional ITSM (Information Technology Service Management) approach with clearly delineated roles and responses.
Whatever the plan, make sure it clearly outlines whom to notify when an incident occurs, how to document the incident, and what steps to take as your team starts working to solve it. This will have a major impact on lowering the MTTR.
An action plan needs to follow an incident management policy or strategy. Depending on the dynamics of your organization this can include any of the following approaches.
Smaller agile companies typically use this approach. When an incident occurs, the team figures out who knows that technology or system best and assigns a resource to fix it.
This is the traditional ITSM approach often used by larger, more structured organizations. Information Technology (IT) is generally in charge of incident management in this kind of environment.
Change management concerns are paramount, and response teams must follow very strict procedures and protocols. In this case, structure is not a burden. It is classed as a benefit.
Responses are shaped to the specific nature of individual incidents, and they involve significant cross-functional collaboration and training to solve problems more efficiently. The response processes will continuously evolve over time. A fluid incident response approach allows organizations to channel the right resources and to call upon team members with the right skills, to address situations in which it is often hard to know at first exactly what is happening.
Integrating a cloud-based log management service into an incident management strategy will enable any team to resolve incidents with more immediacy. During an incident, response teams will be able solve a problem under time pressure, and not have to function differently from their day-to-day working activities.
An automated incident management system can send multi-channel alerts via phone calls, text messages and emails, to all designated responders at once. This will significantly save time that would otherwise be wasted attempting to locate and manually contact each person individually.
Using an automated incident management system for monitoring, you have visibility into your infrastructure that can help you diagnose problems more quickly and more accurately.
For example, having real-time data on the volume of a server’s incoming queries and how quickly the server is responding to them will better prepare you to troubleshoot an issue when that server fails. Data also allows you to see how specific actions to repair system components are impacting system performance, allowing you to apply an appropriate solution more quickly.
A new set of technologies has emerged in the past few years that enables incident response teams to harness Artificial Intelligence (AI) and Machine Learning (ML) capabilities, so they can prevent more incidents and respond to them faster.
These capabilities analyze data generated by software systems in order to predict possible problems, determine the root causes, and drive automation to fix them. It complements your monitoring practices by providing an intelligent feed of incident information alongside your telemetry data. When you use that information to analyze and take action on that data, you will be better prepared for troubleshooting and incident resolution.
As you develop incident response procedures and establish monitoring and alerting practices, be sure to document them and if possible ‘automate’ them using an incident management runbook automation tool.
Automating the process allows you to execute runbooks and automated tasks for faster, more repeatable and consistent problem resolution. When configured and enabled, you can associate runbooks with a process that tells incident response team members exactly what to do when a specific problem occurs.
Use runbooks to collect the response team’s knowledge about a given incident-response scenario in one place. In addition to helping you reduce MTTR, runbooks are useful for training new team members, and they are especially helpful when important members of the team leave the organization.
The idea is to use a runbook as a starting point. It will save time and energy when dealing with known issues, and allowing the team to focus on the most challenging and unique aspects of a problem.
Clearly defined roles and responsibilities are crucial for effectively managing incident response and lowering MTTR. This includes the definition of roles for Incident Management, First and Second line support.
When constructing an incident response team be sure it has a designated leader who oversees incident response and ensures strong communication with stakeholders within and outside the team, and that all team members are clear on responsibilities.
The incident team lead is responsible for directing both the engineering and communication responses. The latter involves engagement with customers, both to gather information and to pass along updates about the incident and our response to it. The incident team lead must make sure that the right people are aware of the issue.
Each incident may also require a technical lead who reports to the incident team lead. The technical lead typically dictates the specific technical response to a given incident. They should be an expert on the system(s) involved in an incident, allowing them to make informed decisions and to assess possible solutions so they can speed resolution and optimize the team’s MTTR performance.
Another important role that an incident may require is a communications lead. The communications lead should come from a customer service team. This person understands the likely impact on customers and shares these insights with the incident team lead. At the same time, as information flows in the opposite direction, the communications lead decides the best way to keep customers informed of the efforts to resolve the incident.
Having focused knowledge specialists on your incident response team is invaluable. However, if you rely solely on these specialists for relatively menial issues, you risk overtaxing them, which can diminish the performance of their regular responsibilities and eventually burn them out. It also handcuffs your response team if that specialist simply is not around when an incident occurs.
It makes sense to invest in cross-training for team members, so they can assume multiple incident response roles and functions. Other members of the team should build enough expertise to address most issues, allowing your specialists to focus on the most difficult and urgent incidents. Comprehensive runbooks can be a great resource for gathering and transferring specialized technical knowledge within your team.
Cross training and knowledge transfer also helps you to avoid one of the most dangerous incident response risks. That being a situation in which one person is the only source of knowledge for a particular system or technology. If that person goes on vacation or abruptly leaves the organization, critical systems can turn into black boxes that nobody on the team has the skills or the knowledge to fix.
You ultimately lower your MTTR by making sure all team members have a deep understanding of your system and are trained across multiple functions and incident-response roles. Your team will be positioned to respond more effectively no matter who is on call when a problem emerges.
While MTTR is not a magic number, it is a strong indicator of an organization’s ability to quickly respond to and repair potentially costly problems. Given the direct impact of system downtime on productivity, profitability and customer confidence, an organization’s understanding of MTTR and its functions is essential for any technology centric company.
You can mitigate the challenges identified, and ensure a low MTTR by making sure all team members have a deep understanding of your systems and are trained across multiple functions and incident-response roles. It will make your team to be positioned to respond more effectively no matter who is on call when a problem emerges.