incident management Archives

Keeping digital services reliable is more important than ever. When something goes wrong in production, on-call teams face significant pressure to identify and resolve the incident quickly – in order to keep customers happy. But it can be difficult to get the right signals to the right person in a timely fashion. Most teams use a combination of observability platforms (like Coralogix) to identify and alert on signals, and then some type of paging or routing system that passes these alerts onward.

Being able to automate this process – so the first time an on-call responder sees an alert, they have all the necessary information to triage the severity of the incident and understand the root cause – saves significant time, helping teams restore services faster and keep customers satisfied.

In this post, we’ll cover how StackPulse and Coralogix can be used together to automatically enrich alerts for faster and better incident management.

What is StackPulse?

StackPulse is an integration first platform — it easily connects to all the systems you are using today to ingest data. As data is sent to the platform, it’s automatically enriched using the characteristics of the event and by pulling more information from other integrations (eg. reaching out to a Kubernetes cluster or using a cloud service provider’s API). StackPulse can also bundle this information and deliver it to the communications platforms that your teams are already using.

One core capability of StackPulse is the ability to take an alert and enrich it before an operator starts their process of responding to the event. This helps to minimize alert fatigue, as this real-time context and information reduces time invested in triaging, analyzing, and responding to events. With StackPulse you can create automated, code-based playbooks to investigate and remediate events across your infrastructure.

StackPulse Playbooks are built from reusable artifacts called Steps. StackPulse allows you to easily build and link steps together to perform actions that interface with multiple systems, as we’ll do with this example. We’ll cover this in more detail later on, as this adds a lot of power to the integration between Coralogix and StackPulse.

Using StackPulse and Coralogix to Enrich and Alert on Data

StackPulse can communicate bi-directionally with Coralogix, an observability platform that analyzes and prioritizes data before it’s indexed and stored. This ability allows for teams to more effectively and efficiently respond to incidents as the time needed for manual investigations and setting up alerts is eliminated almost entirely.

In this example, we’ll spotlight how StackPulse ingests alerts from Coralogix, reacts to an event, and goes on to gather additional context-specific information from the Coralogix platform and a Kubernetes cluster. The enrichment of the initial event is done automatically, without any manual steps — based on the context of the alert.

Along the way, we’ll cover different components of the scenario in detail. In this example, Coralogix is receiving logs from an application — a Sock Shop microservices demo running on a Google Kubernetes Engine cluster.

Coralogix is configured to monitor the health of the application and sends dynamic alerts powered by machine learning to StackPulse via a webhook configuration when critical application or cluster errors are found. We also have a Slack workspace setup with the StackPulse app installed and configured.

StackPulse will ingest the alerts and use the information in the alert payload to provide context for a StackPulse Playbook to perform the alert enrichment and remediation actions.

Incidence Response for Kubernetes Log Error

Our example begins when an alert is generated in the Coralogix platform after an error is identified in the Kubernetes logs.

When this happens, the Coralogix platform sends the event to a configured integration in StackPulse. The first view of this data in StackPulse is in the Journal, which is a federated view of all events passing into the StackPulse Platform.

If we click on the event in the Journal we can see the raw payload coming from Coralogix.

Using the payload from the event, we can build a Trigger based on a set of conditions that will initiate a Playbook. To configure the Trigger, we can use the StackPulse interface to view the payload of the event in the Playbook Planner and easily select the conditions within the payload.

Here we can see the Trigger’s definition in YAML. The nice thing here is that you don’t have to type any of that out, it’s all built from clicks within the GUI. If you’ve worked with Kubernetes before, this will look similar to a custom resource definition.

For this Trigger, we’re looking first for an event coming in from Coralogix_SockShop. Next, we’re looking for three values within the event payload — the Alert Action is trigger, the Application is sock-shop and the Alert Severity is critical. When all of these conditions are met, this would cause a Playbook to run.

Now that we have the Trigger defined, we can build out the Playbook itself. This Playbook will run when a payload is received from Coralogix matching the conditions in the Trigger above, and it will have a few steps:

Communicate with the Kubernetes cluster to gather statistics and events
Combine that information with the original alert from Coralogix, sending it to slack
Ask the Slack channel if they would like to escalate to the on-call engineer. If so, it will create an incident in PagerDuty

We can use the StackPulse Playbook Planner to build out each individual step. Using the library of prebuilt steps, you can simply drag and drop from the planner to your Playbook.

These first steps gather information from Kubernetes, posting that to Slack along with the original Coralogix alert. Here’s what that looks like:

After we provide the alert enrichment to Slack, StackPulse will ask the channel if they’d like to page the on-call person. If a teammate selects Yes, a PagerDuty incident will be created to alert the on-call person.

Here’s the complete picture of the output and interaction within Slack.

stackpulse coralogix slack alert on call

As you can see, StackPulse automatically enriched the alert with relevant information from the cluster. This means the operator responding to the alert has all the context needed to evaluate the health of the cluster without having to perform any manual actions.

Summary

There you have it! Hopefully this post provides you with some clarity on how easy it is to use the StackPulse and Coralogix integration to ingest alerts and automatically react to events using context-specific information.

StackPulse offers a complete, well-integrated solution for managing reliability — including automated alert triggers, playbooks, and documentation helpers. Ready to try? Start a free trial with Coralogix or with StackPulse to see what we have to offer.

Mean time to repair (MTTR) is an essential metric that represents the average time it takes to repair and restore a component or system to functionality. It is a primary measurement of the maintainability of an organization’s systems, equipment, applications and infrastructure, as well as its efficiency in fixing that equipment when an IT incident occurs.

Key challenges with MTTR arise from just trying to figure out that there is actually a problem. Incorrect diagnosis or inadequate repairs can also lengthen MTTR. A low MTTR indicates that a component or service of a distributed system can be repaired quickly and, consequently, that any IT issues associated with it will probably have a less significant impact on the business.

Challenges With Mean Time To Repair (MTTR)

The following section will describe some of the challenges faced with managing MTTR. In essence trying to show that a high MTTR for an application, device or system failure, can result in a significant service interruption and thus a significant business impact.

Here are 6 common issues that contribute to a high (i.e. poor) MTTR:

1. Lack of Understanding Around Your Incidents

To start reducing MTTR, you need to better understand your incidents and failures. Modern enterprise software can help you automatically unite your siloed data to produce a reliable MTTR metric and valuable insights about contributing factors.

By measuring MTTR, you accept that sometimes things will go wrong. It is just a part of development. Once you’ve accepted that the development process is about continuously improving, analyzing and collecting feedback, you will realize that MTTR will lead to better things. Such as faster feedback mechanisms, better logging and processes for making recovery as simple as deployment.

Having a robust incident management action plan, will allow an organization and development teams to have a clear escalation policy that explains what to do if something breaks. The plan will define who to call, how to document what is happening, and how to set things in motion to solve the problem.

It will cover a chain of events that begins with the discovery of an application or infrastructure performance issue, and that ends with learning as much as possible about how to prevent issues from happening again. Thus covering every aspect of a solid strategy for reducing MTTR.

2. Low-Level Monitoring

A good monitoring solution will provide you with a continuous stream of real-time data about your system’s performance. It is usually presented in a single, easy-to-digest dashboard interface. The solution will alert you to any issues as they arise and should provide credible metrics.

Having proper visibility into your applications and infrastructure can make or break any incident response process.

Consider an example of a troubleshooting process without monitoring data. A server hosting a critical application goes down, and the only ‘data’ available to diagnose the problem is the lack of a power source on the front of the server. An incident response team is forced to diagnose and solve the problem with a heavy amount of guesswork. This leads to a long and costly repair process and a high MTTR.

If you have a monitoring solution with real-time monitoring data flows from the application, server, and related infrastructure it changes the situation drastically. It gives an incident response team an accurate read on server load, memory and storage usage, response times, and other metrics. The team can formulate a theory about what is causing a problem and how to fix it using hard facts rather than guesswork.

Response teams can use this monitoring data to assess the impact of a solution as it is being applied, and to move quickly from diagnosing to resolving an incident. This is a powerful one-two combination, making monitoring perhaps the single most important way to promote an efficient and effective incident resolution process and reduce MTTR.

3. Not Having an Action Plan

When it comes to maintaining a low MTTR, there’s no substitute to a thorough action plan. For most organizations, this will require a conventional ITSM (Information Technology Service Management) approach with clearly delineated roles and responses.

Whatever the plan, make sure it clearly outlines whom to notify when an incident occurs, how to document the incident, and what steps to take as your team starts working to solve it. This will have a major impact on lowering the MTTR.

An action plan needs to follow an incident management policy or strategy. Depending on the dynamics of your organization this can include any of the following approaches.

Ad-hoc Approach

Smaller agile companies typically use this approach. When an incident occurs, the team figures out who knows that technology or system best and assigns a resource to fix it.

Fixed Approach

This is the traditional ITSM approach often used by larger, more structured organizations. Information Technology (IT) is generally in charge of incident management in this kind of environment.

Change management concerns are paramount, and response teams must follow very strict procedures and protocols. In this case, structure is not a burden. It is classed as a benefit.

Fluid Approach

Responses are shaped to the specific nature of individual incidents, and they involve significant cross-functional collaboration and training to solve problems more efficiently. The response processes will continuously evolve over time. A fluid incident response approach allows organizations to channel the right resources and to call upon team members with the right skills, to address situations in which it is often hard to know at first exactly what is happening.

Integrating a cloud-based log management service into an incident management strategy will enable any team to resolve incidents with more immediacy. During an incident, response teams will be able solve a problem under time pressure, and not have to function differently from their day-to-day working activities.

4. Not Having an Automated Incident Management System

An automated incident management system can send multi-channel alerts via phone calls, text messages and emails, to all designated responders at once. This will significantly save time that would otherwise be wasted attempting to locate and manually contact each person individually.

Using an automated incident management system for monitoring, you have visibility into your infrastructure that can help you diagnose problems more quickly and more accurately.

For example, having real-time data on the volume of a server’s incoming queries and how quickly the server is responding to them will better prepare you to troubleshoot an issue when that server fails. Data also allows you to see how specific actions to repair system components are impacting system performance, allowing you to apply an appropriate solution more quickly.

A new set of technologies has emerged in the past few years that enables incident response teams to harness Artificial Intelligence (AI) and Machine Learning (ML) capabilities, so they can prevent more incidents and respond to them faster.

These capabilities analyze data generated by software systems in order to predict possible problems, determine the root causes, and drive automation to fix them. It complements your monitoring practices by providing an intelligent feed of incident information alongside your telemetry data. When you use that information to analyze and take action on that data, you will be better prepared for troubleshooting and incident resolution.

5. Not Creating Runbooks

As you develop incident response procedures and establish monitoring and alerting practices, be sure to document them and if possible ‘automate’ them using an incident management runbook automation tool.

Automating the process allows you to execute runbooks and automated tasks for faster, more repeatable and consistent problem resolution. When configured and enabled, you can associate runbooks with a process that tells incident response team members exactly what to do when a specific problem occurs.

Use runbooks to collect the response team’s knowledge about a given incident-response scenario in one place. In addition to helping you reduce MTTR, runbooks are useful for training new team members, and they are especially helpful when important members of the team leave the organization.

The idea is to use a runbook as a starting point. It will save time and energy when dealing with known issues, and allowing the team to focus on the most challenging and unique aspects of a problem.

6. Not Designating Response Teams and Roles

Clearly defined roles and responsibilities are crucial for effectively managing incident response and lowering MTTR. This includes the definition of roles for Incident Management, First and Second line support.

When constructing an incident response team be sure it has a designated leader who oversees incident response and ensures strong communication with stakeholders within and outside the team, and that all team members are clear on responsibilities.

The incident team lead is responsible for directing both the engineering and communication responses. The latter involves engagement with customers, both to gather information and to pass along updates about the incident and our response to it. The incident team lead must make sure that the right people are aware of the issue.

Each incident may also require a technical lead who reports to the incident team lead. The technical lead typically dictates the specific technical response to a given incident. They should be an expert on the system(s) involved in an incident, allowing them to make informed decisions and to assess possible solutions so they can speed resolution and optimize the team’s MTTR performance.

Another important role that an incident may require is a communications lead. The communications lead should come from a customer service team. This person understands the likely impact on customers and shares these insights with the incident team lead. At the same time, as information flows in the opposite direction, the communications lead decides the best way to keep customers informed of the efforts to resolve the incident.

7. Not Training Team Members For Different Roles

Having focused knowledge specialists on your incident response team is invaluable. However, if you rely solely on these specialists for relatively menial issues, you risk overtaxing them, which can diminish the performance of their regular responsibilities and eventually burn them out. It also handcuffs your response team if that specialist simply is not around when an incident occurs.

It makes sense to invest in cross-training for team members, so they can assume multiple incident response roles and functions. Other members of the team should build enough expertise to address most issues, allowing your specialists to focus on the most difficult and urgent incidents. Comprehensive runbooks can be a great resource for gathering and transferring specialized technical knowledge within your team.

Cross training and knowledge transfer also helps you to avoid one of the most dangerous incident response risks. That being a situation in which one person is the only source of knowledge for a particular system or technology. If that person goes on vacation or abruptly leaves the organization, critical systems can turn into black boxes that nobody on the team has the skills or the knowledge to fix.

You ultimately lower your MTTR by making sure all team members have a deep understanding of your system and are trained across multiple functions and incident-response roles. Your team will be positioned to respond more effectively no matter who is on call when a problem emerges.

Summary

While MTTR is not a magic number, it is a strong indicator of an organization’s ability to quickly respond to and repair potentially costly problems. Given the direct impact of system downtime on productivity, profitability and customer confidence, an organization’s understanding of MTTR and its functions is essential for any technology centric company.

You can mitigate the challenges identified, and ensure a low MTTR by making sure all team members have a deep understanding of your systems and are trained across multiple functions and incident-response roles. It will make your team to be positioned to respond more effectively no matter who is on call when a problem emerges.

Tag: incident management

Using Coralogix + StackPulse to Automatically Enrich Alerts and Manage Incidents

What is StackPulse?

Using StackPulse and Coralogix to Enrich and Alert on Data

Incidence Response for Kubernetes Log Error

Summary

Why Your Mean Time to Repair (MTTR) Is Higher Than It Should Be

Challenges With Mean Time To Repair (MTTR)

1. Lack of Understanding Around Your Incidents

2. Low-Level Monitoring

3. Not Having an Action Plan

Ad-hoc Approach

Fixed Approach

Fluid Approach

4. Not Having an Automated Incident Management System

5. Not Creating Runbooks

6. Not Designating Response Teams and Roles

7. Not Training Team Members For Different Roles

Summary

What is StackPulse?

Using StackPulse and Coralogix to Enrich and Alert on Data

Incidence Response for Kubernetes Log Error

Summary

Challenges With Mean Time To Repair (MTTR)

1. Lack of Understanding Around Your Incidents

2. Low-Level Monitoring

3. Not Having an Action Plan

Ad-hoc Approach

Fixed Approach

Fluid Approach

4. Not Having an Automated Incident Management System

5. Not Creating Runbooks

6. Not Designating Response Teams and Roles

7. Not Training Team Members For Different Roles

Summary

Be Our Partner

Thank You