Headed to Las Vegas for AWS re:Invent? Come visit us at booth #118!

Are your customers catching production problems 🔥 before you do?

  • Peter Hammond
  • November 1, 2020

Availability and quality are the biggest differentiators when people opt for a service or product today. You should be aware of the impact of your customers alerting you to your own problems, as well as how to stop this from becoming the norm. To make sure you don’t become an organization known for its bugs, understanding the organizational changes required to deliver a stable service is key. If, as Capers Jones tells us, only as many as 85% of bugs are caught pre-release, it’s important to differentiate yourself with the service you provide.

The Problem

It’s simple to understand why you don’t want unknown bugs to go out to your customers in a release. To truly understand its impact, you need to define the impact of committing problematic code to release, before we look at its solutions.

Problem 1: Your customers’ perception

No one wants to buy a faulty product. You open yourself up to reputational risks – poor reviews, client churn, lack of credibility – when you get a name for having buggy releases. This has three very tangible costs to your business. First, your customers will cease to use your product or service. Second, any new customers will become aware of your pitfalls sooner or later. Lastly, it can have a negative impact on staff morale and direction, and you’ll run the risk of losing your key people.

Problem 2: The road to recovery

Once a customer makes you aware of a bug, you don’t have a choice but to fix it (or you’ll be faced with the problem above). The cost of doing this post-production is only enhanced with the time it takes to detect the problem, or MTTD (mean time to detect). As part of the 2019 State of Devops Report, surveyed “Elite” performing businesses took on average one hour or under to deliver a service restoration or fix a bug, against up to one month for “Low” performing businesses in the same survey. The problem compounds with time: the longer it takes to detect the problem, the more time it takes for your developers to isolate, troubleshoot, fix and then patch. Of all surveyed in the 2019 State of Devops Report, the top performers were at least twice as likely to exceed their organizational SLAs for feature fixes.

Problem 3: Releasing known errors

Releases often go out to customers with “known errors” in them. These are errors that have been assessed to have little impact, or only occur under highly specific conditions, and therefore are unlikely to affect the general release. However, this is just the coding and troubleshooting you’ll have to do later down the line, because you wanted to make a release on time. This notion of technical debt isn’t something new, but with many tech companies doing many releases per day, the compounded work that goes into managing known errors is significant. 

The Solution

Organizations can easily deliver more stable releases to their customers. Analysis indicates that there are a number of things that can greatly enhance your own stability, and keep your overheads down.

Solution 1: What your team should be doing

Revisiting the 2019 State of Devops Report, we can see the growing importance of delivering fast fixes (reduced MTTD) is dependent on two important factors within your team.

Test automation is being viewed as a “golden record” when it comes to validating code for release. It positively impacts continuous integration and continuous deployment, and where deployment is automated, far greater efficiencies are achieved in the fast catching and remediation of bugs, before they hit the customers.

“With automated testing, developers gain confidence that a failure in a test suite denotes an actual failure just as much as a test suite passing successfully means it can be successfully deployed.”

Solution 2: Where you can look for help

The State of DevOps report also tells us that we can’t expect our developers to monitor their own code as well as use the outputs of infrastructure and code monitoring to make decisions. 

This is where Coralogix comes in. Coralogix’s advanced unified UI allows the pooling of log data from applications, infrastructure and networks in one simple view. Not only does this allow your developers to better understand the impact of releases on your system, as well as helping to spot bugs early on. Both of these are critical in reducing your RTTP, which leads to direct savings for your organization.

Coralogix also provides advanced solutions to flag “known errors”, so that if they do go out for release, they aren’t just consigned to a future fix pile. By stopping known errors from slipping through the cracks, you are actively minimizing your technical debt whilst increasing your dev team’s efficiency.

Lastly, Loggregation uses machine learning to benchmark your organization’s code’s performance, building an intelligent baseline that identifies errors and anomalies faster than anyone – even the most eagle-eyed of customers.

Stateful streaming analytics for observability data