The widespread adoption of Agile methodologies in recent years has allowed organizations to significantly increase their ability to push out more high quality software.
Previous development practices revolved heavily around centralized applications and infrequent updates that were shipped maybe once a quarter or even once a year. The new fast-paced CI/CD pipeline and lightweight microservices architecture enable us to introduce new updates at a rate and scale that would have seemed unimaginable just a few years ago.
But with the benefits of new technologies comes a new set of challenges that need to be addressed as well.
While the tools for creating and deploying software have improved significantly, we are still in the process of establishing a more efficient approach to troubleshooting.
This post takes a look at the tools that are currently available for helping teams keep their software functioning as intended, the challenges that they face with advances in the field, and a new way forward that will speed up the troubleshooting process for everyone involved.
Challenges Facing Troubleshooters
On the face of it, the ability to make more software is a positive mark in the ledger. But with new abilities come new issues to solve.
Let’s take a look at three of the most common challenges facing organizations right now.
The Move from the Monolith to Microservices
Our first challenge is that there are now more moving parts where issues may arise that can complicate efforts to fix them quickly and efficiently.
Since making the move to microservices and Kubernetes, we are now dealing with a much more distributed set of environments. We have replaced the “monolith” of the single core app with many smaller, more agile but dispersed apps.
Their high level of distribution makes it more difficult to quickly track down where a change occurred and understand what’s impacting what. As the elements of our apps become more siloed, we lose overall visibility over the system structure.
Teams are Becoming More Siloed
The next challenge is that with more widely distributed teams taking part in the software development process, there are much fewer people who are significantly familiar with the environment when it comes to addressing problems impacting the products.
Each team is proficient in their own domain and segment of the product, but are essentially siloed off from the other groups. This leads to a lack of knowledge sharing across teams that can be highly detrimental when it comes time for someone to get called in to fix an issue stemming from another team’s part of the product.
More Changes, More Often
Then finally for our current review, is the fact that changes are happening far more often than before.
Patches no longer have to wait until it’s Tuesday to come out. New features on the app or adjustments to infrastructure like load balancing can impact the product and require a fix. With all this going on, it is easy for others to not be aware of these changes.
The lack of communication caused by siloes combined with the increase in the number of changes can create confusion when issues arise. Sort of like finding a specific needle in a pile of needles. It’s enough to make you miss the haystack metaphor.
In Search of Context
In examining these challenges, we can quickly understand that they all center around the reality that those tasked with troubleshooting issues lack the needed context for fixing them.
Developers, who are increasingly being called upon to troubleshoot, lack the knowledge of what has changed, by who, what it impacts, or even where to start looking in a potentially unfamiliar environment.
Having made the move to the cloud, developers have at their disposal a wide range of monitoring and observability tools covering needs related to tracing, logs, databases, alerting, security, and topology, providing them with greater visibility into their software throughout the Software Development Lifecycle.
When they get the call that something needs their attention, they often have to begin investigating essentially from scratch. This means jumping into their many tools and dashboards, pouring over logs to try and figure out the source of the problem. While these monitoring and observability tools can provide valuable insights, it can still be difficult to identify which changes impacted which other components.
Some organizations attempt to use communication tools like Slack to track changes. While we love the scroll as much as anyone else, it is far from a comprehensive solution and still lacks the connection between source and impact. Chances are that they will need to call in for additional help in tracking down the root of the problem from someone else on their team or a different team altogether.
In both cases, the person on-call still needs to spend significant and valuable time on connecting the dots between changes and issues. Time that might be better spent on actually fixing the problem and getting the product back online.
Filling in the Missing Part of the DevOps Toolchain
The tooling available to help identify issues are getting much better at providing visibility, including to provide on-call responders with the context that will help them get to the source of the problem faster.
Moving forward, we want to see developers gain a better, holistic understanding of their Kubernetes environments. So even if their microservices are highly distributed, their visibility over them should be more unified.
Fixing problems faster means getting the relevant information into the hands of whoever is called up to address the issue. It should not matter if the incident is in a product that they themselves built or if they are opening it up for the first time.
Reducing the MTTR relies on providing them with the necessary context from the tools that they are already using to monitor their environment. Hopefully by better utilizing our existing resources, we can cut down on the time and number of people required to fix issues when they pop up — allowing us to get back to actually building better products.