There’s an insidious disease increasingly afflicting DevOps teams. It begins innocuously. A team member suggests adding a new logging tool. The senior dev decides to upgrade the tooling. Then it bites.
You’re spending more time navigating between windows than writing code. You’re scared to make an upgrade because it might break the toolchain.
The disease is tool sprawl. It happens when DevOps teams use so many tools that the time and effort spent navigating the toolchain is greater than the savings made by new tools.
Tool sprawl is not something to be taken lightly. A 2016 DevOps survey found that 53% of large organizations use more than 20 tools. In addition, 53% of teams surveyed don’t standardize their tooling.
It creates what Joep Piscaer calls a “tool tax”, increased technical debt, and reduced efficiency which can bog down your business and demoralize your team.
With tool sprawl, a DevOps team is more likely to have impaired observability as data between different tools won’t necessarily be correlated. This ultimately reduces their ability to detect anomalous system activity and locate the source of a fault and increases both Mean Time To Detection (MTTD) and Mean Time To Repair (MTTR).
Also, an overabundance of tools can result in increased toil for your DevOps team. Google’s SRE org defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
Tool sprawl creates toil by forcing DevOps engineers to continually switch between lots of different tools which may or may not be properly integrated. This cuts into the time spent doing useful and productive work such as coding during the day.
Finally, tool sprawl reduces your system’s scalability. This is a real blocker to businesses that want to go to the next level. They can’t scale their application and may have trouble expanding their user base and developing innovative features.
A good DevOps pipeline is dependent on a well-integrated toolchain. When tool sprawl is unchecked, it can result in a poorly integrated set of tools. DevOps teams are forced to get round this by implementing ad-hoc solutions which decrease the resilience and reliability of the toolchain.
This reduces the rate of innovation and modernization in your DevOps architecture. Engineers are too scared to make potentially beneficial upgrades because they don’t want to risk breaking the existing infrastructure.
Another problem created by tool sprawl is that of data silos. If different DevOps engineers use their own dashboards and monitoring tools, it can be difficult (if not impossible) to pool data. This reduces the overall visibility of the system and consequently reduces the level of insights available to the team.
Data silos also cause a lack of collaboration. If every ops team is looking at a different data set and using their own monitoring tool, they can’t meaningfully communicate.
Engineers add tools to increase productivity, not to reduce it. Yet having too many actually has the opposite effect.
Tool sprawl can seriously disrupt the creative processes of engineers. Being forced to pick their way through a thicket of unstandardized and badly integrated tooling breaks their flow, reducing their ability to problem solve. This makes them less effective as engineers and reduces the team’s operational excellence.
Another impairment to productivity is the toxic culture created by a lack of collaboration and communication between different parts of the team. In the previous section, we saw how data silos resulted in a lack of team collaboration.
The worst case of this is that it can lead to a culture of blame. Each part of the team, cognizant only of the information on its part of the system, tries to rationalize that information and treat its view as correct.
This leads to them neglecting other parts of the picture and blaming non-aligned team members for mistakes.
In Star Wars, all living things depended on the Force. Yet the Force was double-edged; it had a light side and a dark side. Similarly, a DevOps pipeline depends on an up-to-date toolchain that can keep pace with the demands of the business.
Yet in trying to keep their toolchain beefed-up, DevOps teams constantly run the risk of tool sprawl. Tooling is often upgraded organically in response to the immediate needs of the team. As Joep warns though, poorly upgrading tooling can create more problems than it solves. It adds complexity and operational burdens.
One way that teams can prevent tool sprawl is by thinking much more carefully about the pros and cons of adding a new tool. As Joep explains, tools have functional and non-functional aspects. Many teams become sold on a new tool based on the functional benefits it brings. These could include allowing the team to visualize data or increasing some aspect of observability.
What they often don’t really think about are the tool’s non-functional aspects. These can include performance, ease of upgrading, and security features.
If a tool was a journey the function would be its destination and its non-functional aspects would be the route it takes. Many teams are like complacent passengers, saying “wake me when we get there” while taking no heed of potential hazards along the way.
Instead, they need to be like ship captains, navigating the complexities of their new tool with foresight and avoiding potential problems before they sink the ship.
Before incorporating a tool into their toolchain, teams need to think about operational issues. These can be anything from the number of people needed to maintain the tool to the repo new versions are available in.
Teams also need to consider agility. Is the tool modular and extensible? If so, it will be relatively easy to enhance functionality downstream. If not, the team may be stuck with obsolescent tooling that they can’t get rid of.
Another tool sprawl mitigation strategy is to opt for “all-in-one” tools that let teams achieve more outcomes with less tooling. A recent study advocates for using a platform vendor that possesses multiple monitoring, analytics and troubleshooting capabilities.
Coralogix is a good example of this kind of platform. It’s an observability and monitoring solution that uses a stateful streaming pipeline and machine learning to analyze and extract insights from multiple data sources. Because the platform leverages artificial intelligence to extract patterns from data, it has the ability to combat data silos and the dangers they bring.
Trusting log analytics to machine learning makes it possible to avoid human limitations and ingest data from all over the system. This data can be pooled and algorithmically analysed to extract insights that human engineers might not have reached.
While we don’t advise pairing down your toolchain to just one tool, a platform like Coralogix goes a long way towards mitigating tool sprawl before it becomes a problem.
For those who are currently wrestling with out-of-control tool sprawl, there is a way out! The tool consolidation roadmap shows teams how to go from a fragmented or ad hoc toolchain to one that is modern and uses few unnecessary tools. The roadmap consists of three phases.
Before a team starts the work of tool consolidation, they need to plan what they’re going to do. The team needs first to ascertain the architecture of the current toolchain as well as the costs and benefits to tool users.
Then they must collectively decide what they want to achieve from the roadmap. Each component of the team will have its own desirable outcome and the resulting toolchain needs to cater to everybody’s interests.
Finally, the team should draw up a timeframe outlining the tool consolidation steps and how long they will take to implement.
The second phase is preparation. This requires the team to draw up a comprehensive list of use cases and map them onto a list of potential solutions. The aim of this phase is to really hash out what high-level requirements the final solution needs to satisfy and flesh these requirements out with lots of use cases.
For example, the DevOps team might want higher visibility into database instance performance. They may then construct use cases around this: “as an engineer, I want to see the CPU utilization of an instance”.
The team can then research and inventory possible solutions that can enable those use cases.
Finally, the team can put its plan into action. This step involves several different components. Having satisfied themselves that the chosen solution best enables their objectives, the team needs to deploy the chosen solution.
This requires testing to make sure it works as intended and deploying to production. The team needs to use the solution to implement any alerting and event management strategies they outlined in the plan.
As an example, Coralogix has dynamic alerting. This enables teams by alerting them to anomalies without requiring them to set a threshold explicitly.
Last but not least, the team needs to document its experience to inform future upgrades, as well as training all team members on how to get the best out of the new solution. (Coralogix has a tutorials page to help with this.)
A DevOps toolchain is a double-edged sword. Used well, upgraded tooling can reduce toil and enhance the capacity of DevOps engineers to solve problems. However, ad hoc upgrades that don’t take the non-functional aspects of new tools into account lead to tool sprawl.
Tool sprawl reverses all the benefits of a good toolchain. Toil is increased and DevOps teams spend so much time navigating the intricacies of their toolchain that they literally cannot do their job properly.
Luckily, tool sprawl is solvable. Systems like Coralogix go a long way towards fixing a fragmented toolchain, by consolidating observability and monitoring into one platform. We’ve seen how teams in the thick of tool sprawl can extricate themselves through the tool consolidation roadmap.
Tooling, like candy, can be good in moderation but bad in excess.