How to Mitigate DevOps Tool Sprawl in Enterprise Organizations

There’s an insidious disease increasingly afflicting DevOps teams. It begins innocuously. A team member suggests adding a new logging tool. The senior dev decides to upgrade the tooling. Then it bites. 

You’re spending more time navigating between windows than writing code. You’re scared to make an upgrade because it might break the toolchain.

The disease is tool sprawl.  It happens when DevOps teams use so many tools that the time and effort spent navigating the toolchain is greater than the savings made by new tools.  

Tool Sprawl: What’s the big problem?

Tool sprawl is not something to be taken lightly.  A 2016 DevOps survey found that 53% of large organizations use more than 20 tools.  In addition, 53% of teams surveyed don’t standardize their tooling.

It creates what Joep Piscaer calls a “tool tax”, increased technical debt, and reduced efficiency which can bog down your business and demoralize your team.

Reduced speed of innovation

With tool sprawl, a DevOps team is more likely to have impaired observability as data between different tools won’t necessarily be correlated.  This ultimately reduces their ability to detect anomalous system activity and locate the source of a fault and increases both Mean Time To Detection (MTTD) and Mean Time To Repair (MTTR).

Also, an overabundance of tools can result in increased toil for your DevOps team. Google’s SRE org defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

Tool sprawl creates toil by forcing DevOps engineers to continually switch between lots of different tools which may or may not be properly integrated.  This cuts into the time spent doing useful and productive work such as coding during the day.

Finally, tool sprawl reduces your system’s scalability. This is a real blocker to businesses that want to go to the next level. They can’t scale their application and may have trouble expanding their user base and developing innovative features.

Lack of integration and data silos

A good DevOps pipeline is dependent on a well-integrated toolchain.  When tool sprawl is unchecked, it can result in a poorly integrated set of tools.  DevOps teams are forced to get round this by implementing ad-hoc solutions which decrease the resilience and reliability of the toolchain.

This reduces the rate of innovation and modernization in your DevOps architecture. Engineers are too scared to make potentially beneficial upgrades because they don’t want to risk breaking the existing infrastructure.

Another problem created by tool sprawl is that of data silos. If different DevOps engineers use their own dashboards and monitoring tools, it can be difficult (if not impossible) to pool data. This reduces the overall visibility of the system and consequently reduces the level of insights available to the team.

Data silos also cause a lack of collaboration.  If every ops team is looking at a different data set and using their own monitoring tool, they can’t meaningfully communicate.

Reduced team productivity

Engineers add tools to increase productivity, not to reduce it. Yet having too many actually has the opposite effect. 

Tool sprawl can seriously disrupt the creative processes of engineers. Being forced to pick their way through a thicket of unstandardized and badly integrated tooling breaks their flow, reducing their ability to problem solve. This makes them less effective as engineers and reduces the team’s operational excellence.

Another impairment to productivity is the toxic culture created by a lack of collaboration and communication between different parts of the team. In the previous section, we saw how data silos resulted in a lack of team collaboration.

The worst case of this is that it can lead to a culture of blame. Each part of the team, cognizant only of the information on its part of the system, tries to rationalize that information and treat its view as correct.

This leads to them neglecting other parts of the picture and blaming non-aligned team members for mistakes.

The “Dark Side” of the toolchain

In Star Wars, all living things depended on the Force. Yet the Force was double-edged; it had a light side and a dark side. Similarly, a DevOps pipeline depends on an up-to-date toolchain that can keep pace with the demands of the business.

Yet in trying to keep their toolchain beefed-up, DevOps teams constantly run the risk of tool sprawl. Tooling is often upgraded organically in response to the immediate needs of the team. As Joep warns though, poorly upgrading tooling can create more problems than it solves. It adds complexity and operational burdens.

Solving the problem of tool sprawl

Consider your options carefully

One way that teams can prevent tool sprawl is by thinking much more carefully about the pros and cons of adding a new tool.  As Joep explains, tools have functional and non-functional aspects. Many teams become sold on a new tool based on the functional benefits it brings. These could include allowing the team to visualize data or increasing some aspect of observability.

What they often don’t really think about are the tool’s non-functional aspects.  These can include performance, ease of upgrading, and security features.

 If a tool was a journey the function would be its destination and its non-functional aspects would be the route it takes. Many teams are like complacent passengers, saying “wake me when we get there” while taking no heed of potential hazards along the way. 

Instead, they need to be like ship captains, navigating the complexities of their new tool with foresight and avoiding potential problems before they sink the ship.

Before incorporating a tool into their toolchain, teams need to think about operational issues. These can be anything from the number of people needed to maintain the tool to the repo new versions are available in.

Teams also need to consider agility. Is the tool modular and extensible? If so, it will be relatively easy to enhance functionality downstream. If not, the team may be stuck with obsolescent tooling that they can’t get rid of.

Toolchain detox

Another tool sprawl mitigation strategy is to opt for “all-in-one” tools that let teams achieve more outcomes with less tooling. A recent study advocates for using a platform vendor that possesses multiple monitoring, analytics and troubleshooting capabilities.

Coralogix is a good example of this kind of platform.  It’s an observability and monitoring solution that uses a stateful streaming pipeline and machine learning to analyze and extract insights from multiple data sources.  Because the platform leverages artificial intelligence to extract patterns from data, it has the ability to combat data silos and the dangers they bring.

Trusting log analytics to machine learning makes it possible to avoid human limitations and ingest data from all over the system.  This data can be pooled and algorithmically analysed to extract insights that human engineers might not have reached.

In addition, Coralogix can be integrated with a range of external platforms and solutions.  These range from cloud providers like AWS and GCP to CI/CD solutions such as Jenkins and CircleCI.

While we don’t advise pairing down your toolchain to just one tool, a platform like Coralogix goes a long way toward optimizing IT costs and mitigating tool sprawl before it becomes a problem.

The tool consolidation roadmap

For those who are currently wrestling with out-of-control tool sprawl, there is a way out! The tool consolidation roadmap shows teams how to go from a fragmented or ad hoc toolchain to one that is modern and uses few unnecessary tools. The roadmap consists of three phases.

Phase 1 – Plan

Before a team starts the work of tool consolidation, they need to plan what they’re going to do. The team needs first to ascertain the architecture of the current toolchain as well as the costs and benefits to tool users.

Then they must collectively decide what they want to achieve from the roadmap. Each component of the team will have its own desirable outcome and the resulting toolchain needs to cater to everybody’s interests.

Finally, the team should draw up a timeframe outlining the tool consolidation steps and how long they will take to implement.

Phase 2 – Prepare

The second phase is preparation. This requires the team to draw up a comprehensive list of use cases and map them onto a list of potential solutions. The aim of this phase is to really hash out what high-level requirements the final solution needs to satisfy and flesh these requirements out with lots of use cases.

For example, the DevOps team might want higher visibility into database instance performance.  They may then construct use cases around this: “as an engineer, I want to see the CPU utilization of an instance”.

The team can then research and inventory possible solutions that can enable those use cases.

Phase 3 – Execute

Finally, the team can put its plan into action. This step involves several different components. Having satisfied themselves that the chosen solution best enables their objectives, the team needs to deploy the chosen solution.

This requires testing to make sure it works as intended and deploying to production.  The team needs to use the solution to implement any alerting and event management strategies they outlined in the plan.

As an example, Coralogix has dynamic alerting. This enables teams by alerting them to anomalies without requiring them to set a threshold explicitly.

Last but not least, the team needs to document its experience to inform future upgrades, as well as training all team members on how to get the best out of the new solution. (Coralogix has a tutorials page to help with this.)

Wrapping Up

A DevOps toolchain is a double-edged sword. Used well, upgraded tooling can reduce toil and enhance the capacity of DevOps engineers to solve problems. However, ad hoc upgrades that don’t take the non-functional aspects of new tools into account lead to tool sprawl.

Tool sprawl reverses all the benefits of a good toolchain. Toil is increased and DevOps teams spend so much time navigating the intricacies of their toolchain that they literally cannot do their job properly.

Luckily, tool sprawl is solvable. Systems like Coralogix go a long way towards fixing a fragmented toolchain, by consolidating observability and monitoring into one platform.  We’ve seen how teams in the thick of tool sprawl can extricate themselves through the tool consolidation roadmap.

Tooling, like candy, can be good in moderation but bad in excess. 

Why Are SaaS Observability Tools So Far Behind?

Salesforce was the first of many SaaS-based companies to succeed and see massive growth. Since they first started out in 1999, Software-as-a-Service (SaaS) tools have taken the IT sector and, well the world, by storm. For one, they mitigate bloatware by moving applications from the client’s computer to the cloud. Plus, the sheer ease of use brought by cloud-based, plug-and-play software solutions has transformed all sorts of sectors. 

Given the SaaS paradigm’s success in everything from analytics to software development itself, it’s natural to ask whether its Midas touch could improve the current state of data observability tools.

Heroku and the Rise of SaaS

Let’s start with a system that we’ve previously talked about, Heroku. Heroku is one of the most popular platforms for deploying cloud-based apps. 

Using a Platform-as-a-Service approach, Heroku lets developers deploy apps in managed containers with maximum flexibility. Instead of apps being hosted in traditional servers, Heroku provides something called dynos.

Dynos are like cradles for applications. They utilize the power of containerization to provide a flexible architecture that takes the hassle of on-premises configuration away from the developer. (We’ve previously talked about the merits of SaaS vs Hosted solutions.)

Heroku’s dynos make scalability effortless. If developers want to scale their app horizontally, they can simply add more dynos. Vertical scaling can be achieved by upgrading dyno types, a process Heroku facilitates through its intuitive dashboard and CLI.

Heroku can even take scaling issues off the developer’s hands completely with its auto-scaling feature. This means that software companies can focus on their mission, providing high-quality software at scale without worrying about the ‘how’ of scalability or configuration.

Systems like Heroku give us a tantalizing glimpse of the power and convenience a SaaS approach can bring to DevOps. The hassle of resource management, configuration, and deployment are abstracted away, allowing developers to focus solely on coding.

SaaS is making steady inroads into DevOps. For example, Coralogix (which integrates with Heroku and is also available as a Heroku add-on), operates with a SaaS approach, allowing users to analyze logs without worrying about configuration details.

Not So SaaS-y Tooling

It might seem that nothing is stopping SaaS from being applied to all aspects of observability tooling. After all, Coralogix already offers a SaaS log analytics solution, so why not just make all logging as SaaS-y as possible?

Log collection is the fly in this particular ointment.  Logging data is often stored in a variety of formats, reflecting the fact that logs may originate from very different systems.  For example, a Linux server will probably store logs as text data while Kubernetes can use a structured logging format or store the logs as JSON.

Because every system has its own logging format, organizations tend to collect their logs on-premises is a big roadblock to the smooth uptake of SaaS. In reality, the variety of systems, in addition to the option to build your own system, is symptomatic of a slower move toward observability in the enterprise. However, this range of options doesn’t mean that log analysis is limited to on-prem systems.

What’s important to note is that organizations are really missing out on SaaS observability tooling. Why is this the case, when SaaS tools and platforms are so widespread? The perceived complexity of varying formats, combined with potential cloud-centric security concerns, might have a role to play.

Moving to Cloud-Based Log Storage with S3 Bucket

To pave the way to Software as a Service log collection, we need to stop storing logs on-prem and move them to the cloud.  Cloud computing is the keystone of SaaS. Applications can be hosted on centralized computing resources and piped to thousands of clients.

AWS lets you store logs in the cloud with S3 Bucket.  S3 is short for Simple Storage Service. As the name implies, S3 Bucket is a service provided by AWS that is specifically designed to let you store and access data quickly and easily.

Pushing Logs to S3 with Logstash and FluentD

For those who aren’t already using AWS, output plugins allow users to push existing log records to S3.  Two of the most popular logging solutions are FluentD and Logstash, so we’ll look at those here. (Coralogix integrates with both FluentD and Logstash)

FluentD Plugin

FluentD contains a plugin called out_s3. This enables users to write pre-existing log records to the S3 Bucket.  Out_s3 has several cool features.

For one, it splits files using the time event logs were created. This means the S3 file structure accurately reflects the original time ordering of log records and not just when they were uploaded to the bucket.

Another thing out_s3 allows users to do is incorporate metadata into the log records.  This means each log record contains the name of its S3 Bucket along with the object key. Downstream systems like Coralogix can then use this info to pinpoint where each log record came from.

At this point, I should mention something that could catch new users out. FluentD’s plugin automatically creates files on an hourly basis. This can mean that when you first upload log records, a new file isn’t created immediately, as it would be with most systems.

While you can’t rely on new files being created immediately, you can change whether they are created more or less frequently by configuring the time key condition.

Logstash Plugin

Logstash’s output plugin is open source and comes under an Apache 2.0 license, meaning there are no restrictions on how you use it. It uploads batches of Logstash events in the form of temporary files, which by default are stored in the Operating System’s temporary directory.

If you don’t like the default save location, Logstash gives you a temporary_directory option that lets you stipulate a preferred save location.

Securing Your Logs

Logs contain sensitive information. A crucial question for those taking the S3 log storage route is making sure S3 Buckets are secure.  Amazon S3 default encryption enables users to ensure that new log file objects are encrypted by default.

If you’ve already got some logs in an S3 Bucket and they aren’t yet encrypted don’t worry. S3 has a couple of tools that let you encrypt existing objects quickly and easily.

Encryption through Batch Operations

One tool is S3 Batch Operations. Batch Operations are S3’s mechanism for performing operations on billions of objects at a time. Simply provide S3 Batch Operations with a list of the log files you want to encrypt and the API performing the appropriate operation.

Encryption can be achieved by using the copy operation to copy unencrypted files to encrypted files in the same S3 Bucket location.

Encryption through Copy Object API

An alternative tool is the Copy Object API. This tool works by copying a single object back to itself using SSE encryption and can be run using the AWS CLI.

Although Copy Object is a powerful tool, it’s not without risks. You’re effectively replacing your existing log files with encrypted versions so make sure all the requisite information and metadata is preserved by the encryption. 

For example, if you are copying log files larger than the multipart_threshold value, the Copy Object API won’t copy the metadata by default.  In this case, you need to specify what metadata you want using the parameter –metadata.

Integrating S3 Buckets with Coralogix

Hooray! Your logs are now firmly in the cloud with S3. Now, all you need to do is analyze them.  Coralogix can help you do this with the S3 to Coralogix Lambda.

This is an API that lets you send log data from your S3 Bucket to Coralogix, where the full power of machine learning can be applied to uncover insights.  To use it you need to define five parameters.

S3BucketName specifies the name of the S3 bucket storing the CloudTrail logs.

ApplicationName is a mandatory metadata field that is sent with each log and helps to classify it.

CoralogixRegion is the region in which your Coralogix account is located. CoralogixRegion can be Europe, US or India, depending on whether your Coralogix URL ends with .com, .us or .in.

PrivateKey is a parameter that can be found in your Coralogix account under Settings -> Send your logs. It is located in the upper left corner.

SubsystemName is a mandatory metadata field that is sent with each log and helps to classify it.

The S3 to Coralogix Lambda can be integrated with AWS’s automation framework through the Serverless Application Model. SAM is an AWS framework that provides resources for creating serverless applications, such as shorthand syntax for APIs and functions.

The code for the Lambda is also available at the S3 to Coralogix Lambda GitHub. As with Logstash, it’s open source under the Apache 2.0 License so there are no restrictions on how you use it.

To Conclude

Software as a Service is a paradigm that is transforming every part of the IT sector, including DevOps. It replaces difficult-to-configure on-premises architecture with uniform and consistent services that remove scalability from the list of an end user’s concerns.

Unfortunately, SaaS observability tooling is still falling behind the curve, but largely because organizations are still maintaining a plethora of systems (and therefore a variety of formats) on-prem. 

Storing your logs in S3 lets you bring the power and convenience of SaaS to log collection. Once your logs are in S3, you can leverage Coralogix’s machine learning analytics to extract insights and predict trends.