How to Mitigate DevOps Tool Sprawl in Enterprise Organizations

There’s an insidious disease increasingly afflicting DevOps teams. It begins innocuously. A team member suggests adding a new logging tool. The senior dev decides to upgrade the tooling. Then it bites. 

You’re spending more time navigating between windows than writing code. You’re scared to make an upgrade because it might break the toolchain.

The disease is tool sprawl.  It happens when DevOps teams use so many tools that the time and effort spent navigating the toolchain is greater than the savings made by new tools.  

Tool Sprawl: What’s the big problem?

Tool sprawl is not something to be taken lightly.  A 2016 DevOps survey found that 53% of large organizations use more than 20 tools.  In addition, 53% of teams surveyed don’t standardize their tooling.

It creates what Joep Piscaer calls a “tool tax”, increased technical debt, and reduced efficiency which can bog down your business and demoralize your team.

Reduced speed of innovation

With tool sprawl, a DevOps team is more likely to have impaired observability as data between different tools won’t necessarily be correlated.  This ultimately reduces their ability to detect anomalous system activity and locate the source of a fault and increases both Mean Time To Detection (MTTD) and Mean Time To Repair (MTTR).

Also, an overabundance of tools can result in increased toil for your DevOps team. Google’s SRE org defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

Tool sprawl creates toil by forcing DevOps engineers to continually switch between lots of different tools which may or may not be properly integrated.  This cuts into the time spent doing useful and productive work such as coding during the day.

Finally, tool sprawl reduces your system’s scalability. This is a real blocker to businesses that want to go to the next level. They can’t scale their application and may have trouble expanding their user base and developing innovative features.

Lack of integration and data silos

A good DevOps pipeline is dependent on a well-integrated toolchain.  When tool sprawl is unchecked, it can result in a poorly integrated set of tools.  DevOps teams are forced to get round this by implementing ad-hoc solutions which decrease the resilience and reliability of the toolchain.

This reduces the rate of innovation and modernization in your DevOps architecture. Engineers are too scared to make potentially beneficial upgrades because they don’t want to risk breaking the existing infrastructure.

Another problem created by tool sprawl is that of data silos. If different DevOps engineers use their own dashboards and monitoring tools, it can be difficult (if not impossible) to pool data. This reduces the overall visibility of the system and consequently reduces the level of insights available to the team.

Data silos also cause a lack of collaboration.  If every ops team is looking at a different data set and using their own monitoring tool, they can’t meaningfully communicate.

Reduced team productivity

Engineers add tools to increase productivity, not to reduce it. Yet having too many actually has the opposite effect. 

Tool sprawl can seriously disrupt the creative processes of engineers. Being forced to pick their way through a thicket of unstandardized and badly integrated tooling breaks their flow, reducing their ability to problem solve. This makes them less effective as engineers and reduces the team’s operational excellence.

Another impairment to productivity is the toxic culture created by a lack of collaboration and communication between different parts of the team. In the previous section, we saw how data silos resulted in a lack of team collaboration.

The worst case of this is that it can lead to a culture of blame. Each part of the team, cognizant only of the information on its part of the system, tries to rationalize that information and treat its view as correct.

This leads to them neglecting other parts of the picture and blaming non-aligned team members for mistakes.

The “Dark Side” of the toolchain

In Star Wars, all living things depended on the Force. Yet the Force was double-edged; it had a light side and a dark side. Similarly, a DevOps pipeline depends on an up-to-date toolchain that can keep pace with the demands of the business.

Yet in trying to keep their toolchain beefed-up, DevOps teams constantly run the risk of tool sprawl. Tooling is often upgraded organically in response to the immediate needs of the team. As Joep warns though, poorly upgrading tooling can create more problems than it solves. It adds complexity and operational burdens.

Solving the problem of tool sprawl

Consider your options carefully

One way that teams can prevent tool sprawl is by thinking much more carefully about the pros and cons of adding a new tool.  As Joep explains, tools have functional and non-functional aspects. Many teams become sold on a new tool based on the functional benefits it brings. These could include allowing the team to visualize data or increasing some aspect of observability.

What they often don’t really think about are the tool’s non-functional aspects.  These can include performance, ease of upgrading, and security features.

 If a tool was a journey the function would be its destination and its non-functional aspects would be the route it takes. Many teams are like complacent passengers, saying “wake me when we get there” while taking no heed of potential hazards along the way. 

Instead, they need to be like ship captains, navigating the complexities of their new tool with foresight and avoiding potential problems before they sink the ship.

Before incorporating a tool into their toolchain, teams need to think about operational issues. These can be anything from the number of people needed to maintain the tool to the repo new versions are available in.

Teams also need to consider agility. Is the tool modular and extensible? If so, it will be relatively easy to enhance functionality downstream. If not, the team may be stuck with obsolescent tooling that they can’t get rid of.

Toolchain detox

Another tool sprawl mitigation strategy is to opt for “all-in-one” tools that let teams achieve more outcomes with less tooling. A recent study advocates for using a platform vendor that possesses multiple monitoring, analytics and troubleshooting capabilities.

Coralogix is a good example of this kind of platform.  It’s an observability and monitoring solution that uses a stateful streaming pipeline and machine learning to analyze and extract insights from multiple data sources.  Because the platform leverages artificial intelligence to extract patterns from data, it has the ability to combat data silos and the dangers they bring.

Trusting log analytics to machine learning makes it possible to avoid human limitations and ingest data from all over the system.  This data can be pooled and algorithmically analysed to extract insights that human engineers might not have reached.

In addition, Coralogix can be integrated with a range of external platforms and solutions.  These range from cloud providers like AWS and GCP to CI/CD solutions such as Jenkins and CircleCI.

While we don’t advise pairing down your toolchain to just one tool, a platform like Coralogix goes a long way toward optimizing IT costs and mitigating tool sprawl before it becomes a problem.

The tool consolidation roadmap

For those who are currently wrestling with out-of-control tool sprawl, there is a way out! The tool consolidation roadmap shows teams how to go from a fragmented or ad hoc toolchain to one that is modern and uses few unnecessary tools. The roadmap consists of three phases.

Phase 1 – Plan

Before a team starts the work of tool consolidation, they need to plan what they’re going to do. The team needs first to ascertain the architecture of the current toolchain as well as the costs and benefits to tool users.

Then they must collectively decide what they want to achieve from the roadmap. Each component of the team will have its own desirable outcome and the resulting toolchain needs to cater to everybody’s interests.

Finally, the team should draw up a timeframe outlining the tool consolidation steps and how long they will take to implement.

Phase 2 – Prepare

The second phase is preparation. This requires the team to draw up a comprehensive list of use cases and map them onto a list of potential solutions. The aim of this phase is to really hash out what high-level requirements the final solution needs to satisfy and flesh these requirements out with lots of use cases.

For example, the DevOps team might want higher visibility into database instance performance.  They may then construct use cases around this: “as an engineer, I want to see the CPU utilization of an instance”.

The team can then research and inventory possible solutions that can enable those use cases.

Phase 3 – Execute

Finally, the team can put its plan into action. This step involves several different components. Having satisfied themselves that the chosen solution best enables their objectives, the team needs to deploy the chosen solution.

This requires testing to make sure it works as intended and deploying to production.  The team needs to use the solution to implement any alerting and event management strategies they outlined in the plan.

As an example, Coralogix has dynamic alerting. This enables teams by alerting them to anomalies without requiring them to set a threshold explicitly.

Last but not least, the team needs to document its experience to inform future upgrades, as well as training all team members on how to get the best out of the new solution. (Coralogix has a tutorials page to help with this.)

Wrapping Up

A DevOps toolchain is a double-edged sword. Used well, upgraded tooling can reduce toil and enhance the capacity of DevOps engineers to solve problems. However, ad hoc upgrades that don’t take the non-functional aspects of new tools into account lead to tool sprawl.

Tool sprawl reverses all the benefits of a good toolchain. Toil is increased and DevOps teams spend so much time navigating the intricacies of their toolchain that they literally cannot do their job properly.

Luckily, tool sprawl is solvable. Systems like Coralogix go a long way towards fixing a fragmented toolchain, by consolidating observability and monitoring into one platform.  We’ve seen how teams in the thick of tool sprawl can extricate themselves through the tool consolidation roadmap.

Tooling, like candy, can be good in moderation but bad in excess. 

Why Are SaaS Observability Tools So Far Behind?

Salesforce was the first of many SaaS-based companies to succeed and see massive growth. Since they first started out in 1999, Software-as-a-Service (SaaS) tools have taken the IT sector and, well the world, by storm. For one, they mitigate bloatware by moving applications from the client’s computer to the cloud. Plus, the sheer ease of use brought by cloud-based, plug-and-play software solutions has transformed all sorts of sectors. 

Given the SaaS paradigm’s success in everything from analytics to software development itself, it’s natural to ask whether its Midas touch could improve the current state of data observability tools.

Heroku and the Rise of SaaS

Let’s start with a system that we’ve previously talked about, Heroku. Heroku is one of the most popular platforms for deploying cloud-based apps. 

Using a Platform-as-a-Service approach, Heroku lets developers deploy apps in managed containers with maximum flexibility. Instead of apps being hosted in traditional servers, Heroku provides something called dynos.

Dynos are like cradles for applications. They utilize the power of containerization to provide a flexible architecture that takes the hassle of on-premises configuration away from the developer. (We’ve previously talked about the merits of SaaS vs Hosted solutions.)

Heroku’s dynos make scalability effortless. If developers want to scale their app horizontally, they can simply add more dynos. Vertical scaling can be achieved by upgrading dyno types, a process Heroku facilitates through its intuitive dashboard and CLI.

Heroku can even take scaling issues off the developer’s hands completely with its auto-scaling feature. This means that software companies can focus on their mission, providing high-quality software at scale without worrying about the ‘how’ of scalability or configuration.

Systems like Heroku give us a tantalizing glimpse of the power and convenience a SaaS approach can bring to DevOps. The hassle of resource management, configuration, and deployment are abstracted away, allowing developers to focus solely on coding.

SaaS is making steady inroads into DevOps. For example, Coralogix (which integrates with Heroku and is also available as a Heroku add-on), operates with a SaaS approach, allowing users to analyze logs without worrying about configuration details.

Not So SaaS-y Tooling

It might seem that nothing is stopping SaaS from being applied to all aspects of observability tooling. After all, Coralogix already offers a SaaS log analytics solution, so why not just make all logging as SaaS-y as possible?

Log collection is the fly in this particular ointment.  Logging data is often stored in a variety of formats, reflecting the fact that logs may originate from very different systems.  For example, a Linux server will probably store logs as text data while Kubernetes can use a structured logging format or store the logs as JSON.

Because every system has its own logging format, organizations tend to collect their logs on-premises is a big roadblock to the smooth uptake of SaaS. In reality, the variety of systems, in addition to the option to build your own system, is symptomatic of a slower move toward observability in the enterprise. However, this range of options doesn’t mean that log analysis is limited to on-prem systems.

What’s important to note is that organizations are really missing out on SaaS observability tooling. Why is this the case, when SaaS tools and platforms are so widespread? The perceived complexity of varying formats, combined with potential cloud-centric security concerns, might have a role to play.

Moving to Cloud-Based Log Storage with S3 Bucket

To pave the way to Software as a Service log collection, we need to stop storing logs on-prem and move them to the cloud.  Cloud computing is the keystone of SaaS. Applications can be hosted on centralized computing resources and piped to thousands of clients.

AWS lets you store logs in the cloud with S3 Bucket.  S3 is short for Simple Storage Service. As the name implies, S3 Bucket is a service provided by AWS that is specifically designed to let you store and access data quickly and easily.

Pushing Logs to S3 with Logstash and FluentD

For those who aren’t already using AWS, output plugins allow users to push existing log records to S3.  Two of the most popular logging solutions are FluentD and Logstash, so we’ll look at those here. (Coralogix integrates with both FluentD and Logstash)

FluentD Plugin

FluentD contains a plugin called out_s3. This enables users to write pre-existing log records to the S3 Bucket.  Out_s3 has several cool features.

For one, it splits files using the time event logs were created. This means the S3 file structure accurately reflects the original time ordering of log records and not just when they were uploaded to the bucket.

Another thing out_s3 allows users to do is incorporate metadata into the log records.  This means each log record contains the name of its S3 Bucket along with the object key. Downstream systems like Coralogix can then use this info to pinpoint where each log record came from.

At this point, I should mention something that could catch new users out. FluentD’s plugin automatically creates files on an hourly basis. This can mean that when you first upload log records, a new file isn’t created immediately, as it would be with most systems.

While you can’t rely on new files being created immediately, you can change whether they are created more or less frequently by configuring the time key condition.

Logstash Plugin

Logstash’s output plugin is open source and comes under an Apache 2.0 license, meaning there are no restrictions on how you use it. It uploads batches of Logstash events in the form of temporary files, which by default are stored in the Operating System’s temporary directory.

If you don’t like the default save location, Logstash gives you a temporary_directory option that lets you stipulate a preferred save location.

Securing Your Logs

Logs contain sensitive information. A crucial question for those taking the S3 log storage route is making sure S3 Buckets are secure.  Amazon S3 default encryption enables users to ensure that new log file objects are encrypted by default.

If you’ve already got some logs in an S3 Bucket and they aren’t yet encrypted don’t worry. S3 has a couple of tools that let you encrypt existing objects quickly and easily.

Encryption through Batch Operations

One tool is S3 Batch Operations. Batch Operations are S3’s mechanism for performing operations on billions of objects at a time. Simply provide S3 Batch Operations with a list of the log files you want to encrypt and the API performing the appropriate operation.

Encryption can be achieved by using the copy operation to copy unencrypted files to encrypted files in the same S3 Bucket location.

Encryption through Copy Object API

An alternative tool is the Copy Object API. This tool works by copying a single object back to itself using SSE encryption and can be run using the AWS CLI.

Although Copy Object is a powerful tool, it’s not without risks. You’re effectively replacing your existing log files with encrypted versions so make sure all the requisite information and metadata is preserved by the encryption. 

For example, if you are copying log files larger than the multipart_threshold value, the Copy Object API won’t copy the metadata by default.  In this case, you need to specify what metadata you want using the parameter –metadata.

Integrating S3 Buckets with Coralogix

Hooray! Your logs are now firmly in the cloud with S3. Now, all you need to do is analyze them.  Coralogix can help you do this with the S3 to Coralogix Lambda.

This is an API that lets you send log data from your S3 Bucket to Coralogix, where the full power of machine learning can be applied to uncover insights.  To use it you need to define five parameters.

S3BucketName specifies the name of the S3 bucket storing the CloudTrail logs.

ApplicationName is a mandatory metadata field that is sent with each log and helps to classify it.

CoralogixRegion is the region in which your Coralogix account is located. CoralogixRegion can be Europe, US or India, depending on whether your Coralogix URL ends with .com, .us or .in.

PrivateKey is a parameter that can be found in your Coralogix account under Settings -> Send your logs. It is located in the upper left corner.

SubsystemName is a mandatory metadata field that is sent with each log and helps to classify it.

The S3 to Coralogix Lambda can be integrated with AWS’s automation framework through the Serverless Application Model. SAM is an AWS framework that provides resources for creating serverless applications, such as shorthand syntax for APIs and functions.

The code for the Lambda is also available at the S3 to Coralogix Lambda GitHub. As with Logstash, it’s open source under the Apache 2.0 License so there are no restrictions on how you use it.

To Conclude

Software as a Service is a paradigm that is transforming every part of the IT sector, including DevOps. It replaces difficult-to-configure on-premises architecture with uniform and consistent services that remove scalability from the list of an end user’s concerns.

Unfortunately, SaaS observability tooling is still falling behind the curve, but largely because organizations are still maintaining a plethora of systems (and therefore a variety of formats) on-prem. 

Storing your logs in S3 lets you bring the power and convenience of SaaS to log collection. Once your logs are in S3, you can leverage Coralogix’s machine learning analytics to extract insights and predict trends.

SaaS vs Hosted Solutions: Which Should You Choose and Why?

A key decision that must be made in a product’s lifecycle is SaaS vs Hosted. Should we be only employing the use of hosted solutions, or are we going to utilize SaaS homegrown solutions?

Hosted solutions are services from providers like AWS and Azure, that take away the operational burden from some well-known pieces of tech, like Kafka or Grafana. SaaS solutions on the other hand are platforms like Coralogix, that offer proprietary value. They each offer benefits and drawbacks. Let’s get into it.

What is SaaS?

‘Software as a service’ is a complete, proprietary solution including all of the infrastructure, software, and components. It is defined as software that is made available on-demand and is most often provided from the Cloud. A user can simply create an account, purchase a subscription, and never have to worry about backend infrastructure or deploying the application.

Great examples of these services are Office365 and G-Suite allowing users to easily edit and save documents in the Cloud. 

What is a Hosted Solution?

A hosted solution takes an open-source tool and bundles it into a managed, core product. For example, AWS offers a hosted Kafka service. The user pays for its usage of the CPU and a fixed overhead for the maintenance of the Kafka instance. In exchange, the end-user doesn’t need to worry about the patching and maintenance of the service.

The Key Differences Between SaaS vs Hosted

There are some key considerations to make when choosing between a SaaS offering and a hosted service. These criteria are key to ensuring that you’re making the best architectural decision for your project.

The Support Model

When you pay for a hosted service, you’re paying for the bare minimum. You’re paying for the CPU and the assumption is that you can take it from there. That means you will get little help understanding optimal use of the software and maximizing efficiency.

A SaaS offering can vary in this respect, but they almost uniformly provide methods for optimizing your use of the platform. For example, Coralogix’s support team is available 24/7 to help customers normalize their log data and reduce storage costs. This service and functionality are wrapped in your subscription fee. Rather than simply paying to host and use a software, you’re paying for a full service.

And who gets up at 4am?

If your hosted service malfunctions because of your usage, you’re still the one getting out of bed. This is because the line of responsibility is firmly on your side. You may not be responsible for the NFRs of your hosted service, but if you misconfigure your instance, it is very much on you.

The Pricing Model

In most cases, the pricing models of both hosted solutions and SaaS offerings are similar. You pay for usage, plus some fixed overhead paid as a service charge to the provider. SaaS offerings are typically a little more expensive here because they offer more features and have taken on more of the support burden.

But what if you’re not ‘most cases’?

If you’re one of the few clients who are operating at a large scale, your operational cost will come down to two things:

  • Your ability to optimize your use of expensive resources, such as 3rd party services
  • How well you can haggle a great deal with a provider

With many hosted services, the former is an option but the latter is very rarely on the table. You will pay a price on a fixed scale, with no debate or wiggle room. This means that if you need a large scale engagement with your hosted service, you should prepare for large scale spending.

For SaaS services, beyond a certain size (i.e. enterprise) they actually encourage you to get in contact to review pricing. Remember, they have greater control of their system, which means they can make special provisions to cut costs and optimize your engagement with them. For example, spinning up a separate cluster for you, rather than putting you onto a less optimized, shared cluster.

The Learning Curve

Open-source software comes with learning curves of varying degrees. In some cases, the software is highly intuitive and simple. In others, it is complex and niche. For example, it is very difficult for one to intuit the best possible practices for your Kafka instance. Kafka has a great deal of configurations that all require time, effort, and experimentation to understand and optimize.

SaaS offerings are all about ease. Rather than offering you a thousand different options, they will most commonly abstract this functionality behind a much more user-friendly interface. This is because they are creating an entirely new product – an abstraction – that is built to serve a wider community of consumers.

Features and Vendor Lock-in

Features comparisons are common when buyers are considering products, and it’s possible that features may be largely similar when weighing up a SaaS vs hosted product or service. However, there are a few key things to consider before making your decision.

Feature Development Agility

As a general rule, hosted solutions or services only have the features of the core product. Enhancements are often reliant on in-house expertise specific to the product. In some cases, new features being released may offer needed benefits, but this is typically less frequent and more of a manual process with hosted solutions. Plus, any newer features will not deviate much from the original product.

Conversely, with SaaS products or services, the developers have much more freedom, enabled by things like cloud-native application development and open source technologies. This trickles down to the customer, giving them faster and varied feature releases, a greater degree of out-of-the-box customization, and often direct feedback to the product development team.

As a SaaS customer, you may even be given early access to a beta feature before it goes to general release. This integrated customer experience is a major reason why the popularity of SaaS is on the rise. 

Vendor Lock-in

Vendor lock-in is something typically associated with monolithic on-prem solutions, but it can be just as common with modern hosted and SaaS solutions. It typically occurs when a software or service manufacturer develops specific features or architectures in a way which makes it either too costly or time-consuming to move to another platform.

The issue of vendor lock-in is far more common with hosted solutions, just by nature of their installation and maintenance requirements. However, there is a danger with SaaS providers as well.

A benefit of SaaS solutions is that the mechanisms for avoiding vendor lock-in are easier and more accessible than with hosted solutions. It’s far easier to build system abstractions around a third-party solution if it’s a SaaS product. Working with open standards and portability in mind will serve you well. More often than not, the proprietary features from a SaaS provider give you the resource to carry out such portability work.

Typically, a customer’s biggest concern is that the SaaS provider will hold their data hostage if they choose to leave or migrate to another platform. There are two ways to mitigate against this happening. First, choose well-reviewed and robust enterprise-grade SaaS solutions. Second, always have a disaster recovery solution in place. If your DR strategy can’t cope with a vendor holding onto your data, then it certainly can’t cope with a ransomware attack.

Conclusion

When it comes down to it, there’s a number of reasons why the vast majority of customers will have a majority-SaaS workspace or architecture in the coming years. The support model for SaaS solutions may add to the price but greatly detracts from your CAPEX when considering on-call staff or 24/7 NOCs. 

Built with cloud-native, open-source principles, SaaS products and homegrown solutions are designed for ease. Functionalities are dictated by customer requirements, as opposed to hosted solutions with a near-infinite number of configurations, which have their own interdependencies and considerations. With a SaaS vendor, you’re much more likely to get a say in what that next feature might look like.

SaaS solutions are designed for easy consumption and are centered around customer success. If you’re concerned about vendor lock-in, you’re in a much better position with a SaaS solution than a hosted one.

All in all, the choice between SaaS vs Hosted should be an easy one.