[Live Webinar] Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy Register today!

Cloud Configuration Drift: What Is It and How to Mitigate it

  • Thomas Russell
  • July 5, 2022
Share article

More organizations than ever run on Infrastructure-as-Code cloud environments. While migration brings unparalleled scale and flexibility advantages, there are also unique security and ops issues many don’t foresee.

So what are the major IaC ops and security vulnerabilities? Configuration drift.

Cloud config drift isn’t a niche concern. Both global blue-chips and local SMEs have harnessed Coded Infrastructure. However, many approach their system security and performance monitoring the same way they would for a traditional, hardware-based system.

Knowing how to keep the deployed state of your cloud environment in line with the planned configuration management tools is vital. Without tools and best practices to mitigate drift, the planned infrastructure vs. as-is code inevitably becomes very different. This creates performance issues and security vulnerabilities.

Luckily, IaC integrity doesn’t have to be an uphill struggle. Keep reading if you want to keep config drift at a glacial pace.  

What is config drift?

In simple terms, configuration drift is when the current state of your infrastructure doesn’t match the IaC configuration as determined by the code.

Even the most carefully coded infrastructure changes after a day of real-world use. Every action creates a change to the code. This is manageable at a small scale, but it becomes a constant battle when you have 100+ engineers, as many enterprise-level teams do. Every engineer is making console-based changes, and every change causes drift. While many of these changes are small, they quickly add up at the operational scale of most businesses. The same flexibility that prompted the great enterprise migration to the cloud can also cause vulnerability.

Config changes in your environment will be consistent, both deliberate and accidental, especially in large organizations where multiple teams (of varying levels of expertise) are working on/in the same IaC environment. Over time these changes mount up and lead to drift.

Why does cloud-config drift happen?

Cloud infrastructure allows engineers to do more with fewer human hours and pairs of hands. Environments and assets can be created and deployed daily (in the thousands of scale demands). Many automatically update, bringing new config files and code from external sources. Cloud environments are constantly growing and adapting, with or without human input.

However, this semi-automated state of flux creates a new problem. Think of it as time travel in cinema; a small action in the past makes an entirely different version of the present. With IaC, a slight change in the code can lead to a deployed as-is system that’s radically unmatched by the planned configuration your engineers are working from.

Here’s the problem; small changes in IaC code always happen. Cloud environments are flexible and create business agility because the coded infrastructure is so malleable. Drift is inevitable, or at least it can feel that way if you don’t have a solution that adequately mitigates it.

What makes config drift a performance and security risk

Traditional monitoring approaches don’t work for cloud environments. Monitoring stacks could be mapped to config designs with minimal issues in a monolithic system. It would be difficult for a new machine or database to appear without engineers noticing when they’d require both physical hardware and human presence to install. The same can’t be said for coded infrastructure.

If your system visibility reflects plans and designs, instead of the actual deployed state, the gap between what your engineers see and what’s actually happening widens every hour. Unchecked config drift doesn’t create blind spots; it creates deep invisible chasms.

For performance, this causes problems. Cloud systems aren’t nearly as disrupted by high-activity peaks as physical systems of decades ago, but they’re not entirely immune. The buildup of unoptimized assets and processes leads to noticeable performance issues no matter how airtight your initial config designs.

Security that doesn’t have full system visibility is a risk that shouldn’t need explaining but is exactly what config drift leads to. Config drift doesn’t just open a back door for cybercriminals; it gives them keys to your digital property.

Common key causes of IaC config drift

Configuration drift in IaC can feel unavoidable. However, key areas are known to create drift if best practices and appropriate tooling aren’t in place.

Here are some of the most common sources of config drift in cloud environments. If your goal is to maintain a good security posture and a drift disruption-free IaC system, addressing the following is an excellent place to start.

Automated pipelines need automated discovery

Automation goes hand-in-hand with IaC. While automated pipelines bring the flexibility and scale necessary for a 21st-century business, they’re a vulnerability in a cloud environment if you rely on manual discovery and system mapping.

Once established, a successful automated pipeline will generate and deploy new assets with little-to-no human oversight. Great for productivity, potentially a nightmare if those assets are misconfigured, or there are no free engineering hours to confirm new infrastructure is visible to your monitoring and security stacks.

IaC monitoring stacks need to incorporate AI-driven automated discovery. It reduces the need for manual system mapping. Manual discovery is tedious on a small scale. It becomes a full-time commitment in a large cloud environment that changes daily. 

More importantly, automated discovery ensures new assets are visible from the moment they’re deployed. There’s no vulnerable period wherein a recently deployed asset is active but still undiscovered by your monitoring/security stacks. Automated discovery doesn’t just save time, it delivers better results and a more secure environment.

An automated pipeline is only one poorly written line of code away from being a conveyor belt of misconfigured assets. Automated discovery ensures pipelines aren’t left to drag your deployed systems further and further from the configured state your security and monitoring stacks operate by.

Resource tagging isn’t to make sysadmin’s lives easier

The nature of automated deployment means untagged assets are an ever-present risk. Vigilance should be taken, especially when operating at scale. This is where real-time environment monitoring becomes a security essential.

Every incorrect or absent tag drifts your deployed state further from the planned config. Due to the scale of automated deployment, it’s rarely a single asset too. Over time the volume of these unaccounted-for “ghost” resources in your system multiplies.

This creates both visibility and governance issues. Ghost resources are almost impossible to monitor and pose significant challenges for optimization and policy updates. Unchecked, these clusters of invisible, unoptimized resources create large security blind spots and environment-wide config drift.

A real-time monitoring function that scans for untagged assets is crucial. Platforms like Coralogix alert your engineers to untagged resources as they’re deployed. From here, they can be de-ghosted with AI/ML automated tagging or removed entirely. Either way, they’re no longer left to build up and become a source of drift or security posture slack.

Undocumented changes invite configuration drift, no exception

Change is constant in coded infrastructure. Documenting them all, no matter how small/trivial, is critical.

One undocumented change probably won’t bring your systems to a halt (although this can and has happened). However, a culture of lax adherence to good practice rarely means just one undocumented alteration. Over time all these unregistered manual changes mount up.

Effective system governance is predicated on updates finding assets in a certain state. If the state is different, these updates won’t be correctly applied (if at all). As you can imagine, an environment containing code that doesn’t match what systems expect to find means the deployed state moves further from the predefined configuration with every update.

A simple but effective solution? AI/ML-powered alerting. Engineers can easily find and rectify undocumented changes without issue if your stack includes functionality to bring it to their attention. Best practice and due diligence are key, but they rely on people. For the days when human error rears its head, real-time monitoring and automated alerts stop undocumented manual changes from building up to drift-making levels.

That being said, allow your IaC to become living documentation

While AI/ML-powered alerting should still be part of your stack, a culture shift away from overreliance on documentation also goes a long way toward mitigating IaC drift. With coded infrastructure, you can always ask yourself, “do I need this documented outside the code itself?”

Manually documenting changes was essential in traditional systems. Since IaC cloud infrastructure is codified, you can drive any changes directly through the code. Your IaC assets contain their own history; their code can record every change and alteration made since deployment. What’s more, these records are always accurate and up to date.

Driving changes through the IaC allows you to harness code as a living documentation of your cloud infrastructure, one that’s more accurate and up-to-date than a manual record. Not only does this save time, but it also reduces the drift risk that comes with manual documentation. There’s no chance of human error, meaning a change is documented incorrectly (or not at all). 

Does config drift make IaC cloud environments more hassle than they’re worth?

No, not even remotely. Despite config drift and other IaC concerns (such as secret exposure through code), cloud systems are still vastly superior to the setups they replaced.

IaC is essential to multiple technologies that make digital transformations and cloud adoption possible. Beyond deployment, infrastructures built and managed with code bring increased flexibility, scalability, and lower costs. By 2022 these aren’t just competitive advantages; entire economies are reliant on businesses operating at a scale only possible with them.

Config drift isn’t a reason to turn our back on IaC. It just means that coded infrastructure requires a contextual approach. The vulnerabilities of an IaC environment can’t be fixed with a simple firewall. You need to understand config drift and adapt your cybersecurity and engineering approaches to tackle the problem head-on.

Observability: the essential concept to stopping IaC config drift

What’s the key takeaway? Config drift leads to security vulnerabilities and performance problems because it creates blind spots. If your monitoring stacks can’t keep up with the speed and scale of an IaC cloud environment, they’ll soon be overwhelmed.

IaC environments are guaranteed to become larger and more complex over time. Drift is an inevitable by-product of use. Every action generates new code, and changes code that already exists. Any robust security or monitoring for an IaC setting needs to be able to move and adapt simultaneously. An AI/ML-powered observability and visibility platform, like Coralogix, is a vital component of any meaningful IaC solution, whether for security, performance, or both.

In almost every successful cyberattack, vulnerabilities were exploited outside of engineer visibility. Slowing drift and keeping the gap between your planned config and your deployed systems keep these vulnerabilities to a manageable, mitigated minimum. Prioritizing automated AI-driven observability of your IaC that grows and changes as your systems do is the first step towards keeping them drift-free, secure, and operating smoothly.

Where Modern Observability
and Financial Savvy Meet.

Live Webinar
Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy
April 30th at 12pm ET | 6pm CET
Save my Seat