Chaos Engineering is one of the hottest new approaches in DevOps. Netflix first pioneered it back in 2008, and since then it’s been adopted by thousands of companies, from the biggest names in tech to small software companies.
In our age of highly distributed cloud-based systems, Chaos Engineering promotes resilient system architectures by applying scientific principles. In this article, I’ll explain exactly what Chaos Engineering is and how you can make it work for your team.
A good way to sum up Chaos Engineering would be with Facebook’s (sadly) defunct motto “move fast and break things”.
Chaos engineering is based on procedures called Chaos Experiments. These typically involve setting up two versions of the same system, an ‘experimental’ version and a ‘control’ version. Engineers then define a ‘steady state’, the characteristic output expected when the system is functioning normally.
When the experiment starts, both the experimental and control systems display the same steady state. Engineers then throw as many stones as possible at the experimental system in an attempt to knock it away from that state.
This stone-throwing could involve anything from randomly causing functions to throw exceptions to simulating the failure of a data center. Two notorious Chaos Engineering tools are Netflix’s Chaos Monkey and Chaos Kong. These were programmed to randomly turn off instances in Netflix’s live systems to test their resiliency. If the end users complained, a Netflix engineer had his work cut out!
Probably the most controversial feature of Chaos Engineering is that experiments are performed on production traffic. While this is highly risky, it’s the only way to ensure real-world validity.
The key benefit of chaos engineering is that it promotes resilient system architectures, especially in cloud-based applications.
Traditional observability strategies use logs, traces and metrics to provide engineers with information about the internal state of an enterprise system. Engineers typically analyze this information and use their knowledge to interpret it.
While this approach works with older applications that are hosted in fixed servers, it’s a much bigger challenge in more modern systems. These contain hundreds or thousands of elements whose interactions are nearly impossible to analyze using traditional methods. They also use serverless architectures which hamper observability with their lack of a fixed server.
This is where Chaos Engineering can fill the gap. It doesn’t require detailed knowledge about a system’s internal vagaries, and it bypasses the obstacles that block traditional observability. That means it will work for arbitrarily complex enterprise applications.
The upshot is that when programmers incorporate chaos into their design processes, they design for resiliency. Since 2008, Netflix has been utilizing Chaos Engineering to foster resilient system architectures. Between 2010 and 2017, they only had one disruption of service due to an instance failure.
Adopting Chaos Engineering can be the best thing you ever do. Companies large and small are using it to promote resilient system design, leading to applications that never disappoint their users. Nevertheless, it’s an approach that must be used wisely. Before you jump in and turn your DevOps team into a giant Chaos Experiment, you need to consider the following issues carefully.
Chaos Engineering is ideal for highly distributed systems with thousands of instances. Not everyone has embraced the cloud yet. Many companies are still wrestling with conventional server architectures and the perils of vertical scaling.
In this context, traditional observability strategies still have plenty of mileage, especially when you throw data-driven approaches and machine learning into the mix. There’s no excuse for not squeezing every drop out of the existing approaches. Tools such as machine learning, Grafana and ELK can kickstart your DevOps practice when used well.
That said, the cloud is fast becoming the arena of choice for application deployment and may well become the norm for the majority of businesses. When you’re ready to move into the cloud, Chaos Engineering will be there waiting.
Most DevOps teams (I hope!) are familiar with the benefits that automation can provide through CI/CD and build pipelines. In Chaos Engineering, automation is essential. Due to the complexity of modern software, there is no way to know in advance which experiments are likely to yield positive results.
Chaos experiments need to be able to test the entire space of possible failures for a given system. It’s not possible to do this manually so a DevOps team that is serious about Chaos Engineering needs to bake a high level of automation into its experimental procedures.
If you’re looking to create automated experiments, Chaos Toolkit is the ideal resource. Created in 2017, it’s an open source project that provides a suite of tools for automating chaos experiments. Written in Python, Chaos Toolkit allows you to run experiments through its own CLI. The Chaos Toolkit is declarative, experiments are encoded in easy to edit JSON files which are customizable as circumstances require.
Chaos Engineering is not for the faint hearted. When you run chaos experiments you’re intentionally sabotaging live systems. You’re gambling with real user experiences. In the worst case, you could cause an outage for your entire user base.
That’s why engineers carrying them out need to work in a way that produces as little collateral damage as possible. Chaos engineering suggests carrying out a series of experiments with successively increasing “blast radius”.
The least risky experiment to try involves testing the client-device functionality by injecting failures into a small number of devices and user-flows. This will reveal any glaring faults in functionality and won’t harm any users. That said, it also won’t shed light on any deeper issues.
If the previous experiment didn’t reveal any problems, a small-scale diffuse experiment is the next thing to try. It’s small-scale because it impacts a small fraction of production traffic. It’s diffuse because that production traffic ends up evenly routed through your production servers. By filtering your success metrics for the affected traffic, you can assess the results of your experiment.
If the small-scale diffuse experiment passes without a hitch, you can try a small-scale concentrated experiment. Like the previous experiment, this involves a small subset of your traffic. However, instead of allowing this traffic to be ‘diffused’ throughout your servers, you funnel the subject traffic into a set of boxes.
This is where things get a bit hairy. The high traffic concentration puts plenty of load on the boxes, allowing hidden resource constraints to surface. There may also be increased latency. Any problems revealed by this experiment will affect the end users involved in it but will not impact your local system.
The final option may have a perverse appeal to Die Hard 4.0 fans. In case you haven’t seen it, a group of hackers manages to shut down the entire internet and throw America into chaos. This experiment takes a similar gamble with your system. It involves testing a large fraction of your production traffic while letting it diffuse through the production servers.
There are two reasons you shouldn’t try this at home. First, any problems revealed by your experiment will affect a significant percentage of your user base, who won’t be very happy about the service disruption they experience. Second, because the experiment is diffuse, it has the potential to cause an outage for your entire user base, particularly if it affects global resource constraints.
Because chaos experiments have the potential to cause large amounts of collateral damage, they need to incorporate a kill switch if things go sideways. Moreover, I’d highly recommend building in automated termination procedures, particularly given that your experiments are likely to be automated and running continuously.
At first glance, chaos is the antithesis of the intricate webs of logic that are software’s life blood. Yet in today’s landscape of massively distributed cloud-based systems, it just isn’t possible to design out failure. Instead, engineers need to cope with failure by designing resiliency into their systems. Chaos engineering is an approach explicitly designed to roll with failure and promote resilient system design.
Chaos engineering’s tolerance of failure makes it a dangerous game. But it’s a game more and more programmers are learning to play. And win.