How Netflix Uses Fault Injection To Truly Understand Their Resilience

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales.

For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection.

With tools like chaos monkey, Netflix employs a cutting edge testing toolkit. Chaos engineering is quickly becoming a staple of many site reliability engineering (SRE) strategies. When looking at Netflix’s capabilities and understanding of their own systems (and the faults it can/can’t tolerate) it is easy to see why.

What Is Fault Injection – Chaos Testing And Experimentation

The goal of fault injection is increasing system resiliency by deliberately introducing faults into your architecture. By experimenting using failure scenarios that are likely to be encountered, but in a controlled environment, it becomes clear where efforts need to be focused to ensure adequate quality of service. Most importantly, it becomes clear before the fact. It is better to find out, for example, that your new API gateway can’t handle traffic spikes in pre-production rather than production.

Fault injection is one way Netflix maintained an almost uninterrupted service during sharp spikes in usage. By stress testing their components against traffic spikes and other faults using fault injection, Netflix’s systems were able to take the strain without service wide outage.

Confidence vs Complexity

Adopting distributed systems has led to an increase in the complexity of the systems we work with. It is difficult to have full confidence in our microservices architecture when there are many components that could potentially fail.

Even the best QA testing can’t predict every real world deployment scenario. Even when every service is functioning properly, unpredictable outcomes still occur. Interactions between healthy services can be disrupted in unimagined ways by real world events and user input.

Netflix have mitigated this using chaos engineering and fault injection. By deliberately injecting failure into their ecosystem Netflix are able to test the resilience of their systems and learn from where they fail. Using a suite of tools creatively dubbed The Simian Army, Netflix maintains an application-based service built on a microservice architecture that is complex, scalable, and at the same time resilient.

A Tolerant Approach

Netflix’s approach to their infrastructure when migrating to the cloud was pioneering. For Netflix, a microservices cloud architecture (in AWS) had to be centered around fault tolerance. 100% up-time is an impossible expectation for any component. Netflix wanted an architecture wherein a single component could fail without affecting the whole system.

Designing a fault tolerant architecture was a challenge. However, system design was only one part of Netflix’s strategy. It was implementing tools and processes to test these systems’ efficacy in worst case scenarios which gave Netflix the edge. Their highly resilient, fault tolerant system would not exist today without them.

Key Components

Netflix has several components within their architecture which they consider key to their system resilience.

Preventing Cascading Failure

Hystrix, for example, is a latency and fault tolerance library. Netflix utilizes it to isolate points of access to remote systems and services. In their microservices architecture Hystrix enables Netflix to stop cascading failure, keeping their complex distributed system resilient by ensuring errors remain isolated.

Open Source, Scalable Memcache Caching

Then there’s EVCache. EVCache is an extremely scalable memcache caching solution. Developed internally by Netflix, EVCache has played a critical role in tracing requests during fault injection testing.

There are of course many more, and the list of resilience building components in the Netflix stack is always growing. Most are open source and developed in-house by Netflix themselves. However, when it comes to chaos engineering, fault injection, and utilizing chaos testing to validate systems, it is the simian army project which shines through as Netflix’ key achievement.

The Chaos Monkey

While tools like Hystrix and EVCache improve resilience by enabling visibility and traceability during failure scenarios, it is the simian army toolbox Netflix relies on to carry out fault injection testing at scale. The first tool in the box, chaos monkey, embodies Netflix’s approach to chaos engineering and fault injection as a testing method.

Monitored Disruption

Chaos monkey randomly disables production instances. This can occur at any time of day, although Netflix do ensure that the environment is carefully monitored. Engineers will be at the ready to fix any issues the testing may cause outside of the test environment. To the uninitiated this may seem like a high risk, but the rewards justify the process.

Automating Recovery

Experiments with chaos monkey enable engineers at Netflix to build recovery mechanisms into their architecture. This means that when the tested scenario occurs in real life, components in the system can rely on automated processes to resume operation as quickly as possible.

Critical insight gained using fault injection allows engineers to accurately anticipate how components in the complex architecture will respond during genuine failure, and build their systems to be resilient to the impacts of these events.

The Evolution Of Chaos

The success of chaos monkey in increasing the fault tolerance of Netflix systems quickly led to further development of the fault injection testing method. Soon tools like chaos gorilla were implemented. Chaos gorilla works in a similar fashion to chaos monkey, however rather than generating outages of individual instances, chaos gorilla takes out an entire Amazon availability zone.

Large Scale Service Verification

Netflix aimed to verify there was no user-visible impact if a functional availability zone went dark. The only way to ensure services re-balance automatically without manual interference was to deliberately instigate that scenario.

From Production To Server

Server side fault injection was also implemented. Latency monkey was developed, which simulates service degradation in the RESTful communication layer. It does this by deliberately creating delays and failures on the service side of a request. Latency monkey then measures if upstream services respond as anticipated.

Latency monkey provides Netflix with insight about the behavior of their calling applications. How calling applications will respond to a dependency slowdown is no longer an unknown. Using latency monkey, Netflix’s engineers can build their systems to withstand network congestion and thread pile up.

Given the nature of Netflix’s key product (a streaming platform), server side resilience is integral to delivering consistent quality of service.

Fault Injection As A Platform

Netflix operates an incredibly large and complex distributed system. Their cloud-based AWS microservices architecture contains an ever increasing number of individual components. Not only does this mean there is always a need to test and reinforce system resiliency, it also means that doing so at a wide enough scale to have a meaningful impact on the ecosystem becomes ever more difficult.

To address this, Netflix developed the FIT platform. FIT (Failure Injection Testing) is Netflix’s solution to the challenge of propagating deliberately induced faults across the architecture consistently. It is by utilizing FIT that automated fault injection in Netflix has gone from an application used in isolated testing to a commonplace occurrence. Using FIT, Netflix can run chaos exercises across large chunks of the system, and engineers can access tests as a self service.

Netflix found that their methods of deliberately introducing faults to their system wasn’t without risk or impact. The FIT platform limits the impact of fault and failure testing on adjacent components and the wider system. However it does so in a way which still allows for serious faults paralleling those encountered during actual runtime to be introduced.

Confident Engineers Build Better Systems

Using chaos engineering principles and techniques, Netflix’s engineers have built a highly complex, distributed cloud architecture they can be confident in.

With every component built to fail at some point, failure is no longer an unknown. It is the uncertainty of failure more than anything which fills engineers with dread when it comes to the resilience of their systems.

Netflix’s systems are proven to withstand all but the most catastrophic failures. The engineers employed by Netflix can all be highly confident in its resilience. This means that they will be more creative in their output, more open to experimentation, and spend less of their energy on worrying about impending service-wide failure.

The benefits of Netflix’s approach are numerous, and their business success is testament to its effectiveness. Using chaos engineering and fault injection, Netflix maintains an application-based service built on a microservice architecture that is complex, scalable, and at the same time resilient.

What is Chaos Engineering and How to Implement It

Chaos Engineering is one of the hottest new approaches in DevOps. Netflix first pioneered it back in 2008, and since then it’s been adopted by thousands of companies, from the biggest names in tech to small software companies.

In our age of highly distributed cloud-based systems, Chaos Engineering promotes resilient system architectures by applying scientific principles.  In this article, I’ll explain exactly what Chaos Engineering is and how you can make it work for your team. 

What is Chaos Engineering?

A good way to sum up Chaos Engineering would be with Facebook’s (sadly) defunct motto “move fast and break things”. 

Chaos engineering is based on procedures called Chaos Experiments. These typically involve setting up two versions of the same system, an ‘experimental’ version and a ‘control’ version. Engineers then define a ‘steady state’, the characteristic output expected when the system is functioning normally.

When the experiment starts, both the experimental and control systems display the same steady state. Engineers then throw as many stones as possible at the experimental system in an attempt to knock it away from that state.

This stone-throwing could involve anything from randomly causing functions to throw exceptions to simulating the failure of a data center. Two notorious Chaos Engineering tools are Netflix’s Chaos Monkey and Chaos Kong. These were programmed to randomly turn off instances in Netflix’s live systems to test their resiliency. If the end users complained, a Netflix engineer had his work cut out!

Probably the most controversial feature of Chaos Engineering is that experiments are performed on production traffic. While this is highly risky, it’s the only way to ensure real-world validity.

Why Choose Chaos Engineering?

The key benefit of chaos engineering is that it promotes resilient system architectures, especially in cloud-based applications. 

Traditional observability strategies use logs, traces and metrics to provide engineers with information about the internal state of an enterprise system. Engineers typically analyze this information and use their knowledge to interpret it.

While this approach works with older applications that are hosted in fixed servers, it’s a much bigger challenge in more modern systems. These contain hundreds or thousands of elements whose interactions are nearly impossible to analyze using traditional methods. They also use serverless architectures which hamper observability with their lack of a fixed server.

This is where Chaos Engineering can fill the gap. It doesn’t require detailed knowledge about a system’s internal vagaries, and it bypasses the obstacles that block traditional observability.  That means it will work for arbitrarily complex enterprise applications.

The upshot is that when programmers incorporate chaos into their design processes, they design for resiliency. Since 2008, Netflix has been utilizing Chaos Engineering to foster resilient system architectures. Between 2010 and 2017, they only had one disruption of service due to an instance failure.

Making Chaos Engineering Work for You

Adopting Chaos Engineering can be the best thing you ever do. Companies large and small are using it to promote resilient system design, leading to applications that never disappoint their users. Nevertheless, it’s an approach that must be used wisely.  Before you jump in and turn your DevOps team into a giant Chaos Experiment, you need to consider the following issues carefully.

Is Chaos Right for You?

Chaos Engineering is ideal for highly distributed systems with thousands of instances. Not everyone has embraced the cloud yet. Many companies are still wrestling with conventional server architectures and the perils of vertical scaling.

In this context, traditional observability strategies still have plenty of mileage, especially when you throw data-driven approaches and machine learning into the mix.  There’s no excuse for not squeezing every drop out of the existing approaches.  Tools such as machine learning, Grafana and ELK can kickstart your DevOps practice when used well.

That said, the cloud is fast becoming the arena of choice for application deployment and may well become the norm for the majority of businesses. When you’re ready to move into the cloud, Chaos Engineering will be there waiting.

Automate Experiments

Most DevOps teams (I hope!) are familiar with the benefits that automation can provide through CI/CD and build pipelines. In Chaos Engineering, automation is essential. Due to the complexity of modern software, there is no way to know in advance which experiments are likely to yield positive results.

Chaos experiments need to be able to test the entire space of possible failures for a given system.  It’s not possible to do this manually so a DevOps team that is serious about Chaos Engineering needs to bake a high level of automation into its experimental procedures.

If you’re looking to create automated experiments, Chaos Toolkit is the ideal resource. Created in 2017, it’s an open source project that provides a suite of tools for automating chaos experiments. Written in Python, Chaos Toolkit allows you to run experiments through its own CLI. The Chaos Toolkit is declarative, experiments are encoded in easy to edit JSON files which are customizable as circumstances require.

Playing with Chaos

Chaos Engineering is not for the faint-hearted. When you run chaos experiments you’re intentionally sabotaging live systems. You’re gambling with real user experiences. In the worst case, you could cause an outage for your entire user base.

That’s why engineers carrying them out need to work in a way that produces as little collateral damage as possible. Chaos engineering suggests carrying out a series of experiments with successively increasing “blast radius.”

The least risky experiment to try involves testing the client-device functionality by injecting failures into a small number of devices and user-flows.  This will reveal any glaring faults in functionality and won’t harm any users. That said, it also won’t shed light on any deeper issues.

Upping the Ante

If the previous experiment didn’t reveal any problems, a small-scale diffuse experiment is the next thing to try. It’s small-scale because it impacts a small fraction of production traffic. It’s diffuse because that production traffic ends up evenly routed through your production servers. By filtering your success metrics for the affected traffic, you can assess the results of your experiment.

If the small-scale diffuse experiment passes without a hitch, you can try a small-scale concentrated experiment. Like the previous experiment, this involves a small subset of your traffic. However, instead of allowing this traffic to be ‘diffused’ throughout your servers, you funnel the subject traffic into a set of boxes.

This is where things get a bit hairy. The high traffic concentration puts plenty of load on the boxes, allowing hidden resource constraints to surface.  There may also be increased latency. Any problems revealed by this experiment will affect the end users involved in it but will not impact your local system.

Going Nuclear

The final option may have a perverse appeal to Die Hard 4.0 fans. In case you haven’t seen it, a group of hackers manages to shut down the entire internet and throw America into chaos. This experiment takes a similar gamble with your system. It involves testing a large fraction of your production traffic while letting it diffuse through the production servers.

There are two reasons you shouldn’t try this at home. First, any problems revealed by your experiment will affect a significant percentage of your user base, who won’t be very happy about the service disruption they experience. Second, because the experiment is diffuse, it has the potential to cause an outage for your entire user base, particularly if it affects global resource constraints.

Because chaos experiments have the potential to cause large amounts of collateral damage, they need to incorporate a kill switch if things go sideways. Moreover, I’d highly recommend building in automated termination procedures, particularly given that your experiments are likely to be automated and running continuously.

Fostering Resiliency

At first glance, chaos is the antithesis of the intricate webs of logic that are software’s life blood.  Yet in today’s landscape of massively distributed cloud-based systems, it just isn’t possible to design out failure. Instead, engineers need to cope with failure by designing resiliency into their systems. Chaos engineering is an approach explicitly designed to roll with failure and promote resilient system design.

Chaos engineering’s tolerance of failure makes it a dangerous game. But it’s a game more and more programmers are learning to play. And win.