How Netflix Uses Fault Injection To Truly Understand Their Resilience

Thomas Russell
April 6, 2021

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales.

For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection.

With tools like chaos monkey, Netflix employs a cutting edge testing toolkit. Chaos engineering is quickly becoming a staple of many site reliability engineering (SRE) strategies. When looking at Netflix’s capabilities and understanding of their own systems (and the faults it can/can’t tolerate) it is easy to see why.

What Is Fault Injection – Chaos Testing And Experimentation

The goal of fault injection is increasing system resiliency by deliberately introducing faults into your architecture. By experimenting using failure scenarios that are likely to be encountered, but in a controlled environment, it becomes clear where efforts need to be focused to ensure adequate quality of service. Most importantly, it becomes clear before the fact. It is better to find out, for example, that your new API gateway can’t handle traffic spikes in pre-production rather than production.

Fault injection is one way Netflix maintained an almost uninterrupted service during sharp spikes in usage. By stress testing their components against traffic spikes and other faults using fault injection, Netflix’s systems were able to take the strain without service wide outage.

Confidence vs Complexity

Adopting distributed systems has led to an increase in the complexity of the systems we work with. It is difficult to have full confidence in our microservices architecture when there are many components that could potentially fail.

Even the best QA testing can’t predict every real world deployment scenario. Even when every service is functioning properly, unpredictable outcomes still occur. Interactions between healthy services can be disrupted in unimagined ways by real world events and user input.

Netflix have mitigated this using chaos engineering and fault injection. By deliberately injecting failure into their ecosystem Netflix are able to test the resilience of their systems and learn from where they fail. Using a suite of tools creatively dubbed The Simian Army, Netflix maintains an application-based service built on a microservice architecture that is complex, scalable, and at the same time resilient.

A Tolerant Approach

Netflix’s approach to their infrastructure when migrating to the cloud was pioneering. For Netflix, a microservices cloud architecture (in AWS) had to be centered around fault tolerance. 100% up-time is an impossible expectation for any component. Netflix wanted an architecture wherein a single component could fail without affecting the whole system.

Designing a fault tolerant architecture was a challenge. However, system design was only one part of Netflix’s strategy. It was implementing tools and processes to test these systems’ efficacy in worst case scenarios which gave Netflix the edge. Their highly resilient, fault tolerant system would not exist today without them.

Key Components

Netflix has several components within their architecture which they consider key to their system resilience.

Preventing Cascading Failure

Hystrix, for example, is a latency and fault tolerance library. Netflix utilizes it to isolate points of access to remote systems and services. In their microservices architecture Hystrix enables Netflix to stop cascading failure, keeping their complex distributed system resilient by ensuring errors remain isolated.

Open Source, Scalable Memcache Caching

Then there’s EVCache. EVCache is an extremely scalable memcache caching solution. Developed internally by Netflix, EVCache has played a critical role in tracing requests during fault injection testing.

There are of course many more, and the list of resilience building components in the Netflix stack is always growing. Most are open source and developed in-house by Netflix themselves. However, when it comes to chaos engineering, fault injection, and utilizing chaos testing to validate systems, it is the simian army project which shines through as Netflix’ key achievement.

The Chaos Monkey

While tools like Hystrix and EVCache improve resilience by enabling visibility and traceability during failure scenarios, it is the simian army toolbox Netflix relies on to carry out fault injection testing at scale. The first tool in the box, chaos monkey, embodies Netflix’s approach to chaos engineering and fault injection as a testing method.

Monitored Disruption

Chaos monkey randomly disables production instances. This can occur at any time of day, although Netflix do ensure that the environment is carefully monitored. Engineers will be at the ready to fix any issues the testing may cause outside of the test environment. To the uninitiated this may seem like a high risk, but the rewards justify the process.

Automating Recovery

Experiments with chaos monkey enable engineers at Netflix to build recovery mechanisms into their architecture. This means that when the tested scenario occurs in real life, components in the system can rely on automated processes to resume operation as quickly as possible.

Critical insight gained using fault injection allows engineers to accurately anticipate how components in the complex architecture will respond during genuine failure, and build their systems to be resilient to the impacts of these events.

The Evolution Of Chaos

The success of chaos monkey in increasing the fault tolerance of Netflix systems quickly led to further development of the fault injection testing method. Soon tools like chaos gorilla were implemented. Chaos gorilla works in a similar fashion to chaos monkey, however rather than generating outages of individual instances, chaos gorilla takes out an entire Amazon availability zone.

Large Scale Service Verification

Netflix aimed to verify there was no user-visible impact if a functional availability zone went dark. The only way to ensure services re-balance automatically without manual interference was to deliberately instigate that scenario.

From Production To Server

Server side fault injection was also implemented. Latency monkey was developed, which simulates service degradation in the RESTful communication layer. It does this by deliberately creating delays and failures on the service side of a request. Latency monkey then measures if upstream services respond as anticipated.

Latency monkey provides Netflix with insight about the behavior of their calling applications. How calling applications will respond to a dependency slowdown is no longer an unknown. Using latency monkey, Netflix’s engineers can build their systems to withstand network congestion and thread pile up.

Given the nature of Netflix’s key product (a streaming platform), server side resilience is integral to delivering consistent quality of service.

Fault Injection As A Platform

Netflix operates an incredibly large and complex distributed system. Their cloud-based AWS microservices architecture contains an ever increasing number of individual components. Not only does this mean there is always a need to test and reinforce system resiliency, it also means that doing so at a wide enough scale to have a meaningful impact on the ecosystem becomes ever more difficult.

To address this, Netflix developed the FIT platform. FIT (Failure Injection Testing) is Netflix’s solution to the challenge of propagating deliberately induced faults across the architecture consistently. It is by utilizing FIT that automated fault injection in Netflix has gone from an application used in isolated testing to a commonplace occurrence. Using FIT, Netflix can run chaos exercises across large chunks of the system, and engineers can access tests as a self service.

Netflix found that their methods of deliberately introducing faults to their system wasn’t without risk or impact. The FIT platform limits the impact of fault and failure testing on adjacent components and the wider system. However it does so in a way which still allows for serious faults paralleling those encountered during actual runtime to be introduced.

Confident Engineers Build Better Systems

Using chaos engineering principles and techniques, Netflix’s engineers have built a highly complex, distributed cloud architecture they can be confident in.

With every component built to fail at some point, failure is no longer an unknown. It is the uncertainty of failure more than anything which fills engineers with dread when it comes to the resilience of their systems.

Netflix’s systems are proven to withstand all but the most catastrophic failures. The engineers employed by Netflix can all be highly confident in its resilience. This means that they will be more creative in their output, more open to experimentation, and spend less of their energy on worrying about impending service-wide failure.

The benefits of Netflix’s approach are numerous, and their business success is testament to its effectiveness. Using chaos engineering and fault injection, Netflix maintains an application-based service built on a microservice architecture that is complex, scalable, and at the same time resilient.