5 Ways Scrum Teams Can Be More Efficient

With progressive delivery, DevOps, scrum, and agile methodologies, the software delivery process has become faster and more collaborative than ever before. Scrum has emerged as a ubiquitous framework for agile collaboration, instilling some basic meetings and roles into a team and enabling them to begin iterating on product increments quickly. However, as scrum teams grow and systems become more complex, it can be difficult to maintain productivity levels in your organization. 

Let’s dive into five ways you can tweak your company’s scrum framework to drive ownership, optimize cloud costs, and overall increased productivity for your team.

1. Product Backlog Optimization

Scrum is an iterative process. Thus, based on feedback from the stakeholders, the product backlog or the list of implementable features gets continually adjusted. Prioritizing and refining tasks on this backlog will ensure that your team delivers the right features to your customers, thus boosting efficiency and morale. 

However, it’s not as easy as it sounds. Each project has multiple stakeholders, and getting them on the same page about feature priority can sometimes prove tricky. That’s why selecting the right product owner and conducting pre-sprint and mid-sprint meetings are essential. 

These meetings help create a shared understanding of the project scope and the final deliverable early. Using prioritization frameworks to categorize features based on value and complexity can also help eliminate the guesswork or biases while deciding priority. 

At the end of each product backlog meeting, document the discussions and send them to the entire team, including stakeholders. That way, as the project progresses, there is less scope for misunderstandings, rework, or missing features. With a refined backlog, you’ll be able to rapidly deliver new changes to your software; however, this gives rise to a new problem.

2. Observability

As software systems become more distributed, there is rarely a single point of failure for applications. Identifying and fixing the broken link in the chain can add hours to the sprint, reducing the team’s overall productivity. Having a solid observability system with log, traces, and metrics monitoring, thus, becomes crucial to improve product quality. 

However, with constant pressure to meet scrum deadlines, it can be challenging to maintain logs and monitor them constantly. That’s precisely where monitoring platforms like Coralogix can help. You can effectively analyze even the most complex of your applications as your security, and log data can be visualized in a single, centralized dashboard. 

Machine learning algorithms in observability platforms continually search for anomalies through this data with an automatic alerting system. Thus, bottlenecks and security issues in a scrum sprint can be identified before they become critical and prioritized accordingly. Collaboration across teams also becomes streamlined as they can access the application analytics data securely without the headache of maintaining an observability stack.

This new information becomes the fuel for continuous improvement within the team. This is brilliant, but just the data alone isn’t enough to drive that change. You need to tap into the most influential meetings in the scrum framework—the retrospective.

3. Efficient Retrospectives

Even though product delivery is usually at the forefront of every scrum meeting, retrospectives are arguably more important as they directly impact both productivity and the quality of the end product.

Retrospectives at the end of the sprint are powerful opportunities to improve workflows and processes. If done right, these can reduce time waste, speed up future projects, and help your team collaborate more efficiently.

During a retrospective, especially if it’s your team’s first one, it is important to set ground rules to allow constructive criticism. Retrospectives are not about taking the blame but rather about solving issues collectively.

To make the retrospective actionable, you can choose a structure for the meetings. For instance, some companies opt for a “Start, Stop, Continue” format where employees can jot down what they think they should start doing, what has been working well and what has not. Another popular format is the “5 Whys,” which encourages team members to introspect and think critically about improving the project workflow.

As sprint retrospectives are relatively regular, sticking to a particular format can get slightly repetitive. Instead, switch things up by changing the time duration of the meeting, retrospective styles, and the mandatory members. No matter which format or style you choose, the key is to engage the entire team.

At the end of a retrospective, document what was discussed and plan to address the positive and negative feedback. This list would help you pick and prioritize the changes that would impact and implement them from the next sprint. Throughout your work, you may find that some of these actions can only be picked up by a specific group of people. This is called a “single point of failure,” and the following tip can solve it.

4.  Cross-Training

Cross-training helps employees upskill themselves, understand the moving parts of a business, and how their work fits in the larger scheme of things. The idea is to train employees on the most critical or base tasks across the organization segment, thus enabling better resource allocation. 

Cross-training has been successful because pairs programming helps boost collaboration and cross-train teams. If there’s an urgent product delivery or one of the team members is not available, others can step in to complete the task. Cross-functional teams can also iterate more quickly than their siloed counterparts as they have the skills to prototype and test minimum viable products within the team rapidly.

However, the key to cross-training is not to overdo it. Having a developer handle the server-side of things or support defects for some time is reasonable, but if it becomes a core part of their day, it wouldn’t fit with their career goals. Cross-functional doesn’t mean that everyone should do everything, but rather help balance work and allocate tasks more efficiently.

When engineers are moving between tech stacks and supporting one another, it does come with a cost. That team will need to think hard about how they work to build the necessary collaborative machinery, such as CI/CD pipelines. These tools, together, form the developer workflow, and with cross-functional teams, an optimal workflow is essential to team success.

5. Workflow Optimization

Manual work and miscommunication cause the most significant drain on a scrum team’s productivity. Choosing the right tools can help cut down this friction and boost process efficiency and sprint velocity. Different tools that can help with workflow optimization include project management tools like Jira, collaboration tools like Slack and Zoom, productivity tools like StayFocused, and data management tools like Google Sheets.

Many project management tools have built-in features specific to agile teams, such as customizable scrum boards, progress reports, and backlog management on simple drag-and-drop interfaces. For example, tools like Trello or Asana help manage and track user stories, improve visibility, and identify blockers effectively through transparent deadlines. 

You can also use automation tools like Zapier and Butler to automate repetitive tasks within platforms like Trello. For example, you can set up rules on Zapier to trigger whenever a particular action is performed. For instance, every time you add a new card on Trello, you can configure it to make a new Drive folder or schedule a meeting. This would cut down unnecessary switching between multiple applications and save man-hours. Thus, with routine tasks automated, the team can focus on more critical areas of product delivery.

It’s also important to keep track of software costs while implementing tools. Keep track of the workflow tools you implement and trim those that don’t lead to a performance increase or are redundant.

Final Thoughts

While scrum itself allows for speed, flexibility, and energy from teams, incorporating these five tips can help your team become even more efficient. However, you should always remember that the Scrum framework is not a one-size-fits-all. Scrum practices that would work in one scenario might be a complete failure in the next one. 

Thus, your scrum implementations should always allow flexibility and experimentation to find the best fit for the team and project. After all, that’s the whole idea behind being agile, isn’t it?

Five Tricks that Senior Engineers Use When They’re Debugging

Debugging is a fundamental skill in the arsenal of any engineer. Mistakes happen, and bugs are inevitable, but a skilled debugger can catch a bug early and find an elegant solution. 

But wait, what exactly is debugging?

It’s tempting to think of debugging as solving a problem. You’re fixing it, so it doesn’t come up again. This is not true. You’re not at the point of a solution. Right now, you’re investigating. You’re trying to work out the problem before committing to a solution. 

“Debugging is like being the detective in a crime movie where you are also the murderer.” – Filipe Fortes

So let’s get into it. Here are five debugging techniques that will supercharge your debugging skills and give you insight into even the most complex problems.

1. Use the IDE

A typical pattern that many engineers do is rely heavily on their minds. This pattern is good, but it’s not optimal. Your IDE comes packed with tools that will help you analyze a problem, like the software debugger tool that will help you to step through a coding problem, line by line. The important thing about the software debugger tool is that it can also tell you the value of every single variable, how many times a loop is iterated, whether an if-statement is entered, and more. When you’re debugging in programming, this tool is indispensable.

Remember though – the IDE won’t tell you the problem…

All the IDE can do is surface as much information as possible. It’s up to you to do two things. The first is to set up your IDE best for your working style. Some people like to have every window and dial open to see all the metrics as they work. Others prefer a clean, minimalistic working space. The second is to use the information put in front of you effectively. Your IDE gives you everything – it’s up to you to filter that down into insights. When you get the tooling right, it’s time to think more methodically about the problem in front of you.

2. Write out your assumptions

Debugging is fundamentally a testing activity. You’re poking your creation to truly understand how it behaves and, most importantly, why it behaves that way. When you’re testing, you have to make a series of assumptions. It’s a natural part of the testing process, but sometimes, those assumptions can be incorrect. 

Our assumptions can be so fundamental that we overlook them, but it’s valuable to be thoroughly conversant with the assumptions you’re making for complex issues. 

BDD is a great tool here

Behavior-driven development (BDD) is a powerful technique for surfacing assumptions, describing expected behavior, and flushing out any inconsistent thinking you have. It is most commonly utilized in the software design phase, but it’s equally valuable when trying to understand existing software. 

Try this workflow. 

  1. Write out a BDD case, for example, given that I have already logged in and my account has admin powers, when I navigate to the home page, I should be able to edit the page’s title. 
  2. Test it by navigating and trying it out. 
  3. You know your BDD test case is wrong if it doesn’t work. At that point, you can ask the following questions: Are you logged in? Does the account have admin powers? Do admins have the ability to edit pages?
  4. Create a new BDD case and start again, but this time more refined. Iterate towards the true issue, learning each time.

Mastering your use of language to surface your assumptions is one of the vital debugging techniques, but sometimes it’s not just about your language – it’s about your thinking. Sometimes you need someone, or namely something, to talk to. 

3. Rubber duck debugging

Sometimes, it’s a good idea to completely abandon your sanity and talk to inanimate objects when you’re stuck. This may sound a little wild, but it has a long and convincing pedigree in software engineering. The beauty of speaking to a rubber duck is that the conversation is agonizingly one-sided, forcing you to articulate every detail. 

Speaking about a problem often helps you find inconsistencies or assumptions in your thinking. These assumptions can lead to new test cases that will help you focus on the issue. There are dozens of senior engineers who swear by rubber duck debugging. 

People work too!

You could also have a conversation with a colleague. Asking for help is not a sign of weakness; it’s the sign of an efficient, collaborative engineer. If there is an available colleague nearby, have a chat! But sometimes, there’s nothing better than the vacant, plastic stare of a rubber duck.

Conversing is an essential method for understanding the problem, but sometimes the problem is so complicated that you need to take a different tactic. If the code is complex or difficult to understand, then it’s time to try and change the behavior to test your understanding.

4. Mess with the code a little

Okay, so you’ve tried a bunch of stuff, and it’s not working how you expect. You don’t know why and every input you’ve been attempting to give the code just isn’t making any sense. Your next step is to change the code a little. Here is what you do:

  1. Predict how the behavior will change if you change the code in a certain way.
  2. Change the code.
  3. Measure the outcome and see if your prediction is correct.
  4. Make a new prediction and start again.

This is a brilliant way of understanding why each function is in place for very complicated algorithms. Change a little, and investigate why it doesn’t match your expectations. Only now, you’re not investigating the whole algorithm; you’re just looking at one little step. When debugging in programming and the code is complex, making micro-changes and predicting their outcome is an excellent way of practically assessing the code’s behavior and can break up those long, monotonous staring competitions that engineers often find themselves in. 

However, if all else fails, there is one age-old debugging technique aiding engineers since we first started building things.

5. Get up and go outside

Newton didn’t discover gravity when an apple hit him on the head, but the myth has a sensible moral. Sometimes, ideas just hit you on the head, and no amount of brute force thinking will speed up that discovery. When you focus for a long time, the assumptions you make begin to pile up. It can become difficult to see through tunnel vision.

To get away from this, take yourself away from the problem for 10 minutes. Make a coffee, go for a walk, go to the gym – just do something that lets your mind work on the problem in the background. Your mind has a startling capacity for passive problem solving, and it is one of the most underutilized debugging skills.

Now you know what to do

Debugging can be one of the most rewarding engineering experiences, that warm sensation of enlightenment when you finally discover the actual workings of your solution is hard to beat. So the next time you run into a problem, don’t panic – just remember to be methodical, focused, and relaxed. You’ve got this.

Adding Observability to Your CI/CD Pipeline in CircleCI

The simplest CI/CD observability pipeline consists of three stages: build, test, and deploy. 

In modern software systems, it is common for several developers to work on the same project simultaneously. Siloed working with infrequent merging of code in a shared repository often leads to bugs and conflicts that are difficult and time-consuming to resolve. To solve this problem, we can adopt continuous integration.  

Continuous integration is the practice of writing code in short, incremental bursts and pushing it to a shared project repository frequently so that automated build and testing can be run against it. This ensures that when a developer’s code gets merged into the overall project codebase, any integration problems are detected as early as possible. The automatic build and testing are handled by a CI server.

If passing the automated build and testing results in code being automatically deployed to production, that is called continuous deployment. 

All the sequential steps that need to be automatically executed from the moment a developer commits a change to it being shipped to production is referred to as a CI/CD pipeline.  CI/CD pipelines can range from very simple to very complex, depending on the needs of the application.

Important considerations when developing CI/CD pipelines

Building a CI/CD pipeline is no simple task. It presents numerous challenges, some of which include:

Automating the wrong processes

The whole premise of CI/CD is to increase developer productivity and optimize time-to-market. This goal gets defeated when the CI/CD pipeline has many steps in it that aren’t necessary or that could be done faster manually. 

When developing a CI/CD pipeline, you should:

  • consider how long a task takes to perform manually and whether it is worth automating
  • evaluate all the steps in the CI/CD pipeline and only include those that are necessary
  • analyze performance metrics to determine whether the pipeline is improving productivity
  • understand the technologies you are working with and their limitations as well as how they can be optimized so that you can speed up the build and testing stages.

Ineffective testing

Tests are written to find and remove bugs and ensure that code behaves in the desired manner. You can have a great CI/CD pipeline in place but still get bug-ridden code in production because of poorly written, ineffective tests. 

To improve the effectiveness of a CI/CD pipeline, you should:

  • write automated tests during development, ideally by practicing test-driven development (TDD)
  • examine the tests to ensure that they are of high quality and suitable for the application
  • ensure that the tests have decent code coverage and cover all the appropriate edge cases

Lack of observability in CI/CD pipelines

Continuous integration and continuous deployment underpin agile development. Together they ensure that features are developed and released to users quickly while maintaining high quality standards.  This makes the CI/CD pipeline business-critical infrastructure.

The more complex the software being built, the more complex the CI/CD pipeline that supports it. What happens when one part of the pipeline malfunctions? How do you discover an issue that is causing the performance of the CI/CD pipeline to degrade?

It is important that developers and the platform team are able to obtain data that answers these critical questions right from the CI/CD pipeline itself so that they can address issues as they arise.

Making a CI/CD pipeline observable means collecting quality and performance metrics on each stage of the CI/CD pipeline and thus proactively working to ensure the reliability and optimal performance of this critical piece of infrastructure. 

Quality metrics

Quality metrics help you identify how good the code being pushed to production is. While the whole premise of a CI/CD pipeline is to increase the speed at which software is shipped to get fast feedback from customers, it is also important to not be shipping out buggy code.

By tracking things like test pass rate, deployment success rate, and defects escape rate you can more easily identify where to improve the quality of code being produced. 

Productivity metrics

An effective CI/CD pipeline is a performant one. You should be able to build, test, and ship code as quickly as possible. Tracking performance-related metrics can give you insight into how performant your CI/CD pipeline is and enable you to identify and fix any bottlenecks causing performance issues.

Performance-based metrics include time-to-market, defect resolution time, deployment frequency, build/test duration, and the number of failed deployments. 

Observability in your CI/CD pipeline

The first thing needed to make a CI/CD pipeline observable is to use the right observability tool. Coralogix is a stateful streaming analytics platform that allows you to:

The observability tool you choose can then be configured to track and report on the observability metrics most pertinent to your application.

When an issue is discovered, the common practice is to have the person who committed the change that resulted in the issue investigate the problem and find a solution. The benefit of this approach is that it makes team members have a sense of complete end-to-end ownership of any task they take on as they have to ensure it gets shipped successfully. 

Another good practice is to conduct a post-mortem reviewing the incident to identify what worked to resolve it and how things can be done better next time. The feedback from the post-mortem can also be used to identify where CI/CD pipeline can be improved to prevent future issues. 

Example of a simple CircleCI CI/CD pipeline

There are a number of CI servers you can use to build your CI/CD pipeline. Popular ones include Jenkins, CircleCI, Gitlab and a newcomer Github Actions.

Coralogix provides integrations with CircleCI, Jenkins, and Gitlab that enable you to quickly and easily send logs and metrics to Coralogix from these platforms. 

The general principle of most CI servers is that you define your CI/CD pipeline in a yml file as a workflow consisting of sequential jobs. Each job defines a particular stage of your CI/CD pipeline and can consist of multiple steps. 

An example of a CircleCI CI/CD pipeline for building and testing a python application is shown in the code snippet below.

To add a deploy stage, you can use any one of the deployment orbs CircleCI provides. An orb is simply a reusable configuration package CircleCI makes available to help simplify your deployment configuration. There are orbs for most of the common deployment targets, including AWS and Heroku. 

The completed CI/CD pipeline with deployment to Heroku is shown in the code snippet below.

Having created this CI/CD pipeline you would think that you are done, but in fact, you have only done half the job. The above CI/CD pipeline is missing a critical component to make it truly effective: observability. 

Making the CI/CD pipeline observable

Coralogix provides an orb that makes it simple to integrate your CircleCI CI/CD pipeline. This enables you to send pipeline data to Coralogix in real-time for analysis of the health and performance of your pipeline. 

The Coralogix orb provides four endpoints:

  • coralogix/stats for sending the final report of w workflow job to Coralogix
  • coralogix/logs for sending the logs of all workflow jobs to Coralogix for debugging
  • coralogix/send for sending 3rd party logs generated during a workflow job to Coralogix
  • coralogix/tag for creating a tag and a report for the workflow in Coralogix

To add observability to your CircleCI pipeline:

  1. In your Coralogix account, go ahead and enable Pipelines by navigating to Project Settings -> Advanced Settings -> Pipelines and turn it on
  2. Add the Coralogix orb stanza at the top of your CircleCI configuration file
  3. Use the desired Coralogix endpoint in your existing pipeline

The example below shows how you can use Coralogix to debug a CircleCI workflow. Adding the coralogix/logs job at the end of the workflow means that all the logs generated by CircleCI during the workflow will be sent to your Coralogix account, which will allow you to debug all the different jobs in the workflow. 

Conclusion

CI/CD pipelines are a critical piece of infrastructure. By making your CI/CD pipeline observable you turn it into a source of real-time actionable insight into its health and performance.

Observability of CI/CD pipelines should not come as an after-thought but rather something that is incorporated into the design of the pipeline from the onset. Coralogix provides integrations for CircleCI and Jenkins that make it a reliable partner for introducing observability to your CI/CD pipeline. 

What is Chaos Engineering and How to Implement It

Chaos Engineering is one of the hottest new approaches in DevOps. Netflix first pioneered it back in 2008, and since then it’s been adopted by thousands of companies, from the biggest names in tech to small software companies.

In our age of highly distributed cloud-based systems, Chaos Engineering promotes resilient system architectures by applying scientific principles.  In this article, I’ll explain exactly what Chaos Engineering is and how you can make it work for your team. 

What is Chaos Engineering?

A good way to sum up Chaos Engineering would be with Facebook’s (sadly) defunct motto “move fast and break things”. 

Chaos engineering is based on procedures called Chaos Experiments. These typically involve setting up two versions of the same system, an ‘experimental’ version and a ‘control’ version. Engineers then define a ‘steady state’, the characteristic output expected when the system is functioning normally.

When the experiment starts, both the experimental and control systems display the same steady state. Engineers then throw as many stones as possible at the experimental system in an attempt to knock it away from that state.

This stone-throwing could involve anything from randomly causing functions to throw exceptions to simulating the failure of a data center. Two notorious Chaos Engineering tools are Netflix’s Chaos Monkey and Chaos Kong. These were programmed to randomly turn off instances in Netflix’s live systems to test their resiliency. If the end users complained, a Netflix engineer had his work cut out!

Probably the most controversial feature of Chaos Engineering is that experiments are performed on production traffic. While this is highly risky, it’s the only way to ensure real-world validity.

Why Choose Chaos Engineering?

The key benefit of chaos engineering is that it promotes resilient system architectures, especially in cloud-based applications. 

Traditional observability strategies use logs, traces and metrics to provide engineers with information about the internal state of an enterprise system. Engineers typically analyze this information and use their knowledge to interpret it.

While this approach works with older applications that are hosted in fixed servers, it’s a much bigger challenge in more modern systems. These contain hundreds or thousands of elements whose interactions are nearly impossible to analyze using traditional methods. They also use serverless architectures which hamper observability with their lack of a fixed server.

This is where Chaos Engineering can fill the gap. It doesn’t require detailed knowledge about a system’s internal vagaries, and it bypasses the obstacles that block traditional observability.  That means it will work for arbitrarily complex enterprise applications.

The upshot is that when programmers incorporate chaos into their design processes, they design for resiliency. Since 2008, Netflix has been utilizing Chaos Engineering to foster resilient system architectures. Between 2010 and 2017, they only had one disruption of service due to an instance failure.

Making Chaos Engineering Work for You

Adopting Chaos Engineering can be the best thing you ever do. Companies large and small are using it to promote resilient system design, leading to applications that never disappoint their users. Nevertheless, it’s an approach that must be used wisely.  Before you jump in and turn your DevOps team into a giant Chaos Experiment, you need to consider the following issues carefully.

Is Chaos Right for You?

Chaos Engineering is ideal for highly distributed systems with thousands of instances. Not everyone has embraced the cloud yet. Many companies are still wrestling with conventional server architectures and the perils of vertical scaling.

In this context, traditional observability strategies still have plenty of mileage, especially when you throw data-driven approaches and machine learning into the mix.  There’s no excuse for not squeezing every drop out of the existing approaches.  Tools such as machine learning, Grafana and ELK can kickstart your DevOps practice when used well.

That said, the cloud is fast becoming the arena of choice for application deployment and may well become the norm for the majority of businesses. When you’re ready to move into the cloud, Chaos Engineering will be there waiting.

Automate Experiments

Most DevOps teams (I hope!) are familiar with the benefits that automation can provide through CI/CD and build pipelines. In Chaos Engineering, automation is essential. Due to the complexity of modern software, there is no way to know in advance which experiments are likely to yield positive results.

Chaos experiments need to be able to test the entire space of possible failures for a given system.  It’s not possible to do this manually so a DevOps team that is serious about Chaos Engineering needs to bake a high level of automation into its experimental procedures.

If you’re looking to create automated experiments, Chaos Toolkit is the ideal resource. Created in 2017, it’s an open source project that provides a suite of tools for automating chaos experiments. Written in Python, Chaos Toolkit allows you to run experiments through its own CLI. The Chaos Toolkit is declarative, experiments are encoded in easy to edit JSON files which are customizable as circumstances require.

Playing with Chaos

Chaos Engineering is not for the faint-hearted. When you run chaos experiments you’re intentionally sabotaging live systems. You’re gambling with real user experiences. In the worst case, you could cause an outage for your entire user base.

That’s why engineers carrying them out need to work in a way that produces as little collateral damage as possible. Chaos engineering suggests carrying out a series of experiments with successively increasing “blast radius.”

The least risky experiment to try involves testing the client-device functionality by injecting failures into a small number of devices and user-flows.  This will reveal any glaring faults in functionality and won’t harm any users. That said, it also won’t shed light on any deeper issues.

Upping the Ante

If the previous experiment didn’t reveal any problems, a small-scale diffuse experiment is the next thing to try. It’s small-scale because it impacts a small fraction of production traffic. It’s diffuse because that production traffic ends up evenly routed through your production servers. By filtering your success metrics for the affected traffic, you can assess the results of your experiment.

If the small-scale diffuse experiment passes without a hitch, you can try a small-scale concentrated experiment. Like the previous experiment, this involves a small subset of your traffic. However, instead of allowing this traffic to be ‘diffused’ throughout your servers, you funnel the subject traffic into a set of boxes.

This is where things get a bit hairy. The high traffic concentration puts plenty of load on the boxes, allowing hidden resource constraints to surface.  There may also be increased latency. Any problems revealed by this experiment will affect the end users involved in it but will not impact your local system.

Going Nuclear

The final option may have a perverse appeal to Die Hard 4.0 fans. In case you haven’t seen it, a group of hackers manages to shut down the entire internet and throw America into chaos. This experiment takes a similar gamble with your system. It involves testing a large fraction of your production traffic while letting it diffuse through the production servers.

There are two reasons you shouldn’t try this at home. First, any problems revealed by your experiment will affect a significant percentage of your user base, who won’t be very happy about the service disruption they experience. Second, because the experiment is diffuse, it has the potential to cause an outage for your entire user base, particularly if it affects global resource constraints.

Because chaos experiments have the potential to cause large amounts of collateral damage, they need to incorporate a kill switch if things go sideways. Moreover, I’d highly recommend building in automated termination procedures, particularly given that your experiments are likely to be automated and running continuously.

Fostering Resiliency

At first glance, chaos is the antithesis of the intricate webs of logic that are software’s life blood.  Yet in today’s landscape of massively distributed cloud-based systems, it just isn’t possible to design out failure. Instead, engineers need to cope with failure by designing resiliency into their systems. Chaos engineering is an approach explicitly designed to roll with failure and promote resilient system design.

Chaos engineering’s tolerance of failure makes it a dangerous game. But it’s a game more and more programmers are learning to play. And win.