production Archives

Maintaining product focus is the best way to guarantee a successful business. As the late great Steve Jobs put it:

“if you keep an eye on the profits, you’re going to skimp on the product… but if you focus on making really great products, the profits will follow.”

There are a wide variety of statistics available on how much time developers actually spend writing code, anywhere from 25% to 32%. Whichever is true, these studies show that your developers could (and should) be spending more time focusing on actually developing your product. In this post, we’re going to examine some of the common roadblocks developers face, and how you can enable them to overcome them and optimize your IT costs.

Distraction #1 – Sifting Through the Noise

A complex system or product can produce millions of logs a day. Some of them hold important information that may need to be addressed, but for the most part they simply cause a ton of noise. Noise is killer for productivity. In order to keep things running smoothly, with minimal maintenance from developers, you’ll need a solution that can provide succinct error analysis.

One way to do this is by clustering common logs based on shared attributes. Automated clustering of similar logs is the first step to giving your developers more time in their day to focus on product development. Coralogix uses new Streama technology to automatically analyze your logs and prioritize the ones that hold the most value for your business.

Distraction #2 – Trying to Understand What’s ‘Normal’

Understanding what constitutes ‘normal’ behavior in complicated systems is tricky. When dealing with multiple releases per day and the multiple factors involved, it gets even more complicated.

One common way developers are alerted to the effects of system changes is by relying on threshold alerts. Threshold alerts rely on manually set parameters which are time consuming and commonly lead to false positives (i.e. more noise). Replacing these alerts with a machine learning solution that can understand your system and product’s baselines and adjust dynamically, will keep your team on task and only alert them when they’re needed. No more wild goose chases.

Distraction #3 – Staying on Top of Technical Debt

Nobody wants to worry about technical debt, that’s why it exists, but it’s important to recognize that it only becomes more troublesome as time passes. Every known error that’s released to production can mean a week of troubleshooting waiting for your developers. That’s a lot of time being pulled away from product development.

Plus, it’s expensive! If you want to try and quantify the actual cost of remediating technical debt from known errors, it’s between $3.61 and $5.42 per line of code. That cost comes from the time taken to apply fixes to issues caused by known errors in production.

Distraction #4 – Upgrading Supporting Systems

Monitoring and logging systems are incredibly important for any organization. Upgrading, managing and optimizing these systems is equally as important, but is also a huge distraction from your product.

When your engineers upgrade your ELK stack, they aren’t focused on product development. Your ELK stack forms a powerful part of your technology, but it isn’t your product. Your development team won’t have a product-focused mindset when they are applying a fix to each index following the latest Elasticsearch upgrade. Switching to a managed log and monitoring solution will help. Focus on your product and leave log management to the specialists.

Distraction #5 – Responding to Releases

Once you release a feature, a development team with a product focus ought to be working on the next release. However, releases sometimes have unpredictable or unforeseen impacts on your overall build quality and system performance.

The ability to carry out post-release reports on overall system performance is key for maintaining product focus. This allows your developers to move onto the next big thing.

Product Focus and Centricity with Coralogix

What’s the bottom line? A strong and independent logging and monitoring system underpins a development team’s ability to do what they do best. Coralogix provides expertise-driven out-of-the-box functional and scalable logging solutions, reducing time spent on activities which detract from product focus.

Loggregation gives your developers a succinct overview of all your log outputs, clustering them based on commonalities and saving them time when troubleshooting. Combine this with an advanced, unified UI and an intelligent open search functionality, and the time your developers spend on tasks not related to product development will reduce dramatically.

Coralogix deals with billions of logs per day for customers big and small through by offering a fully managed service. Using intelligent machine learning, Coralogix can establish performance baselines and identify known error patterns automatically. This scalable solution gives tangible relief on development resources, allowing producthttps://coralogix.com/docs/user-guides/monitoring-and-insights/anomaly-detection/new-error-and-critical-logs-anomaly/ development to take priority.

Automated benchmark reports give your team total awareness of the impact that new releases will have on build quality. This Coralogix feature means that you’ll always be up to date on your system’s health.

These offering and Coralogix’s overall innovative spirit ensure that your developers will have more product focus.

Chaos Engineering is one of the hottest new approaches in DevOps. Netflix first pioneered it back in 2008, and since then it’s been adopted by thousands of companies, from the biggest names in tech to small software companies.

In our age of highly distributed cloud-based systems, Chaos Engineering promotes resilient system architectures by applying scientific principles. In this article, I’ll explain exactly what Chaos Engineering is and how you can make it work for your team.

What is Chaos Engineering?

A good way to sum up Chaos Engineering would be with Facebook’s (sadly) defunct motto “move fast and break things”.

Chaos engineering is based on procedures called Chaos Experiments. These typically involve setting up two versions of the same system, an ‘experimental’ version and a ‘control’ version. Engineers then define a ‘steady state’, the characteristic output expected when the system is functioning normally.

When the experiment starts, both the experimental and control systems display the same steady state. Engineers then throw as many stones as possible at the experimental system in an attempt to knock it away from that state.

This stone-throwing could involve anything from randomly causing functions to throw exceptions to simulating the failure of a data center. Two notorious Chaos Engineering tools are Netflix’s Chaos Monkey and Chaos Kong. These were programmed to randomly turn off instances in Netflix’s live systems to test their resiliency. If the end users complained, a Netflix engineer had his work cut out!

Probably the most controversial feature of Chaos Engineering is that experiments are performed on production traffic. While this is highly risky, it’s the only way to ensure real-world validity.

Why Choose Chaos Engineering?

The key benefit of chaos engineering is that it promotes resilient system architectures, especially in cloud-based applications.

Traditional observability strategies use logs, traces and metrics to provide engineers with information about the internal state of an enterprise system. Engineers typically analyze this information and use their knowledge to interpret it.

While this approach works with older applications that are hosted in fixed servers, it’s a much bigger challenge in more modern systems. These contain hundreds or thousands of elements whose interactions are nearly impossible to analyze using traditional methods. They also use serverless architectures which hamper observability with their lack of a fixed server.

This is where Chaos Engineering can fill the gap. It doesn’t require detailed knowledge about a system’s internal vagaries, and it bypasses the obstacles that block traditional observability. That means it will work for arbitrarily complex enterprise applications.

The upshot is that when programmers incorporate chaos into their design processes, they design for resiliency. Since 2008, Netflix has been utilizing Chaos Engineering to foster resilient system architectures. Between 2010 and 2017, they only had one disruption of service due to an instance failure.

Making Chaos Engineering Work for You

Adopting Chaos Engineering can be the best thing you ever do. Companies large and small are using it to promote resilient system design, leading to applications that never disappoint their users. Nevertheless, it’s an approach that must be used wisely. Before you jump in and turn your DevOps team into a giant Chaos Experiment, you need to consider the following issues carefully.

Is Chaos Right for You?

Chaos Engineering is ideal for highly distributed systems with thousands of instances. Not everyone has embraced the cloud yet. Many companies are still wrestling with conventional server architectures and the perils of vertical scaling.

In this context, traditional observability strategies still have plenty of mileage, especially when you throw data-driven approaches and machine learning into the mix. There’s no excuse for not squeezing every drop out of the existing approaches. Tools such as machine learning, Grafana and ELK can kickstart your DevOps practice when used well.

That said, the cloud is fast becoming the arena of choice for application deployment and may well become the norm for the majority of businesses. When you’re ready to move into the cloud, Chaos Engineering will be there waiting.

Automate Experiments

Most DevOps teams (I hope!) are familiar with the benefits that automation can provide through CI/CD and build pipelines. In Chaos Engineering, automation is essential. Due to the complexity of modern software, there is no way to know in advance which experiments are likely to yield positive results.

Chaos experiments need to be able to test the entire space of possible failures for a given system. It’s not possible to do this manually so a DevOps team that is serious about Chaos Engineering needs to bake a high level of automation into its experimental procedures.

If you’re looking to create automated experiments, Chaos Toolkit is the ideal resource. Created in 2017, it’s an open source project that provides a suite of tools for automating chaos experiments. Written in Python, Chaos Toolkit allows you to run experiments through its own CLI. The Chaos Toolkit is declarative, experiments are encoded in easy to edit JSON files which are customizable as circumstances require.

Playing with Chaos

Chaos Engineering is not for the faint-hearted. When you run chaos experiments you’re intentionally sabotaging live systems. You’re gambling with real user experiences. In the worst case, you could cause an outage for your entire user base.

That’s why engineers carrying them out need to work in a way that produces as little collateral damage as possible. Chaos engineering suggests carrying out a series of experiments with successively increasing “blast radius.”

The least risky experiment to try involves testing the client-device functionality by injecting failures into a small number of devices and user-flows. This will reveal any glaring faults in functionality and won’t harm any users. That said, it also won’t shed light on any deeper issues.

Upping the Ante

If the previous experiment didn’t reveal any problems, a small-scale diffuse experiment is the next thing to try. It’s small-scale because it impacts a small fraction of production traffic. It’s diffuse because that production traffic ends up evenly routed through your production servers. By filtering your success metrics for the affected traffic, you can assess the results of your experiment.

If the small-scale diffuse experiment passes without a hitch, you can try a small-scale concentrated experiment. Like the previous experiment, this involves a small subset of your traffic. However, instead of allowing this traffic to be ‘diffused’ throughout your servers, you funnel the subject traffic into a set of boxes.

This is where things get a bit hairy. The high traffic concentration puts plenty of load on the boxes, allowing hidden resource constraints to surface. There may also be increased latency. Any problems revealed by this experiment will affect the end users involved in it but will not impact your local system.

Going Nuclear

The final option may have a perverse appeal to Die Hard 4.0 fans. In case you haven’t seen it, a group of hackers manages to shut down the entire internet and throw America into chaos. This experiment takes a similar gamble with your system. It involves testing a large fraction of your production traffic while letting it diffuse through the production servers.

There are two reasons you shouldn’t try this at home. First, any problems revealed by your experiment will affect a significant percentage of your user base, who won’t be very happy about the service disruption they experience. Second, because the experiment is diffuse, it has the potential to cause an outage for your entire user base, particularly if it affects global resource constraints.

Because chaos experiments have the potential to cause large amounts of collateral damage, they need to incorporate a kill switch if things go sideways. Moreover, I’d highly recommend building in automated termination procedures, particularly given that your experiments are likely to be automated and running continuously.

Fostering Resiliency

At first glance, chaos is the antithesis of the intricate webs of logic that are software’s life blood. Yet in today’s landscape of massively distributed cloud-based systems, it just isn’t possible to design out failure. Instead, engineers need to cope with failure by designing resiliency into their systems. Chaos engineering is an approach explicitly designed to roll with failure and promote resilient system design.

Chaos engineering’s tolerance of failure makes it a dangerous game. But it’s a game more and more programmers are learning to play. And win.

Tag: production

5 Common Distractions that Risk Breaking up Your Product Focus

Distraction #1 – Sifting Through the Noise

Distraction #2 – Trying to Understand What’s ‘Normal’

Distraction #3 – Staying on Top of Technical Debt

Distraction #4 – Upgrading Supporting Systems

Distraction #5 – Responding to Releases

Product Focus and Centricity with Coralogix

What is Chaos Engineering and How to Implement It

What is Chaos Engineering?

Why Choose Chaos Engineering?

Making Chaos Engineering Work for You

Is Chaos Right for You?

Automate Experiments

Playing with Chaos

Upping the Ante

Going Nuclear

Fostering Resiliency

Distraction #1 – Sifting Through the Noise

Distraction #2 – Trying to Understand What’s ‘Normal’

Distraction #3 – Staying on Top of Technical Debt

Distraction #4 – Upgrading Supporting Systems

Distraction #5 – Responding to Releases

Product Focus and Centricity with Coralogix

What is Chaos Engineering?

Why Choose Chaos Engineering?

Making Chaos Engineering Work for You

Is Chaos Right for You?

Automate Experiments

Playing with Chaos

Upping the Ante

Going Nuclear

Fostering Resiliency

Be Our Partner

Thank You

Download our logo in high resolution