monitoring Archives

ECS Monitoring Metrics that Help Optimize and Troubleshoot Tasks

Posted on December 30, 2021June 29, 2025 by eugene evdokimov

Compute functions that run on Amazon’s Elastic Container Service (ECS) require regular monitoring to ensure proper running and managing of containerized functions on AWS – in short, ECS monitoring is a must. ECS can manage containers with either EC2 or Fargate compute functions. While EC2 and Fargate are compute services, EC2 allows users to configure virtually every functional aspect. Fargate is more limited in its available settings but is simpler to set up. Before setting up a solution with ECS, determine if this service best meets your needs.

When setting up an ECS service, you need to configure any monitoring and observability tools. AWS provides tools to collect and process ECS data to help monitor your solution. Developers can use Coralogix’s Log Analytics and AWS Observability platforms to provide more detailed metrics.

ECS Monitoring Infrastructure

AWS provides metrics useful for ECS monitoring. Let’s go over several recommended metrics and evaluate how they will help ensure the health of your ECS infrastructure.

Task Count

Task count measures how many tasks are active on the cluster. This metric helps analyze how busy your cluster is. Users can define the desired number of tasks to keep active when defining a service. ECS will automatically ensure that the desired number of tasks of a given container will run. The desired count is also available in the service metrics from ECS. Users can compare the desired count with the running task count to ensure their services are executing as expected.

Along with the number of running tasks, AWS tracks the number of pending tasks. Pending tasks are in a transition state, and ECS is waiting on the container to activate the task. Tasks can get stuck in a pending state, causing customer service outages. An unresponsive docker daemon or the ECS container losing connectivity with the ECS service may cause task outages during task launch.

AWS uses services to launch tasks automatically. Tasks stuck in a pending state or having a lower-than-required running task count would also benefit from automatic alerts. Alerts would allow development teams to intervene and limit outages on your platform in cases where automatic launches fail.

CPU Reservation and Utilization

The total CPU reservation measures the total CPU units reserved by all the tasks running on a single ECS cluster. You can have multiple clusters in your AWS account, and each would specify its CPU reservation value.

Active tasks (Fargate) or container instances (EC2) on a cluster will use reserved CPU units. Each active instance will register a certain number of units based on its task definition. Instances that are ‘Active’ or ‘Draining’ will affect the results.

CPU utilization measures how many CPU units are used by the EC2 tasks running on your cluster. It is helpful to see how many spare CPU units are available to launch new EC2 instances at any given time by comparing CPU reservation for utilization. If you attempt to launch an instance without enough reserved units, a failure may occur.

Users can also enhance metrics on EC2 instances, so CPU usage is logged per instance instead of only per cluster. AWS provides CPU capacity, reserved capacity, and usage per instance. Users must set up these metrics separately from the other cluster-based metrics.

To troubleshoot a CPU utilization close to 100%, AWS recommends rebooting the instance. If the CPU requirements are higher than you have reserved for the cluster, restarting will not fix the problem. In this case, you may need to revise your CPU requirements or convert to a new instance type with better performance.

Memory Reservation and Utilization

Memory reservation measures how much memory is allocated for use in the cluster by all tasks. The reservation value is used to calculate the percentage of memory available. Cluster memory reservation is the percent ratio of the total mebibytes (MiB) of memory reserved by tasks to the total MiB of memory registered by container instances in the same cluster.

The memory utilization value shows the percentage of memory used in the cluster. It is the percent ratio of the total MiB of memory used by tasks divided by the total MiB of memory registered by container instances in the same cluster.

Cluster Reservation and Utilization

Cluster reservation metrics are available only for clusters with tasks using EC2 launch types. Fargate launch types don’t reserve CPU or memory the same way, so they are measured differently.

Before running a task, users define a container that will run the task. ECS uses the task definition to reserve the appropriate amount of CPU, GPU, and memory. Even if a task has been defined, its reserved values are not part of this calculation if the task is not running.

The cluster CPU reservation is the percent ratio of total CPU units reserved to total CPU units registered by container instances in the cluster. In other words, it is the total CPU needed over the total CPU units allocated for all tasks running on the cluster. Users cannot use more CPU to run tasks than the cluster has available. Cluster CPU utilization instead compares the percent ratio of total CPU units used by tasks in the cluster to the total CPU units registered by container instances. By comparing the CPU reservation to the CPU utilization, developers can understand how much capacity is remaining on their cluster and when they need to increase the sizes of their cluster or container definitions.

Similarly, the cluster memory reservation is the percent ratio of the total MiB of memory reserved by tasks to the total MiB of memory registered by container instances in the cluster. It is the memory reserved by running tasks divided by the total memory allocated to the cluster. Developers can compare a cluster’s memory reservation to its utilization to determine if the size of the cluster should be increased. Further, graphical analysis of the memory utilization may show memory leaks in code.

Service Utilization

Service utilization uses similar calculations to a tasks’ memory and CPU utilization calculations discussed above. The service CPU utilization is the percent ratio of the total CPU units used by tasks in the service to the total CPU units specified in the task definition for the number of tasks in the service. The service utilization calculations depend on the number of tasks in the service and not on the cluster itself. The memory calculation is the same but replacing CPU units with the MiB of memory used by tasks.

Service utilization is allowed to go over 100% when CPU or memory capacity is defined at the container level. When these units are defined at the task level, going over-limit is not allowed, and the task will fail.

ECS Monitoring in your Application Level Tasks

With AWS Lambda, several metrics are automatically created for each function. For example, when an error is logged, it is added to a metric automatically. This metric, in turn, can be used to set the alarm and notify the development team of issues in the running function. Such metrics are not automatically created with ECS monitoring for tasks.

Error log metrics are beneficial for tracking the health of compute functions. Other metrics specific to your platform’s use case may also be necessary. To meet any metric needs inside your function, you can create custom metrics and alarms in AWS CloudWatch or send logs to third-party systems. The benefit of the third-party systems is sending logs and using their existing analytics to detect issues without predefining everything you may need to track. Coralogix’s log analytics platform detects anomalies in your logs and alerts you based on its findings.

Automatic Alerting and ECS Monitoring

The metrics listed above are critical to monitor and maintain the health of your ECS infrastructure and ensure users do not discover outages. Developers need to be alerted when problems arise to limit issues successfully.

Alerting can be done using different methods, either in AWS or with third-party services. AWS Cloudwatch provides alarms that can trigger alerts on known issues. Metric data is sent to CloudWatch, and if it meets the criteria set in the alarm, a notification is sent to the predefined location. Third-party systems like Coralogix’s AWS Observability platform use machine learning to detect issues with little customization. Coralogix provides a method of ECS monitoring by configuring an endpoint in the ECS task definition.

Summary

AWS provides standard metrics for monitoring ECS deployments. These metrics differ when using EC2 versus Fargate launch types. Generally, teams will need to watch for CPU usage, memory usage, and the number of running tasks in a cluster. Application-level metrics such as tracking error logs in a task need to be set up manually in CloudWatch when using AWS observability tools.

ECS monitoring data can also be sent to third-party tools for analysis and to gain observability into your platform. Coralogix’s AWS Observability platform can track ECS metrics and alert users when issues arise.

What is eBPF and Why is it Important for Observability?

Posted on December 9, 2021June 29, 2025 by eugene evdokimov

Observability is one of the most popular topics in technology at the moment, and that isn’t showing any sign of changing soon. Agentless log collection, automated analysis, and machine learning insights are all features and tools that organizations are investigating to optimize their systems’ observability. However, there is a new kid on the block that has been gaining traction at conferences and online: the Extended Berkeley Packet Filter, or eBPF. So, what is eBPF?

Let’s take a deep dive into some of the hype around eBPF, why people are so excited about it, and how best to apply it to your observability platform.

What came out of Cloud Week 2021?

Cloud Week, for the uninitiated, is a week-long series of talks and events where major cloud service providers (CSPs) and users get together and discuss hot topics of the day. It’s an opportunity for vendors to showcase new features and releases, but this year observability stole the show.

Application Performance Monitoring

Application Performance Monitoring, or APM, is not particularly new when it comes to observability. However, Cloud Week brought a new perception of APM: using it for infrastructure. Putting both applications and infrastructure under the APM umbrella in your observability approach not only streamlines operations but also gives you top-to-bottom observability for your stack.

Central Federated Observability

Whilst we at Coralogix have been enabling centralized and federated observability for some time (just look at our data visualization and cloud integration options), it was a big discussion topic at Cloud Week. Federated observability is vital for things like multi-cloud management and cluster management, and centralizing this just underpins one of the core tenets of observability. Simple, right?

eBPF

Now, not to steal the show, but eBPF was a big hit at Cloud Week 2021. This is because its traditional use (in security engineering) has been reimagined and reimplemented to address gaps in observability. We’ll dig deeper into what eBPF is later on!

What is eBPF – an Overview and Short History

Around 2007, the Berkeley Packet Filter (BPF) was designed to filter network packets and collect those packets based on predetermined rules. The filters took the form of programs that then run on a standard VM. However, the BPF quickly became outdated by the progression to 64-bit processors. So what is eBPF and how is it different?

It wasn’t until 2014 that the eBPF was introduced. eBPF is aligned to modern hardware standards (64-bit registers). It’s a Linux kernel technology (version 4.x and above) and allows you to bridge traditional observability and security gaps. It does this by allowing programs that assist with security and/or monitoring to continue running without having to alter the kernel source code or debug, essentially by running a virtual machine inside the kernel.

Where can you use eBPF?

As we’ve covered, eBPF isn’t brand new, but it is fairly nuanced when applied to a complex observability scenario.

Network Observability

Network observability is fundamental for any organization seeking total system observability. Traditionally, network or SRE teams would have to deploy myriad data collection tools and agents. This is because, in complex infrastructure, organizations will likely have a variety of on-premise and cloud servers from different vendors, with different code levels and operating systems for virtual machines and containers. Therefore, every variation could need a different monitoring agent.

Implementing eBPF does away with these complexities. By installing a program at a kernel level, network and SRE teams would have total visibility of all network operations of everything running on that particular server.

Kubernetes Observability

Kubernetes presents an interesting problem for observability, because of the number of kernels with different operating systems that you might be running across your system. As mentioned above, this makes monitoring things like their network usage and requirements exceptionally difficult. Fortunately, there are several eBPF applications to make Kubernetes observability a lot easier.

Dynamic Network Control

At the start, we discussed how eBPF uses predetermined rules to monitor and trace things like network performance. Combine this with network observability above, and we can see how this makes life a lot simpler. However, these rules are still constants (until they’re manually changed), which can make your system slow to react to network changes.

Cilium is an open-source project that seeks to help with the more arduous side of eBPF administration: rule management. On a packet-by-packet basis, Cilium can analyze network traffic usage and requirements and automatically adjust the eBPF rules to accommodate container-level workload requirements.

Pod-level Network Usage

eBPF can be used to carry out socket filtering at the cgroup level. So, by installing an eBPF program that monitors pod-level statistics, you can get granular information that would only normally be accessible in the /sys Linux directory. Because the eBPF program has kernel access, it can deliver more accurate information with context from the kernel.

What is eBPF best at – the Pros and Cons of eBPF for Observability

So far, we’ve explored what eBPF is and what it can mean for your system observability. Sure, it can be a great tool when utilized in the right way, but that doesn’t mean it’s without its drawbacks.

Pro: Unintrusive

eBPF is a very light touch tool for monitoring anything that runs with a Linux kernel. Whilst the eBPF program sits within the kernel, it doesn’t alter any source code which makes it a great companion for exfiltrating monitoring data and for debugging. What eBPF is great at is enabling clientless monitoring across complex systems.

Pro: Secure

As above, because an eBPF program doesn’t alter the kernel at all, you can preserve your access management rules for code-level changes. The alternative is using a kernel module, which brings with it a raft of security concerns. Additionally, eBPF programs have a verification phase that prevents resources from being over-utilized.

Pro: Centralized

Using an eBPF program gives you monitoring and tracing standards with more granular detail and kernel context than other options. This can easily be exported into the user space and ingested by an observability platform for visualization.

Con: It’s very new

Whilst eBPF has been around since 2017, it certainly isn’t battle-tested for more complex requirements like cgroup level port filtering across millions of pods. Whilst this is an aspiration for the open-source project, there is still some work to go.

Con: Linux restrictions

eBPF is only available on the newer version of Linux kernels, which could be prohibitive for an organization that is a little behind on version updates. If you aren’t running Linux kernels, then eBPF simply isn’t for you.

Conclusion – eBPF and Observability

There’s no denying that eBPF is a powerful tool, and has been described as a “Linux superpower.” Whilst some big organizations like Netflix have deployed it across their estate, others still show hesitancy due to the infancy and complexity of the tool. eBPF certainly has applications beyond those listed in this article, and new uses are still being discovered.

One thing’s for certain, though. If you want to explore how you can supercharge your observability and security, with or without tools like eBPF, then look to Coralogix. Not only are we trusted by enterprises across the world, but our cloud and platform-agnostic solution has a range of plugins and ingest features designed to handle whatever your system throws at it.

The world of observability is only going to get more complex and crowded as tools such as eBPF come along. Coralogix offers simplicity.

Istio Log Analysis Guide

Posted on November 18, 2021June 3, 2025 by eugene evdokimov

Istio has quickly become a cornerstone of most Kubernetes clusters. As your container orchestration platform scales, Istio embeds functionality into the fabric of your cluster that makes log monitoring, observability, and flexibility much more straightforward. However, it leaves us with our next question – how do we monitor Istio? This Istio log analysis guide will help you get to the bottom of what your Istio platform is doing.

What is Istio Service Mesh?

Before we understand Istio, we’ll need to understand what a service mesh is. Imagine you have lots of applications running on your platform. Each application does something different, yet they all share a common set of problems. For example, authentication, traffic monitoring, rerouting traffic, performing seamless deployments, and so on. You could solve this problem in each application, but this would take a long time.

So you solve the problem once, and let the service mesh handle it

Instead, a service mesh creates a fabric that sits in between every application. You can adjust your mesh centrally, and those changes will be rolled out across the fabric to your various applications. Rather than solving the problem in every application individually, your solutions sit on the service mesh and, each time you configure a change in your mesh, your applications don’t know the difference.

Istio comes with a wide variety of these common features. A few of the popular uses of Istio are:

Traffic management using the Istio VirtualService
Generation of tracing metrics
Intelligent network segmentation
Implementing cluster wide policies for security
Generating consistent system logs for every application on the cluster
Implementing mutual TLS for encrypted traffic within the cluster

As you can imagine, software that is this complex has many moving parts and needs to be monitored closely, to ensure it is functioning properly.

How do you monitor Istio?

The most common mode of installation for Istio in modern clusters is to use the Istio operator. This approach makes upgrading more straightforward and allows you to simply declare which version of Istio you would like, rather than having to wire up all of the underlying components.

Monitoring the Istio Operator with Istio Log Analysis

The Istio operator produces logs and metrics that can give you an incredibly powerful insight into its health. These logs can be broken down into various scopes. These scopes break up the logs into their constituent functionality and enable you to view specific components within the operator pods, so if you wish to understand the functionality of one particular sub-component of the Istio operator, you can use the scopes to separately query your logs.

Istio log analysis needs centralised logs

Istio will produce a lot of logs, and if you’re trying to parse all of them by hand, you’re going to find yourself with more information than you can work with. Centralized log analytics make it easy for you to slice and query your log information, so you can quickly gain insights into your system without drowning in log files. In short, while you’re analyzing your Istio logs, Coralogix can handle your Istio log management.

The Envoy Access Logs are your best friend

One of the greatest benefits of Istio log analysis is the insights that come from the Envoy logs. Istio automatically injects a TCP proxy alongside your applications, that filters and analyses traffic. This proxy can be configured to block or allow traffic, depending on what you’d like to do. Out of the box, it also provides some powerful access logs.

What can you do with logs?

Istio log analysis offers a whole new dimension of observability into your cluster. While the metrics that Istio produces are very diverse and impressive, they only give part of the story. For example, your metrics may tell you that 3% of your traffic is failing for one of your microservices, but it won’t tell you that the same IP address is the source of all of those failures. For that, you need Istio log analysis.

Logging provides the context to your data and gives you real-world insights that you can use to immediately tackle the problems you’re trying to solve. Orchestration platforms typically scale fast. When engineers realize how easy it is to deploy new microservices, the number of services on the cluster grows and, with that growth, comes new operational challenges. Istio log analysis provides the context you need to understand issues as they arise and respond accordingly.

But what if I prefer to use Metrics?

Metrics have their own power, of course. The ability to perform calculations on your metric values allows you to visualize and draw correlations between disparate measurements. Fortunately, Coralogix offers the Logs2Metrics service to unlock this power. You can input your logs into Coralogix and parse values out of them. These may include error count, request count, or latency.

Dive deeper with tracing

Istio also generates powerful tracing metrics. These metrics enable you to track the full lifecycle of a request, as it moves between your applications. This is a vital component of observability when you’re working with distributed systems in a microservices architecture. When this is enabled, you’ll be able to see traffic flowing through your systems and spot problem areas. You will be able to see that a whole request, through several microservices, took 10 seconds, but 5 seconds of that was caused by latency in a single service.

Sounds great! Why don’t we enable this for everything?

The simple answer is this – tracing can impact your performance. If you have a system that needs to process millions of requests every minute, the tiny latency overhead of your tracing metrics becomes expensive. For this reason, you should seek to enable tracing for those systems that really need it and can afford the extra overhead.

Summary

Istio provides a great deal of power to the user, but also comes with its own operational challenges. Istio log analysis is an essential part of your observability stack and will provide you with the context you need to get the job done. By focusing on logs and metrics, deploying your Istio instance using the Istio operator, centralizing your log analytics, and taking advantage of tracing and proxying, you’ll be able to make full use of your service mesh and focus on the problems that really matter.

Discovering the Differences Between Log Observability and Monitoring

Posted on October 19, 2021June 29, 2025 by eugene evdokimov

Log observability and log monitoring are terms often used interchangeably, but really they describe two approaches to solving and understanding different things.

Observability refers to the ability to understand the state of a complex system (or series of systems) without needing to make any changes or deploy new code.

Monitoring is the collection, aggregation, and analysis of data (from applications, networks, and systems) which allows engineers to both proactively and reactively deal with problems in production.

It’s easy to see why they’re treated as interchangeable terms, as they are deeply tied to each other. Without monitoring, there would be no observability (because you need all of that data that you’re collecting and aggregating in order to gain system observability). That said, there’s a lot more to observability than passively monitoring systems in case something goes wrong.

In this article, we will examine the different elements that make up monitoring and observability and see how they overlap.

Types of Monitoring

Monitoring is a complex and diverse field. There are a number of key elements and practices that should be employed for effective monitoring. If monitoring refers to looking at a series of processes, and how they are conducted, whether they complete successfully and efficiently, then you should be aware of the following types of monitoring to build your monitoring practice.

Black and White Box Monitoring

Black box monitoring, also known as server-level monitoring, refers to the monitoring of specific metrics on the server such as disk space, health, CPU metrics, and load. At a granular level, this means aggregating data from network switches, load balancers, looking at disk health, and many other metrics that you may traditionally associate with system administration.

White box monitoring refers more specifically to what is running on the server. This can include things like queries to databases, application performance versus user requests, and what response codes your application is generating. White box monitoring is critical for application and web layer vulnerability understanding.

White and black box monitoring shouldn’t be practiced in isolation. Previously, more focus may have been given to black box or server-level monitoring. However, with the rise of the DevOps and DevSecOps methodologies, they are more frequently carried out in tandem. When using black and white box monitoring harmoniously, you can use the principles of observability to gain a better understanding of total system health and performance. More on that later!

Real-Time vs Trend Analysis

Real-time monitoring is critical for understanding what is going on in your system. It covers the active status of your environment, with log and metric data relating to things like availability, response time, CPU usage, and latency. Strong real-time analysis is important for setting accurate and useful alerts, which may notify you of critical events such as outages and security breaches. Log observability and monitoring depend heavily on real-time analysis.

Think of trend analysis as the next stage of real-time analysis. If you’re collecting data and monitoring events in your system in real-time, trend analysis is helpful for gaining visibility into patterns of events. This can be accomplished with a visualization tool, such as Kibana or native Coralogix dashboards.

Trend analysis allows organizations to correlate information and events from disparate systems which may together paint a better picture of system health or performance. Thinking back to the introduction of this piece, we can see where this might link into observability.

Performance Monitoring

Performance monitoring is pretty self-explanatory. It is a set of processes that enable you to understand either network, server, or application performance. This is closely linked to system monitoring, which may be the combination of multiple metrics from multiple sources.

Performance monitoring is particularly important for organizations with customer-facing applications or platforms. If your customers catch problems before you do, then you risk reputational or financial impact.

Analyzing Metrics

Good monitoring relies on the collection, aggregation, and analysis of metrics. How these metrics are analyzed will vary from organization to organization, or on a more granular level, from team to team.

There is no “one size fits all” for analyzing metrics. However, there are two powerful tools at your disposal when considering metric analysis.

Visualization

Data visualization is nothing particularly new. However, its value in the context of monitoring is significant. Depending on what you choose to plot on a dashboard, you can cross-pollinate data from different sources which enhances your overall system understanding.

For example, you might see on a single dashboard with multiple metrics that your response time is particularly high during a specific part of the day. When this is overlaid with network latency, CPU performance, and third-party outages, you can gain context.

Context is key here. Visualization gives your engineers the context to truly understand events in your system, not as isolated incidents, but interconnected events.

Machine Learning

The introduction of machine learning to log and metric analysis is an industry-wide game changer. Machine learning allows predictive analytics based on your current system health and status and past events. Log observability and monitoring are taken to the next level by machine learning practices.

Sifting through logs for log observability and monitoring is an often time-consuming task. However, tools like Loggregation effectively filter and promote logs based on precedent, without needing user intervention. Not only does this save time in analysis, which is particularly important post-security events, but it also means your logging system stays lean and accurate.

Defining Rules

Monitoring traditionally relies on rules which trigger alerts. These rules often need to be fine-tuned over time, because setting rules to alert you of things that you don’t know are going to happen in advance is difficult.

Additionally, rules are only as good as your understanding of the system they relate to. Alerts and rules require a good amount of testing, to prepare you for each possible eventuality. While machine learning (as discussed above) can make this a lot easier for your team, it’s important to get the noise-to-signal ratio correct.

The Noise-to-Signal Ratio

This is a scientific term (backed up by a formula), which helps to define what the acceptable level of background noise is for clear signals or, in this case, insights. In terms of monitoring, rules, and alerts; we’re talking about how many false or acceptable error messages there are in combination with unhelpful log data. Coralogix has a whole set of features that help filter out the noise, while ensuring the important signals reach their target, to help defend your log observability and monitoring against unexpected changes in data.

From Monitoring to Observability

So what is the difference then?

Monitoring is the harvesting and aggregation of data and metrics from your system. Observability builds on this and turns the harvested data into insights and actionable intelligence about your system. If monitoring provides visibility, then observability provides context.

A truly observable system provides all the data that’s needed in order to understand what’s going on, without the need for more data. Ultimately, an observability platform gives you the ability to see trends and abnormalities as they emerge, instead of waiting for alerts to be triggered. A cornerstone of your observability is log observability and monitoring.

In this way, you can use marketing metrics as a diagnostic tool for system health, or even understand the human aspect of responses to outages by pulling in data from collaboration tools.

Log Observability and Monitoring

Monitoring and observability shouldn’t be viewed in isolation: the former is a precursor to the latter. Observability has taken monitoring up a few notches, meaning that you don’t need to know every question you’ll ask of your system before implementing the solution.

True observability is heterogeneous, allowing you to cross-analyze data from your Kubernetes cluster, your firewall, and your load balancer in a single pane of glass. Why? Well, you might not know why you need it yet, but the beauty of a truly observable system is that it’s there when you need to query it.

As systems grow ever more advanced, and there are increasing numbers of variables in play, a robust observability platform will give you the information and context you need to stay in the know.

Adding Observability to Your CI/CD Pipeline in CircleCI

Posted on August 11, 2021June 3, 2025 by eugene evdokimov

The simplest CI/CD observability pipeline consists of three stages: build, test, and deploy.

In modern software systems, it is common for several developers to work on the same project simultaneously. Siloed working with infrequent merging of code in a shared repository often leads to bugs and conflicts that are difficult and time-consuming to resolve. To solve this problem, we can adopt continuous integration.

Continuous integration is the practice of writing code in short, incremental bursts and pushing it to a shared project repository frequently so that automated build and testing can be run against it. This ensures that when a developer’s code gets merged into the overall project codebase, any integration problems are detected as early as possible. The automatic build and testing are handled by a CI server.

If passing the automated build and testing results in code being automatically deployed to production, that is called continuous deployment.

All the sequential steps that need to be automatically executed from the moment a developer commits a change to it being shipped to production is referred to as a CI/CD pipeline. CI/CD pipelines can range from very simple to very complex, depending on the needs of the application.

Important considerations when developing CI/CD pipelines

Building a CI/CD pipeline is no simple task. It presents numerous challenges, some of which include:

Automating the wrong processes

The whole premise of CI/CD is to increase developer productivity and optimize time-to-market. This goal gets defeated when the CI/CD pipeline has many steps in it that aren’t necessary or that could be done faster manually.

When developing a CI/CD pipeline, you should:

consider how long a task takes to perform manually and whether it is worth automating
evaluate all the steps in the CI/CD pipeline and only include those that are necessary
analyze performance metrics to determine whether the pipeline is improving productivity
understand the technologies you are working with and their limitations as well as how they can be optimized so that you can speed up the build and testing stages.

Ineffective testing

Tests are written to find and remove bugs and ensure that code behaves in the desired manner. You can have a great CI/CD pipeline in place but still get bug-ridden code in production because of poorly written, ineffective tests.

To improve the effectiveness of a CI/CD pipeline, you should:

write automated tests during development, ideally by practicing test-driven development (TDD)
examine the tests to ensure that they are of high quality and suitable for the application
ensure that the tests have decent code coverage and cover all the appropriate edge cases

Lack of observability in CI/CD pipelines

Continuous integration and continuous deployment underpin agile development. Together they ensure that features are developed and released to users quickly while maintaining high quality standards. This makes the CI/CD pipeline business-critical infrastructure.

The more complex the software being built, the more complex the CI/CD pipeline that supports it. What happens when one part of the pipeline malfunctions? How do you discover an issue that is causing the performance of the CI/CD pipeline to degrade?

It is important that developers and the platform team are able to obtain data that answers these critical questions right from the CI/CD pipeline itself so that they can address issues as they arise.

Making a CI/CD pipeline observable means collecting quality and performance metrics on each stage of the CI/CD pipeline and thus proactively working to ensure the reliability and optimal performance of this critical piece of infrastructure.

Quality metrics

Quality metrics help you identify how good the code being pushed to production is. While the whole premise of a CI/CD pipeline is to increase the speed at which software is shipped to get fast feedback from customers, it is also important to not be shipping out buggy code.

By tracking things like test pass rate, deployment success rate, and defects escape rate you can more easily identify where to improve the quality of code being produced.

Productivity metrics

An effective CI/CD pipeline is a performant one. You should be able to build, test, and ship code as quickly as possible. Tracking performance-related metrics can give you insight into how performant your CI/CD pipeline is and enable you to identify and fix any bottlenecks causing performance issues.

Performance-based metrics include time-to-market, defect resolution time, deployment frequency, build/test duration, and the number of failed deployments.

Observability in your CI/CD pipeline

The first thing needed to make a CI/CD pipeline observable is to use the right observability tool. Coralogix is a stateful streaming analytics platform that allows you to:

Analyze and monitor your log data in the context of your CI/CD pipeline and without indexing to enable full observability without breaking your budget.
Collect all system metrics and access unlimited granularity, cardinality, and labeling with no additional costs. Get maximum coverage for your systems with straightforward pricing.

The observability tool you choose can then be configured to track and report on the observability metrics most pertinent to your application.

When an issue is discovered, the common practice is to have the person who committed the change that resulted in the issue investigate the problem and find a solution. The benefit of this approach is that it makes team members have a sense of complete end-to-end ownership of any task they take on as they have to ensure it gets shipped successfully.

Another good practice is to conduct a post-mortem reviewing the incident to identify what worked to resolve it and how things can be done better next time. The feedback from the post-mortem can also be used to identify where CI/CD pipeline can be improved to prevent future issues.

Example of a simple CircleCI CI/CD pipeline

There are a number of CI servers you can use to build your CI/CD pipeline. Popular ones include Jenkins, CircleCI, Gitlab and a newcomer Github Actions.

Coralogix provides integrations with CircleCI, Jenkins, and Gitlab that enable you to quickly and easily send logs and metrics to Coralogix from these platforms.

The general principle of most CI servers is that you define your CI/CD pipeline in a yml file as a workflow consisting of sequential jobs. Each job defines a particular stage of your CI/CD pipeline and can consist of multiple steps.

An example of a CircleCI CI/CD pipeline for building and testing a python application is shown in the code snippet below.

To add a deploy stage, you can use any one of the deployment orbs CircleCI provides. An orb is simply a reusable configuration package CircleCI makes available to help simplify your deployment configuration. There are orbs for most of the common deployment targets, including AWS and Heroku.

The completed CI/CD pipeline with deployment to Heroku is shown in the code snippet below.

Having created this CI/CD pipeline you would think that you are done, but in fact, you have only done half the job. The above CI/CD pipeline is missing a critical component to make it truly effective: observability.

Making the CI/CD pipeline observable

Coralogix provides an orb that makes it simple to integrate your CircleCI CI/CD pipeline. This enables you to send pipeline data to Coralogix in real-time for analysis of the health and performance of your pipeline.

The Coralogix orb provides four endpoints:

coralogix/stats for sending the final report of w workflow job to Coralogix
coralogix/logs for sending the logs of all workflow jobs to Coralogix for debugging
coralogix/send for sending 3rd party logs generated during a workflow job to Coralogix
coralogix/tag for creating a tag and a report for the workflow in Coralogix

To add observability to your CircleCI pipeline:

In your Coralogix account, go ahead and enable Pipelines by navigating to Project Settings -> Advanced Settings -> Pipelines and turn it on
Add the Coralogix orb stanza at the top of your CircleCI configuration file
Use the desired Coralogix endpoint in your existing pipeline

The example below shows how you can use Coralogix to debug a CircleCI workflow. Adding the coralogix/logs job at the end of the workflow means that all the logs generated by CircleCI during the workflow will be sent to your Coralogix account, which will allow you to debug all the different jobs in the workflow.

Conclusion

CI/CD pipelines are a critical piece of infrastructure. By making your CI/CD pipeline observable you turn it into a source of real-time actionable insight into its health and performance.

Observability of CI/CD pipelines should not come as an after-thought but rather something that is incorporated into the design of the pipeline from the onset. Coralogix provides integrations for CircleCI and Jenkins that make it a reliable partner for introducing observability to your CI/CD pipeline.

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Posted on April 6, 2021June 3, 2025 by eugene evdokimov

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales.

For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection.

With tools like chaos monkey, Netflix employs a cutting edge testing toolkit. Chaos engineering is quickly becoming a staple of many site reliability engineering (SRE) strategies. When looking at Netflix’s capabilities and understanding of their own systems (and the faults it can/can’t tolerate) it is easy to see why.

What Is Fault Injection – Chaos Testing And Experimentation

The goal of fault injection is increasing system resiliency by deliberately introducing faults into your architecture. By experimenting using failure scenarios that are likely to be encountered, but in a controlled environment, it becomes clear where efforts need to be focused to ensure adequate quality of service. Most importantly, it becomes clear before the fact. It is better to find out, for example, that your new API gateway can’t handle traffic spikes in pre-production rather than production.

Fault injection is one way Netflix maintained an almost uninterrupted service during sharp spikes in usage. By stress testing their components against traffic spikes and other faults using fault injection, Netflix’s systems were able to take the strain without service wide outage.

Confidence vs Complexity

Adopting distributed systems has led to an increase in the complexity of the systems we work with. It is difficult to have full confidence in our microservices architecture when there are many components that could potentially fail.

Even the best QA testing can’t predict every real world deployment scenario. Even when every service is functioning properly, unpredictable outcomes still occur. Interactions between healthy services can be disrupted in unimagined ways by real world events and user input.

Netflix have mitigated this using chaos engineering and fault injection. By deliberately injecting failure into their ecosystem Netflix are able to test the resilience of their systems and learn from where they fail. Using a suite of tools creatively dubbed The Simian Army, Netflix maintains an application-based service built on a microservice architecture that is complex, scalable, and at the same time resilient.

A Tolerant Approach

Netflix’s approach to their infrastructure when migrating to the cloud was pioneering. For Netflix, a microservices cloud architecture (in AWS) had to be centered around fault tolerance. 100% up-time is an impossible expectation for any component. Netflix wanted an architecture wherein a single component could fail without affecting the whole system.

Designing a fault tolerant architecture was a challenge. However, system design was only one part of Netflix’s strategy. It was implementing tools and processes to test these systems’ efficacy in worst case scenarios which gave Netflix the edge. Their highly resilient, fault tolerant system would not exist today without them.

Key Components

Netflix has several components within their architecture which they consider key to their system resilience.

Preventing Cascading Failure

Hystrix, for example, is a latency and fault tolerance library. Netflix utilizes it to isolate points of access to remote systems and services. In their microservices architecture Hystrix enables Netflix to stop cascading failure, keeping their complex distributed system resilient by ensuring errors remain isolated.

Open Source, Scalable Memcache Caching

Then there’s EVCache. EVCache is an extremely scalable memcache caching solution. Developed internally by Netflix, EVCache has played a critical role in tracing requests during fault injection testing.

There are of course many more, and the list of resilience building components in the Netflix stack is always growing. Most are open source and developed in-house by Netflix themselves. However, when it comes to chaos engineering, fault injection, and utilizing chaos testing to validate systems, it is the simian army project which shines through as Netflix’ key achievement.

The Chaos Monkey

While tools like Hystrix and EVCache improve resilience by enabling visibility and traceability during failure scenarios, it is the simian army toolbox Netflix relies on to carry out fault injection testing at scale. The first tool in the box, chaos monkey, embodies Netflix’s approach to chaos engineering and fault injection as a testing method.

Monitored Disruption

Chaos monkey randomly disables production instances. This can occur at any time of day, although Netflix do ensure that the environment is carefully monitored. Engineers will be at the ready to fix any issues the testing may cause outside of the test environment. To the uninitiated this may seem like a high risk, but the rewards justify the process.

Automating Recovery

Experiments with chaos monkey enable engineers at Netflix to build recovery mechanisms into their architecture. This means that when the tested scenario occurs in real life, components in the system can rely on automated processes to resume operation as quickly as possible.

Critical insight gained using fault injection allows engineers to accurately anticipate how components in the complex architecture will respond during genuine failure, and build their systems to be resilient to the impacts of these events.

The Evolution Of Chaos

The success of chaos monkey in increasing the fault tolerance of Netflix systems quickly led to further development of the fault injection testing method. Soon tools like chaos gorilla were implemented. Chaos gorilla works in a similar fashion to chaos monkey, however rather than generating outages of individual instances, chaos gorilla takes out an entire Amazon availability zone.

Large Scale Service Verification

Netflix aimed to verify there was no user-visible impact if a functional availability zone went dark. The only way to ensure services re-balance automatically without manual interference was to deliberately instigate that scenario.

From Production To Server

Server side fault injection was also implemented. Latency monkey was developed, which simulates service degradation in the RESTful communication layer. It does this by deliberately creating delays and failures on the service side of a request. Latency monkey then measures if upstream services respond as anticipated.

Latency monkey provides Netflix with insight about the behavior of their calling applications. How calling applications will respond to a dependency slowdown is no longer an unknown. Using latency monkey, Netflix’s engineers can build their systems to withstand network congestion and thread pile up.

Given the nature of Netflix’s key product (a streaming platform), server side resilience is integral to delivering consistent quality of service.

Fault Injection As A Platform

Netflix operates an incredibly large and complex distributed system. Their cloud-based AWS microservices architecture contains an ever increasing number of individual components. Not only does this mean there is always a need to test and reinforce system resiliency, it also means that doing so at a wide enough scale to have a meaningful impact on the ecosystem becomes ever more difficult.

To address this, Netflix developed the FIT platform. FIT (Failure Injection Testing) is Netflix’s solution to the challenge of propagating deliberately induced faults across the architecture consistently. It is by utilizing FIT that automated fault injection in Netflix has gone from an application used in isolated testing to a commonplace occurrence. Using FIT, Netflix can run chaos exercises across large chunks of the system, and engineers can access tests as a self service.

Netflix found that their methods of deliberately introducing faults to their system wasn’t without risk or impact. The FIT platform limits the impact of fault and failure testing on adjacent components and the wider system. However it does so in a way which still allows for serious faults paralleling those encountered during actual runtime to be introduced.

Confident Engineers Build Better Systems

Using chaos engineering principles and techniques, Netflix’s engineers have built a highly complex, distributed cloud architecture they can be confident in.

With every component built to fail at some point, failure is no longer an unknown. It is the uncertainty of failure more than anything which fills engineers with dread when it comes to the resilience of their systems.

Netflix’s systems are proven to withstand all but the most catastrophic failures. The engineers employed by Netflix can all be highly confident in its resilience. This means that they will be more creative in their output, more open to experimentation, and spend less of their energy on worrying about impending service-wide failure.

The benefits of Netflix’s approach are numerous, and their business success is testament to its effectiveness. Using chaos engineering and fault injection, Netflix maintains an application-based service built on a microservice architecture that is complex, scalable, and at the same time resilient.

Why Your Mean Time to Repair (MTTR) Is Higher Than It Should Be

Posted on March 9, 2021June 3, 2025 by eugene evdokimov

Mean time to repair (MTTR) is an essential metric that represents the average time it takes to repair and restore a component or system to functionality. It is a primary measurement of the maintainability of an organization’s systems, equipment, applications and infrastructure, as well as its efficiency in fixing that equipment when an IT incident occurs.

Key challenges with MTTR arise from just trying to figure out that there is actually a problem. Incorrect diagnosis or inadequate repairs can also lengthen MTTR. A low MTTR indicates that a component or service of a distributed system can be repaired quickly and, consequently, that any IT issues associated with it will probably have a less significant impact on the business.

Challenges With Mean Time To Repair (MTTR)

The following section will describe some of the challenges faced with managing MTTR. In essence trying to show that a high MTTR for an application, device or system failure, can result in a significant service interruption and thus a significant business impact.

Here are 6 common issues that contribute to a high (i.e. poor) MTTR:

1. Lack of Understanding Around Your Incidents

To start reducing MTTR, you need to better understand your incidents and failures. Modern enterprise software can help you automatically unite your siloed data to produce a reliable MTTR metric and valuable insights about contributing factors.

By measuring MTTR, you accept that sometimes things will go wrong. It is just a part of development. Once you’ve accepted that the development process is about continuously improving, analyzing and collecting feedback, you will realize that MTTR will lead to better things. Such as faster feedback mechanisms, better logging and processes for making recovery as simple as deployment.

Having a robust incident management action plan, will allow an organization and development teams to have a clear escalation policy that explains what to do if something breaks. The plan will define who to call, how to document what is happening, and how to set things in motion to solve the problem.

It will cover a chain of events that begins with the discovery of an application or infrastructure performance issue, and that ends with learning as much as possible about how to prevent issues from happening again. Thus covering every aspect of a solid strategy for reducing MTTR.

2. Low-Level Monitoring

A good monitoring solution will provide you with a continuous stream of real-time data about your system’s performance. It is usually presented in a single, easy-to-digest dashboard interface. The solution will alert you to any issues as they arise and should provide credible metrics.

Having proper visibility into your applications and infrastructure can make or break any incident response process.

Consider an example of a troubleshooting process without monitoring data. A server hosting a critical application goes down, and the only ‘data’ available to diagnose the problem is the lack of a power source on the front of the server. An incident response team is forced to diagnose and solve the problem with a heavy amount of guesswork. This leads to a long and costly repair process and a high MTTR.

If you have a monitoring solution with real-time monitoring data flows from the application, server, and related infrastructure it changes the situation drastically. It gives an incident response team an accurate read on server load, memory and storage usage, response times, and other metrics. The team can formulate a theory about what is causing a problem and how to fix it using hard facts rather than guesswork.

Response teams can use this monitoring data to assess the impact of a solution as it is being applied, and to move quickly from diagnosing to resolving an incident. This is a powerful one-two combination, making monitoring perhaps the single most important way to promote an efficient and effective incident resolution process and reduce MTTR.

3. Not Having an Action Plan

When it comes to maintaining a low MTTR, there’s no substitute to a thorough action plan. For most organizations, this will require a conventional ITSM (Information Technology Service Management) approach with clearly delineated roles and responses.

Whatever the plan, make sure it clearly outlines whom to notify when an incident occurs, how to document the incident, and what steps to take as your team starts working to solve it. This will have a major impact on lowering the MTTR.

An action plan needs to follow an incident management policy or strategy. Depending on the dynamics of your organization this can include any of the following approaches.

Ad-hoc Approach

Smaller agile companies typically use this approach. When an incident occurs, the team figures out who knows that technology or system best and assigns a resource to fix it.

Fixed Approach

This is the traditional ITSM approach often used by larger, more structured organizations. Information Technology (IT) is generally in charge of incident management in this kind of environment.

Change management concerns are paramount, and response teams must follow very strict procedures and protocols. In this case, structure is not a burden. It is classed as a benefit.

Fluid Approach

Responses are shaped to the specific nature of individual incidents, and they involve significant cross-functional collaboration and training to solve problems more efficiently. The response processes will continuously evolve over time. A fluid incident response approach allows organizations to channel the right resources and to call upon team members with the right skills, to address situations in which it is often hard to know at first exactly what is happening.

Integrating a cloud-based log management service into an incident management strategy will enable any team to resolve incidents with more immediacy. During an incident, response teams will be able solve a problem under time pressure, and not have to function differently from their day-to-day working activities.

4. Not Having an Automated Incident Management System

An automated incident management system can send multi-channel alerts via phone calls, text messages and emails, to all designated responders at once. This will significantly save time that would otherwise be wasted attempting to locate and manually contact each person individually.

Using an automated incident management system for monitoring, you have visibility into your infrastructure that can help you diagnose problems more quickly and more accurately.

For example, having real-time data on the volume of a server’s incoming queries and how quickly the server is responding to them will better prepare you to troubleshoot an issue when that server fails. Data also allows you to see how specific actions to repair system components are impacting system performance, allowing you to apply an appropriate solution more quickly.

A new set of technologies has emerged in the past few years that enables incident response teams to harness Artificial Intelligence (AI) and Machine Learning (ML) capabilities, so they can prevent more incidents and respond to them faster.

These capabilities analyze data generated by software systems in order to predict possible problems, determine the root causes, and drive automation to fix them. It complements your monitoring practices by providing an intelligent feed of incident information alongside your telemetry data. When you use that information to analyze and take action on that data, you will be better prepared for troubleshooting and incident resolution.

5. Not Creating Runbooks

As you develop incident response procedures and establish monitoring and alerting practices, be sure to document them and if possible ‘automate’ them using an incident management runbook automation tool.

Automating the process allows you to execute runbooks and automated tasks for faster, more repeatable and consistent problem resolution. When configured and enabled, you can associate runbooks with a process that tells incident response team members exactly what to do when a specific problem occurs.

Use runbooks to collect the response team’s knowledge about a given incident-response scenario in one place. In addition to helping you reduce MTTR, runbooks are useful for training new team members, and they are especially helpful when important members of the team leave the organization.

The idea is to use a runbook as a starting point. It will save time and energy when dealing with known issues, and allowing the team to focus on the most challenging and unique aspects of a problem.

6. Not Designating Response Teams and Roles

Clearly defined roles and responsibilities are crucial for effectively managing incident response and lowering MTTR. This includes the definition of roles for Incident Management, First and Second line support.

When constructing an incident response team be sure it has a designated leader who oversees incident response and ensures strong communication with stakeholders within and outside the team, and that all team members are clear on responsibilities.

The incident team lead is responsible for directing both the engineering and communication responses. The latter involves engagement with customers, both to gather information and to pass along updates about the incident and our response to it. The incident team lead must make sure that the right people are aware of the issue.

Each incident may also require a technical lead who reports to the incident team lead. The technical lead typically dictates the specific technical response to a given incident. They should be an expert on the system(s) involved in an incident, allowing them to make informed decisions and to assess possible solutions so they can speed resolution and optimize the team’s MTTR performance.

Another important role that an incident may require is a communications lead. The communications lead should come from a customer service team. This person understands the likely impact on customers and shares these insights with the incident team lead. At the same time, as information flows in the opposite direction, the communications lead decides the best way to keep customers informed of the efforts to resolve the incident.

7. Not Training Team Members For Different Roles

Having focused knowledge specialists on your incident response team is invaluable. However, if you rely solely on these specialists for relatively menial issues, you risk overtaxing them, which can diminish the performance of their regular responsibilities and eventually burn them out. It also handcuffs your response team if that specialist simply is not around when an incident occurs.

It makes sense to invest in cross-training for team members, so they can assume multiple incident response roles and functions. Other members of the team should build enough expertise to address most issues, allowing your specialists to focus on the most difficult and urgent incidents. Comprehensive runbooks can be a great resource for gathering and transferring specialized technical knowledge within your team.

Cross training and knowledge transfer also helps you to avoid one of the most dangerous incident response risks. That being a situation in which one person is the only source of knowledge for a particular system or technology. If that person goes on vacation or abruptly leaves the organization, critical systems can turn into black boxes that nobody on the team has the skills or the knowledge to fix.

You ultimately lower your MTTR by making sure all team members have a deep understanding of your system and are trained across multiple functions and incident-response roles. Your team will be positioned to respond more effectively no matter who is on call when a problem emerges.

Summary

While MTTR is not a magic number, it is a strong indicator of an organization’s ability to quickly respond to and repair potentially costly problems. Given the direct impact of system downtime on productivity, profitability and customer confidence, an organization’s understanding of MTTR and its functions is essential for any technology centric company.

You can mitigate the challenges identified, and ensure a low MTTR by making sure all team members have a deep understanding of your systems and are trained across multiple functions and incident-response roles. It will make your team to be positioned to respond more effectively no matter who is on call when a problem emerges.

Key Differences Between Observability and Monitoring – And Why You Need Both

Posted on February 23, 2021June 3, 2025 by eugene evdokimov

Monitoring is asking your system questions about its current state. Usually these are performance related, and there are many open source enterprise monitoring tools available. Many of those available are also specialized. There are tools specifically catering to application monitoring, cloud monitoring, container monitoring, network infrastructure… the list is endless, regardless of the languages and tools in your stack.

Observability is taking the data from your monitoring stack and using it to ask new questions of your system. Examples include finding and highlighting key problem areas or identifying parts of your system that can be optimized. A system performance management stack built with observability as a focus enables you to then apply the answers to those questions in the form of a refined monitoring stack.

Observability and Monitoring are viewed by many as interchangeable terms. This is not the case. While they are interlinked, they are not interchangeable. There are actually very clear and defined differences between them.

Why You Need Both Observability and Monitoring

You can see why Observability and Monitoring are so often grouped together. When done properly, your Observability and Monitoring stacks will operate in a fluid cycle. Your Monitoring stack provides you with detailed system data, your Observability stack turns these analytics into insights, those insights are in turn applied to your Monitoring stack to enable it to produce better data.

What you gain from this process is straightforward – visibility.

System visibility in any kind of modern software development is absolutely vital. With development life-cycles becoming evermore defined by CI/CD, containerization, and complex architectures like microservices, contemporary engineers are tasked with keeping watch over an unprecedented amount of sources for potential error.

The best Observability tools in the world can provide little to no useful analysis if they’re not being fed by data from a solid Monitoring stack. Conversely, a Monitoring stack does little but clog up your data storage with endless streams of metrics if there’s no Observability tooling present to collect and refine the data.

It is only by combining the two that system visibility reaches a level which can provide a quick picture of where your system is and where it could be. You need this dual perspective, too, as in today’s forever advancing technical landscape no system is the same for long.

More Progress, More Problems

Without strong Observability and Monitoring, you could be only 1-2 configurations or updates away from a system-wide outage. This only becomes more true as your system expands and new processes are implemented. Growth brings progress, and in turn, progress brings further growth.

From a business perspective this is what you want. If your task is to keep the tech your business relies on running smoothly, however, growth and progress mean more systems and services to oversee. They also mean change. Not only will there be more to observe and monitor, but it will be continually different. The stack your enterprise relies on in the beginning will be radically different five years into operation.

Understanding where monitoring and observability differ is essential if you don’t want exponential growth and change to cause huge problems in the future of your development life-cycle or BAU operations.

Only through an established and healthy monitoring practice can your observability tools provide the insights needed to preempt how impending changes and implementations will affect the overall system. Conversely, only with solid observability tooling can you ensure the metrics you’re tracking are relevant to the current state of the system infrastructure.

A Lesson From History

Consider the advent of commercially available cloud computing and virtual networks. Moving over to Azure or AWS brought many obvious benefits. However, for some, it also brought a lot of hurdles and headaches.

Why?

Their Observability and Monitoring stacks were flawed. Some weren’t monitoring those parts of their system which would be put under strain from the sudden spike in internet usage, or were still relying on Monitoring stacks built at a time when most activity happened on external servers. Others refashioned their monitoring stacks accordingly but, due to lack of Observability tools, had little-to-no visibility over parts of their system that extended into the cloud.

Continuous Monitoring/Continuous Observability

DevOps methodology has spread rapidly over the last decade. As such, the implementation of CI/CD pipelines has become synonymous with growth and scaling. There are many advantages of Continuous Implementation/Continuous Development, but at their core, CI/CD pipelines are favored because they enable a rapid release approach to development.

Rapid release is great from a business perspective. It allows the creation of more products, faster, which can be updated and re-released with little-to-no disruption to performance. On the technical side, this means constant changes. If a CI/CD pipeline process isn’t correctly monitored and observed then it can’t be controlled. All of these changes need to be tracked, and their implications on the wider system need to be fully understood.

Rapid Release – Don’t Run With Your Eyes Shut

There are plenty of continuous monitoring/observability solutions on the market. Investing in specific tooling for this purpose in a CI/CD environment is highly recommended. In complex modern development, Observability and Monitoring mean more than simply tracking APM metrics.

Metrics such as how many builds are run, and how frequently, must be tracked. Vulnerability becomes a higher priority as more components equal more weak points. The pace at which vulnerability-creating faults are detected must keep up with the rapid deployment of new components and code. Preemptive alerting and real-time traceability become essential, as your development life-cycle inevitably becomes tied to the health of your CI/CD pipeline.

These are just a few examples of Observability and Monitoring challenges that a high-frequency CI/CD environment can create. The ones which are relevant to you will entirely depend on your specific project. These can only be revealed if both your Observability and Monitoring stacks are strong. Only then will you have the visibility to freely interrogate your system at a speed that keeps up with quick-fire changes that rapid-release models like CI/CD bring.

Back to Basics

Complex system architectures like cloud-based microservices are now common practice, and rapid release automated CI/CD pipelines are near industry standards. With that, it’s easy to overcomplicate your life cycle. Your Observability and Monitoring stack should be the foundation of simplicity that ensures this complexity doesn’t plunge your system into chaos.

As much as there is much to consider, it doesn’t have to be complicated. Stick to the fundamental principles. Whatever shape your infrastructure takes, as long as you are able to maintain visibility and interrogate your system at any stage in development or operation you’ll be able to reduce outages and predict the challenges any changes bring.

The specifics of how this is implemented may be complex. The goal, the reason you’re implementing them, is not. You’re ensuring you don’t fall behind change and complexity by creating means to take a step back and gain perspective.

Observability and Monitoring – Which Do You Need To Work On?

As you can see, Observability and Monitoring both need to be considered individually. While they work in tandem, each has a specific function. If the function of one is substandard, the other will suffer.

To summarize, monitoring is the process of harvesting quantitative data from your system. This data will take the form of many different metrics (query, error, processing, events, traces etc). Monitoring is asking questions of your system. If you know what it is you want to know, but your data isn’t providing the answer from your system, your Monitoring stack that needs work.

Observability is the process of transforming collected data into analysis and insight. It is the process of using existing data to discover and inform what changes may be needed and which metrics should be tracked in the future. If you are unsure what it is you should be asking when interrogating your system, Observability should be your focus.

Remember, as with all technology, there is never a ‘job done’ moment when it comes to Observability and Monitoring. It is a continuous process, and your stacks, tools, platforms, and systems relating to both should be constantly evolving and changing. This piece lists but a few of the factors software companies of all sizes should be considering during the development life-cycle.

System Traceability: What is It and How Can You Implement It?

Posted on February 1, 2021June 3, 2025 by eugene evdokimov

System traceability is one of the three pillars of observability stack. The basic concept of data observability is of operations, which include logging monitoring, tracing, and displaying metrics.

Tracing is intuitively useful. Identify specific points in an application, proxy, framework, library, runtime, middleware, and anything else in the path of a request that represents the following of either ‘forks’ in execution flow and/or a hop or a fan out across network or process boundaries.

As tracing is a major component of monitoring, it is becoming even more useful in modern technology design that uses microservice architectures. This means the role of tracing has evolved to following a more distributed pattern.

Key Pillars of System Traceability

The Purpose of System Tracing in the Observability Stack

The ‘Observability Stack’ helps developers understand multi-layered architectures. You need to understand what is slow and what is not working. Your observability stack is there to do just that.

Tracing is most common in a microservices environment. Although less common, any sufficiently complex application can benefit from the advantages that tracing provides. When your architecture is distributed, it can be difficult to work out the overall latency of a request. Tracing solves this problem.

Traces are a critical part of observability. They provide context for other telemetry. Traces help define which metrics would be most valuable in a given situation, or which logs are most relevant.

The Challenges of System Traceability

Tracing is, by far, the hardest to retrofit into an existing infrastructure, because for tracing to work, every component needs to comply. In modern microservice applications, this is especially true when dealing with multiple connected components.

Metrics and Logs for your Microservices

Increasing use of microservices is introducing additional complexity from a system monitoring perspective. Metrics fail to connect the dots across all the services, and this is where distributed tracing shines.

Minimize your Overhead

With distributed tracing, a core challenge is to minimize the overhead during any collection process. Creating a trace, propagating this, and storing additional metadata can cause issues for the application. If it is already busy, the addition of this new logic may be enough to impact performance.

TIP: A typical approach to mitigate this is to sample the traces. For example, only instrument one in one thousand requests. Consider the volume of your traffic and what a representative sample might be.

Application-Level Code

A further problem with tracing instrumentation is that it tends not to be sufficient for developers to instrument their code alone. Many applications are built using open source frameworks that might require additional instrumentation. It is certainly the case that tracing is most successfully deployed in organizations that are consistent with their use of frameworks.

Distributed Tracing

Modern microservice architectures introduce advantages to application development, but there’s also the cost of reduced visibility. Distributed tracing can provide end-to-end visibility and reveal service dependencies, showing how the services respond to each other. You can compare anomalous traces against performance based ones to see the differences in behavior, structure, and timing. This information will allow you to better understand the culprit in the observed symptoms and jump to the performance bottlenecks in your systems.

Trace Latency Across your Entire System

Distributed tracing is a critical component of observability in connected systems and focuses on performance monitoring and troubleshooting. Your individual components can easily report how long their API call took, but traceability will summate and store each of these values. This enables your teams to think about latency in a holistic way.

By understanding the performance of every service, it will enable your development teams to quickly identify and resolve issues.

System Traceability is Essential

Modern day cloud-based services or cloud-based log management solutions, need to embed tracing with the logs. It gives you a way to understand what has happened and perhaps more importantly, why something happened.

This is an effective way for development teams and DevOps teams to understand what caused the issues and how to fix them efficiently. Which ultimately makes them run much faster. Traceability has become popular because of its effectiveness. In the world of microservices, what we gain in flexibility, we lose in visibility. Traceability allows us to reclaim that, and monitor with confidence.

Are Your Customers Catching Your Software Problems For You?

Posted on January 26, 2021 by eugene evdokimov

Software problems have a very real impact on your bottom line. Availability and quality are the biggest differentiators when people opt for a service or product today. You should be aware of the impact of your customers alerting you to your own problems. To make sure you don’t become an organization known for its bugs, you must understand the organizational changes required to deliver a stable service. If, as Capers Jones tells us, only as many as 85% of bugs are caught pre-release, it’s important to differentiate yourself with the service you provide.

The Problem

It’s simple to understand why you don’t want unknown bugs to go out to your customers in a release. To understand its impact, you need to define the impact of committing problematic code to release.

Problem 1: Your Customers’ Perception

No one wants to buy a tool full of software problems. You open yourself up to reputation risks. Poor reviews, client churn, lack of credibility. They give you a name for buggy releases. This has three very tangible costs to your business. First, your customers will cease to use your product or service. Second, any new customers will become aware of your pitfalls sooner or later. Lastly, it can have a negative impact on staff morale and direction. You run the risk of losing your key people.

Problem 2: The Road to Recovery

Once a customer makes you aware of a bug, you must fix it (or deal with the problem above). The cost of doing this post-production is enhanced with the time it takes for you to detect the problem, or MTTD (mean time to detect). As part of the 2019 State of Devops Report, surveyed “Elite” performing businesses took on average one hour or under to deliver a service restoration or fix a bug, against up to one month for “Low” performing businesses in the same survey. The problem compounds with time: the longer it takes to detect the problem, the more time it takes for your developers to isolate, troubleshoot, fix and then patch. Of all surveyed in the 2019 State of Devops Report, the top performers were at least twice as likely to exceed their organizational SLAs for feature fixes.

Problem 3: Releasing Known Software Problems

Releases often go out to customers with “known software problems” in them. These are errors that have been assessed to have little impact, and therefore are unlikely to affect the general release. However, this is just the coding and troubleshooting you’ll have to do later down the line, because you wanted to make a release on time. This notion of technical debt isn’t something new, but with many tech companies doing many releases per day, the compounded work that goes into managing known errors is significant.

The Solution

Organizations can easily deliver more stable releases to their customers. Analysis indicates that there are a number of things that can greatly enhance your own stability.

Solution 1: What Your Team Should be Doing to Limit Software Problems

Revisiting the State of Devops Report, we can see the growing importance of delivering fast fixes (reduced MTTD) is dependent on two important factors within your team.

Test automation is the “golden record” when it comes to validating code for release. It positively impacts continuous integration and continuous deployment. Automated deployment enhances these efficiencies.

“With automated testing, developers gain confidence that a failure in a test suite denotes an actual failure”

However, testing isn’t the whole solution. Errors will still make it through, and you need to know when they do.

Solution 2: How Coralogix Can Help Detect Software Problems

Coralogix’s advanced unified UI allows the pooling of log data from applications, infrastructure and networks in one simple view. Not only does this allow your developers to better understand the impact of releases on your system, as well as helping to spot bugs early on. Both of these are critical in reducing your RTTP, which leads to direct savings for your organization.

Coralogix also provides advanced solutions to flag “known errors”, so that if they do go out for release, they aren’t just consigned to a future fix pile. By stopping known errors from slipping through the cracks, you are actively minimizing your technical debt. This increases your dev team’s efficiency.

Lastly, Loggregation uses machine learning to benchmark your organization’s code’s performance, building an intelligent baseline that identifies errors and anomalies faster than anyone – even the most eagle-eyed of customers.

Force Multiply Your Observability Stack with a Platform Thinking Strategy

Posted on January 10, 2021June 3, 2025 by eugene evdokimov

Platform thinking is a term that has spread throughout the business and technology ecosystem. But what is platform thinking, and how can a platform strategy force multiply the observability capabilities of your team?

Platform thinking is an evolution from the traditional pipeline model. In this model, we have the provider/producer at one end and the consumer at the other, with value traveling in one direction. Platform thinking turns this on its head, allowing groups to derive value from each other regardless of whether they are the users or creators of the product.

In this article, we will unpack what platform thinking is, how it fits into the world of software engineering, and the ways in which using platform thinking can revolutionize the value of any observability stack.

Platform Thinking – A Simple Explanation

Traditionally, value is sent from the producer to the consumer and takes the form of the utility or benefit gained upon engagement with whatever the product/service happens to be in each case. Financial value obviously then travels back up the pipeline to the owner of said product/service, but this isn’t the kind of value we’re concerned with here.

It should go without saying that any successful change to a business model (be it technological or organizational) should lead to an increase in financial value. Otherwise, what would be the point in implementing it?

Platform thinking is certainly no exception to the end goal being profit, but the value we’re discussing here is a little more intangible. When we say ‘value’ we are referring to the means by which financial ends are achieved, rather than the ends themselves.

So How Does Platform Thinking Apply to Engineering?

The above explanation obviously makes a lot of sense if your curiosity about platform thinking stems from financial or business concerns. However, that’s probably not why you’re reading this. You want to know how platform thinking can be applied to your technical operations.

It’s great that platform thinking could generate more revenue for your company or employer, but in your mind this is more of a convenient by-product of your main aim. You’ll be pleased to hear that the benefits of implementing a platform thinking approach to your operational processes will be felt by your engineers and analysts before they’re noticed by the company accountant.

As we covered in the previous section, the value added by platform thinking comes in the enabling of collaboration and free movement of value between groups on the platform. Financial value then comes from the inevitable increase in productivity and quality of delivery that results.

Platform Thinking in a Technical Context

A technical ecosystem founded on platform thinking principles means that everybody involved, be they an individual engineer or an entire development team, has access to a shared stack that has been built upon by everybody else.

Engineers work and build on a foundation of shared tooling which is continuously being honed and refined by their peers and colleagues. Anybody joining enters with access to a pre-built toolbox containing tools already tailored to suit the unique needs of the project.

The productivity implications of this should go without saying, but they’re astronomical. Any engineer will be able to tell you that they spend a disproportionately large amount of their time on configuring and implementing tooling. Platform thinking can significantly reduce, or even remove entirely, the time spent on these activities.

Observability- Why It’s Useful

Observability and monitoring are essential components of any technical ecosystem. Be it project-based development or BAU operations and system administration, a healthy observability stack is often the only thing between successful delivery and system wide outage.

A well-executed observability solution prevents bottlenecks, preempts errors and bugs, and provides you with the visibility needed to ensure everything works as it should. Without observability being a high priority, troubleshooting and endless investigations of the logs to find the origin of errors will define your development lifecycle.

In our highly competitive technology market, the speed and efficiency observability enables can often be the difference between success and failure. Understandably your observability stack is something you’ll want to get right.

Freeing Yourself From Unicorns

Here’s the thing – not everybody is an observability mastermind. The front-end JavaScript developer you hired to work on the UI of your app isn’t going to have the same level of observability knowledge as your back-end engineers or systems architects. It’s going to take them longer to implement observability tooling, as it’s not their natural forte.

Rather than attempting to replace your front-end UI dev with a unicorn who understands both aesthetic design and systems functionality, you could instead implement a platform thinking strategy for your observability stack.

Shared Strength & Relieved Pressure

In any project or team, the most skilled or experienced members often struggle to avoid becoming the primary resource upon which success rests. Engineers enjoy a challenge, and it’s not uncommon to find that the ambitious ones take more and more under their belt if it means a chance to get some more time with a new tool, language, or system.

This is a great quality, and one that should be applauded. However, it also means that when your superstar engineer is out for vacation or leaves for new horizons, the hole they leave behind can be catastrophic. This is especially true when the skills and knowledge they take with them are intrinsically linked to the functionality of your systems and processes.

By implementing a platform thinking approach your investment in your observability stack transforms it into a platform of functionality and centralized knowledge which all engineers involved can tap into. Not only does this reduce pressure on your strongest team members, it also means if they leave you don’t have to find a fabled unicorn to replace them.

A Platform Observability Approach

A platform thinking approach to the observability of your ecosystem enables every developer, engineer, analyst, and architect to contribute to and benefit from your observability stack.

Every participant will have access to pre-implemented tooling which is ready to integrate with their contributions. What’s more, when new technology is introduced, their deployment and configurations will be accessible so that others can implement the same without a substantial time investment.

This in turn significantly increases the productivity of everyone in your ecosystem. The front-end UI developers will be free to focus on front-end UI development. Your systems engineers and analysts can focus on new and creative ways to optimize instead of fixing errors caused by ineffective tracing or logging.

In short, a collectively owned observability platform enables everyone to focus on what they’re best at.

Force Multiplied Observability

The aggregate time saved and pooling of both resources and expertise will have benefits felt by everyone involved. From a system-wide perspective, it will also close up those blind spots caused by poor or inconsistent implementation of observability tools and solutions.

You can loosely measure the efficacy of your observability stack by how often outages, bottlenecks, or major errors occur. If your observability stack is producing the intended results, then these will be few and far between.

With a platform thinking strategy the efficacy of your observability stack is multiplied as many times as there are active participants. Every single contributor is also a beneficiary, and each one increases the range and strength of your stack’s system-wide visibility and effectiveness. Each new participant brings new improvements.

By creating your observability process with a platform thinking led approach you’ll find yourself in possession of a highly efficacious observability stack. Everybody in your ecosystem will benefit from access to the tools it contains, and productivity of your technical teams will leap to levels it has never seen before.

A Crash Course in Kubernetes Monitoring

Posted on December 10, 2020June 3, 2025 by eugene evdokimov

Kubernetes log monitoring can be complex. To do it successfully requires several components to be monitored simultaneously. First, it’s important to understand what those components are, which metrics should be monitored and what tools are available to do so.

In this post, we’ll take a close look at everything you need to know to get started with monitoring your Kubernetes-based system.

Monitoring Kubernetes Clusters vs. Kubernetes Pods

Monitoring Kubernetes Clusters

When monitoring the cluster, a full view across all areas is obtained, giving a good impression of the health of all pods, nodes, and apps.

Key areas to monitor at the cluster level include:

Node load: Tracking the load on each node is integral to monitoring efficiency. Some nodes are used more than others. Rebalancing the load distribution is key to keeping workloads fluid and effective. This can be done via DaemonSets.
Unsuccessful pods: Pods fail and abort. This is a normal part of Kubernetes processes. When a pod that should be working at a more efficient level or is inactive, it is essential to investigate the reason behind the anomalies in pod failures.
Cluster usage: Monitoring cluster infrastructure allows adjustment of the number of nodes in use and the allocation of resources to power workloads efficiently. The visibility of resources being distributed allows scaling up or down and avoids the costs of additional infrastructure. It is important to set a container’s memory and CPU usage limit accordingly.

Monitoring Kubernetes Pods

Cluster monitoring provides a global view of the Kubernetes environment, but collecting data from individual pods is also essential. It reveals the health of individual pods and the workloads they are hosting, providing a clearer picture of pod performance at a granular level, beyond the cluster.

Key areas to monitor at the cluster level include:

Total pod instances: There needs to be enough instances of a pod to ensure high availability. This way hosting bandwidth is not wasted, however consideration is needed to not run ‘too many extra’ pod instances.
Actual pod instances: Monitoring the number of instances for each pod that’s running versus what is expected to be running will reveal how to redistribute resources to achieve the desired state in terms of pods instances. ReplicaSets could be misconfigured with varying metrics, so it’s important to analyze these regularly.
Pod deployment: Monitoring pods deployment allows to view any misconfigurations that might be diminishing the availability of pods. It’s critical to monitor how resources distribute to nodes.

Important Metrics for Kubernetes Monitoring

To gain a higher visibility into a Kubernetes installation, there are several metrics that will provide valuable insight into how the apps are running.

Common metrics

These are metrics collected from the Kubernetes code, written in Golang. They allow understanding of performance in the platform at a cellular level and display the state of what is happening in the GoLang processes.

Node metrics –

Monitoring the standard metrics from the operating systems that power Kubernetes nodes provides insight into the health of each node.

Each Kubernetes Node has a finite capacity of memory and CPU and that can be utilized by the running pods, so these two metrics need to be monitored carefully. Other common node metrics to monitor include CPU load, memory consumption, filesystem activity and usage and network activity.

One approach to monitoring all cluster nodes is to create a special kind of Kubernetes pod called DaemonSets. Kubernetes ensures that every node created has a copy of the DaemonSet pod, which virtually enables one deployment to watch each machine in the cluster. As nodes are destroyed, the pod is also terminated.

Kubelet metrics –

To ensure the Control Plane is communicating efficiently with each individual node that a Kubelet runs on, it is recommended to monitor the Kubelet agent regularly. Beyond the common GoLang common metrics described above, Kubelet exposes some internals about its actions that are useful to track as well.

Controller manager metrics –

To ensure that workloads are orchestrated effectively, monitor the requests that the Controller is making to external APIs. This is critical in cloud-based Kubernetes deployments.

Scheduler metrics

To identify and prevent delays, monitor latency in the scheduler. This will ensure Kubernetes is deploying pods smoothly and on time.

The main responsibility of the scheduler is to choose which nodes to start newly launched pods on, based on resource requests and other conditions.

The scheduler logs are not very helpful on their own. Most of the scheduling decisions are available as Kubernetes events, which can be logged easily in a vendor-independent way, thus are the recommended source for troubleshooting. The scheduler logs might be needed in the rare case when the scheduler is not functioning, but a kubectl logs call is usually sufficient.

etcd metrics –

etcd stores all the configuration data for Kubernetes. etcd metrics will provide essential visibility into the condition of the cluster.

Container metrics –

Looking specifically into individual containers will allow monitoring of exact resource consumption rather than more general Kubernetes metrics. CAdvisor analyzes resource usage happening inside containers.

API Server metrics –

The Kubernetes API server is the interface to all the capabilities that Kubernetes provides. The API server controls all the operations that Kubernetes can perform. Monitoring this critical component is vital to ensure a smooth running cluster.

The API server metrics are grouped into a major categories:

Request Rates and Latencies
Performance of controller work queues
etcd helper cache work queues and cache performance
General process status (File Descriptors/Memory/CPU Seconds)
Golang status (GC/Memory/Threads)

kube-state-metrics –

kube-state-metrics is a service that makes cluster state information easily consumable. Where the Metrics Server exposes metrics on resource usage by pods and nodes, kube-state-metrics listens to the Control Plane API server for data on the overall status of Kubernetes objects (nodes, pods, Deployments, etc) as well as the resource limits and allocations for those objects. It then generates metrics from that data that are available through the Metrics API.

kube-state-metrics is an optional add-on. It is very easy to use and exports the metrics through an HTTP endpoint in a plain text format. They were designed to be easily consumed / scraped by open source tools like Prometheus.

In Kubernetes, the user can fetch system-level metrics from various out of the box tools like cAdvisor, Metrics Server, and Kubernetes API Server. It is also possible to fetch application level metrics from integrations like kube-state-metrics and Prometheus Node Exporter.

Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It locally stores all scraped samples and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API tools can be used to visualize the collected data.

Prometheus, Grafana and Alertmanager

One of the most popular Kubernetes monitoring solutions is the open-source Prometheus, Grafana and Alertmanager stack, deployed alongside kube-state-metrics and node_exporter to expose cluster-level Kubernetes object metrics as well as machine-level metrics like CPU and memory usage.

What is Prometheus?

Prometheus is a pull-based tool used specifically for containerized environments like Kubernetes. It is primarily focused on the metrics space and is more suited for operational monitoring. Exposing and scraping prometheus metrics is straightforward, and they are human readable, in a self-explanatory format. The metrics are published using a standard HTTP transport and can be checked using a web browser.

Apart from application metrics, Prometheus can collect metrics related to:

Node exporter, for the classical host-related metrics: cpu, mem, network, etc.
Kube-state-metrics for orchestration and cluster level metrics: deployments, pod metrics, resource reservation, etc.
Kube-system metrics from internal components: kubelet, etcd, scheduler, etc.

Prometheus can configure rules to trigger alerts using PromQL, Alertmanager will be in charge of managing alert notification, grouping, inhibition, etc.

Using Prometheus with Alertmanager and Grafana

PromQL (Prometheus Query Language) lets the user choose time-series data to aggregate and then view the results as tabular data or graphs in the Prometheus expression browser. Results can also be consumed by the external system via an API.

How does Alertmanager fit in? The Alertmanager component configures the receivers, gateways to deliver alert notifications. It handles alerts sent by client applications such as the Prometheus server and takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty or OpsGenie. It also takes care of silencing and inhibition of alerts.

Grafana can pull metrics from any number of Prometheus servers and display panels and dashboards. It also has the added ability to register multiple different backends as a datasource and render them all out on the same dashboard. This makes Grafana an outstanding choice for monitoring dashboards.

Useful Log Data for Troubleshooting

Logs are useful to examine when a problem is revealed by metrics. They give exact and invaluable information which provides more details than metrics. There are many options for logging in most of Kubernetes’ components. Applications also generate log data.

Digging deeper into the cluster requires logging into the relevant machines.

The locations of the relevant log files are:

Master

/var/log/kube-apiserver.log – API Server, responsible for serving the API

/var/log/kube-scheduler.log – Scheduler, responsible for making scheduling decisions

/var/log/kube-controller-manager.log – Controller that manages replication controllers

Worker nodes

/var/log/kubelet.log – Kubelet, responsible for running containers on the node

/var/log/kube-proxy.log – Kube Proxy, responsible for service load balancing

etcd logs

etcd uses the Github capnslog library for logging application output categorized into levels.

A log message’s level is determined according to these conventions:

Error: Data has been lost, a request has failed for a bad reason, or a required resource has been lost.
Warning: Temporary conditions that may cause errors, but may work fine.
Notice: Normal, but important (uncommon) log information.
Info: Normal, working log information, everything is fine, but helpful notices for auditing or common operations.
Debug: Everything is still fine, but even common operations may be logged and less helpful but more quantity of notices.

kubectl

When it comes to troubleshooting the Kubernetes cluster and the applications running on it, understanding and using logs are crucial. Like most systems, Kubernetes maintains thorough logs of activities happening in the cluster and applications, which highlight the root causes of any failures.

Logs in Kubernetes can give an insight into resources such as nodes, pods, containers, deployments and replica sets. This insight allows the observation of the interactions between those resources and see the effects that one action has on another. Generally, logs in the Kubernetes ecosystem can be divided into the cluster level (logs outputted by components such as the kubelet, the API server, the scheduler) and the application level (logs generated by pods and containers).

Use the following syntax to run kubectl commands from your terminal window:

kubectl [command] [TYPE] [NAME] [flags]

Where:

command: the operation to perform on one or more resources, i.e. create, get, describe, delete.
TYPE: the resource type.
NAME: the name of the resource.
flags: optional flags.

Examples:

kubectl get pod pod1    # Lists resources of the pod ‘pod1’

kubectl logs pod1    # Returns snapshot logs from the pod ‘pod1’

Kubernetes Events

Since Kubernetes Events capture all the events and resource state changes happening in your cluster, they allow past activities to be analyzed in your cluster. They are objects that display what is happening inside a cluster, such as the decisions made by the scheduler or why some pods were evicted from the node. They are the first thing to inspect for application and infrastructure operations when something is not working as expected.

Unfortunately, Kubernetes events are limited in the following ways:

Kubernetes Events can generally only be accessed using kubectl.
The default retention period of kubernetes events is one hour.
The retention period can be increased but this can cause issues with the cluster’s key-value store.
There is no way to visualize these events.

To address these issues, open source tools like Kubewatch, Eventrouter and Event-exporter have been developed.

Summary

Kubernetes monitoring is performed to maintain the health and availability of containerized applications built on Kubernetes. When you are creating the monitoring strategy for Kubernetes-based systems, it’s important to keep in mind the top metrics to monitor along with the various monitoring tools discussed in this article.