AWS Service Observability using OpenTelemetry 

Effective AWS observability tools can help teams proactively solve problems.

The efficient use of observability statistics is essential to any microservice architecture. OpenTelemetry is a project supported by the Cloud Native Computing Foundation (CNCF) to enhance the observability of microservice projects. AWS Distro for OpenTelemetry (ADOT) is an AWS-supported distribution of the OpenTelemetry project specifically designed to improve the observability of AWS projects. 

Observability Data

Observability data includes metrics, traces, and logs, and each of these data types is integral in understanding the status of your system. 

Observability Standards in Industry

Cloud applications are complicated. Whether you are a small company or a large service, this isn’t something to underestimate. If you cannot efficiently troubleshoot your system, issues will affect users.

Researchers have published several industrial surveys since distributed systems have become the go-to model for new and upgraded software solutions. Distributed systems have improved aspects of software creation by separating pieces of the infrastructure, making them simpler to edit without affecting the entire system. However, distributed systems have also brought about new issues with observability. These include:

Scale and complexity of data: this is primarily an issue with logs which tend to be more verbose than metrics and races. However, long systems that include traces with hundreds of points or metrics for complicated services can also be complex.

Instrumentation: depending on if you are adding monitoring to existing cloud systems, rebuilding monitoring that isn’t working, or creating a new system with monitoring in mind, instrumentation requires different amounts of work. 

Incomplete Information: troubleshooting microservices is especially challenging without a complete picture of the problem, including logs, metrics, and traces. 

Studies have shown that the three pillars of observability (logs, metrics, and traces) combined provide a method for preventing outages and predicting failures before they occur.  In one study, researchers interviewed individuals from ten companies ranging in size, domain, and deployment type; eight of the ten companies had already implemented a distributed tracing model to augment their existing logging and metric infrastructure. The other two companies were either planning or beginning the implementation of such a model. 

One other industry study, which interviewed individuals from sixteen companies, has shown that company culture and awareness of the usefulness of observability infrastructure in their business model is essential to success when running a distributed system. Often companies might be acting reactively to issues that arise when users experience failures. Developers will be focused on troubleshooting in these situations and reactively implementing ad-hoc solutions without fully realizing the extent of technical problems. Observability infrastructure implemented early can prevent the need for such reactionary work. 

Observability with OpenTelemetry

The OpenTelemetry project enables cloud users to collect all three types of observability data about their cloud services and analyze them. However, the ADOT project currently supports a stable integration for metrics and traces; log integration is still in the beta stage. Developers can use other tools like FluentBit to fill in the log data so all three values can be used to analyze the health of your system. 

Without analytics, the observability data is less useful. Developers and DevOps teams require a way to quickly read and see the system’s health using this data. ADOT can send telemetry data to several observability tools within AWS, such as XRay, OpenSearch, or CloudWatch. Data can also be sent directly or via these other tools to external services such as Coralogix’s observability engine

OpenTelemetry Specifications

OpenTelemetry is an open-source set of APIs, SDKs, tools, and integrations used to create and manage observability data. The project defines an implementation that can make a move to similarly-formatted data without considering your cloud vendor. OpenTelemetry is not a back-end to analyze observability data but provides a means to export data from any cloud vendor to any such back-end. It is a pluggable architecture that AWS has used to create the AWS Distro for OpenTelemetry. 

OpenTelemetry has defined several data sources supported by its infrastructure. 

Log

Logs are the most commonly-used observability tool, spanning all architecture types. A log is a timestamped record of an event in your software. They can be structured (typically JSON) or unstructured. Logs can be used to determine the root cause of an issue. 

In OpenTelemetry, a log is an independent event but may be associated with a span (or a single operation within a trace). OpenTelemetry defines a log as anything that is not part of a trace or metric. 

Metric

A metric represents system performance data measured over time as a numerical value. Metric data includes the measurement itself, the time it was captured, and any relevant metadata. Metrics can be used to indicate both the availability and performance of services. Custom metrics can also be used to measure the performance of more platform-specific data. 

OpenTelemetry defines three metric types: 

  1. Counter: a metric value summed over time
  2. Measure: a value aggregated over time
  3. Observer: a set of values measured at a moment in time

Trace

Traces are used to track the progression of a single request as it flows through services in a platform. Distributed tracing is helpful in microservices since these records track a command from its initialization to its termination.

A span is a unit of work in a trace, and a trace is comprised as a tree of spans completed throughout an application. Spans will show work cone by a single service within a complete request. Each span includes request, error, and duration data that can be used to troubleshoot an application or feature. 

Baggage

Though the three pillars of observability include only logs, metrics, and traces, OpenTelemetry also defines a type called baggage. Baggage is used for propagating name/value pairs used for indexing events in one service with attributes provided in an upstream service. Baggage is used to establish relationships between these events. For example, baggage can be used to add the API user or token that triggered a series of events in a SaaS platform.

ADOT with AWS Compute Services

Since its initial preview release in October 2021, the project has been made generally available with new features added according to the AWS roadmap. AWS supports the ADOT distribution, providing a means to record observability data across different AWS compute services, including tasks deployed using ECS and Lambdas. 

ADOT uses a collector, a set of components combined to handle events. The collector is made up of a:

Receiver: the component which receives data to trace and transforms it to an internally-used format recognized by the processor.

Processor: the component that will transform the received data by adding it or dropping it and then forward it to the exporter.

Exporter: the component which forwards the data to its destination either internal to AWS (e.g., X-Ray) or a file that an external analyzer can then use (e.g., Coralogix Analytics Platform) 

The ADOT collector will forward trace events to AWS X-Ray and metric events to AWS CloudWatch, though these destinations are customizable. From both of these services, data can be sent to third-party tools for further analysis and to predict when issues will arise in the platform.

AWS Distro for OpenTelemetry with AWS Lambda

AWS Distro for OpenTelemetry supports exporting metrics and traces for AWS Lambda using a Lambda layer plugin. Lambda layers provide an easy setup where developers do not need to configure the ADOT code directly; traces will automatically send to AWS X-Ray. Metrics are not yet available for Lambda through the OpenTelemetry Distro, though many Lambda metrics are already automatically available through CloudWatch. 

AWS Distro for OpenTelemetry with AWS ECS

AWS ECS allows you to export metrics and trace data to several services using the ADOT sidecar container built and maintained by AWS as an open-source service. The container will collect and route observability data to your chosen destination. Recent updates to the AWS console allow developers to specify observability data collection during task definition creation. 

Customizations of the ECS task definition and the ADOT configuration are available if custom configurations are required for your use case. 

Collecting Trace Data

Trace data can be turned on in ECS tasks by configuring it in the ECS console. There is no need to add any logic or configurations to the code to get started. Traces appear by default in AWS X-Ray. Customized configurations are available for traces by configuring the ADOT setup. Traces may also be sent to other services or stored in files for export to third-party vendors for analysis.

Collecting Metric Data

Currently, the collection of metric data, or Container Insights, within ECS is in preview. The preview is fully available to all AWS accounts, and users need to be able to handle updates that may come to the service before it is made generally available. 

Container insights can be turned on directly when running a task in ECS. This setting has been integrated into the console for easy access. The same insights are also available if using Amazon Elastic Kubernetes Service (EKS) and Kubernetes on EC2. 

Container insights include information not typically collected with AWS CloudWatch. Container insights give further diagnostic information as metrics to help detect and recover from container issues like restart failures. Collected metrics include CPU, memory, network, and disk usage. Detection can be done using a CloudWatch alarm or sending metric data to third-party services for analysis and alarm. 

ECS Monitoring Metrics that Help Optimize and Troubleshoot Tasks

Compute functions that run on Amazon’s Elastic Container Service (ECS) require regular monitoring to ensure proper running and managing of containerized functions on AWS – in short, ECS monitoring is a must. ECS can manage containers with either EC2 or Fargate compute functions. While EC2 and Fargate are compute services, EC2 allows users to configure virtually every functional aspect. Fargate is more limited in its available settings but is simpler to set up. Before setting up a solution with ECS, determine if this service best meets your needs.

When setting up an ECS service, you need to configure any monitoring and observability tools. AWS provides tools to collect and process ECS data to help monitor your solution. Developers can use Coralogix’s Log Analytics and AWS Observability platforms to provide more detailed metrics.

ECS Monitoring Infrastructure

AWS provides metrics useful for ECS monitoring. Let’s go over several recommended metrics and evaluate how they will help ensure the health of your ECS infrastructure. 

Task Count

Task count measures how many tasks are active on the cluster. This metric helps analyze how busy your cluster is. Users can define the desired number of tasks to keep active when defining a service. ECS will automatically ensure that the desired number of tasks of a given container will run. The desired count is also available in the service metrics from ECS. Users can compare the desired count with the running task count to ensure their services are executing as expected.

Along with the number of running tasks, AWS tracks the number of pending tasks. Pending tasks are in a transition state, and ECS is waiting on the container to activate the task. Tasks can get stuck in a pending state, causing customer service outages. An unresponsive docker daemon or the ECS container losing connectivity with the ECS service may cause task outages during task launch. 

AWS uses services to launch tasks automatically. Tasks stuck in a pending state or having a lower-than-required running task count would also benefit from automatic alerts. Alerts would allow development teams to intervene and limit outages on your platform in cases where automatic launches fail.

CPU Reservation and Utilization

The total CPU reservation measures the total CPU units reserved by all the tasks running on a single ECS cluster. You can have multiple clusters in your AWS account, and each would specify its CPU reservation value. 

Active tasks (Fargate) or container instances (EC2) on a cluster will use reserved CPU units. Each active instance will register a certain number of units based on its task definition. Instances that are ‘Active’ or ‘Draining’ will affect the results. 

CPU utilization measures how many CPU units are used by the EC2 tasks running on your cluster. It is helpful to see how many spare CPU units are available to launch new EC2 instances at any given time by comparing CPU reservation for utilization. If you attempt to launch an instance without enough reserved units, a failure may occur. 

Users can also enhance metrics on EC2 instances, so CPU usage is logged per instance instead of only per cluster. AWS provides CPU capacity, reserved capacity, and usage per instance. Users must set up these metrics separately from the other cluster-based metrics. 

To troubleshoot a CPU utilization close to 100%, AWS recommends rebooting the instance. If the CPU requirements are higher than you have reserved for the cluster, restarting will not fix the problem. In this case, you may need to revise your CPU requirements or convert to a new instance type with better performance.

Memory Reservation and Utilization

Memory reservation measures how much memory is allocated for use in the cluster by all tasks. The reservation value is used to calculate the percentage of memory available. Cluster memory reservation is the percent ratio of the total mebibytes (MiB) of memory reserved by tasks to the total MiB of memory registered by container instances in the same cluster. 

The memory utilization value shows the percentage of memory used in the cluster. It is the percent ratio of the total MiB of memory used by tasks divided by the total MiB of memory registered by container instances in the same cluster. 

Cluster Reservation and Utilization

Cluster reservation metrics are available only for clusters with tasks using EC2 launch types. Fargate launch types don’t reserve CPU or memory the same way, so they are measured differently. 

Before running a task, users define a container that will run the task. ECS uses the task definition to reserve the appropriate amount of CPU, GPU, and memory. Even if a task has been defined, its reserved values are not part of this calculation if the task is not running. 

The cluster CPU reservation is the percent ratio of total CPU units reserved to total CPU units registered by container instances in the cluster. In other words, it is the total CPU needed over the total CPU units allocated for all tasks running on the cluster. Users cannot use more CPU to run tasks than the cluster has available. Cluster CPU utilization instead compares the percent ratio of total CPU units used by tasks in the cluster to the total CPU units registered by container instances. By comparing the CPU reservation to the CPU utilization, developers can understand how much capacity is remaining on their cluster and when they need to increase the sizes of their cluster or container definitions. 

Similarly, the cluster memory reservation is the percent ratio of the total MiB of memory reserved by tasks to the total MiB of memory registered by container instances in the cluster. It is the memory reserved by running tasks divided by the total memory allocated to the cluster. Developers can compare a cluster’s memory reservation to its utilization to determine if the size of the cluster should be increased. Further, graphical analysis of the memory utilization may show memory leaks in code.

Service Utilization

Service utilization uses similar calculations to a tasks’ memory and CPU utilization calculations discussed above. The service CPU utilization is the percent ratio of the total CPU units used by tasks in the service to the total CPU units specified in the task definition for the number of tasks in the service. The service utilization calculations depend on the number of tasks in the service and not on the cluster itself. The memory calculation is the same but replacing CPU units with the MiB of memory used by tasks. 

Service utilization is allowed to go over 100% when CPU or memory capacity is defined at the container level. When these units are defined at the task level, going over-limit is not allowed, and the task will fail.

ECS Monitoring in your Application Level Tasks

With AWS Lambda, several metrics are automatically created for each function. For example, when an error is logged, it is added to a metric automatically. This metric, in turn, can be used to set the alarm and notify the development team of issues in the running function. Such metrics are not automatically created with ECS monitoring for tasks.

Error log metrics are beneficial for tracking the health of compute functions. Other metrics specific to your platform’s use case may also be necessary. To meet any metric needs inside your function, you can create custom metrics and alarms in AWS CloudWatch or send logs to third-party systems. The benefit of the third-party systems is sending logs and using their existing analytics to detect issues without predefining everything you may need to track. Coralogix’s log analytics platform detects anomalies in your logs and alerts you based on its findings. 

Automatic Alerting and ECS Monitoring

The metrics listed above are critical to monitor and maintain the health of your ECS infrastructure and ensure users do not discover outages. Developers need to be alerted when problems arise to limit issues successfully.

Alerting can be done using different methods, either in AWS or with third-party services. AWS Cloudwatch provides alarms that can trigger alerts on known issues. Metric data is sent to CloudWatch, and if it meets the criteria set in the alarm, a notification is sent to the predefined location. Third-party systems like Coralogix’s AWS Observability platform use machine learning to detect issues with little customization. Coralogix provides a method of ECS monitoring by configuring an endpoint in the ECS task definition.

Summary

AWS provides standard metrics for monitoring ECS deployments. These metrics differ when using EC2 versus Fargate launch types. Generally, teams will need to watch for CPU usage, memory usage, and the number of running tasks in a cluster. Application-level metrics such as tracking error logs in a task need to be set up manually in CloudWatch when using AWS observability tools. 


ECS monitoring data can also be sent to third-party tools for analysis and to gain observability into your platform. Coralogix’s AWS Observability platform can track ECS metrics and alert users when issues arise.