We just raised $142 million in our Series D Round! Read About Our Plans for the Future

AWS Service Observability using OpenTelemetry 

  • Joanna Wallace
  • May 19, 2022

Efficient use of observability statistics is essential to any microservice architecture. OpenTelemetry is a project supported by the Cloud Native Computing Foundation (CNCF) to enhance the observability of microservice projects. AWS Distro for OpenTelemetry (ADOT) is an AWS-supported distribution of the OpenTelemetry project specifically designed to improve the observability of AWS projects. 

Observability Data

Observability data includes metrics, traces, and logs, and each of these data types is integral in understanding the status of your system. 

Observability Standards in Industry

Cloud applications are complicated. Whether you are a small company or a large service, this isn’t something to underestimate. If you cannot efficiently troubleshoot your system, issues will affect users. Effective observability tools can help teams proactively solve problems.

Researchers have published several industrial surveys since distributed systems have become the go-to model for new and upgraded software solutions. Distributed systems have improved aspects of software creation by separating pieces of the infrastructure, making them simpler to edit without affecting the entire system. However, distributed systems have also brought about new issues with observability. These include:

Scale and complexity of data: this is primarily an issue with logs which tend to be more verbose than metrics and races. However, long systems that include traces with hundreds of points or metrics for complicated services can also be complex.

Instrumentation: depending on if you are adding monitoring to existing cloud systems, rebuilding monitoring that isn’t working, or creating a new system with monitoring in mind, instrumentation requires different amounts of work. 

Incomplete Information: troubleshooting microservices is especially challenging without a complete picture of the problem, including logs, metrics, and traces. 

Studies have shown that the three pillars of observability (logs, metrics, and traces) combined provide a method for preventing outages and predicting failures before they occur.  In one study, researchers interviewed individuals from ten companies ranging in size, domain, and deployment type; eight of the ten companies had already implemented a distributed tracing model to augment their existing logging and metric infrastructure. The other two companies were either planning or beginning the implementation of such a model. 

One other industry study, which interviewed individuals from sixteen companies, has shown that company culture and awareness of the usefulness of observability infrastructure in their business model is essential to success when running a distributed system. Often companies might be acting reactively to issues that arise when users experience failures. Developers will be focused on troubleshooting in these situations and reactively implementing ad-hoc solutions without fully realizing the extent of technical problems. Observability infrastructure implemented early can prevent the need for such reactionary work. 

Observability with OpenTelemetry

The OpenTelemetry project enables cloud users to collect all three types of observability data about their cloud services and analyze them. However, the ADOT project currently supports a stable integration for metrics and traces; log integration is still in the beta stage. Developers can use other tools like FluentBit to fill in the log data so all three values can be used to analyze the health of your system. 

Without analytics, the observability data is less useful. Developers and DevOps teams require a way to quickly read and see the system’s health using this data. ADOT can send telemetry data to several observability tools within AWS, such as XRay, OpenSearch, or CloudWatch. Data can also be sent directly or via these other tools to external services such as Coralogix’s observability engine

OpenTelemetry Specifications

OpenTelemetry is an open-source set of APIs, SDKs, tools, and integrations used to create and manage observability data. The project defines an implementation that can make a move to similarly-formatted data without considering your cloud vendor. OpenTelemetry is not a back-end to analyze observability data but provides a means to export data from any cloud vendor to any such back-end. It is a pluggable architecture that AWS has used to create the AWS Distro for OpenTelemetry. 

OpenTelemetry has defined several data sources supported by its infrastructure. 

Log

Logs are the most commonly-used observability tool, spanning all architecture types. A log is a timestamped record of an event in your software. They can be structured (typically JSON) or unstructured. Logs can be used to determine the root cause of an issue. 

In OpenTelemetry, a log is an independent event but may be associated with a span (or a single operation within a trace). OpenTelemetry defines a log as anything that is not part of a trace or metric. 

Metric

A metric represents system performance data measured over time as a numerical value. Metric data includes the measurement itself, the time it was captured, and any relevant metadata. Metrics can be used to indicate both the availability and performance of services. Custom metrics can also be used to measure the performance of more platform-specific data. 

OpenTelemetry defines three metric types: 

  1. Counter: a metric value summed over time
  2. Measure: a value aggregated over time
  3. Observer: a set of values measured at a moment in time

Trace

Traces are used to track the progression of a single request as it flows through services in a platform. Distributed tracing is helpful in microservices since these records track a command from its initialization to its termination.

A span is a unit of work in a trace, and a trace is comprised as a tree of spans completed throughout an application. Spans will show work cone by a single service within a complete request. Each span includes request, error, and duration data that can be used to troubleshoot an application or feature. 

Baggage

Though the three pillars of observability include only logs, metrics, and traces, OpenTelemetry also defines a type called baggage. Baggage is used for propagating name/value pairs used for indexing events in one service with attributes provided in an upstream service. Baggage is used to establish relationships between these events. For example, baggage can be used to add the API user or token that triggered a series of events in a SaaS platform.

ADOT with AWS Compute Services

Since its initial preview release in October 2021, the project has been made generally available with new features added according to the AWS roadmap. AWS supports the ADOT distribution, providing a means to record observability data across different AWS compute services, including tasks deployed using ECS and Lambdas. 

ADOT uses a collector, a set of components combined to handle events. The collector is made up of a:

Receiver: the component which receives data to trace and transforms it to an internally-used format recognized by the processor.

Processor: the component that will transform the received data by adding it or dropping it and then forward it to the exporter.

Exporter: the component which forwards the data to its destination either internal to AWS (e.g., X-Ray) or a file that an external analyzer can then use (e.g., Coralogix Analytics Platform) 

The ADOT collector will forward trace events to AWS X-Ray and metric events to AWS CloudWatch, though these destinations are customizable. From both of these services, data can be sent to third-party tools for further analysis and to predict when issues will arise in the platform.

AWS Distro for OpenTelemetry with AWS Lambda

AWS Distro for OpenTelemetry supports exporting metrics and traces for AWS Lambda using a Lambda layer plugin. Lambda layers provide an easy setup where developers do not need to configure the ADOT code directly; traces will automatically send to AWS X-Ray. Metrics are not yet available for Lambda through the OpenTelemetry Distro, though many Lambda metrics are already automatically available through CloudWatch. 

AWS Distro for OpenTelemetry with AWS ECS

AWS ECS allows you to export metrics and trace data to several services using the ADOT sidecar container built and maintained by AWS as an open-source service. The container will collect and route observability data to your chosen destination. Recent updates to the AWS console allow developers to specify observability data collection during task definition creation. 

Customizations of the ECS task definition and the ADOT configuration are available if custom configurations are required for your use case. 

Collecting Trace Data

Trace data can be turned on in ECS tasks by configuring it in the ECS console. There is no need to add any logic or configurations to the code to get started. Traces appear by default in AWS X-Ray. Customized configurations are available for traces by configuring the ADOT setup. Traces may also be sent to other services or stored in files for export to third-party vendors for analysis.

Collecting Metric Data

Currently, the collection of metric data, or Container Insights, within ECS is in preview. The preview is fully available to all AWS accounts, and users need to be able to handle updates that may come to the service before it is made generally available. 

Container insights can be turned on directly when running a task in ECS. This setting has been integrated into the console for easy access. The same insights are also available if using Amazon Elastic Kubernetes Service (EKS) and Kubernetes on EC2. 

Container insights include information not typically collected with AWS CloudWatch. Container insights give further diagnostic information as metrics to help detect and recover from container issues like restart failures. Collected metrics include CPU, memory, network, and disk usage. Detection can be done using a CloudWatch alarm or sending metric data to third-party services for analysis and alarm. 

Related Articles