Using AWS Timestream for System Health Monitoring
Amazon Web Services (AWS) introduced a preview of Timestream in November 2018 before releasing the full version in October 2020. AWS Timestream is a time series…
Whether you are just starting your observability journey or already are an expert, our courses will help advance your knowledge and practical skills.
Expert insight, best practices and information on everything related to Observability issues, trends and solutions.
Explore our guides on a broad range of observability related topics.
Effective AWS observability tools can help teams proactively solve problems.
The efficient use of observability statistics is essential to any microservice architecture. OpenTelemetry is a project supported by the Cloud Native Computing Foundation (CNCF) to enhance the observability of microservice projects. AWS Distro for OpenTelemetry (ADOT) is an AWS-supported distribution of the OpenTelemetry project specifically designed to improve the observability of AWS projects.
Observability data includes metrics, traces, and logs, and each of these data types is integral in understanding the status of your system.
Cloud applications are complicated. Whether you are a small company or a large service, this isn’t something to underestimate. If you cannot efficiently troubleshoot your system, issues will affect users.
Researchers have published several industrial surveys since distributed systems have become the go-to model for new and upgraded software solutions. Distributed systems have improved aspects of software creation by separating pieces of the infrastructure, making them simpler to edit without affecting the entire system. However, distributed systems have also brought about new issues with observability. These include:
Scale and complexity of data: this is primarily an issue with logs which tend to be more verbose than metrics and races. However, long systems that include traces with hundreds of points or metrics for complicated services can also be complex.
Instrumentation: depending on if you are adding monitoring to existing cloud systems, rebuilding monitoring that isn’t working, or creating a new system with monitoring in mind, instrumentation requires different amounts of work.
Incomplete Information: troubleshooting microservices is especially challenging without a complete picture of the problem, including logs, metrics, and traces.
Studies have shown that the three pillars of observability (logs, metrics, and traces) combined provide a method for preventing outages and predicting failures before they occur. In one study, researchers interviewed individuals from ten companies ranging in size, domain, and deployment type; eight of the ten companies had already implemented a distributed tracing model to augment their existing logging and metric infrastructure. The other two companies were either planning or beginning the implementation of such a model.
One other industry study, which interviewed individuals from sixteen companies, has shown that company culture and awareness of the usefulness of observability infrastructure in their business model is essential to success when running a distributed system. Often companies might be acting reactively to issues that arise when users experience failures. Developers will be focused on troubleshooting in these situations and reactively implementing ad-hoc solutions without fully realizing the extent of technical problems. Observability infrastructure implemented early can prevent the need for such reactionary work.
The OpenTelemetry project enables cloud users to collect all three types of observability data about their cloud services and analyze them. However, the ADOT project currently supports a stable integration for metrics and traces; log integration is still in the beta stage. Developers can use other tools like FluentBit to fill in the log data so all three values can be used to analyze the health of your system.
Without analytics, the observability data is less useful. Developers and DevOps teams require a way to quickly read and see the system’s health using this data. ADOT can send telemetry data to several observability tools within AWS, such as XRay, OpenSearch, or CloudWatch. Data can also be sent directly or via these other tools to external services such as Coralogix’s observability engine.
OpenTelemetry is an open-source set of APIs, SDKs, tools, and integrations used to create and manage observability data. The project defines an implementation that can make a move to similarly-formatted data without considering your cloud vendor. OpenTelemetry is not a back-end to analyze observability data but provides a means to export data from any cloud vendor to any such back-end. It is a pluggable architecture that AWS has used to create the AWS Distro for OpenTelemetry.
OpenTelemetry has defined several data sources supported by its infrastructure.
Logs are the most commonly-used observability tool, spanning all architecture types. A log is a timestamped record of an event in your software. They can be structured (typically JSON) or unstructured. Logs can be used to determine the root cause of an issue.
In OpenTelemetry, a log is an independent event but may be associated with a span (or a single operation within a trace). OpenTelemetry defines a log as anything that is not part of a trace or metric.
A metric represents system performance data measured over time as a numerical value. Metric data includes the measurement itself, the time it was captured, and any relevant metadata. Metrics can be used to indicate both the availability and performance of services. Custom metrics can also be used to measure the performance of more platform-specific data.
OpenTelemetry defines three metric types:
Traces are used to track the progression of a single request as it flows through services in a platform. Distributed tracing is helpful in microservices since these records track a command from its initialization to its termination.
A span is a unit of work in a trace, and a trace is comprised as a tree of spans completed throughout an application. Spans will show work cone by a single service within a complete request. Each span includes request, error, and duration data that can be used to troubleshoot an application or feature.
Though the three pillars of observability include only logs, metrics, and traces, OpenTelemetry also defines a type called baggage. Baggage is used for propagating name/value pairs used for indexing events in one service with attributes provided in an upstream service. Baggage is used to establish relationships between these events. For example, baggage can be used to add the API user or token that triggered a series of events in a SaaS platform.
Since its initial preview release in October 2021, the project has been made generally available with new features added according to the AWS roadmap. AWS supports the ADOT distribution, providing a means to record observability data across different AWS compute services, including tasks deployed using ECS and Lambdas.
ADOT uses a collector, a set of components combined to handle events. The collector is made up of a:
Receiver: the component which receives data to trace and transforms it to an internally-used format recognized by the processor.
Processor: the component that will transform the received data by adding it or dropping it and then forward it to the exporter.
Exporter: the component which forwards the data to its destination either internal to AWS (e.g., X-Ray) or a file that an external analyzer can then use (e.g., Coralogix Analytics Platform)
The ADOT collector will forward trace events to AWS X-Ray and metric events to AWS CloudWatch, though these destinations are customizable. From both of these services, data can be sent to third-party tools for further analysis and to predict when issues will arise in the platform.
AWS Distro for OpenTelemetry supports exporting metrics and traces for AWS Lambda using a Lambda layer plugin. Lambda layers provide an easy setup where developers do not need to configure the ADOT code directly; traces will automatically send to AWS X-Ray. Metrics are not yet available for Lambda through the OpenTelemetry Distro, though many Lambda metrics are already automatically available through CloudWatch.
AWS ECS allows you to export metrics and trace data to several services using the ADOT sidecar container built and maintained by AWS as an open-source service. The container will collect and route observability data to your chosen destination. Recent updates to the AWS console allow developers to specify observability data collection during task definition creation.
Customizations of the ECS task definition and the ADOT configuration are available if custom configurations are required for your use case.
Trace data can be turned on in ECS tasks by configuring it in the ECS console. There is no need to add any logic or configurations to the code to get started. Traces appear by default in AWS X-Ray. Customized configurations are available for traces by configuring the ADOT setup. Traces may also be sent to other services or stored in files for export to third-party vendors for analysis.
Currently, the collection of metric data, or Container Insights, within ECS is in preview. The preview is fully available to all AWS accounts, and users need to be able to handle updates that may come to the service before it is made generally available.
Container insights can be turned on directly when running a task in ECS. This setting has been integrated into the console for easy access. The same insights are also available if using Amazon Elastic Kubernetes Service (EKS) and Kubernetes on EC2.
Container insights include information not typically collected with AWS CloudWatch. Container insights give further diagnostic information as metrics to help detect and recover from container issues like restart failures. Collected metrics include CPU, memory, network, and disk usage. Detection can be done using a CloudWatch alarm or sending metric data to third-party services for analysis and alarm.
Amazon Web Services (AWS) introduced a preview of Timestream in November 2018 before releasing the full version in October 2020. AWS Timestream is a time series…
Within this blog post, we’re going to take a look at AWS Log Insights and cover some of the topics that you will find useful around…
Going serverless relieves you of setting up servers, updating operating systems, or maintaining physical infrastructure. But what happens when a function doesn’t work and things go…