OpenTelemetry is a data collection framework for monitoring applications. OpenTelemetry metrics allow the collection and reporting of raw measurements along with contextual information like time, labels, and resources. They serve as a foundational tool for monitoring and observability in applications, providing insights into performance, usage, and health.
Metrics in OpenTelemetry are designed to be versatile, supporting various types of data including counts, gauges, and histograms. They accommodate diverse monitoring needs, from simple count of requests to complex measurements of system resource utilization.
Implementing OpenTelemetry Metrics facilitates better understanding and analysis of system behavior. This enables developers and operators to detect anomalies, optimize performance, and ensure reliability and scalability.
OpenTelemetry metrics can be represented by one of the following data types:
Events in OpenTelemetry represent instances of measurements taken at specific points in time. The framework provides a flexible structure to capture not only the value but also the contextual information that accompanies the measurement. This concept is useful for pinpointing exact moments of interest within a system, such as spikes in load or errors during execution.
Using OpenTelemetry events, developers can track the occurrence and details of specific system events. This granularity is critical for in-depth analysis and troubleshooting. By aggregating these events over time, patterns can emerge, offering insights into system behavior and performance trends.
A data stream aggregates events over time, creating a continuous stream of measurement data. This model simplifies the tracking of metrics by providing a unified view of measurements that evolve over a period. It is particularly useful for monitoring metrics that represent ongoing activities, such as requests per second or CPU utilization. This helps in identifying long-term issues and assessing the impact of changes made to the system.
OpenTelemetry expands on the concept of data streams by associating each data stream with a time series. This approach captures the evolution of metrics over time and preserves the sequential order of events. It is suitable for metrics that require detailed historical analysis and forecasting future trends.
Here is an overview of the metric collection process in OpenTelemetry, represented by the components and the order in which they are used.
The Meter Provider acts as the entry point for the OpenTelemetry Metrics API. It is responsible for creating and managing Meter instances, which are used to capture metrics. The provider ensures that all Meters operate within a consistent configuration, facilitating the standardized collection of metric data across the application.
Working with the Meter Provider, developers can configure global settings for metric collection. This includes defining which metrics to collect, setting collection intervals, and specifying the destination for exported metrics.
A Meter is an instrument within the OpenTelemetry Metrics API that facilitates the capture of metrics. It provides various methods to record measurements, supporting different types of metrics such as counters, gauges, and histograms. Each Meter is associated with a specific component or library, allowing metrics to be collected in a granular and organized manner.
Utilizing Meters enables developers to precisely define what to measure and how. This fine-grained control over metric collection allows for tailored monitoring solutions, focusing on critical components and disregarding irrelevant data.
Instruments are the tools used within the OpenTelemetry Metrics API to record specific measurements. They support various metric types, ensuring that developers can capture the data most relevant to their monitoring objectives. Instruments provide a simple interface for recording measurements, abstracting away the complexities of metric aggregation and export.
Leveraging different types of instruments, developers can capture a range of metrics, from basic counts and gauges to complex histograms and summaries.
Related content: Read our guide to OpenTelemetry collector
Here are some of the metrics collected by OpenTelemetry.
Counters are a type of metric instrument used to capture a cumulative total, typically representing the number of occurrences of an event. They are best suited for tracking increments, such as the number of requests received or tasks completed. Counters provide a simple way to monitor system activity and workload.
Using counters, developers can gauge system throughput and detect anomalies in operational flow. For example, a sudden drop in the counter for processed transactions may indicate a bottleneck or failure in the system.
Example: Counting HTTP requests received by a server
To illustrate the use of counters in OpenTelemetry, consider a scenario where you want to track the number of HTTP requests received by a server.
Here’s a simple example in Python using the OpenTelemetry API. Before running this and the following examples, please first install the Python libraries opentelemetry-sdk and opentelemetry-api.
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
# Set the MeterProvider
metrics.set_meter_provider(MeterProvider())
# Obtain a meter
meter = metrics.get_meter("http_requests_meter")
# Define a counter
request_counter = meter.create_counter(
"http_requests_total",
description="Total number of HTTP requests received",
)
# Function to track requests
def track_request():
request_counter.add(1, {"method": "GET", "endpoint": "/api/data"})
# Simulate receiving a request
track_request()
In this example, a counter named http_requests_total is created to track the total number of HTTP requests received by the server. The track_request function increments the counter by 1 each time it is called, simulating a new request. Labels like method and endpoint provide additional context for each measurement.
Gauges measure the current value of a particular attribute, such as memory usage or queue depth. Unlike counters, gauges can increase or decrease, providing a snapshot of a system’s state at a given point in time. They are crucial for assessing the health and performance of resources.
Implementing gauges allows operators to monitor resource utilization and capacity. For example, tracking the gauge of available memory helps in preventing out-of-memory errors by facilitating timely scaling or optimization.
Example: Monitoring the size of a job queue
Here is how you could monitor the current size of a job queue in Python:
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
# Set the MeterProvider
metrics.set_meter_provider(MeterProvider())
# Obtain a meter
meter = metrics.get_meter("job_queue_meter")
# Data structure for which we will be monitoring size
queue = []
# Define the callback function for updating the gauge
def callback_function(observer):
print("Size of Queue Updated")
observer.observe(get_current_queue_size(), 2)
# Define a gauge
queue_size_gauge = meter.create_observable_gauge(
"job_queue_size",
description="Current size of the job queue",
callbacks=[callback_function],
)
def get_current_queue_size():
return len(queue)
In this example, an observable gauge named job_queue_size is created to monitor the size of a job queue. The callback function is used to monitor current size measurements, with a label indicating the priority of jobs in the queue.
Histograms in OpenTelemetry collect and categorize data points into distinct buckets, enabling the visualization of the distribution of measured values over a period. This type of metric is useful for understanding the variability and outliers in system performance metrics, such as request latency or size of payloads.
By analyzing histograms, developers can identify performance bottlenecks and optimize system response times. For example, a histogram of request latencies might reveal a long tail of slow requests, prompting investigation and remediation measures.
Example: Using histograms to track latency
The following code shows how to use histograms to tracking request latencies:
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
# Set the MeterProvider
metrics.set_meter_provider(MeterProvider())
# Obtain a meter
meter = metrics.get_meter("request_latency_meter")
# Define a histogram
latency_histogram = meter.create_histogram(
"request_latency",
description="Distribution of request latencies",
)
# Function to track latency
def track_latency(latency):
latency_histogram.record(latency, {"endpoint": "/api/data"})
# Simulate tracking a request latency
track_latency(123) # latency in milliseconds
This example shows how to track the distribution of request latencies using a histogram named request_latency. The track_latency function records the latency of a request, with labels for additional context, such as the endpoint.
Here are some best practices for making the most out of metrics in OpenTelemetry.
Labels and attributes enrich metric data with contextual details, making them more specific and informative. However, using too many labels can lead to an explosion of metric dimensions, complicating analysis and storage. To strike a balance, select labels that provide meaningful differentiation without overwhelming the dataset.
Applying attributes thoughtfully enhances the utility of metrics. For example, adding a label for error codes to a counter of failed requests enables finer analysis of failure causes, facilitating targeted troubleshooting and resolution.
Adopting a consistent naming convention for metrics is essential for avoiding confusion and ensuring easy identification and aggregation. Names should be descriptive, concise, and follow a predictable pattern across the application. This consistency aids in the discovery and analysis of metrics, enhancing the effectiveness of monitoring efforts.
A systematic approach to naming metrics simplifies their management, making it easier to correlate related metrics and interpret their significance. For example, using a standard prefix for all metrics related to database operations facilitates quick isolation of database-related performance issues.
When recording metrics, precision is important for capturing accurate and meaningful data. This involves selecting appropriate metric types and instruments, configuring suitable collection intervals, and ensuring reliable measurement methods. At the same time, reporting metrics should be focused and purposeful, prioritizing the most relevant and actionable information.
Precise recording and purposeful reporting maximize the value of collected metrics. For example, accurately tracking request latencies at a fine granularity supports detailed performance analysis. Focusing reports on key percentile values can highlight areas that require attention.
Ensuring the efficiency and reliability of the telemetry pipeline is crucial for the effective use of OpenTelemetry metrics. This involves monitoring the pipeline for bottlenecks, data loss, or delays and optimizing its performance and scalability. Regularly reviewing and adjusting configuration, such as sampling rates and aggregation strategies, can enhance pipeline capabilities.
Active monitoring and continuous optimization of the telemetry pipeline ensure high-quality metric collection and reporting. This proactive approach helps in maintaining the responsiveness and accuracy of monitoring systems, enabling swift detection and resolution of issues.
Coralogix sets itself apart in observability with its modern architecture, enabling real-time insights into logs, metrics, and traces with built-in cost optimization. Coralogix’s straightforward pricing covers all its platform offerings including APM, RUM, SIEM, infrastructure monitoring and much more. With unparalleled support that features less than 1 minute response times and 1 hour resolution times, Coralogix is a leading choice for thousands of organizations across the globe.