OpenTelemetry Span Metrics
What is Span Metrics?
Span Metrics offers an automated method of transforming and aggregating trace data into metrics outside Coralogix using the OpenTelemetry Span Metrics Connector. By sending the metrics to Coralogix, you can utilize our cutting-edge APM features, save on costs, and gain comprehensive insights into your data.
Span Metrics is currently available in beta for early adopters.
This document provides a step-by-step guide to setting up Span Metrics for your APM:
- Configuring the collector for Span Metrics
- Enabling tail sampling for optimized traces
- Defining buckets
- Switching APM UI from Events2Metrics to Span Metrics (only after configuring the collector)
- Troubleshooting
Benefits
Use Span Metrics for any of the following:
- Cost savings. Sending Span Metrics reduces the volume and frequency of the data sent, helping you cut costs dramatically when compared to sending us 100% of your spans. Compare and contrast our data pipelines, as detailed here.
- Easy startup. Span Metrics is particularly valuable when your system lacks traditional metrics but implements distributed tracing. It allows you to obtain metrics from your tracing pipeline without additional setup.
- Comprehensive insights. Even if your system is already equipped with metrics, leveraging span metrics can offer a deeper level of monitoring. The metrics generated provide insights at the application level, showcasing how tracing information propagates throughout your applications. Offers deeper monitoring even if traditional metrics are already in place.
- Secure migration. Easily migrate from the Events2Metrics to Span Metrics data pipeline, while retaining E2M data during a defined retention period.
Span Metrics Generation
For customers transitioning from Events2Metrics (E2M) to Span Metrics, the default method is maintaining both Span Metrics and E2M. This allows E2M-based metrics to be generated alongside Span Metrics, enabling a fallback if necessary. Costs will be incurred for both pipelines. The user can update the default to a single method at any stage.
Collector configuration
Before configuring Span Metrics, consider the following key points.
- Data visibility: You cannot view Events2Metrics and Span Metrics data simultaneously in the UI. If both Span Metrics and E2M data are sent, use the API commands provided in this document to toggle between the two methods.
- Metrics dimensions: Using Span Metrics allows integration of more dimensions directly from the collector. Remember that each dimension counts toward your quota. We recommend incorporating only essential dimensions into the collector configuration. For optimal performance, limit the total number of permutations to 300,000 per metric during any selected time frame.
- SLO and Apdex: These settings are not automatically migrated when transitioning from Events2Metrics to Span Metrics. Define the buckets that represent your latency thresholds during the Span Metrics setup. Then, create the actual SLO and Apdex per service within the Service Catalog UI.
- Adjustments in Grafana, Alerts, and Custom Dashboards: After migrating from Events2Metrics (E2M) to Span Metrics, if E2M metrics are no longer being generated, any custom dashboards, Grafana alerts, or other configurations that previously relied on Service Catalog E2M metrics must be updated to use Span Metrics instead.
Updating the collector
Update the collector either manually (using your own OpenTelemetry) or via HELM (using Kubernetes extension). See the relevant sections below.
Creating Span Metrics with the Kubernetes extension for OTel
Enabling Span Metrics
- If you have not yet done so, deploy the Coralogix Kubernetes extension package. Navigate to Data Flow > Extensions > Kubernetes from your Coralogix toolbar.
- Manually upgrade the Helm chart used with your Kubernetes integration to its latest version to enable the creation of Span Metrics. Span Metrics is disabled by default and can be enabled by setting the
spanmetrics.enabled
value totrue
in the values.yaml file.
spanMetrics:
enabled: true
collectionInterval: "{{.Values.global.collectionInterval}}"
metricsExpiration: 5m
histogramBuckets:
[1ms, 4ms, 10ms, 20ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s]
extraDimensions:
- name: http.method
- name: cgx.transaction
- name: cgx.transaction.root
Note
Enabling the feature will create additional metrics, which may significantly increase depending on how you instrument your applications. This is especially true for cases where the span name includes specific values, such as user IDs or UUIDs. Such instrumentation practice is strongly discouraged. In such cases, we recommend correcting your instrumentation or using the spanMetrics.spanNameReplacePattern
parameter, to replace the problematic values with a generic placeholder.
Enabling tail sampling
We recommend following your setup with tail sampling. Enabling this feature grants you additional APM capabilities while optimizing costs. Tail sampling lets users view traces, service connections, and maps in the Coralogix platform. Find out more here.
The following example demonstrates how to employ tail sampling for trace reduction using the tail sampling processor. Incorporate the otel-integration
by installing it with the tail-sampling-values.yaml
configuration. For instance:
helm repo add coralogix-charts-virtual <https://cgx.jfrog.io/artifactory/coralogix-charts-virtual>
helm upgrade --install otel-coralogix-integration coralogix-charts-virtual/otel-integration \\
--render-subchart-notes -f tail-sampling-values.yaml
This adjustment will set up the otel-agent
pods to transmit span data to the coralogix-opentelemetry-gateway
deployment through the load balancing exporter. Ensure adequate replica configuration and resource allocation to handle the anticipated load. Subsequently, you must configure tail-sampling processor policies according to your specific tail sampling requirements.
When operating in an Openshift environment, ensure the distribution: "openshift"
parameter is set in your values.yaml
. In Windows environments, utilize the values-windows-tailsampling.yaml
configuration file. Find out more here.
Creating Span Metrics using your own OpenTelemetry
Enabling Span Metrics
When using your own OpenTelemetry or Prometheus, add the following to your configuration file:
connectors:
spanmetrics:
histogram:
explicit:
buckets: [100us, 1ms, 2ms, 4ms, 6ms, 10ms, 100ms, 250ms]
dimensions:
- name: http.method
- name: cgx.transaction
- name: cgx.transaction.root
exemplars:
enabled: true
dimensions_cache_size: 1000
aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
metrics_flush_interval: 15s
metricsExpiration: 5m
events:
enabled: true
dimensions:
- name: exception.type
- name: exception.message
Note
dimensions
may be modified, but should not be removed.Adjust
buckets
to fit your usage best. Read more below.Enabling this feature may create additional metrics, which can increase significantly depending on how you instrument your applications. This is especially true for cases where the span name includes specific values, such as user IDs or UUIDs. Such instrumentation practice is strongly discouraged. In such cases, we recommend adding the following code snippet to the configuration file:
Enabling tail sampling
We recommend configuring tail sampling alongside your setup. Enabling this feature enhances APM capabilities while optimizing costs. Tail sampling allows you to view traces, service dependencies, and maps within the Coralogix platform.
This section demonstrates how to send traces with errors using tail sampling. We recommend creating multiple tracing pipelines for each type of filtering.
- Add the
tail_sampling
processor definition underprocessors
. In this example, it is namederrors
, but you can choose any name. - Include this processor in the pipeline, ensuring it follows state-using processors like
k8sattributes
.
processors:
tail_sampling/errors:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: only-errors
type: status_code
status_code: {status_codes: [ERROR]}
service:
pipelines:
traces/errors:
exporters:
- coralogix
processors:
# - state based processors like k8sattributes
- tail_sampling/errors
- batch
receivers:
- otlp
Validating the metrics
Validate your metrics to ensure that all of them are sent. The following metrics are generated by OpenTelemetry and, by default, are sent to enable Span Metrics. The metrics and their labels should not be removed.
Service Catalog (mandatory)
Metric | Label |
---|---|
duration_ms_sum | span_name, service_name, span_kind, status_code, http_method |
duration_ms_bucket | span_name, service_name, span_kind, status_code, http_method, le |
calls_total | span_name, service_name, span_kind, status_code, http_method |
duration_ms_count | span_name, service_name, span_kind, status_code, http_method |
Databases Catalog (mandatory)
The following labels are used to enable the Databases Catalog using Span Metrics. Currently, this is only supported for Kubernetes users who are using the Coralogix Helm chart and the values.yaml
file.
Metric | Label |
---|---|
db_calls_total | status_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name |
db_duration_ms_sum | status_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name |
db_duration_ms_count | status_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name |
db_duration_ms_bucket | status_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name, le |
Service flows (optional)
A service flow denotes a singular logical unit of work in a software application. More precisely, it encompasses the function and method calls constituting that unit of work. Each flow consists of a root span, an operation that serves as its entry point and triggers all other related operations. Using a custom instrumentation, the service flows tags are being added and accordingly should be converted into metric labels.
Metric | Label |
---|---|
duration_ms_sum | cgx_transaction, cgx_transaction_root |
duration_ms_bucket | cgx_transaction, cgx_transaction_root |
calls_total | cgx_transaction, cgx_transaction_root |
duration_ms_count | cgx_transaction, cgx_transaction_root |
API error tracking (optional)
To enable API error tracking for span metrics, the following labels are added by default:
Metric | Label |
---|---|
duration_ms_sum | rpc.grpc.status_code, http.response.status_code |
duration_ms_bucket | rpc.grpc.status_code, http.response.status_code |
calls_total | rpc.grpc.status_code, http.response.status_code |
duration_ms_count | rpc.grpc.status_code, http.response.status_code |
Span Metrics buckets for percentiles, SLO and Apdex
Configuring collector buckets
To ensure accurate calculation of SLOs, Apdex, latency, and latency percentiles, you must manually define the appropriate bucket thresholds in the collector YAML file, as they are not configured by default. Choose buckets that best align with your data and verify their correct configuration.
Apdex settings
The Span Metrics connector must explicitly include buckets for both 'T' and '4T' to ensure correct Apdex threshold calculation. In the example below, the Apdex threshold can be set to 1ms because both 'T' (1ms) and '4T' (4ms) are specified.
connectors:
spanmetrics:
histogram:
explicit:
buckets: [ 100us, 1ms, 2ms, 4ms, 6ms, 10ms, 100ms, 250ms ]
Modifying buckets used for active SLOs or Apdex calculations
Modifying buckets that are actively used in existing SLOs or Apdex calculations will immediately halt the current processing of those SLOs or Apdex scores. This will trigger an error in the UI, indicating the disruption. To restore functionality, the affected SLOs or Apdex calculations must be reconfigured with the updated threshold (according to the current bucket settings). Note that this will restart the SLO calculations, treating them as new SLOs.
Selecting buckets
Span Metrics do not have default buckets. You must define the buckets based on your environment and the distribution of service request durations over time.
As a best practice, start by analyzing your most critical services:
What is the acceptable latency threshold ('T') in terms of duration?
Add this as a bucket or a series of buckets.
Consider adding another bucket for ‘4T’ to account for Apdex calculations.
Review service requests. What are the maximum and minimum durations?
Are these durations anomalies or recurring patterns? If they are recurring, define additional buckets to capture these cases.
Add intermediate buckets to cover a broader range of durations for more granular insights.
Configure different buckets per application
To use a Span Metrics connector with different buckets for each application in Kubernetes environment, you must use the spanMetricsMulti
preset. For example:
presets:
spanMetricsMulti:
enabled: false
defaultHistogramBuckets: [1ms, 4ms, 10ms, 20ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s]
configs:
- selector: route() where attributes["service.name"] == "one"
histogramBuckets: [1s, 2s]
- selector: route() where attributes["service.name"] == "two"
histogramBuckets: [5s, 10s]
For every selector, you must write an OTTL statement. Find out more here.
Enable MultiSpanMetrics if you want to define metrics for each service. Generally, it’s better to have a broad bucket definition that covers the system. MultiSpanMetrics allows for more detailed per-service metrics.
- This feature is available only for Kubernetes extension users.
Update your HELM chart values by setting MultiSpanMetrics to
true
instead of Span Metrics:
Note
If you're using multiple collectors, ensure that the bucket configuration is consistent across all of them.
Exemplars
The OpenTelemetry feature enhances issue investigation by allowing users to seamlessly navigate from metrics to the spans dashboard. This integration enriches the metrics with span and trace correlation, providing better visibility and insights.
For APM with Span Metrics, you can set it to false
if needed.
Using multiple OTel agents
When using multiple OpenTelemetry (OTel) collector agents, each performs span metrics aggregation separately. Without a unique label value, Coralogix receives the metrics individually and cannot effectively aggregate them. For example, Kubernetes users who implement a collector on each node may experience metrics from the same service on different nodes overwriting each other. Adding k8s.pod.name
as a label resolves this issue by providing a unique identifier, which differentiates the metrics and enables accurate querying and aggregation.
Full configuration with Database Catalog
Span Metrics
- Currently, the Database Catalog integration with Span Metrics is available for Kubernetes only.
- Read the following instructions and use the provided values.yaml file.
Events2Metric
If you want to use Span Metrics for the Service Catalog while continuing to use E2M for the Database Catalog: ****
- Define the filter for the Database Catalog under
processors
. - Add the following processor to the pipeline to filter the spans.
processors:
filter/dbcatalog:
error_mode: ignore
traces:
span:
- 'attributes["db.system"] == nil'
service:
pipelines:
traces/dbcatalog:
exporters:
- coralogix
processors:
- filter/dbcatalog
receivers:
- otlp
Enabling API error tracking
Service error data is extracted from span metrics within the time interval selected in the time picker, based on HTTP or gRPC status codes. To enable API error tracking using span metrics, make sure these attributes are included within your error spans, and follow the instructions below.
Note
errorTracking
works only with OpenTelemetry SDKs that support OpenTelemetry Semantic conventions above v1.21.0. If you're using an older version, you may need to modify certain attributes. Read more here
Using OTel Kubernetes extension
- Ensure that the Coralogix Helm repository is up-to-date with the latest version.
- Obtain the latest values.yaml from the Coralogix OpenTelemetry Integration repository.
Once you've updated the config in the
values.yaml
file, apply the changes usinghelm upgrade
. Be sure to replace<namespace>
with the correct Kubernetes namespace where the extension is deployed.After the upgrade, verify that all pods are running correctly:
If the collector does not roll out after the change, initiate a manual rollout.
Switching APM UI from Events2Metrics to Span Metrics
After setting up the span metrics collector, you can update the APM UI using the following API command to utilize the metrics collected via span metrics. Note that Events2Metrics and Span Metrics data cannot be displayed simultaneously in the UI.
Switch from Events2Metrics (E2M) to Span Metrics via the API using the following command:
grpcurl -H "Authorization: Bearer <token>" -d @ ng-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/ReplaceApmSettings <<EOF { "apm_settings": { "catalog_settings": [ { "source": "APM_SOURCE_SPAN_METRICS", "catalog": "SERVICE_CATALOG" }, { "source": "APM_SOURCE_SPAN_METRICS", "catalog": "DATABASE_CATALOG" } ] } } EOF
Make sure you're using the correct gRPC endpoint (
<env url>
).
Going forward, both E2M and Span Metrics will be collected (with data ingestion charges applied accordingly), but the UI will display data based on Span Metrics. Historical Events2Metrics metric data can still be accessed via Custom Dashboards and Grafana, subject to its retention period.
Reverting to Events2Metrics collection
If needed, you can switch back to E2M, as detailed below.
grpcurl -H "Authorization: Bearer <token>" -d @ ng-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/ReplaceApmSettings <<EOF
{
"apm_settings": {
"catalog_settings": [
{
"source": "APM_SOURCE_E2M",
"catalog": "SERVICE_CATALOG"
},
{
"source": "APM_SOURCE_E2M",
"catalog": "DATABASE_CATALOG"
}
]
}
}
EOF
Disabling Events2Metrics
To stop collecting data from Events2Metrics pipeline and rely solely on Span Metrics (or vice versa), run the following command. This command also specifies the date when you want Events2Metrics to stop metrics generation.
grpcurl -H "Authorization: Bearer <token>" -d @ ng-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/ReplaceApmSettings <<EOF
{
"apm_settings": {
"catalog_settings": [
{
"source": "APM_SOURCE_SPAN_METRICS",
"catalog": "DATABASE_CATALOG",
"migration_period_end_date": {
"nanos": 0,
"seconds": "1731109760"
}
},
{
"source": "APM_SOURCE_SPAN_METRICS",
"catalog": "SERVICE_CATALOG",
"migration_period_end_date": {
"nanos": 0,
"seconds": "1731109760"
}
}
]
}
}
EOF
- nanos: Always set to zero.
- seconds: Paste the Epoch timestamp representing the chosen date.
- The
migration_period_end_date
also allows you to define a specific period during which both Events2Metrics and Span Metrics data are generated and retained. After this period, only Span Metrics remain. Events2Metrics-based metrics will continue to be generated as long as spans are sent to Coralogix during the defined period, enabling you to revert if needed. Once the retention period ends, Events2Metrics-based metrics will no longer be created.
Note
If you decide to migrate back to Events2Metrics in the future:
- Contact our support team to re-create APM Events2Metrics rules.
- Redefine your SLO and Apdex settings from scratch, as they are not automatically restored.
- Once Events2Metrics is disabled, it will no longer be possible to view Events2Metrics-based data for the period after its deactivation, except for historical data prior to disabling, which will still be available according to its retention period.
Validating your data source
Validate data source using the following command:
grpcurl -H "Authorization: Bearer <token>" -d @ g-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/GetApmSettings <<EOF
{
"catalog": "SERVICE_CATALOG"
}
EOF
Present Lambda functions with Span Metrics for Service Catalog
The Service Catalog will function seamlessly with both Span Metrics and Events2Metric, as long as all instructions in the documentation are followed correctly.
To display services based on AWS Lambda in the Service Catalog, your organization must send spans or Span Metrics. With Events2Metric (E2M), this is done automatically. For Span Metrics, the only requirement is to ensure that Lambda-generated spans are routed through the collector.
Note
This ensures that Lambda-based services, along with their metrics, are displayed in the Service Catalog. Serverless catalog is supported only using Events2Metrics.
Troubleshooting
High cardinality (over 300K), which occurs when metrics or spans contain labels with numerous unique values, such as user IDs, UUIDs, or session-specific data. This creates a large number of metric combinations, often exceeding practical limits. For example, using user-specific values in span names or labels can lead to exponentially growing cardinality, complicating metric analysis and visualization. In cases of high cardinality caused by overly unique span names, we recommend adjusting your instrumentation or using the
spanMetrics.spanNameReplacePattern
parameter to replace the problematic values with a generic placeholder. For example, if your span name corresponds to templateuser-1234
, you can use the following pattern to replace the user ID with a generic placeholder. This will result in your spans having a generalized nameuser-{id}
. Learn how to replace specificspan.name
with a generic one as detailed here.Use
metrics_expiration
when you want to control how long unexported metrics are kept in memory. See here.Reduce data volume. Remove the Events2Metrics rules and stop generating Events2Metrics metrics when they have fully transitioned to Span Metrics and no longer require dual-pipeline data collection. This step is suitable for reducing metric ingestion and storage costs, as well as simplifying system configurations. However, it is important to note that any data generated during the period when Events2Metrics rules are removed will not be accessible if the user decides to revert to Events2Metrics later. This action helps decrease the overall data volume but does not directly address or affect cardinality issues.
Permissions
In your Coralogix application, go to Settings > Roles > Compare Roles > APM - Manage Service Catalog Services and verify that the following permissions exist.
Permission Group | Resource | Action | Permission Name |
---|---|---|---|
APM | Manage Service Catalog Services | UpdateConfig | SERVICE-CATALOG:UPDATE |