Skip to content

OpenTelemetry Span Metrics

What is Span Metrics?

Span Metrics offers an automated method of transforming and aggregating trace data into metrics outside Coralogix using the OpenTelemetry Span Metrics Connector. By sending the metrics to Coralogix, you can utilize our cutting-edge APM features, save on costs, and gain comprehensive insights into your data.

Span Metrics is currently available in beta for early adopters.

This document provides a step-by-step guide to setting up Span Metrics for your APM:

  • Configuring the collector for Span Metrics
  • Enabling tail sampling for optimized traces
  • Defining buckets
  • Switching APM UI from Events2Metrics to Span Metrics (only after configuring the collector)
  • Troubleshooting

Benefits

Use Span Metrics for any of the following:

  • Cost savings. Sending Span Metrics reduces the volume and frequency of the data sent, helping you cut costs dramatically when compared to sending us 100% of your spans. Compare and contrast our data pipelines, as detailed here.
  • Easy startup. Span Metrics is particularly valuable when your system lacks traditional metrics but implements distributed tracing. It allows you to obtain metrics from your tracing pipeline without additional setup.
  • Comprehensive insights. Even if your system is already equipped with metrics, leveraging span metrics can offer a deeper level of monitoring. The metrics generated provide insights at the application level, showcasing how tracing information propagates throughout your applications. Offers deeper monitoring even if traditional metrics are already in place.
  • Secure migration. Easily migrate from the Events2Metrics to Span Metrics data pipeline, while retaining E2M data during a defined retention period.

Span Metrics Generation

For customers transitioning from Events2Metrics (E2M) to Span Metrics, the default method is maintaining both Span Metrics and E2M. This allows E2M-based metrics to be generated alongside Span Metrics, enabling a fallback if necessary. Costs will be incurred for both pipelines. The user can update the default to a single method at any stage.

Collector configuration

Before configuring Span Metrics, consider the following key points.

  • Data visibility: You cannot view Events2Metrics and Span Metrics data simultaneously in the UI. If both Span Metrics and E2M data are sent, use the API commands provided in this document to toggle between the two methods.
  • Metrics dimensions: Using Span Metrics allows integration of more dimensions directly from the collector. Remember that each dimension counts toward your quota. We recommend incorporating only essential dimensions into the collector configuration. For optimal performance, limit the total number of permutations to 300,000 per metric during any selected time frame.
  • SLO and Apdex: These settings are not automatically migrated when transitioning from Events2Metrics to Span Metrics. Define the buckets that represent your latency thresholds during the Span Metrics setup. Then, create the actual SLO and Apdex per service within the Service Catalog UI.
  • Adjustments in Grafana, Alerts, and Custom Dashboards: After migrating from Events2Metrics (E2M) to Span Metrics, if E2M metrics are no longer being generated, any custom dashboards, Grafana alerts, or other configurations that previously relied on Service Catalog E2M metrics must be updated to use Span Metrics instead.

Updating the collector

Update the collector either manually (using your own OpenTelemetry) or via HELM (using Kubernetes extension). See the relevant sections below.

Creating Span Metrics with the Kubernetes extension for OTel

Enabling Span Metrics
  1. If you have not yet done so, deploy the Coralogix Kubernetes extension package. Navigate to Data Flow > Extensions > Kubernetes from your Coralogix toolbar.
  2. Manually upgrade the Helm chart used with your Kubernetes integration to its latest version to enable the creation of Span Metrics. Span Metrics is disabled by default and can be enabled by setting the spanmetrics.enabled value to true in the values.yaml file.
    spanMetrics:
      enabled: true
      collectionInterval: "{{.Values.global.collectionInterval}}"
      metricsExpiration: 5m
      histogramBuckets:
        [1ms, 4ms, 10ms, 20ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s]
      extraDimensions:
        - name: http.method
        - name: cgx.transaction
        - name: cgx.transaction.root

Note

Enabling the feature will create additional metrics, which may significantly increase depending on how you instrument your applications. This is especially true for cases where the span name includes specific values, such as user IDs or UUIDs. Such instrumentation practice is strongly discouraged. In such cases, we recommend correcting your instrumentation or using the spanMetrics.spanNameReplacePattern parameter, to replace the problematic values with a generic placeholder.

Enabling tail sampling

We recommend following your setup with tail sampling. Enabling this feature grants you additional APM capabilities while optimizing costs. Tail sampling lets users view traces, service connections, and maps in the Coralogix platform. Find out more here.

The following example demonstrates how to employ tail sampling for trace reduction using the tail sampling processor. Incorporate the otel-integration by installing it with the tail-sampling-values.yaml configuration. For instance:

helm repo add coralogix-charts-virtual <https://cgx.jfrog.io/artifactory/coralogix-charts-virtual> helm upgrade --install otel-coralogix-integration coralogix-charts-virtual/otel-integration \\ --render-subchart-notes -f tail-sampling-values.yaml

This adjustment will set up the otel-agent pods to transmit span data to the coralogix-opentelemetry-gateway deployment through the load balancing exporter. Ensure adequate replica configuration and resource allocation to handle the anticipated load. Subsequently, you must configure tail-sampling processor policies according to your specific tail sampling requirements.

When operating in an Openshift environment, ensure the distribution: "openshift" parameter is set in your values.yaml. In Windows environments, utilize the values-windows-tailsampling.yaml configuration file. Find out more here.

Creating Span Metrics using your own OpenTelemetry

Enabling Span Metrics

When using your own OpenTelemetry or Prometheus, add the following to your configuration file:

 connectors:
      spanmetrics:
        histogram:
          explicit:
            buckets: [100us, 1ms, 2ms, 4ms, 6ms, 10ms, 100ms, 250ms]
        dimensions:
          - name: http.method
          - name: cgx.transaction
          - name: cgx.transaction.root
        exemplars:
          enabled: true
        dimensions_cache_size: 1000
        aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"    
        metrics_flush_interval: 15s
        metricsExpiration: 5m
        events:
          enabled: true
          dimensions:
            - name: exception.type
            - name: exception.message

Note

  • dimensions may be modified, but should not be removed.

  • Adjust buckets to fit your usage best. Read more below.

  • Enabling this feature may create additional metrics, which can increase significantly depending on how you instrument your applications. This is especially true for cases where the span name includes specific values, such as user IDs or UUIDs. Such instrumentation practice is strongly discouraged. In such cases, we recommend adding the following code snippet to the configuration file:

  transform/span_name:
    trace_statements:
    - context: span
      statements:
      - replace_pattern(name, "^(.*)$", "$1")
Enabling tail sampling

We recommend configuring tail sampling alongside your setup. Enabling this feature enhances APM capabilities while optimizing costs. Tail sampling allows you to view traces, service dependencies, and maps within the Coralogix platform.

This section demonstrates how to send traces with errors using tail sampling. We recommend creating multiple tracing pipelines for each type of filtering.

  1. Add the tail_sampling processor definition under processors. In this example, it is named errors, but you can choose any name.
  2. Include this processor in the pipeline, ensuring it follows state-using processors like k8sattributes.
processors:
  tail_sampling/errors:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 10
    policies:
      - name: only-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
service:        
  pipelines:
    traces/errors:
      exporters:
        - coralogix
      processors:
      # - state based processors like k8sattributes
        - tail_sampling/errors
        - batch 
       receivers:
        - otlp

Validating the metrics

Validate your metrics to ensure that all of them are sent. The following metrics are generated by OpenTelemetry and, by default, are sent to enable Span Metrics. The metrics and their labels should not be removed.

Service Catalog (mandatory)

MetricLabel
duration_ms_sumspan_name, service_name, span_kind, status_code, http_method
duration_ms_bucketspan_name, service_name, span_kind, status_code, http_method, le
calls_totalspan_name, service_name, span_kind, status_code, http_method
duration_ms_countspan_name, service_name, span_kind, status_code, http_method

Databases Catalog (mandatory)

The following labels are used to enable the Databases Catalog using Span Metrics. Currently, this is only supported for Kubernetes users who are using the Coralogix Helm chart and the values.yaml file.

MetricLabel
db_calls_totalstatus_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name
db_duration_ms_sumstatus_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name
db_duration_ms_countstatus_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name
db_duration_ms_bucketstatus_code, db_system, db_namespace, span_name, db_operation_name, db_collection_name, service_name, le

Service flows (optional)

service flow denotes a singular logical unit of work in a software application. More precisely, it encompasses the function and method calls constituting that unit of work. Each flow consists of a root span, an operation that serves as its entry point and triggers all other related operations. Using a custom instrumentation, the service flows tags are being added and accordingly should be converted into metric labels.

MetricLabel
duration_ms_sumcgx_transaction, cgx_transaction_root
duration_ms_bucketcgx_transaction, cgx_transaction_root
calls_totalcgx_transaction, cgx_transaction_root
duration_ms_countcgx_transaction, cgx_transaction_root

API error tracking (optional)

To enable API error tracking for span metrics, the following labels are added by default:

MetricLabel
duration_ms_sumrpc.grpc.status_code, http.response.status_code
duration_ms_bucketrpc.grpc.status_code, http.response.status_code
calls_totalrpc.grpc.status_code, http.response.status_code
duration_ms_countrpc.grpc.status_code, http.response.status_code

Span Metrics buckets for percentiles, SLO and Apdex

Configuring collector buckets

To ensure accurate calculation of SLOs, Apdex, latency, and latency percentiles, you must manually define the appropriate bucket thresholds in the collector YAML file, as they are not configured by default. Choose buckets that best align with your data and verify their correct configuration.

Apdex settings

The Span Metrics connector must explicitly include buckets for both 'T' and '4T' to ensure correct Apdex threshold calculation. In the example below, the Apdex threshold can be set to 1ms because both 'T' (1ms) and '4T' (4ms) are specified.

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [ 100us, 1ms, 2ms, 4ms, 6ms, 10ms, 100ms, 250ms ]

Modifying buckets used for active SLOs or Apdex calculations

Modifying buckets that are actively used in existing SLOs or Apdex calculations will immediately halt the current processing of those SLOs or Apdex scores. This will trigger an error in the UI, indicating the disruption. To restore functionality, the affected SLOs or Apdex calculations must be reconfigured with the updated threshold (according to the current bucket settings). Note that this will restart the SLO calculations, treating them as new SLOs.

Selecting buckets

Span Metrics do not have default buckets. You must define the buckets based on your environment and the distribution of service request durations over time.

As a best practice, start by analyzing your most critical services:

  • What is the acceptable latency threshold ('T') in terms of duration?

    • Add this as a bucket or a series of buckets.

    • Consider adding another bucket for ‘4T’ to account for Apdex calculations.

  • Review service requests. What are the maximum and minimum durations?

    • Are these durations anomalies or recurring patterns? If they are recurring, define additional buckets to capture these cases.

    • Add intermediate buckets to cover a broader range of durations for more granular insights.

Configure different buckets per application

To use a Span Metrics connector with different buckets for each application in Kubernetes environment, you must use the spanMetricsMulti preset. For example:

  presets:
    spanMetricsMulti:
      enabled: false
      defaultHistogramBuckets: [1ms, 4ms, 10ms, 20ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s]
      configs:
        - selector: route() where attributes["service.name"] == "one"
          histogramBuckets: [1s, 2s]
        - selector: route() where attributes["service.name"] == "two"
          histogramBuckets: [5s, 10s]

For every selector, you must write an OTTL statement. Find out more here.

Enable MultiSpanMetrics if you want to define metrics for each service. Generally, it’s better to have a broad bucket definition that covers the system. MultiSpanMetrics allows for more detailed per-service metrics.

  • This feature is available only for Kubernetes extension users.
  • Update your HELM chart values by setting MultiSpanMetrics to true instead of Span Metrics:

    spanMetrics:
      enabled: false
    spanMetricsMulti:
      enabled: true
    

Note

If you're using multiple collectors, ensure that the bucket configuration is consistent across all of them.

Exemplars

The OpenTelemetry feature enhances issue investigation by allowing users to seamlessly navigate from metrics to the spans dashboard. This integration enriches the metrics with span and trace correlation, providing better visibility and insights.

For APM with Span Metrics, you can set it to false if needed.

exemplars: true

Using multiple OTel agents

When using multiple OpenTelemetry (OTel) collector agents, each performs span metrics aggregation separately. Without a unique label value, Coralogix receives the metrics individually and cannot effectively aggregate them. For example, Kubernetes users who implement a collector on each node may experience metrics from the same service on different nodes overwriting each other. Adding k8s.pod.name as a label resolves this issue by providing a unique identifier, which differentiates the metrics and enables accurate querying and aggregation.

Full configuration with Database Catalog

Span Metrics

Events2Metric

If you want to use Span Metrics for the Service Catalog while continuing to use E2M for the Database Catalog: ****

  1. Define the filter for the Database Catalog under processors.
  2. Add the following processor to the pipeline to filter the spans.
processors:
  filter/dbcatalog:
    error_mode: ignore
    traces:
      span:
        - 'attributes["db.system"] == nil'
service:        
  pipelines:        
    traces/dbcatalog:
      exporters:
        - coralogix
      processors:
        - filter/dbcatalog
      receivers:
        - otlp

Enabling API error tracking

Service error data is extracted from span metrics within the time interval selected in the time picker, based on HTTP or gRPC status codes. To enable API error tracking using span metrics, make sure these attributes are included within your error spans, and follow the instructions below.

Note

errorTracking works only with OpenTelemetry SDKs that support OpenTelemetry Semantic conventions above v1.21.0. If you're using an older version, you may need to modify certain attributes. Read more here

Using OTel Kubernetes extension

  1. Ensure that the Coralogix Helm repository is up-to-date with the latest version.
  2. Obtain the latest values.yaml from the Coralogix OpenTelemetry Integration repository.
  3. Once you've updated the config in the values.yaml file, apply the changes using helm upgrade. Be sure to replace <namespace> with the correct Kubernetes namespace where the extension is deployed.

    helm upgrade --install otel-integration coralogix-charts-virtual/otel-integration -f values.yaml -n <namespace>
    
  4. After the upgrade, verify that all pods are running correctly:

    kubectl get pods -n <namespace>
    

If the collector does not roll out after the change, initiate a manual rollout.

Switching APM UI from Events2Metrics to Span Metrics

After setting up the span metrics collector, you can update the APM UI using the following API command to utilize the metrics collected via span metrics. Note that Events2Metrics and Span Metrics data cannot be displayed simultaneously in the UI.

  1. Switch from Events2Metrics (E2M) to Span Metrics via the API using the following command:

    grpcurl -H "Authorization: Bearer <token>" -d @ ng-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/ReplaceApmSettings <<EOF
    {
        "apm_settings": {
            "catalog_settings": [
                {
                    "source": "APM_SOURCE_SPAN_METRICS",
                    "catalog": "SERVICE_CATALOG"
                },
                {
                    "source": "APM_SOURCE_SPAN_METRICS",
                    "catalog": "DATABASE_CATALOG"
                }
            ]
        }
    }
    EOF
    
  2. Make sure you're using the correct gRPC endpoint (<env url>).

Going forward, both E2M and Span Metrics will be collected (with data ingestion charges applied accordingly), but the UI will display data based on Span Metrics. Historical Events2Metrics metric data can still be accessed via Custom Dashboards and Grafana, subject to its retention period.

Reverting to Events2Metrics collection

If needed, you can switch back to E2M, as detailed below.

grpcurl -H "Authorization: Bearer <token>" -d @ ng-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/ReplaceApmSettings <<EOF
{
    "apm_settings": {
        "catalog_settings": [
            {
                "source": "APM_SOURCE_E2M",
                "catalog": "SERVICE_CATALOG"
            },
            {
                "source": "APM_SOURCE_E2M",
                "catalog": "DATABASE_CATALOG"
            }
        ]
    }
}
EOF

Disabling Events2Metrics

To stop collecting data from Events2Metrics pipeline and rely solely on Span Metrics (or vice versa), run the following command. This command also specifies the date when you want Events2Metrics to stop metrics generation.

grpcurl -H "Authorization: Bearer <token>" -d @ ng-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/ReplaceApmSettings <<EOF 

{
    "apm_settings": {
        "catalog_settings": [
            {
                "source": "APM_SOURCE_SPAN_METRICS",
                "catalog": "DATABASE_CATALOG",
                "migration_period_end_date": {
                    "nanos": 0,
                    "seconds": "1731109760"
                }
            },
            {
                "source": "APM_SOURCE_SPAN_METRICS",
                "catalog": "SERVICE_CATALOG",
                "migration_period_end_date": {
                    "nanos": 0,
                    "seconds": "1731109760"
                }
            }
        ]
    }
}

EOF
  • nanos: Always set to zero.
  • seconds: Paste the Epoch timestamp representing the chosen date.
  • The migration_period_end_date also allows you to define a specific period during which both Events2Metrics and Span Metrics data are generated and retained. After this period, only Span Metrics remain. Events2Metrics-based metrics will continue to be generated as long as spans are sent to Coralogix during the defined period, enabling you to revert if needed. Once the retention period ends, Events2Metrics-based metrics will no longer be created.

Note

If you decide to migrate back to Events2Metrics in the future:

  • Contact our support team to re-create APM Events2Metrics rules.
  • Redefine your SLO and Apdex settings from scratch, as they are not automatically restored.
  • Once Events2Metrics is disabled, it will no longer be possible to view Events2Metrics-based data for the period after its deactivation, except for historical data prior to disabling, which will still be available according to its retention period.

Validating your data source

Validate data source using the following command:

grpcurl -H "Authorization: Bearer <token>" -d @ g-api-grpc.<env url> com.coralogixapis.service_catalog.v1.ApmSettingsService/GetApmSettings <<EOF
{
    "catalog": "SERVICE_CATALOG"
}
EOF

Present Lambda functions with Span Metrics for Service Catalog

The Service Catalog will function seamlessly with both Span Metrics and Events2Metric, as long as all instructions in the documentation are followed correctly.

To display services based on AWS Lambda in the Service Catalog, your organization must send spans or Span Metrics. With Events2Metric (E2M), this is done automatically. For Span Metrics, the only requirement is to ensure that Lambda-generated spans are routed through the collector.

Note

This ensures that Lambda-based services, along with their metrics, are displayed in the Service Catalog. Serverless catalog is supported only using Events2Metrics.

Troubleshooting

  • High cardinality (over 300K), which occurs when metrics or spans contain labels with numerous unique values, such as user IDs, UUIDs, or session-specific data. This creates a large number of metric combinations, often exceeding practical limits. For example, using user-specific values in span names or labels can lead to exponentially growing cardinality, complicating metric analysis and visualization. In cases of high cardinality caused by overly unique span names, we recommend adjusting your instrumentation or using the spanMetrics.spanNameReplacePattern parameter to replace the problematic values with a generic placeholder. For example, if your span name corresponds to template user-1234, you can use the following pattern to replace the user ID with a generic placeholder. This will result in your spans having a generalized name user-{id}. Learn how to replace specific span.name with a generic one as detailed here.

  • Use metrics_expiration when you want to control how long unexported metrics are kept in memory. See here.

  • Reduce data volume. Remove the Events2Metrics rules and stop generating Events2Metrics metrics when they have fully transitioned to Span Metrics and no longer require dual-pipeline data collection. This step is suitable for reducing metric ingestion and storage costs, as well as simplifying system configurations. However, it is important to note that any data generated during the period when Events2Metrics rules are removed will not be accessible if the user decides to revert to Events2Metrics later. This action helps decrease the overall data volume but does not directly address or affect cardinality issues.

Permissions

In your Coralogix application, go to Settings > Roles > Compare Roles > APM - Manage Service Catalog Services and verify that the following permissions exist.

Permission GroupResourceActionPermission Name
APMManage Service Catalog ServicesUpdateConfigSERVICE-CATALOG:UPDATE