Skip to content

Span Metrics Cardinality Limiting

Span metrics can generate extremely high cardinality when dynamic values—such as UUIDs or timestamps—are used in label fields like span_name. This can result in:

  • Excessive time series creation and broken APM experience
  • Significantly higher data volume being sent (eats units quota)
  • Performance degradation

To mitigate this, Coralogix adopts a mechanism similar to the OpenTelemetry Metrics SDK cardinality limits. This feature introduces an automatic, configurable cardinality control mechanism within the spanmetrics pipeline of the OpenTelemetry Collector.

Coralogix will detect and expose when services exceed their cardinality limits, ensuring users have visibility into dropped series and can take corrective action early. This includes both backend detection. In future versions we will add frontend UI representation.

How it works

Per-service-per-metric cardinality limit

  • A threshold (e.g., 100,000) is applied per service per metric.

    • For example, calls_total{service="order-service"} will have a 100,000 series cap.
  • This ensures high-cardinality services do not impact others globally.

Overflow redirection with a fallback label

  • Once the time series limit is reached, new unique combinations of labels are no longer tracked individually.
  • Instead, the corresponding spans are aggregated into a fallback series that includes the special label:

    otel_metric_overflow="true"
    
  • Example: Cardinality limit of 3, with 5 time series sent (each with 50 spans)

    calls_total{service_name="A", span_name="uuid1"} 
    calls_total{service_name="A", span_name="uuid2"} 
    calls_total{service_name="A", span_name="uuid3"} 
    calls_total{service_name="A", otel_metric_overflow="true"} 
    
    • The first 3 series are preserved.
    • The remaining 100 spans are collapsed into a single time series tagged with otel_metric_overflow="true".

Configuration options

Cardinality limit settings

To set the cardinality limit with aggregation_cardinality_limit, ensure you are using OpenTelemetry Collector version 0.130.0 or later.

Coralogix Kubernetes integration

With the Coralogix Kubernetes Complete Observability integration, the cardinality limit is automatically enabled and set to 100,000 by default, starting from Helm chart version v.0.0.203 and later. No additional configuration is required.

  • To disable the cardinality limit by overriding the default value, add an aggregationCardinalityLimit field under the SpanMetrics connector, and set to it to 0.
  • To edit the cardinality limit, set the aggregationCardinalityLimit field to the desired value, as follows:
spanMetrics:
      enabled: true
      collectionInterval: "{{.Values.global.collectionInterval}}"
      metricsExpiration: 5m
      histogramBuckets:
        [1ms, 4ms, 10ms, 20ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s]
      aggregationCardinalityLimit: 100000
      extraDimensions:
        - name: http.method
        - name: cgx.transaction
        - name: cgx.transaction.root

For users not using Kubernetes or the Coralogix Kubernetes integration

Add the aggregationCardinalityLimit settings as part of the OTel collector under the spanmetrics connector, and set the limit you want.

  • To disable the cardinality limit, either set the aggregationCardinalityLimit field to 0 or remove it entirely.
  • If you are using the dbMetric connector, ensure that the aggregationCardinalityLimit field is specified under this connector as well.
connectors:
      spanmetrics:
        namespace: ""
        histogram:
          explicit:
            buckets: [100us, 1ms, 2ms, 4ms, 6ms, 10ms, 100ms, 250ms]
        aggregationCardinalityLimit: 100000
        dimensions:
          - name: http.method
          - name: cgx.transaction
          - name: cgx.transaction.root
        exemplars:
          enabled: true
        dimensions_cache_size: 1000
        aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"    
        metrics_flush_interval: 15s
        metricsExpiration: 5m
        events:
          enabled: true
          dimensions:
            - name: exception.type
            - name: exception.message

Retention scope

  • Tracked time series are stored in-memory only and are cleared when the OpenTelemetry Collector or sending pod restarts—no persistent state is maintained.
  • If a service stops sending data for 5 minutes, its cache is reset automatically.
  • If the service is redeployed without stopping data flow, the cache persists; to reset it, either restart the collector or allow the service to idle for 5 minutes.

Best practices for alerting

  • Set up alerts based on the presence of the label otel_metric_overflow="true".
  • This allows early detection of cardinality issues—as soon as overflow begins, even if only a single value is dropped.

Example PromQL expression:

sum by (service_name) (duration_ms_bucket{otel_metric_overflow="true"}) > 0

This expression:

  • Groups metrics by service_name.
  • Triggers the alert if overflowed time series exists.
  • Useful for early warning before overflow volume becomes significant.