Skip to content

Span Metrics Cardinality Limiting

Span metrics can generate extremely high cardinality when dynamic values—such as UUIDs or timestamps—are used in label fields like span_name. This can result in:

  • Excessive time series creation and broken APM experiance
  • Significantly higher data volume being sent (eats units quota)
  • Performance degradation

To mitigate this, Coralogix adopts a mechanism similar to the OpenTelemetry Metrics SDK cardinality limits. This feature introduces an automatic, configurable cardinality control mechanism within the spanmetrics pipeline of the OpenTelemetry Collector.

Coralogix will detect and expose when services exceed their cardinality limits, ensuring users have visibility into dropped series and can take corrective action early. This includes both backend detection. In future versions we will add frontend UI representation.

How it works

Per-service-per-metric cardinality limit

  • A threshold (e.g., 100,000) is applied per service per metric.

    • For example, calls_total{service="order-service"} will have a 100,000 series cap.
  • This ensures high-cardinality services do not impact others globally.

Overflow redirection with a fallback label

  • Once the time series limit is reached, new unique combinations of labels are no longer tracked individually.
  • Instead, the corresponding spans are aggregated into a fallback series that includes the special label:

    otel_metric_overflow="true"
    
  • Example: Cardinality limit of 3, with 5 time series sent (each with 50 spans)

    calls_total{service_name="A", span_name="uuid1"} 
    calls_total{service_name="A", span_name="uuid2"} 
    calls_total{service_name="A", span_name="uuid3"} 
    calls_total{service_name="A", otel_metric_overflow="true"} 
    
    • The first 3 series are preserved.
    • The remaining 100 spans are collapsed into a single time series tagged with otel_metric_overflow="true".

Configuration options

Collector configuration (for all environments)

You can manually add the cardinality limit protection in the OpenTelemetry Collector:

spanmetrics:
  aggregation_cardinality_limit: 100000

K8s - Helm configuration (values.yaml)

The following preset is provided by default:

spanMetrics:
  aggregationCardinalityLimit: 100000

To disable it (or change) set value to 0:

spanmetrics:
  aggregation_cardinality_limit: 0

Retention scope

  • Tracked time series are stored in-memory only and are cleared when the OpenTelemetry Collector or sending pod restarts—no persistent state is maintained.
  • If a service stops sending data for 5 minutes, its cache is reset automatically.
  • If the service is redeployed without stopping data flow, the cache persists; to reset it, either restart the collector or allow the service to idle for 5 minutes.

Best practices for alerting

  • Set up alerts based on the presence of the label otel_metric_overflow="true".
  • This allows early detection of cardinality issues—as soon as overflow begins, even if only a single value is dropped.

Example PromQL expression:

sum by (service_name) (duration_ms_bucket{otel_metric_overflow="true"}) > 0

This expression:

  • Groups metrics by service_name.
  • Triggers the alert if overflowed time series exists.
  • Useful for early warning before overflow volume becomes significant.