Span Metrics Cardinality Limiting
Span metrics can generate extremely high cardinality when dynamic values—such as UUIDs or timestamps—are used in label fields like span_name
. This can result in:
- Excessive time series creation and broken APM experiance
- Significantly higher data volume being sent (eats units quota)
- Performance degradation
To mitigate this, Coralogix adopts a mechanism similar to the OpenTelemetry Metrics SDK cardinality limits. This feature introduces an automatic, configurable cardinality control mechanism within the spanmetrics
pipeline of the OpenTelemetry Collector.
Coralogix will detect and expose when services exceed their cardinality limits, ensuring users have visibility into dropped series and can take corrective action early. This includes both backend detection. In future versions we will add frontend UI representation.
How it works
Per-service-per-metric cardinality limit
A threshold (e.g., 100,000) is applied per service per metric.
- For example,
calls_total{service="order-service"}
will have a 100,000 series cap.
- For example,
This ensures high-cardinality services do not impact others globally.
Overflow redirection with a fallback label
- Once the time series limit is reached, new unique combinations of labels are no longer tracked individually.
Instead, the corresponding spans are aggregated into a fallback series that includes the special label:
Example: Cardinality limit of 3, with 5 time series sent (each with 50 spans)
calls_total{service_name="A", span_name="uuid1"} calls_total{service_name="A", span_name="uuid2"} calls_total{service_name="A", span_name="uuid3"} calls_total{service_name="A", otel_metric_overflow="true"}
- The first 3 series are preserved.
- The remaining 100 spans are collapsed into a single time series tagged with
otel_metric_overflow="true"
.
Configuration options
Collector configuration (for all environments)
You can manually add the cardinality limit protection in the OpenTelemetry Collector:
K8s - Helm configuration (values.yaml)
The following preset is provided by default:
To disable it (or change) set value to 0:
Retention scope
- Tracked time series are stored in-memory only and are cleared when the OpenTelemetry Collector or sending pod restarts—no persistent state is maintained.
- If a service stops sending data for 5 minutes, its cache is reset automatically.
- If the service is redeployed without stopping data flow, the cache persists; to reset it, either restart the collector or allow the service to idle for 5 minutes.
Best practices for alerting
- Set up alerts based on the presence of the label
otel_metric_overflow="true"
. - This allows early detection of cardinality issues—as soon as overflow begins, even if only a single value is dropped.
Example PromQL expression:
This expression:
- Groups metrics by
service_name
. - Triggers the alert if overflowed time series exists.
- Useful for early warning before overflow volume becomes significant.