How to use it
Metrics
Please refer to the following documentation for the full list of metrics and their labels, collected from various sources:
Additionally, k8sattributes processor and resource detection processor are used to add more metadata labels.
Prometheus receiver is used to scrape Kubernetes API Server and Kubelet cAdvisor endpoints for display in the Kubernetes Dashboard.
!!! note\n
OpenTelemetry metrics are converted to Prometheus format following the [OpenTelemetry specification](https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#otlp-metric-points-to-prometheus)
Custom Metrics
In addition to standard metrics, the OpenTelemetry Integration provides the following custom metrics:
kube_pod_status_qos_class
Provides information about the Pod QOS class.
| Metric Type | Value | Labels |
|---|---|---|
| Gauge | 1 | reason |
kube_pod_status_reason
Provides information about the Kubernetes Pod Status.
| Metric Type | Value | Labels |
|---|---|---|
| Gauge | 1 | reason |
Example reason label keys: Evicted, NodeAffinity, NodeLost, Shutdown, UnexpectedAdmissionError
kube_node_info
Provides information about the Kubernetes Node.
| Metric Type | Value | Labels |
|---|---|---|
| Gauge | 1 | k8s.kubelet.version |
k8s.container.status.last_terminated_reason
Provides information about Pod's last termination.
| Metric Type | Value | Labels |
|---|---|---|
| Gauge | 1 | reason |
Example reason label keys: OOMKilled
kubernetes_build_info
Provides information about the Kubernetes version.
Container Filesystem usage metrics
- container_fs_writes_total
- container_fs_reads_total
- container_fs_writes_bytes_total
- container_fs_reads_bytes_total
- container_fs_usage_bytes
CPU throttling metrics
- container_cpu_cfs_periods_total
- container_cpu_cfs_throttled_periods_total
Available Endpoints
Applications can send OTLP Metrics and Jaeger, Zipkin and OTLP traces to the local nodes, as otel-agent is using hostNetwork .
| Protocol | Port |
|---|---|
| Zipkin | 9411 |
| Jaeger GRPC | 6832 |
| Jaeger Thrift binary | 6832 |
| Jaeger Thrift compact | 6831 |
| Jaeger Thrift http | 14268 |
| OTLP GRPC | 4317 |
| OTLP HTTP | 4318 |
Example application environment configuration
The following code creates a new environment variable (NODE) containing the node's IP address and then uses that IP in the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. This ensures that each instrumented pod will send data to the local OTEL collector on the node it is currently running on.
env:
- name: NODE
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://$(NODE):4317"
About global collection interval
The global collection interval (global.collectionInterval) is the interval in which the collector will collect metrics from the configured receivers. For most optimal default experience, we recommend using the 30 second interval set by the chart. However, if you'd prefer to collect metrics more (or less) often, you can adjust the interval by changing the global.collectionInterval value in the values.yaml file. The minimal recommended global interval is 15s. If you wish to use default value for each component set internally by the collector, you can remove the collection interval parameter from presets completely.
Beware that using lower interval will result in more metric data points being sent to the backend, thus resulting in more costs. Note that the choice of the interval also has an effect on behavior of rate functions, for more see here.
About batch sizing
Batch processor ensures that the telemetry being sent to Coralogix backend is batched into bigger requests, ensuring lower networking overhead and better performance. The batching processor is enabled by default and we strongly recommend to use it. By default, the otel-integration chart uses the following recommended settings for batch processors in all collectors:
These settings imposes a hard limit of 2048 units (spans, metrics, logs) on the batch size, ensuring a balance between the recommended size of the batches and networking overhead.
You may adjust these settings according to your needs, but when configuring the batch processor by yourself, it is important to be mindful of the size limites imposed by the Coraloigx endpoints (currently max. 10 MB after decompression - see documentation).
More information on how to configure the batch processor can be found here.
About span metrics
The collector provides a possibility to synthesize R.E.D (Request, Error, Duration) metrics based on the incoming span data. This can be useful to obtain extra metrics about the operations you have instrumented for tracing. For more information, please refer to the OpenTelemetry Collector documentation.
This feature is enabled by default and can be disabled by setting the spanmetrics.enabled value to false in the values.yaml file.
Beware that enabling the feature will result in creation of additional metrics. Depending on how you instrument your applications, this can result in a significant increase in the number of metrics. This is especially true for cases where the span name includes specific values, such as user IDs or UUIDs. Such instrumentation practice is strongly discouraged.
In such cases, we recommend to either correct your instrumentation or to use the spanMetrics.spanNameReplacePattern parameter, to replace the problematic values with a generic placeholder. For example, if your span name corresponds to template user-1234, you can use the following pattern to replace the user ID with a generic placeholder. See the following configuration:
This will result in your spans having generalized name user-{id}.
SpanMetrics Error Tracking
Once you enable the Span Metrics preset, the errorTracking configuration will automatically be enabled.
This is how you can disable the errorTracking option:
[!NOTE] The
errorTrackingfeature works only with OpenTelemetry SDKs that support OpenTelemetry Semantic Conventions version v1.21.0 or later. If you are using an older SDK version, you may need to transform certain attributes (for example,http.status_codetohttp.response.status_code).To perform this transformation, add the following configuration:
SpanMetrics Database Monitoring
Once you enable the Span Metrics preset, the dbMetrics`` configuration will automatically be enabled. It generates RED (Request, Errors, Duration) metrics for database spans. For example, querydb_calls_total` to view generated request metrics.
This is needed to enable the Database Monitoring feature inside Coralogix APM.
This is how you can disable the dbMetrics option:
The dbMetrics also support transform statements that will apply only to database traces. Here's how you can use it:
presets:
spanMetrics:
enabled: true
dbMetrics:
enabled: true
transformStatements:
- replace_pattern(attributes["db.query.text"], "\\d+", "?") # removes potential IDs for the attribute
- set(attributes["span.duration_ns"], span.end_time_unix_nano - span.start_time_unix_nano) # stores the span duration in ns in an attribute
Note on Semantic Conventions for old OTEL SDKs
The dbMetrics preset only works with OpenTelemetry SDKs that support OpenTelemetry Semantic conventions v1.26.0.
| Language | SDK version with dbMetrics support |
|---|---|
| Go | v1.28.0+ |
| Java | v1.41.0+ |
| JavaScript | v1.26.0+ |
| Python | v1.26.0+ |
| .NET | v1.10.0+ |
| C++ | v1.16.0+ |
| PHP | v1.0.0+ |
| Ruby | v1.4.0+ |
| Rust | v0.25.0+ |
| Swift | v1.10.0+ |
| Erlang/Elixir | v1.3.0+ |
If you are using older versions, you might need to transform some attributes, such as:
db.sql.table => db.collection.name
db.mongodb.collection => db.collection.name
db.cosmosdb.container => db.collection.name
db.cassandra.table => db.collection.name
To do that, you can add the configuration below. It will take care of defining the transform/spanmetrics processor with those transform statements and adding it to the end of the traces pipeline, just before batching. This ensures that the transformations are applied to all spans before they are routed to the spanmetrics or forward/db connectors, putting all the spans on the same semantic convention.
[!IMPORTANT] Correlation might be broken if the transform statements below are applied only at the
dbMetricslevel.
spanMetrics:
enabled: true
transformStatements:
- set(attributes["db.namespace"], attributes["db.name"]) where attributes["db.namespace"] == nil
- set(attributes["db.namespace"], attributes["server.address"]) where attributes["db.namespace"] == nil
- set(attributes["db.namespace"], attributes["network.peer.name"]) where attributes["db.namespace"] == nil
- set(attributes["db.namespace"], attributes["net.peer.name"]) where attributes["db.namespace"] == nil
- set(attributes["db.namespace"], attributes["db.system"]) where attributes["db.namespace"] == nil
- set(attributes["db.operation.name"], attributes["db.operation"]) where attributes["db.operation.name"] == nil
- set(attributes["db.collection.name"], attributes["db.sql.table"]) where attributes["db.collection.name"] == nil
- set(attributes["db.collection.name"], attributes["db.cassandra.table"]) where attributes["db.collection.name"] == nil
- set(attributes["db.collection.name"], attributes["db.mongodb.collection"]) where attributes["db.collection.name"] == nil
- set(attributes["db.collection.name"], attributes["db.redis.database_index"]) where attributes["db.collection.name"] == nil
- set(attributes["db.collection.name"], attributes["db.elasticsearch.path_parts.index"]) where attributes["db.collection.name"] == nil
- set(attributes["db.collection.name"], attributes["db.cosmosdb.container"]) where attributes["db.collection.name"] == nil
- set(attributes["db.collection.name"], attributes["aws_dynamodb.table_names"]) where attributes["db.collection.name"] == nil
dbMetrics:
enabled: true
Span metrics with different buckets per application
If you want to use Span Metrics connector with different buckets per application you need to use spanMetricsMulti preset. For example:
presets:
spanMetricsMulti:
enabled: false
defaultHistogramBuckets: [1ms, 4ms, 10ms, 20ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s]
configs:
- selector: route() where attributes["service.name"] == "one"
histogramBuckets: [1s, 2s]
- selector: route() where attributes["service.name"] == "two"
histogramBuckets: [5s, 10s]
For selector you need to write a OTTL statement, more information is available in routing connector docs.
Multi-line log configuration
This helm chart supports multi-line configurations for different namespace, pod, and/or container names. The following example configuration applies a specific firstEntryRegex for a container which is part of a x Pod in y namespace:
presets:
logsCollection:
enabled: true
multilineConfigs:
- namespaceName:
value: kube-system
podName:
value: app-a.*
useRegex: true
containerName:
value: http
firstEntryRegex: ^[^\s].*
combineWith: ""
- namespaceName:
value: kube-system
podName:
value: app-b.*
useRegex: true
containerName:
value: http
firstEntryRegex: ^[^\s].*
combineWith: ""
- namespaceName:
value: default
firstEntryRegex: ^[^\s].*
combineWith: ""
This feature uses filelog receiver's router and recombine operators.
Alternatively, you can add a recombine filter at the end of log collection operators using extraFilelogOperators field. The following example adds a single recombine operator for all Kubernetes logs:
presets:
logsCollection:
enabled: true
extraFilelogOperators:
- type: recombine
combine_field: body
source_identifier: attributes["log.file.path"]
is_first_entry: body matches "^(YOUR-LOGS-REGEX)"
Integrating Kube State Metrics
You can configure otel-integration to collect Kube State Metrics metrics. Using Kube State Metrics is useful when missing metrics or labels in the Kubernetes Cluster Receiver. Kube State Metrics collects Kubernetes cluster-level metrics that are crucial for monitoring resource states, like pods, deployments, and HorizontalPodAutoscalers (HPAs). To integrate with Kube State Metrics, create a file called values-ksm.yaml, and there configure the metrics and labels that you wish to collect:
metricAllowlist:
- kube_horizontalpodautoscaler_labels
- kube_horizontalpodautoscaler_spec_max_replicas
- kube_horizontalpodautoscaler_status_current_replicas
- kube_pod_info
- kube_pod_labels
- kube_pod_container_status_waiting
- kube_pod_container_status_waiting_reason
metricLabelsAllowlist:
- pods=[app,environment]
- horizontalpodautoscalers=[app,environment]
Then install Kube State Metrics:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-state-metrics prometheus-community/kube-state-metrics --values values-ksm.yaml
This command adds the Prometheus community's Helm repository and installs Kube State Metrics using the values you've configured.
Next, configure opentelemetry-cluster-collector to scrape Kube State Metrics via Prometheus receiver.
helm upgrade --install otel-coralogix-integration coralogix-charts-virtual/otel-integration . --values values-cluster-ksm.yaml
Once the installation is complete, verify that the Kube State Metrics are being scraped and ingested inside Coralogix.
Connecting to Coralogix fleet management
The integration connects to the Coralogix fleet management server through fleetManagement preset. This connection happens through the OpAMP extension of the Collector and the endpoint used is: https://ingress.<CORALOGIX_DOMAIN>/opamp/v1. This feature is enabled by default. You can disable it by setting the presets.fleetManagement.enabled property to false.
[!NOTE] Important security considerations when enabling this feature: - Because this extension shares your Collector's configuration with the fleet management server, it's important to ensure that any secret contained in it is using the environment variable expansion syntax. - The default capabilities of the OpAMP extension do not include remote configuration or packages. - By default, the extension will pool the server every 2 minutes. Additional network requests might be made between the server and the Collector, depending on the configuration on both sides.
To enable this feature, set the presets.fleetManagement.enabled property to true. Here is an example values.yaml:
Known errors
When running on Windows, you might see the "failed getting host info" error. This is expected behavior because the collector attempts to retrieve Windows metadata from the Windows Registry, which is only possible when running from HostProcess Windows containers. This error has no negative impact on the functionality of the Collector or OpAMP in any way.
Example:
"msg":"failed getting host info","otelcol.component.id":"opamp","otelcol.component.kind":"Extension","error":"The system cannot find the file specified.","