Quick Start Observability for OpenTelemetry Collector
Thank you!
We got your information.
Coralogix Extension For OpenTelemetry Collector Includes:
Dashboards - 1
Gain instantaneous visualization of all your OpenTelemetry Collector data.
Alerts - 7
Stay on top of OpenTelemetry Collector key performance metrics. Keep everyone in the know with integration with Slack, PagerDuty and more.
Down
This alert monitors the availability of the OpenTelemetry Collector, specifically detecting when it is down or unavailable. A downed Collector disrupts the flow of telemetry data, leading to gaps in observability and potential delays in diagnosing system issues. The alert is triggered when the OpenTelemetry Collector is unreachable or non-responsive for more than 10 minutes. Monitoring this metric helps ensure continuous telemetry data collection and delivery, which is critical for maintaining system visibility and troubleshooting capabilities. Customization Guidance: - Threshold: Adjust the downtime threshold based on your tolerance for temporary outages versus critical failures. - Monitoring Period: Align the monitoring period with the expected uptime requirements of your environment. - Notification Frequency: Set notification frequency to ensure timely responses while avoiding alert fatigue during transient issues. Action: If this alert is triggered, verify the health and status of the Collector service, check for infrastructure issues (e.g., network or host failures), and review logs for error messages or configuration issues. Restart the service if necessary and ensure redundancy to minimize future downtime.
Refusing Logs
This alert monitors instances where the OpenTelemetry Collector refuses to process incoming log data. Such refusals can indicate resource limitations, misconfigurations, or an inability to handle the current log volume, potentially leading to gaps in observability. The alert is triggered when the OpenTelemetry Collector rejects incoming logs for more than 10 consecutive minutes or exceeds a predefined number of refusals within a specified monitoring period. Monitoring this metric helps ensure reliable log ingestion, maintaining the integrity of observability data and enabling effective system diagnostics. Customization Guidance: - Threshold: Tailor the refusal duration or count threshold based on your system’s capacity and acceptable limits for transient log processing issues. - Monitoring Period: Adjust the monitoring period to align with typical traffic fluctuations and logging peaks. - Notification Frequency: Balance responsiveness with alert fatigue by setting notification intervals that reflect your operational needs. Action: If this alert is triggered, review the Collector's resource allocations (CPU, memory, and disk), check pipeline configurations for bottlenecks, and validate log batching or rate-limiting settings. If necessary, scale the infrastructure or adjust workloads to ensure sustained performance.
Refusing Metrics
This alert monitors instances where the OpenTelemetry Collector refuses to process incoming metric data. Such refusals may indicate resource exhaustion, misconfigurations, or an inability to handle the current metric volume, potentially impacting observability and performance monitoring. The alert is triggered when the OpenTelemetry Collector rejects incoming metrics for more than 10 consecutive minutes. Monitoring this metric helps ensure the reliable collection and processing of metrics, which is critical for maintaining visibility into system health and performance. Customization Guidance: - Threshold: Adjust the refusal duration or count threshold based on your system's capacity and acceptable limits for temporary metric processing delays. - Monitoring Period: Set the monitoring period to reflect your environment's typical workload and peak activity times. - Notification Frequency: Configure notification intervals to balance timely responses with minimizing alert fatigue during transient issues. Action: If this alert is triggered, investigate the Collector's resource allocations (CPU, memory), review pipeline configurations for bottlenecks, and optimize metric batching or rate-limiting settings. Scale infrastructure if necessary or adjust configurations to accommodate higher metric ingestion volumes.
Refusing Spans
This alert monitors instances where the OpenTelemetry Collector refuses to process incoming span data. Such refusals may indicate resource constraints, misconfigurations, or an inability to handle the current trace volume, potentially disrupting distributed tracing and application observability. The alert is triggered when the OpenTelemetry Collector rejects incoming spans for more than 10 consecutive minutes. Monitoring this metric helps ensure the reliable collection and processing of spans, which is essential for maintaining end-to-end visibility and diagnosing performance or dependency issues. Customization Guidance: - Threshold: Adjust the refusal duration or count threshold based on your system's capacity and acceptable tolerance for temporary delays in span processing. - Monitoring Period: Set the monitoring period to align with your typical tracing patterns, particularly during peak traffic times. - Notification Frequency: Configure notification intervals to provide timely responses while avoiding alert fatigue for short-term issues. Action: If this alert is triggered, review the Collector’s resource allocations (CPU, memory), check for pipeline bottlenecks, and optimize trace batching or rate-limiting settings. Consider scaling infrastructure or adjusting configurations to accommodate higher span ingestion rates if necessary.
Logs Send Failures
This alert monitors failures in the OpenTelemetry Collector when sending log data to its configured destinations. Log send failures can indicate connectivity issues, misconfigurations, or destination overload, potentially resulting in data loss and incomplete observability. The alert is triggered when the OpenTelemetry Collector encounters a significant number of log send failures over a 10-minute period. Monitoring this metric helps ensure reliable delivery of log data, which is essential for effective monitoring, troubleshooting, and compliance requirements. Customization Guidance: - Threshold: Adjust the failure count threshold based on your system's tolerance for temporary send issues and the criticality of the logs. - Monitoring Period: Modify the monitoring period to reflect the expected log transmission intervals and traffic peaks in your environment. - Notification Frequency: Balance timely notifications with minimizing alert fatigue by setting appropriate notification intervals. Action: If this alert is triggered, investigate the Collector's connectivity to its log destinations, verify configuration settings (e.g., endpoint URLs, credentials), and check for destination-side issues such as throttling or overload. Implement retry mechanisms and consider scaling resources on either side to handle increased log traffic.
Metrics Send Failures
This alert monitors failures in the OpenTelemetry Collector when sending metric data to its configured destinations. Metrics send failures can indicate connectivity issues, misconfigurations, or destination-side limitations, potentially resulting in gaps in performance monitoring and observability. The alert is triggered when the OpenTelemetry Collector experiences a significant number of metric send failures over a 10-minute period. Monitoring this metric helps ensure reliable delivery of metric data, which is critical for tracking system health and performance trends. Customization Guidance: - Threshold: Adjust the failure count threshold based on your system's tolerance for temporary send issues and the criticality of the metrics being monitored. - Monitoring Period: Configure the monitoring period to match the expected metric transmission frequency and traffic patterns in your environment. - Notification Frequency: Set notification intervals that provide timely awareness while minimizing alert fatigue during transient issues. Action: If this alert is triggered, check the Collector's connectivity to its metric destinations, validate configuration settings (e.g., endpoints, credentials), and review logs for specific error details. Investigate destination-side issues like rate limits or processing bottlenecks, and optimize retry policies or scale resources to ensure sustained delivery.
Span Send Failures
This alert monitors failures in the OpenTelemetry Collector when sending span data to its configured destinations. Span send failures may indicate connectivity issues, misconfigurations, or destination-side limitations, potentially disrupting distributed tracing and impacting application observability. The alert is triggered when the OpenTelemetry Collector encounters a significant number of span send failures over a 10-minute period. Monitoring this metric helps ensure reliable delivery of span data, which is essential for maintaining complete and accurate tracing information for system diagnostics and performance optimization. Customization Guidance: - Threshold: Adjust the failure count threshold based on your system's tolerance for temporary send issues and the criticality of the spans for tracing workflows. - Monitoring Period: Set the monitoring period to align with typical span transmission frequencies and traffic patterns, especially during peak loads. - Notification Frequency: Configure notification intervals to ensure timely responses while minimizing alert fatigue for transient issues. Action: If this alert is triggered, verify the Collector’s connectivity to its span destinations, review configuration settings (e.g., endpoints, authentication), and check for errors in logs or traces. Investigate destination-side issues like throttling or capacity constraints, and adjust retry policies or scale resources to ensure sustained span delivery.
Integration
Learn more about Coralogix's out-of-the-box integration with OpenTelemetry Collector in our documentation.