Quick Start Observability for Amazon MSK
Thank you!
We got your information.
Coralogix Extension For Amazon MSK Includes:
Dashboards - 1
Gain instantaneous visualization of all your Amazon MSK data.
Alerts - 10
Stay on top of Amazon MSK key performance metrics. Keep everyone in the know with integration with Slack, PagerDuty and more.
High CPU Utilization Alert
This alert monitors the CPU utilization of AWS MSK brokers. High CPU utilization can indicate resource exhaustion and affect Kafka performance. The alert is triggered when CPU utilization exceeds 80% for a sustained period. Monitoring this metric ensures that brokers can handle workloads effectively. Customization Guidance: - Threshold: Adjust based on expected workloads. - Monitoring Period: Modify the period to suit cluster activity patterns. - Notification Frequency: Balance responsiveness with alert fatigue. Action: Investigate workloads, consider scaling up brokers, or optimize configurations.
High Disk Utilization Alert
This alert monitors the disk utilization of AWS MSK brokers. High disk usage can lead to partition unavailability and performance degradation. The alert is triggered when disk utilization exceeds 85%. Monitoring this metric ensures sufficient storage for Kafka data and logs. Customization Guidance: - Threshold: Adjust based on storage capacity and usage patterns. - Monitoring Period: Modify to reflect data growth trends. - Notification Frequency: Immediate notification for critical issues. Action: Add storage or redistribute partitions across brokers.
Under-Replicated Partitions Alert
This alert monitors the under-replicated partitions in an AWS MSK cluster. Under-replicated partitions increase the risk of data loss. The alert is triggered when the count of under-replicated partitions exceeds zero. Monitoring this metric helps maintain data replication and cluster reliability. Customization Guidance: - Threshold: Always set at > 0 for critical monitoring. - Monitoring Period: Ensure frequent checks to detect issues promptly. - Notification Frequency: Immediate notification for critical issues. Action: Investigate broker issues or replication lag and resolve promptly.
Offline Partitions Alert
This alert monitors the count of offline partitions in an AWS MSK cluster. Offline partitions indicate potential broker issues or cluster instability. The alert is triggered when any partition is offline. Monitoring this metric ensures data availability and cluster reliability. Customization Guidance: - Threshold: Always set at > 0 for critical monitoring. - Monitoring Period: Frequent checks for rapid detection. - Notification Frequency: Immediate notification for critical issues. Action: Investigate and resolve broker or partition issues promptly.
Broker Not Available Alert
This alert monitors the active controller broker in an AWS MSK cluster. A missing controller broker can indicate cluster issues. The alert is triggered when no active controller broker is available. Monitoring this metric ensures cluster stability and availability. Customization Guidance: - Threshold: Set at < 1 to detect missing controllers. - Monitoring Period: Frequent checks for rapid detection. - Notification Frequency: Immediate for critical issues. Action: Restart or replace the affected broker to restore functionality.
High Memory Usage Alert
This alert monitors the memory usage of AWS MSK brokers. High memory usage can impact broker performance and lead to crashes or throttling. The alert is triggered when memory usage exceeds 85% of total capacity for a sustained period. Monitoring this metric helps ensure stable broker performance and prevents memory-related failures. Customization Guidance: - Threshold: Set based on operational tolerance and memory capacity (e.g., > 85%). - Monitoring Period: Adjust to capture sustained high usage (e.g., 10-minute intervals). - Notification Frequency: Immediate for critical thresholds. Action: Investigate memory-intensive workloads, optimize Kafka configurations, or consider scaling up broker instances.
High Heap Memory Usage Alert
This alert monitors the heap memory usage of AWS MSK brokers. High heap memory usage can lead to out-of-memory errors and negatively impact broker performance. The alert is triggered when heap memory usage exceeds 85% of the maximum heap size for a sustained period. Monitoring this metric helps ensure efficient memory management and prevents heap-related failures. Customization Guidance: - Threshold: Set based on operational tolerance and expected memory usage (e.g., > 85%). - Monitoring Period: Adjust to capture sustained high usage (e.g., 10-minute intervals). - Notification Frequency: Immediate for critical thresholds. Action: Analyze broker heap memory usage, optimize garbage collection, increase heap size, or scale up broker instances as needed.
Network Receive Errors Alert
This alert monitors the network receive errors of AWS MSK brokers. Network receive errors indicate potential issues with broker connectivity or underlying infrastructure. The alert is triggered when the count of network receive errors exceeds acceptable limits. Monitoring this metric helps ensure reliable network communication for brokers. Customization Guidance: - Threshold: Set based on operational tolerance for network errors (e.g., > 0). - Monitoring Period: Frequent checks to quickly detect issues (e.g., 1-minute intervals). - Notification Frequency: Immediate for critical issues. Action: Investigate the network configuration, broker connectivity, and underlying infrastructure for potential issues.
Network Transmit Errors Alert
This alert monitors the network transmit errors of AWS MSK brokers. Network transmit errors indicate potential issues with broker connectivity or underlying infrastructure. The alert is triggered when the count of network transmit errors exceeds acceptable limits. Monitoring this metric helps ensure reliable data transmission from brokers. Customization Guidance: - Threshold: Set based on operational tolerance for network errors (e.g., > 0). - Monitoring Period: Frequent checks to quickly detect issues (e.g., 1-minute intervals). - Notification Frequency: Immediate for critical issues. Action: Investigate the network configuration, broker connectivity, and underlying infrastructure for potential issues.
Maximum Offset Lag Alert
This alert monitors the maximum offset lag across all partitions in a topic for AWS MSK. High offset lag indicates that consumers are falling behind producers and may lead to delays in data processing. The alert is triggered when the maximum offset lag exceeds a defined threshold for a sustained period. Monitoring this metric ensures timely data consumption and helps prevent delays in processing. Customization Guidance: - Threshold: Set based on application-specific SLAs for acceptable lag (e.g., > 1000). - Monitoring Period: Adjust based on the criticality of real-time data processing. - Notification Frequency: Immediate for critical thresholds. Action: Scale consumer instances, optimize consumer configurations, or investigate high producer rates causing the lag.
Integration
Learn more about Coralogix's out-of-the-box integration with Amazon MSK in our documentation.