Real-time AI observability is here - introducing Coralogix's AI Center

Learn more

Quick Start Observability for AWS EC2

thank you

Thank you!

We got your information.

AWS EC2
AWS EC2 icon

Coralogix Extension For AWS EC2 Includes:

Dashboards - 1

Gain instantaneous visualization of all your AWS EC2 data.

Amazon EC2
Amazon EC2

Alerts - 6

Stay on top of AWS EC2 key performance metrics. Keep everyone in the know with integration with Slack, PagerDuty and more.

Amazon EC2 - High CPU Utilization

This alert is designed to monitor the CPU utilization of Amazon EC2 instances to ensure optimal performance and avoid potential resource exhaustion. The alert is triggered when the CPU utilization exceeds a predefined critical threshold, indicating that the instance is under high load, which could lead to performance degradation or application instability. Trigger Condition: The alert is activated when the CPU utilization of an EC2 instance exceeds a critical threshold of 80%. This high CPU usage can signify that the instance is nearing its processing capacity, potentially leading to slower response times, throttling, or even system failures. Customization Guidance: Threshold: The default critical threshold is set at 80%. However, this can be adjusted based on the instance type, workload, and performance requirements of your application. Some high-compute workloads may require a higher threshold (e.g., 90%), while sensitive applications may need a lower limit (e.g., 70%). Granularity: For more precise monitoring, consider breaking down the CPU utilization by process or service to identify which specific tasks or applications are consuming the most CPU. This could help pinpoint whether it’s a specific application, service, or external factor causing the spike. Alert Tiering: Implement multi-level alerts to catch CPU utilization spikes early. For instance, set a warning threshold at 60% to allow proactive steps, such as scaling or load balancing, before reaching critical levels. Notification Frequency: Configure the alert frequency to avoid excessive noise while maintaining awareness of sustained CPU utilization issues. For example, you might want to trigger notifications only after the threshold is exceeded for a sustained period (e.g., 5 minutes) rather than for brief spikes. Action: When the alert is triggered, immediately analyze the workload on the EC2 instance. Consider whether vertical scaling (increasing instance size) or horizontal scaling (adding instances) is needed. Check for inefficient processes, high computational demands, or underlying service issues. Other potential optimizations include balancing the load across multiple instances, adjusting auto-scaling policies, or offloading tasks to less busy instances.

Amazon EC2 - System Status Checks Failed

This alert is designed to monitor the health of the underlying infrastructure supporting an Amazon EC2 instance. The alert is triggered when system status checks fail, indicating a potential issue with the AWS infrastructure hosting the instance. System checks evaluate the health of the instance's physical host, network, and underlying hardware. Trigger Condition: The alert is activated when the EC2 system status check fails. This failure often points to problems with the physical hardware, networking, or other AWS-related issues that prevent the instance from functioning properly. These checks are beyond the control of the user but critical for the instance's operational integrity. Customization Guidance: Threshold: The alert should be triggered immediately upon the system status check failure. Since this indicates an infrastructure-level issue, quick detection is essential to minimize downtime. Granularity: While system checks monitor the health of the AWS infrastructure supporting your instance, consider combining this alert with instance-level health checks to get a full picture of any potential issues (e.g., hardware, networking, or OS-level issues). Alert Tiering: You can set multi-tiered alerts. For example, a warning alert can be issued if an instance is experiencing occasional failures, while a critical alert is triggered when the system check fails for an extended period (e.g., 5 minutes), signaling a more serious underlying issue. Notification Frequency: Immediate notification is recommended since system check failures typically require AWS intervention or instance recovery actions. Configure the alert to repeat until the issue is resolved to ensure quick response. Action: Upon activation of this alert, investigate the system status check failure by performing the following steps: Check AWS Health Dashboard: Visit the AWS Health Dashboard to see if there are any ongoing infrastructure issues in the AWS region where your instance is hosted. Reboot or Stop/Start the Instance: In many cases, a simple reboot or stop/start can resolve the issue by migrating your instance to healthy hardware within the AWS infrastructure. Contact AWS Support: If the issue persists, contact AWS Support to investigate further. Since system checks are AWS-managed, any prolonged failure will likely need AWS intervention. Consider Auto-Recovery: If system checks fail frequently, enable EC2 Auto-Recovery to automatically reboot or replace the instance in case of underlying infrastructure failure.

Amazon EC2 - Instance Status Checks Failed

This alert is designed to monitor the health of the EC2 instance itself by tracking the instance status check. The alert is triggered when the instance status check fails, indicating an issue with the instance’s operating system, boot process, or underlying configuration that prevents it from functioning properly. Trigger Condition: The alert is activated when the EC2 instance status check fails. This failure generally signals problems within the instance, such as OS corruption, software misconfigurations, or insufficient resources (CPU, memory, disk space) preventing the instance from operating correctly. Customization Guidance: Threshold: The alert should be triggered immediately upon a status check failure, as this often indicates that the instance is unresponsive or malfunctioning. Immediate notification is critical to minimizing downtime. Granularity: Instance status checks focus on problems internal to the instance, such as OS-level issues. You may want to combine this with system status checks (which monitor the AWS infrastructure) to get comprehensive visibility into both the host environment and instance health. Alert Tiering: Consider setting a warning alert if the status check fails intermittently, and a critical alert for sustained failure (e.g., more than 5 minutes) or when multiple consecutive status checks fail, suggesting a serious problem with the instance. Notification Frequency: Immediate notifications are recommended as instance-level failures can impact the availability of your applications. Consider setting repeat notifications until the issue is resolved to ensure swift action. Action: Upon activation of this alert, it is essential to investigate the root cause of the instance status check failure and take appropriate corrective actions: Reboot the Instance: Attempt to reboot the instance to resolve transient issues, such as kernel crashes or misconfigurations. Check Logs: Investigate the instance system logs (e.g., dmesg, syslog) to identify any software errors, misconfigurations, or resource exhaustion (CPU, memory, disk). Verify Instance Resources: Check if the instance is running out of key resources like memory or disk space and consider scaling up or modifying the instance. Stop/Start the Instance: If the reboot doesn't fix the issue, perform a full stop/start cycle to move the instance to new hardware and reload the instance from scratch. Perform OS-Level Troubleshooting: If necessary, connect to the instance using a recovery method such as EC2 Instance Recovery or attach the root EBS volume to another instance for detailed troubleshooting.

Amazon EC2 - Attached EBS Status Checks Failed

This alert is designed to monitor the health and availability of Amazon EBS volumes attached to an EC2 instance. The alert is triggered when the EBS volume status check fails, indicating an issue with the attached EBS volume that may prevent the instance from reading from or writing to the volume. Trigger Condition: The alert is activated when the status check for an attached EBS volume fails. This failure could indicate several underlying problems, including data corruption, I/O performance issues, or connectivity problems between the EC2 instance and the EBS volume. Customization Guidance: Threshold: The alert should be triggered immediately upon EBS volume status check failure, as this issue can lead to application outages or data unavailability. EBS volumes are critical for persistent storage, so rapid detection is essential. Granularity: Consider setting separate alerts for I/O performance degradation and volume state (e.g., impaired or stalled) to differentiate between temporary I/O slowdowns and complete volume failures. Alert Tiering: You can create multi-tier alerts based on the nature of the failure: Warning Alerts for temporary I/O performance issues or latency (e.g., exceeding 90% I/O capacity). Critical Alerts when the EBS volume is marked as "impaired" or "failed" by AWS, as this could lead to complete data inaccessibility. Notification Frequency: Immediate notifications are highly recommended for this alert since EBS status check failures can directly affect application performance or availability. Set the alert to repeat until the issue is resolved to ensure continuous awareness. Action: When the alert is triggered, it's crucial to quickly identify and resolve the issue with the EBS volume to prevent data loss or performance degradation: Check EBS Volume Status in AWS Console: Go to the EC2 Dashboard > Volumes section to inspect the status of the attached EBS volume. AWS may provide further details if the volume is marked as "impaired" or "failed." Investigate CloudWatch Metrics for EBS: Review CloudWatch metrics for the volume, such as IOPS, throughput, and latency, to determine if the issue is related to heavy usage or throttling. Check Disk Utilization: Log into the EC2 instance and check the disk space usage (df -h) and volume health (dmesg, lsblk) to detect if there’s any corruption or unmounted file systems. Detach and Reattach the EBS Volume: If the volume is not responding, detach it from the instance and reattach it. This may resolve issues with connectivity or I/O errors. Restore from Snapshot: If the volume is severely impaired or corrupted, and recovery efforts fail, restore the volume from a recent EBS snapshot. Contact AWS Support: If AWS indicates an issue with the volume that cannot be resolved through the above steps, contact AWS Support for assistance in recovering or replacing the volume.

Amazon EC2 - High Server Errors

This alert is designed to monitor the occurrence of HTTP 5xx server-side errors on your Amazon EC2 instance. 5xx errors indicate that the server encountered an issue while processing requests, which could lead to degraded user experience or application downtime. Common 5xx errors include: Trigger Condition: The alert is triggered when the number of HTTP 5xx errors exceeds a specified threshold, indicating an ongoing issue with the server. For example, you may set the alert to trigger when the count of 5xx errors exceeds 10 errors in 5 minutes. Customization Guidance: Threshold : Set a specific count threshold for 5xx errors. For example: A critical alert can be triggered when there are 10 or more 5xx errors in a 10-minute window. Adjust these thresholds based on your application's traffic volume and tolerance for errors. Granularity: You can customize the alert to differentiate between specific 5xx errors . This can help identify the root cause more effectively. Alert Tiering: Implement multi-level alerting: Warning alerts for a lower count of errors (e.g., 5 in 10 minutes). Critical alerts for a higher count of errors (e.g., 10 or more in 10 minutes), indicating a more severe issue. Notification Frequency: Set the alert to notify immediately when the error count crosses the threshold, but adjust the frequency to avoid alert fatigue for transient issues. Action: Upon activation of the alert, take immediate action to investigate and resolve the underlying cause: Check Server Logs: Review system and application logs for error messages that could explain the spike in 5xx errors. Monitor Resource Usage: Verify whether the instance is running low on CPU, memory, or disk resources, which could lead to 503 errors. Investigate Upstream Services: If you encounter 504 errors, check the availability and performance of any upstream services or databases the instance depends on. Review Application Health: Ensure that the application running on the instance is stable and not encountering issues such as unhandled exceptions or resource leaks.

Amazon EC2 - High Client Errors

This alert is designed to monitor the occurrence of HTTP 4xx client-side errors on your Amazon EC2 instance. 4xx errors indicate that the client sent a bad request that the server could not process. These errors are often caused by misconfigured applications, incorrect API requests, or authentication issues. Common 4xx errors include: Trigger Condition: The alert is triggered when the number of HTTP 4xx errors exceeds a specified threshold, indicating a potential issue with how the client interacts with the EC2 instance. For example, you may set the alert to trigger when the count of 4xx errors exceeds 10 errors in 10 minutes. Customization Guidance: Threshold : Set a specific count threshold for 4xx errors. For example: A warning alert can be triggered when there are 5 4xx errors in a 10-minute window. A critical alert can be triggered when there are 10 or more 4xx errors in a 10-minute window. Adjust these thresholds based on your application's traffic and tolerance for client errors. Granularity: Customize the alert to differentiate between specific 4xx errors . This helps in identifying the exact root cause more efficiently. Alert Tiering: Implement multi-level alerting: Warning alerts for a lower count of errors (e.g., 5 in 10 minutes). Critical alerts for a higher count of errors (e.g., 10 or more in 10 minutes), indicating a more significant client-side issue. Notification Frequency: Configure the alert to notify immediately when the error count crosses the threshold, but fine-tune the frequency to avoid excessive noise for temporary spikes. Action: Upon activation of the alert, investigate the cause of high client errors: Review API Requests: Check for malformed or incorrect API requests coming from the client, such as incorrect parameters, invalid URLs, or unsupported operations. Check Authentication Issues: Investigate any 401 or 403 errors to ensure that clients have the correct permissions and tokens to access resources. Verify Application Configuration: Look for misconfigurations in the client application or issues in the communication between the client and the EC2 instance.

Integration

Learn more about Coralogix's out-of-the-box integration with AWS EC2 in our documentation.

Read More
Schedule Demo

Enterprise-Grade Solution