[Workshop Alert] Mastering Observability with OpenTelemetry Fundamentals - Register Now!

Quick Start Observability for Amazon API Gateway

Amazon API Gateway
Amazon API Gateway icon

Coralogix Extension For Amazon API Gateway Includes:

Dashboards - 1

Gain instantaneous visualization of all your Amazon API Gateway data.

Amazon API Gateway
Amazon API Gateway

Alerts - 5

Stay on top of Amazon API Gateway key performance metrics. Keep everyone in the know with integration with Slack, PagerDuty and more.

Availability Less Than 90%

This alert aims to ensure high availability and reliability of the Amazon API Gateway services by monitoring API availability percentages across different regions. The alert is activated when API availability in any specified region drops below 90% over the last 10 minutes. API availability is a critical indicator of the health and operational effectiveness of services delivered through Amazon API Gateway. A drop in availability can significantly impact user experience, particularly in high-dependency applications. Customization Guidance: Threshold: The default threshold is set at 90% availability. Depending on the criticality of the API and the tolerance for downtime, adjust this threshold to a more stringent or lenient percentage as needed. Monitoring Period: The standard monitoring period is 10 minutes, providing a near real-time insight into the availability status. Adjust this period based on the operational needs and expected traffic patterns of your API to either capture more immediate issues or analyze longer-term trends. Region Specificity: Specify which regions this alert should monitor more closely based on where your key consumer bases are located or where critical operations occur. Notification Frequency: Consider the frequency of this alert to optimize the balance between responsiveness and noise. Adjust according to the criticality of the function’s uninterrupted operation Action: Upon triggering this alert, immediately investigate the cause of the reduced availability. Check for reported outages, configuration errors, or unusual spikes in traffic that could be overwhelming the service. Coordinating with AWS support might be necessary if the issue stems from the infrastructure layer.

High Integration Latency Detected

This alert is designed to monitor and quickly address delays in backend communication that can affect the overall responsiveness and performance of the API Gateway. The alert is activated when the integration latency—defined as the time taken for the API Gateway to receive a response from the backend service—exceeds a critical threshold of 500 milliseconds. High integration latency can be indicative of issues within the backend services, such as resource constraints, network delays, or inefficient code. It is crucial to maintain low latency to ensure that API consumers have a seamless and efficient experience. Customization Guidance: Threshold: The default critical threshold is set at 500 milliseconds. Adjust this threshold based on the performance requirements of your application and typical backend response times. Monitoring Scope: Configure the alert to monitor latency across different stages of the API or specific methods that are critical to your business operations. Alert Sensitivity: Consider setting up a tiered alert system where warnings are issued at lower thresholds (e.g., 300 milliseconds) to provide early warnings before reaching critical delays. Notification Frequency: Consider the frequency of this alert to optimize the balance between responsiveness and noise. Adjust according to the criticality of the functions uninterrupted operation Action: If this alert triggers, perform an immediate investigation to identify the source of the delay. Review backend service performance and configurations, network conditions, and any recent changes that might have affected response times. Optimization may involve scaling up backend resources, enhancing network configurations, or refining backend code.

Critical Overall Latency

This alert is designed to ensure high performance and responsiveness of the API Gateway by monitoring the total latency, which includes both integration latency and API Gateway overhead. The alert is activated when the total latency of the request-response cycle exceeds a critical threshold of 500 milliseconds. Total latency reflects the overall time taken for an API request to be processed and a response to be returned, encompassing both the time spent within the API Gateway and the backend services. Exceeding the threshold suggests significant delays that could impact user experience and system efficiency. Customization Guidance: Threshold: The default critical threshold is set at 500 milliseconds. Depending on the sensitivity and performance benchmarks of your application, you may adjust this threshold to better align with expected response times. Granularity: Consider configuring the alert to differentiate between the sources of latency—integration versus Gateway overhead—to more precisely identify and address the cause of delays. Alert Tiering: Implement a multi-level alerting mechanism where preliminary warnings are issued at lower thresholds (e.g., 300 milliseconds) to facilitate early detection and intervention before reaching critical levels. Notification Frequency: Consider the frequency of this alert to optimize the balance between responsiveness and noise. Adjust according to the criticality of the functions uninterrupted operation Action: Upon activation, promptly investigate the components contributing to the latency. Analyze network performance, backend service efficiency, and API Gateway configurations. Optimization efforts might include scaling backend resources, improving network pathways, or adjusting caching strategies within the API Gateway.

5xx Error Rate Exceeds 5% in Last 10 Minutes

This alert is designed to detect and address significant increases in server-side errors, represented by 5xx HTTP status codes, within the API Gateway. Monitoring these errors is crucial for maintaining the reliability and availability of the API services. The alert is activated when the rate of 5xx errors exceeds 5% of the total API requests within the last 10 minutes. 5xx errors indicate server-side issues that can result from internal configuration errors, backend failures, or other conditions that prevent the server from fulfilling requests. A high error rate can severely impact user experiences and trust in the API services. Customization Guidance: Threshold: The default threshold is set at a 5% error rate. Adjust this threshold based on the criticality of your APIs uptime and the historical error rates observed. Lower thresholds may be suitable for high-availability environments. Monitoring Period: The standard monitoring period is 10 minutes, offering a balance between timely responses and meaningful error data. This period can be adjusted to either shorter or longer intervals based on the API traffic patterns and operational requirements. Error Specificity: Consider refining the alert to trigger based on specific 5xx error codes (e.g., 503 Service Unavailable vs. 500 Internal Server Error) if particular errors are more critical to your operations. Notification Frequency: Consider the frequency of this alert to optimize the balance between responsiveness and noise. Adjust according to the criticality of the functions uninterrupted operation Action: Upon triggering, immediately investigate the root causes of the 5xx errors. Review logs and metrics for the API Gateway and backend services to identify and rectify configuration issues or resource bottlenecks. Engage with technical support if necessary to resolve infrastructure-related problems swiftly.

4xx Error Rate Exceeds 5% in Last 10 Minutes

This alert is designed to monitor and manage the rate of 4xx HTTP errors, which typically indicate client-side issues, but can also suggest API misconfigurations or misuse. Monitoring these errors is essential for maintaining the usability and security of the API. The alert is activated when the rate of 4xx errors exceeds 5% of the total API requests within the last 10 minutes. A high rate of 4xx errors could indicate widespread issues such as endpoint misuse, authentication failures, or client errors, which can negatively impact user experience and potentially overload the server with unnecessary requests. Customization Guidance: Threshold: The default threshold is set at 5% error rate. Adjust this threshold based on the APIs operational norms and the tolerance level of client errors. For APIs in critical operations, a lower threshold may be necessary. Monitoring Period: The standard monitoring period is 10 minutes. This interval can be adjusted shorter or longer depending on how dynamic the API traffic is and to ensure timely identification of issues without generating noise from normal fluctuations. Error Specificity: Consider configuring the alert to differentiate between different types of 4xx errors (e.g., 401 Unauthorized vs. 404 Not Found) to better target specific problems and improve the diagnostic process. Notification Frequency: Consider the frequency of this alert to optimize the balance between responsiveness and noise. Adjust according to the criticality of the functions uninterrupted operation Action: If this alert triggers, analyze the logs to determine the common causes of the 4xx errors. Look for patterns such as specific endpoints generating errors or particular client behaviors. Review API documentation and configurations to ensure they align correctly with client interactions. Implement rate limiting or blocking policies if necessary to manage misuse or overly aggressive use of the API.

Integration

Learn more about Coralogix's out-of-the-box integration with Amazon API Gateway in our documentation.

Read More
Schedule Demo