Quick Start Observability for AWS Aurora
Thank you!
We got your information.
Coralogix Extension For AWS Aurora Includes:
Dashboards - 1
Gain instantaneous visualization of all your AWS Aurora data.
Alerts - 5
Stay on top of AWS Aurora key performance metrics. Keep everyone in the know with integration with Slack, PagerDuty and more.
Amazon Aurora - Global DB Replication Lag Higher Than 1.5 seconds
This alert is designed to monitor the replication lag in an Amazon Aurora Global Database. Replication lag refers to the delay between updates on the primary Aurora cluster and their application on secondary (replica) clusters in different regions. A replication lag exceeding 1.5 seconds can lead to outdated or inconsistent data in the replica clusters, potentially affecting application performance and user experience, especially in read-heavy, globally distributed applications. Trigger Condition: The alert is triggered when the Aurora Global DB replication lag exceeds 1.5 seconds for more than a specified duration (e.g., 10 minutes). Persistent replication lag can indicate network issues, resource contention, or high write activity on the primary cluster. Customization Guidance: Threshold: Set a specific replication lag threshold. For example: A warning alert can be triggered when replication lag exceeds 1 second. A critical alert can be triggered when replication lag exceeds 1.5 seconds for more than 10 minutes. Granularity: Customize the alert to track replication lag for individual Aurora Global DB regions. This helps you isolate replication lag to specific regions, allowing for targeted troubleshooting. Alert Tiering: Implement multi-level alerts: Warning alerts when the replication lag is slightly elevated but manageable (e.g., between 1.0 and 1.5 seconds). Critical alerts when replication lag exceeds 1.5 seconds, indicating a significant delay in replication. Notification Frequency: Configure the alert to notify immediately when replication lag exceeds the threshold, and repeat alerts if the lag persists to ensure prompt attention. Action: Upon activation of the alert, investigate and resolve the cause of the replication lag: Check Network Latency: Review the network connection between the primary and replica regions to ensure there are no latency or bandwidth issues causing delays in data replication. Monitor Database Workload: Check for high write volumes, heavy read/write operations, or complex queries on the primary cluster that could contribute to increased replication lag. Review Resource Utilization: Examine CPU, memory, and disk I/O utilization on both the primary and replica Aurora clusters. Resource exhaustion could slow down the replication process. Evaluate Replication Configuration: Ensure that the replication settings are properly configured for high availability and performance across regions. Adjust settings or replication strategies if necessary. Investigate Read/Write Traffic Patterns: Check whether an unusual spike in database traffic is contributing to the replication lag and consider scaling the Aurora cluster if needed.
Amazon Aurora - Global DB Process Lag Higher Than 1.5 seconds
This alert is designed to monitor process lag in an Amazon Aurora Global Database. Process lag refers to the time taken for changes (writes/updates) to be processed and applied on the Aurora Global DB replicas. A process lag higher than 1.5 seconds may indicate delays in applying changes, which could affect the availability of the latest data on the secondary regions of the Aurora Global Database. Trigger Condition: The alert is triggered when the Aurora Global DB process lag exceeds 1.5 seconds. A persistent process lag can result in inconsistent data between the primary and secondary clusters, leading to potential performance issues for read-heavy applications or those that require up-to-date data. Customization Guidance: Threshold: Set a specific threshold for process lag. For example: A warning alert can be triggered when the process lag exceeds 1.0 second for a short period. A critical alert can be triggered when the process lag exceeds 1.5 seconds for more than 5 minutes. Granularity: Customize the alert to track process lag per replica region in the Aurora Global DB setup. This helps identify specific regions where the lag is occurring. Alert Tiering: Implement multi-level alerts: Warning alerts for when process lag is elevated but still manageable (e.g., between 1.0 - 1.5 seconds). Critical alerts for when process lag exceeds 1.5 seconds, indicating more significant delays in processing updates. Notification Frequency: Configure the alert to notify immediately when process lag exceeds the threshold, and set repeated notifications for sustained lag to ensure timely action. Action: Upon activation of the alert, investigate the root cause of the high process lag: Check Database Workload: Investigate whether high write volumes or complex queries are contributing to processing delays. Monitor Resource Utilization: Check the CPU, memory, and disk I/O utilization on both the primary and replica clusters to ensure they are not resource-constrained. Network Latency: Review any potential network-related issues that might be contributing to delays in applying updates across regions. Review Replication Configuration: Ensure that replication settings are optimized and there are no bottlenecks in the way Aurora is handling replication across the global clusters.
Amazon Aurora - High Aborted Clients Count
This alert is designed to monitor the aborted clients count in an Amazon Aurora database. An aborted client refers to a client connection to the Aurora database that was terminated unexpectedly, either due to network issues, incorrect configurations, timeouts, or authentication problems. A high number of aborted client connections may indicate issues with the database's configuration, network instability, or client-side problems, and could affect the availability and performance of your application. Trigger Condition: The alert is triggered when the aborted clients count exceeds a defined threshold (e.g., 100 aborted connections within 5 minutes). A persistent high aborted client count may indicate an underlying problem that needs immediate attention. Customization Guidance: Threshold: Set specific thresholds for the number of aborted clients. For example: A warning alert can be triggered when the aborted client count exceeds 10 aborted connections in a 10-minute window. A critical alert can be triggered when the aborted client count exceeds 15 or more aborted connections in a 10-minute window. Granularity: You can track aborted client connections per Aurora instance or cluster, depending on your setup. This helps identify whether specific instances are experiencing more aborted connections. Alert Tiering: Implement multi-level alerting: Warning alerts for a moderate increase in aborted client connections Critical alerts for a high number of aborted client connections . Notification Frequency: Configure the alert to notify immediately when the number of aborted clients exceeds the threshold. Set up repeated notifications if the issue persists over a prolonged period to ensure timely resolution. Action: Upon activation of the alert, take the following steps to investigate and resolve the issue: Check Network Connectivity: Investigate any network instability or packet loss between the clients and the Aurora database. Network issues are a common cause of aborted client connections. Review Database Configuration: Verify that the Aurora instance and database configurations, such as connection timeouts and max connections, are properly set to avoid prematurely terminating client connections. Monitor Application Logs: Look for client-side logs that indicate the reasons for connection termination, such as incorrect credentials, timeouts, or misconfigurations in the application. Check Resource Utilization: Ensure that the Aurora instance has enough resources (CPU, memory, I/O) to handle client connections. Resource exhaustion can lead to abrupt disconnection of clients. Review Authentication Issues: Examine whether there are frequent authentication failures or incorrect credentials causing client connections to be terminated before they can be established. Scale Database Resources: If the number of connections to the Aurora cluster exceeds capacity, consider scaling the instance or adjusting the max connection settings to handle the increase
Amazon Aurora - Memory Num Declined Sql Count
This alert is designed to monitor the number of memory decline events in an Amazon Aurora database, specifically focusing on the memory used for SQL operations. A memory decline event occurs when the database encounters memory pressure, potentially leading to failed queries, slower response times, or even throttling of operations. Monitoring the count of these events helps to identify whether the database is running out of memory to handle SQL workloads effectively. Trigger Condition: The alert is triggered when the number of memory decline events exceeds a specific threshold based on a count (e.g., more than 10 memory decline events in 10 minutes). A frequent occurrence of memory declines can indicate underlying issues with resource allocation, query optimization, or workload distribution. Customization Guidance: Threshold (Count-Based): Set the alert to trigger when the count of memory declines crosses specific thresholds. For example: A warning alert can be triggered when 5 memory decline events are detected in a 10-minute window. A critical alert can be triggered when 10 or more memory decline events occur within a 10-minute window. Granularity: Track the memory decline events per Aurora instance by db_instance_identifier. This helps in identifying which Aurora instances are experiencing more frequent memory issues. Alert Tiering: Implement tiered alerts based on the number of memory decline events: Warning alert when the number of decline events reaches 5 within 5 minutes, indicating that the instance may be approaching memory exhaustion. Critical alert when the number of memory decline events exceeds 10 within 5 minutes, signaling a significant memory issue that may affect database performance. Notification Frequency: Configure notifications to trigger when the count of memory decline events exceeds the threshold and repeat notifications if the count continues to rise over time. Action: Upon activation of the alert, take the following steps to investigate and mitigate memory issues: Monitor SQL Workload Patterns: Investigate the types of queries or operations that may be consuming memory. Focus on optimizing large or complex queries that may be consuming significant memory resources. Check Resource Allocation: Review the current memory configuration of the Aurora instance. Ensure that sufficient memory is allocated to handle the workload. If necessary, consider scaling the instance to provide additional memory. Analyze Memory Usage Trends: Examine memory usage trends over time to determine if the memory pressure is part of a growing pattern, which may require optimization or resource scaling. Optimize Queries: Check for inefficient queries or workloads that may cause memory spikes. Optimize those queries to reduce their memory footprint. Scale Resources: If the workload has increased beyond the capacity of the current instance, consider scaling up the instance to provide more memory for SQL operations.
Amazon Aurora - Replica Lag Higher Than 1 seconds
This alert is designed to monitor replication lag in Amazon Aurora replicas. Replica lag occurs when there is a delay in applying updates from the primary (writer) instance to the replica (read) instances. A high replication lag can cause outdated or stale data on the replica instances, which can impact read-heavy applications that rely on up-to-date data from the replica. Trigger Condition: The alert is triggered when the Aurora replica lag exceeds a specified threshold (e.g., more than 1 second) for a defined period (e.g., 5 minutes). Persistent replica lag may indicate issues with resource allocation, networking, or high write activity on the primary instance. Customization Guidance: Threshold: Set the alert to trigger when replication lag crosses the following thresholds: A warning alert can be triggered when the replication lag exceeds 1 second. A critical alert can be triggered when the replication lag exceeds 5 seconds for more than 5 minutes. Granularity: Customize the alert to track replication lag for specific replica instances by db_instance_identifier. This will allow you to pinpoint the replica instances experiencing higher lag. Alert Tiering: Implement multi-level alerts: Warning alert when replication lag is elevated but still manageable (e.g., 1-5 seconds). Critical alert when replication lag becomes significant (e.g., more than 5 seconds for an extended period), indicating potential performance degradation. Notification Frequency: Configure notifications to trigger immediately when the replication lag crosses the threshold, and repeat notifications if the lag persists to ensure timely action. Action: Upon activation of the alert, take the following steps to investigate and resolve the cause of replication lag: Check Write Activity on the Primary Instance: Monitor the write load on the primary Aurora instance. High write throughput can increase the amount of data that needs to be replicated, leading to increased lag on replicas. Analyze Network Performance: Review the network connectivity between the primary and replica instances to ensure there are no latency issues causing delays in data replication. Monitor Resource Utilization: Check the resource usage (CPU, memory, I/O) on both the primary and replica instances. If resources are constrained, the replica may struggle to keep up with changes from the primary instance. Optimize Query and Workload Distribution: Ensure that queries are being efficiently distributed between the primary and replica instances, especially for read-heavy workloads. Offload read operations to the replica to reduce the load on the primary instance and reduce replication lag. Consider Scaling: If the Aurora replica lag is consistently high due to increased workload, consider scaling up the Aurora instance or adding more replicas to distribute the read load and replication.
Integration
Learn more about Coralogix's out-of-the-box integration with AWS Aurora in our documentation.