[Live Webinar] Unlocking real-time AI Observability with Coralogix's AI Center

Register Now

Quick Start Observability for AWS Backup

thank you

Thank you!

We got your information.

AWS Backup
AWS Backup icon

Coralogix Extension For AWS Backup Includes:

Dashboards - 1

Gain instantaneous visualization of all your AWS Backup data.

AWS Backup
AWS Backup

Alerts - 10

Stay on top of AWS Backup key performance metrics. Keep everyone in the know with integration with Slack, PagerDuty and more.

High Number of Backup Jobs Pending

This alert monitors the number of pending backup jobs in AWS Backup. A high number of pending jobs can indicate resource constraints, misconfigurations, or performance bottlenecks, potentially delaying critical backups and increasing the risk of data loss. The alert is triggered when the number of backup jobs in a pending state exceeds a defined threshold (e.g., 10 jobs) for more than 10 minutes. Monitoring this metric helps ensure timely execution of backup jobs, maintaining data protection and compliance with recovery objectives. Customization Guidance: - Threshold: Adjust the threshold based on the expected volume of backup jobs and your tolerance for delays in backup execution. - Monitoring Period: Configure the monitoring period to reflect your backup schedules and peak activity times. - Notification Frequency: Set notification intervals to provide timely responses while avoiding alert fatigue during transient spikes in pending jobs. Action: If this alert is triggered, investigate the status of AWS Backup resources, verify backup policies for misconfigurations, and ensure sufficient compute, network, or storage resources are available to process pending jobs. Consider staggering backup schedules or increasing resource capacity to address bottlenecks.

High Number of Running Backup Jobs

This alert monitors the number of running backup jobs in AWS Backup. A high number of running jobs may indicate heavy backup activity, potential resource contention, or inefficiencies in job execution, which could impact backup performance and other workloads. The alert is triggered when the number of running backup jobs exceeds a defined threshold (e.g., 20 jobs) for more than 10 minutes. Monitoring this metric helps ensure efficient backup operations and prevents resource contention that could affect system performance. Customization Guidance: - Threshold: Adjust the threshold based on your environment’s typical backup workload and resource capacity. - Monitoring Period: Set the monitoring period to align with your backup schedules and expected activity levels. - Notification Frequency: Configure notifications to balance responsiveness with minimizing alert fatigue during high-activity periods. Action: If this alert is triggered, review active backup jobs for inefficiencies or overlapping schedules, ensure adequate resources are available to handle the backup load, and consider optimizing or staggering backup schedules to reduce contention. Investigate if certain jobs are taking longer than expected and address potential bottlenecks.

Backup Jobs Failed

This alert monitors failed backup jobs in AWS Backup. Backup job failures can indicate issues such as resource unavailability, misconfigurations, or connectivity problems, potentially compromising data protection and recovery capabilities. The alert is triggered when the number of failed backup jobs exceeds a defined threshold (e.g., 5 failures) within a 10-minute period. Monitoring this metric helps ensure the reliability of backup processes, which is critical for maintaining data integrity and meeting recovery objectives. Customization Guidance: - Threshold: Adjust the failure count threshold based on your tolerance for transient issues and the criticality of the data being backed up. - Monitoring Period: Configure the monitoring period to align with your backup schedules and expected frequency of jobs. - Notification Frequency: Set notification intervals to ensure timely responses while minimizing alert fatigue during intermittent failures. Action: If this alert is triggered, review error messages and logs for failed jobs, verify backup configurations (e.g., IAM permissions, destination settings), and ensure resources (e.g., storage, network) are available. Reattempt failed backups where appropriate and address underlying issues to prevent recurring failures.

No Backup Jobs Completed

This alert monitors the completion of backup jobs in AWS Backup, specifically detecting when no backup jobs have been successfully completed within a given period. A lack of completed backup jobs may indicate systemic issues, such as resource unavailability, misconfigurations, or service disruptions, potentially compromising data protection. The alert is triggered when no backup jobs are completed within a defined period (e.g., 30 minute). Monitoring this metric helps ensure backups are completed as scheduled, maintaining data protection and compliance with recovery objectives. Customization Guidance: - Threshold: Adjust the time window based on your backup schedules and acceptable tolerance for periods without completed jobs. - Monitoring Period: Configure the monitoring period to reflect the frequency of your backup jobs. - Notification Frequency: Set notification intervals to balance timely responses with avoiding unnecessary alerts during low activity periods. Action: If this alert is triggered, check for system-wide issues affecting AWS Backup, verify that backup schedules are active and correctly configured, and ensure sufficient resources (e.g., storage, network) are available. Investigate logs for errors or delays and re-initiate backup jobs as necessary.

Recovery Points Expired

This alert monitors the expiration of recovery points in AWS Backup. Expired recovery points indicate that backup data has exceeded its retention period and is no longer available for recovery, which can affect compliance with data retention policies or recovery objectives. The alert is triggered when the number of expired recovery points exceeds a defined threshold (e.g., 10 recovery points) within a 24-hour period. Monitoring this metric helps ensure alignment with retention policies and verifies that critical data remains available for recovery when needed. Customization Guidance: - Threshold: Adjust the threshold based on the criticality of the recovery points and your organization’s data retention requirements. - Monitoring Period: Set the monitoring period to align with the lifecycle policies of your backup data. - Notification Frequency: Configure notification intervals to provide timely awareness of significant expirations without overwhelming users with frequent alerts. Action: If this alert is triggered, review lifecycle policies and retention settings for backups, confirm that critical data has not been unintentionally deleted, and adjust retention periods as needed. Ensure that new backups are created and available to replace expired recovery points if necessary.

Partial Recovery Points Detected

This alert monitors the creation of partial recovery points in AWS Backup. Partial recovery points indicate incomplete or failed backup operations, which could compromise data integrity and recovery reliability. The alert is triggered when the number of partial recovery points exceeds a defined threshold (e.g., 5 partial recovery points) within a 24-hour period. Monitoring this metric helps ensure the completeness and reliability of backup data, which is critical for meeting recovery objectives and maintaining data integrity. Customization Guidance: - Threshold: Adjust the threshold based on your environment's backup volume and tolerance for incomplete backups. - Monitoring Period: Set the monitoring period to match the frequency and schedule of backup jobs. - Notification Frequency: Configure notification intervals to ensure timely awareness of partial backups without causing alert fatigue. Action: If this alert is triggered, investigate logs and error messages associated with the partial recovery points, verify resource availability (e.g., storage, network), and check for configuration issues in backup policies. Reattempt incomplete backups and address underlying issues to prevent future occurrences.

Recovery Points Deletion Delayed

This alert monitors delays in the deletion of recovery points in AWS Backup. Delayed deletions may indicate issues with lifecycle policy enforcement, resource constraints, or system performance, potentially affecting storage utilization and compliance with retention policies. The alert is triggered when the deletion of recovery points is delayed beyond a defined threshold (e.g., 24 hours) for more than 0 recovery points. Monitoring this metric helps ensure proper enforcement of lifecycle policies and efficient management of backup storage resources. Customization Guidance: - Threshold: Adjust the threshold based on your organization’s retention policies and tolerance for delays in recovery point deletions. - Monitoring Period: Configure the monitoring period to align with your backup lifecycle schedules. - Notification Frequency: Set notification intervals to balance timely responses with minimizing alert fatigue for transient delays. Action: If this alert is triggered, review lifecycle policies and ensure they are correctly configured and applied. Check for system performance or resource issues that might delay deletions, and investigate logs for errors related to deletion tasks. Consider scaling resources or contacting AWS support if delays persist.

Recovery Points Not Created

This alert monitors the creation of recovery points in AWS Backup, specifically detecting when expected recovery points are not being created. A failure to create recovery points may indicate misconfigurations, insufficient resources, or system issues, potentially leaving critical data unprotected. The alert is triggered when no recovery points are created within a defined period (e.g., 30 minute) despite scheduled backup jobs. Monitoring this metric helps ensure backups are successfully completed, maintaining data protection and compliance with recovery policies. Customization Guidance: - Threshold: Adjust the time window based on your backup schedules and tolerance for delays in recovery point creation. - Monitoring Period: Set the monitoring period to reflect the frequency of your backup jobs and recovery point creation. - Notification Frequency: Configure notification intervals to provide timely responses while avoiding unnecessary alerts during low activity periods. Action: If this alert is triggered, verify the backup schedules and policies to ensure they are active and correctly configured. Check for resource availability (e.g., storage, network) and investigate logs for errors or delays. Address any identified issues and ensure backups are re-initiated as necessary.

Partial Recovery Points Detected

This alert monitors the occurrence of partial recovery points in AWS Backup. Partial recovery points indicate that backup operations were not fully successful, potentially compromising the integrity of the backup data and the ability to recover it reliably. The alert is triggered when the number of partial recovery points exceeds a defined threshold (e.g., 3 partial recovery points) within a 30 minute period. Monitoring this metric helps ensure that backup operations are completed successfully, maintaining the reliability of recovery processes and compliance with data protection objectives. Customization Guidance: - Threshold: Adjust the threshold based on your environment's backup volume and acceptable tolerance for incomplete backups. - Monitoring Period: Set the monitoring period to reflect the schedule and frequency of your backup jobs. - Notification Frequency: Configure notification intervals to ensure timely responses without overwhelming users with excessive alerts. Action: If this alert is triggered, review the logs associated with the partial recovery points to identify the root cause, such as resource unavailability or configuration issues. Verify the backup policies and ensure that necessary resources (e.g., storage, compute) are available. Reattempt the

Restore Jobs Failed

This alert monitors failed restore jobs in AWS Backup. Restore job failures may indicate issues such as corrupted backups, misconfigurations, or resource unavailability, potentially preventing the recovery of critical data and impacting business continuity. The alert is triggered when the number of failed restore jobs exceeds a defined threshold (e.g., 3 failures) within a 10-minute period. Monitoring this metric helps ensure the reliability of restore operations, which is critical for meeting recovery objectives and minimizing downtime during incidents. Customization Guidance: - Threshold: Adjust the failure count threshold based on your organization’s tolerance for transient restore issues and the criticality of the data being restored. - Monitoring Period: Configure the monitoring period to align with your expected frequency of restore operations. - Notification Frequency: Set notification intervals to ensure timely responses while avoiding unnecessary alerts for isolated failures. Action: If this alert is triggered, review the error logs and messages associated with the failed restore jobs to determine the cause. Verify the integrity of the backup data, ensure the availability of required resources (e.g., storage, compute), and check the restore configurations (e.g., permissions, destinations). Reattempt the restores where appropriate and address underlying issues to prevent further failures.

Integration

Learn more about Coralogix's out-of-the-box integration with AWS Backup in our documentation.

Read More
Schedule Demo

Enterprise-Grade Solution