Service Level Objectives
Note
SLOs and SLO alerts are in Beta mode. Features may change, and some functionality may be limited.
Overview
Service Level Objectives (SLOs) help you define clear, measurable performance targets for any object that emits metrics—whether it's a service, application, infrastructure component, or custom-defined entity. These targets can be based on key indicators such as availability, latency, error rates, or other relevant metrics.
Once your SLOs are in place, you can monitor them in real time using dashboards and alerts that surface potential issues early. Over time, you can track how well each object is meeting its targets, with full visibility into error budgets and burn rates to support data-driven reliability decisions.
Use SLOs for:
- Proactive issue detection: SLOs act as benchmarks for measuring whether a service or any metric-generating object is performing well or degrading. By monitoring the metrics that contribute to SLOs, identify early signs of degradation or failure and take corrective action before issues escalate.
- Error budget tracking: Monitor and track error budgets to measure whether they are within acceptable failure limits according to the SLOs. Manage how much downtime, degraded performance, or error rates are acceptable before corrective action is necessary.
- Automated alerts and remediation: Trigger alerts when SLOs are at risk or breached to enable teams to proactively resolve issues before they affect end-users, reducing downtime and ensuring services consistently meet customer expectations.
Core concepts
A service level indicator (SLI) is the actual measured value of an object’s performance, representing real-time compliance with the SLO by tracking metrics such as request success rate, response time, or uptime.
A service level objective (SLO) defines the target threshold for an SLI. It communicates reliability goals by setting a clear expectation—for example, “99.9% success rate over 30 days.”
SLO groups act as a subset of an SLO, defined by unique combinations of label values (via group by logic). Each group independently tracks its own compliance and budget consumption.
An error budget is the acceptable margin of failure for an SLO. For example, a 99% availability on 7-days time frame SLO allows for 1% of errors over a rolling window of 7 days. Once the budget is depleted, the SLO is breached.
A remaining budget is the percentage of the allowable bad events out of the entire error budget.
An SLO time frame is the entire duration over which the SLO is defined and evaluated.
A rolling time window is a continuously updated time frame (for example, past 1 hour) used for evaluating metrics like burn rate. It ensures responsiveness to recent changes.
SLO types
Event-based SLOs track the success rate of individual requests. An event-based SLO is based on an SLI defined as the ratio of successful (good) events to the total number of events. The SLO is met when this ratio meets or exceeds the target objective over a specified compliance period.
Time window SLOs evaluate performance over fixed intervals. This SLO type is best suited for use cases that prioritize reliability across consistent, minute-by-minute time windows. It ensures that quality remains steady throughout the entire SLO period by evaluating performance in each discrete time window. Any time window that fails to meet the defined criteria is treated as a failure, thereby contributing to the depletion of the SLO error budget.
SLO alerts
Define notifications that help monitor and maintain a service's reliability by tracking whether the service is meeting its predefined SLOs. SLO alerts notify teams when service performance deviates from the agreed-upon targets for availability, response times, error rates, or other key metrics.
Error budget
Error budget alerts notify you when the remaining error budget drops to or below a percentage you define. You can configure alerts at specific thresholds — such as warning at 50% remaining and critical at 10% — to catch issues early and avoid breaching your SLO.
Burn rate
Burn rate alerts detect when the error budget is being consumed faster than the rate needed to survive the full compliance window. When a burn rate threshold is exceeded, an alert is triggered.
Burn rate alerts support two detection modes:
- Single Mode: Monitors the burn rate over the evaluation window you define. Fires when the budget is depleting faster than the sustainable pace to last the compliance window.
- Dual Mode: Adds a rapid quality signal on top of the slow burn check. Detects sudden collapses in service quality evaluated over a fixed 10-minute window, alongside the burn rate check.
You can define a burn rate multiplier that triggers an alert when the observed burn rate exceeds a specified multiple of the sustainable rate. For example, a multiplier of 2 fires when the budget is burning twice as fast as it should to last through the compliance window.
Create SLOs
Create event-based and time window SLOs to track and maintain your service's reliability based on user-centric performance goals.
Manage SLOs
View and manage your SLOs in Coralogix’s SLO Center to track SLO compliance and ensure object performance aligns with the allocated user budget.
Monitor service performance with SLO alerts
Monitor and maintain a stable error budget by configuring SLO alerts that notify you when performance or error rates exceed defined thresholds.
Resources
This guide provides a comprehensive walkthrough of Coralogix SLOs. Here is what you will find:
- Create SLOs:
- Safe Use of Recording Rule-Based Metrics
- View and Manage SLOs
- Configure SLO Alerts
- Permissions
- Using the Underlying SLO Metrics
- SLO Management API
- Infrastructure as Code
