Skip to content

Create Event-Based SLOs

Event-based SLOs measure reliability as the ratio of successful (good) events to total events, making them ideal for high-traffic systems where every request matters.

They’re especially useful when every event counts—such as tracking successful requests or transactions—because even small error rates can significantly impact user experience.

Overview

An event-based SLO is based on an SLI defined as the ratio of successful (good) events to the total number of events. The SLO is met when this ratio meets or exceeds the target objective over a specified compliance period.

Note

The metrics used in the queries under good and total events can either be sent directly to Coralogix or derived from logs and spans using Event2Metrics.

SLO components

When defining an event-based SLO, you must specify:

  • A good events query that defines what counts as a successful event
  • A total events query representing the full population of events to measure
  • A time frame against which the SLO compliance is evaluated (e.g., 7 days)
  • A target percentage of successful events (e.g., 99.9%)

This structure ensures that your SLO tracks meaningful outcomes that are aligned with entity goals and user experience.

Good events

The good events query defines what constitutes a successful or healthy event. These events are a subset of total events that meet your success criteria (e.g., fast response time, no errors).

This query determines the numerator in your SLO calculation.

For example:

  • HTTP responses with a status code of 200 or 204.
  • Transactions completed in under 2 seconds.
  • Requests not containing error messages or exceptions.

Total events

The total events query defines the full set of events that the SLO should measure. This represents the denominator in your SLO formula.

For example:

  • All requests to a given service endpoint.
  • Transactions excluding health checks or irrelevant noise.
  • Log entries that represent meaningful user actions.

SLI calculation

To calculate the SLI, use two PromQL queries: one for counting good events and one for total events. Then use the following formula:

\[ \text{SLI} = \frac{\text{Good events}}{\text{Total events}} \quad \text{(measured over the SLO time frame)} \]

Good events\Total events over the SLO time frame

For example, if your system logs 9,800 successful responses out of 10,000 total requests:

\[ \text{SLI} = \frac{9800}{10000} = 0.98\ (\text{or } 98\%) \]

This value is compared against your defined SLO threshold (e.g., 99%) to determine compliance.

SLO setup

To create a new Service Level Objective (SLO), go to APM > SLO Center and click Create SLO. Define the SLO details.
FieldDescription
NameThe unique name identifying your SLO.
OwnerOwnership defaults to the SLO creator.
Entity labels(Optional) Metadata used for filtering SLOs.
Description(Optional) Additional context or purpose of the SLO to clarify its scope and intent for other users.

Select the SLO type

Select the event-based SLO.

Query configuration

Configure an event-based SLO by defining PromQL queries for good events and all events. These queries determine the percentage of successful events over time and are used to evaluate whether the SLO objective is being met.

Grouping SLOs

You can apply grouping to event-based SLOs by using group by in your queries. This creates a grouped SLO, where each unique value in the grouping field (e.g., service name) results in an individual SLO tracked under the same definition.

When grouping is applied, each combination becomes a distinct SLO instance with its own compliance evaluation, alerts, and burn rate tracking. For example, if you have 100 services and define an SLO grouped by service_name, the SLO Center will display a single high-level SLO definition. Clicking into it reveals a detailed drilldown with 100 individual SLOs—one per service. Each of these is monitored independently.

This approach is especially useful for managing large-scale systems with shared SLO logic but multiple units of accountability, such as services, teams, or environments.

Query examples

The following examples show how to define good and total event queries for event-based SLOs. Each query uses group by to evaluate reliability across specific parts of your system—such as services, routes, or infrastructure instances. Grouping enables more targeted monitoring, alerting, and troubleshooting.

Example: HTTP status codes grouped by service

This example measures request success rates by evaluating HTTP response codes for each service.

Good events

sum(increase(calls_total{status_code != "STATUS_CODE_ERROR"}[1m])) by (service_name)

Total events

sum(increase(calls_total[1m])) by (service_name)

This query pair measures request success by filtering out error codes. Grouping by service_name creates separate SLOs for each service, allowing you to monitor performance across your architecture.

Example: HTTP requests grouped by service and method

This example tracks successful versus total HTTP requests for each unique combination of service and HTTP method.

Good events

sum by (service_name, http_method) (
  http_requests_total{status_code!~"5.."}
)

Total events

sum by (service_name, http_method) (
  http_requests_total
)

Example: Redis commands grouped by host, db, and command

This example monitors Redis command reliability by grouping total and error-free command counts by instance, database, and command.

Good events

sum (
  redis_commands_processed_total{result!="err"}
) by (instance, redis_db, cmd)

Total events

sum (
  redis_commands_processed_total
) by (instance, redis_db, cmd)

Example: Postgres commits and total transactions grouped by database

This example compares committed transactions (good events) against the total number of transaction attempts (commits + rollbacks) in each Postgres database. It demonstrates how to combine multiple metrics into a single total query using a consistent aggregation (group by).

Good events

sum by (datname) (pg_stat_database_xact_commit_total)

Total events

sum by (datname) (
  pg_stat_database_xact_commit_total +
  pg_stat_database_xact_rollback_total
)

Query validation

During SLO creation, the system performs built-in validations. If any validation fails, an error message is displayed, and you must adjust your queries before proceeding. When queries are valid, Good events and Total events are visualized in the Preview section.
Validation typeDescriptionTroubleshooting
Invalid query syntaxQuery contains a syntax error that prevents it from running.Check for typos, missing brackets, or invalid operators. Use autocomplete to correct issues.
Good events exceed total eventsGood events query returns more results than the total events query.Make sure the good events query is a subset of the total events. Check for incorrect filters or overly broad logic.
Excessive query cardinalityQuery includes too many unique time series and exceeds system limits.Add filters or reduce grouping dimensions. If needed, split the SLO into multiple smaller ones.
Grouping mismatchGood and total events use different grouping fields.Ensure both queries use the same group_by dimensions. Compare fields side by side to align them.
Recording rule detectedMetric based on a recording rule.To ensure reliable SLO results, add a 2m offset so that the data has time to fully update.
SLO group limitationUsers may create SLOs with a maximum of 10,000 groups. If you modify an SLO and it produces more than 10,000 groups, the system will stop tracking the SLO.Reduce the number of group-by values or add filters to your SLO query to exclude groups.

Set the SLO target and time frame

After defining your queries, configure:

  • Target success rate – The percentage of good events required to meet the SLO (e.g., 99.9% of requests should be successful)
  • Time frame – The compliance period over which the SLI is evaluated (e.g., 7, 14, 21, 28, or 90 days)

These parameters define the bounds of your error budget and ensure reliability goals are aligned with operational requirements.

Limitations

When an SLO is first created, data collection for its associated metric begins at that point in time. As a result, initial calculations are based solely on data from the moment of creation onward. Once the age of the SLO exceeds its defined time frame, the system begins using the full SLO time frame as a rolling window—continuously evaluating data across the entire duration of the time frame.

Next steps

Click Save to store the SLO and return to the SLO Center.
To configure an SLO-based alert, click Save & create alert.

Additional resources

Find out how to safely use recording rule–based metrics in your SLO creation with this guide.