Skip to content

Policy-Based Service Health

Policy-based Service Health enhances the Service Health feature by automatically evaluating predefined health policies against your monitored services. This provides a real-time, traffic-light overview of service reliability—without needing to manually configure alerts.

Note

This feature is enabled by default only when using Compact span metric.

Glossary

TermDescription
PolicyA rule that determines a service’s health based on a specific data sources (for example, avg. latency or error rate). Each policy includes conditions that define when a service is considered Healthy, Warning, or Critical.
ConditionA condition within a policy that evaluates metric performance against defined threshold. For example: “Warning if error rate > 1%,” “Critical if error rate > 5%,” or “Otherwise Healthy”.
Time WindowThe time window in the policy editor determines the period of data used for evaluation. For example, evaluating an error rate over a 1-minute window versus a 15-minute window can produce different results. Policies are continuously recalculated based on the latest data within this time window.
Health IndicatorThe visual traffic-light symbol - 🟩 Healthy (Green), 🟨 Warning (Yellow), 🟥 Critical (Red), ⬛ Unavailable (Gray) - that represents the service’s overall health. The most severe active policy result always takes precedence (Critical > Warning > Healthy).

Key benefits

  • Immediate visibility: Instantly see which services are healthy, warning, or critical—no manual alert setup required.
  • Simplified setup: Policies are predefined and automatically applied to your monitored services.
  • Customizable thresholds: You can edit default policy values to match your service’s behavior or performance expectations.
  • Proactive monitoring: Detect potential issues from latency or error rate trends before they trigger incidents.
  • Real-time preview: See how your changes affect service health before saving.

View service health at a glance

The Service Health indicator provides an immediate, visual traffic-light assessment of your service’s operational state.

Coralogix offers two modes of Service Health evaluation:

  • Alert-based Service Health: available for all users and based on active incidents or alerts.
  • Policy-based Service Health: available to customers using Compact span metric and the OpenTelemetry (OTel) integration, starting from version v0.0.230.

Policy-based Service Health enhances the alert-based version by including incidents and alerts as part of a broader set of predefined health policies—such as latency, error rate, and log error rate—to determine whether a service is Healthy, Warning, or Critical.

These health policies run automatically in the background, using real-time APM metrics, and provide a unified health status for each service.

How health policies determine service status

Each service’s health state is automatically determined based on predefined health policies:
PolicyMetricWarning thresholdCritical threshold
IncidentsIncident priority≥ P5> P3
LatencyAvg / P90 / P95 / P99> 400 ms> 800 ms
Span error rateError rate (%)> 1 %> 5 %
Logs error rateError rate (%)> 1 %> 5 %

Note

Logs error rate are counted for Critical and Error logs only.

The system evaluates all policies simultaneously and displays the most severe active state:
IndicatorMeaningCondition
🟩 Healthy (Green)All policies within thresholdsNo policy breaches
🟨 Warning (Yellow)At least one policy breaching its Warning thresholdNo Critical breaches
🟥 Critical (Red)At least one policy breaching its Critical thresholdOne or more Critical breaches
Unavailable (Gray)No policies or data availableNo metrics or configuration detected

Note

  • If any policy breaches a Critical threshold, the service’s overall health immediately becomes 🟥 Critical (Red).

  • When viewing service health across multiple environments, the overall status may appear as 🟩 Healthy (Green) even if some environments contain 🟨 Warning (Yellow) or 🟥 Critical (Red) issues. In this case, a banner appears above the Service Overview indicating that issues were detected in one or more environments, but the overall service is healthy.

Configuring policies

Health policies are predefined and automatically applied to all services.

You do not need to create new policies—but you can edit existing ones to align thresholds and time windows with your service’s behavior.

Optionally, you can disable a policy per service or all policies.

Edit directly from the Service Catalog

You can edit a policy for any service directly from the Service Catalog or Overview page:

  1. Hover over the Health card for a specific metric (for example, Latency or Logs error rate).
  2. Select the three-dot menu that appears in the top-right corner of the card.
  3. Select Edit policy from the dropdown menu.

Edit policy

This opens the Policy editor, where you can view and adjust predefined thresholds.

Policy editor

Use the Policy editor

Inside the Policy Editor, you can:

  • Review the policy name and optionally add a description.
  • View or refine entity filters (for example, apply a policy only to a specific service).
  • Edit health thresholds and define new evaluation time windows (1, 5, 10, 15 minutes).
  • Apply the change to a specific service or globally (to Any service). Selecting a specific service updates the thresholds only for that service, while choosing Any applies the updated policy to all current and future services in your account.

Example configuration:

  1. Select the metric you want to evaluate (for example, entity.latency).
  2. Set warning and critical thresholds, such as:
    • Warning if latency > 400 ms
    • Critical if latency > 800 ms
  3. Define the time window (for example, average latency in the last 1 second, evaluated over 5 minutes).
  4. The Otherwise clause automatically sets the state to Healthy when thresholds are not breached.

Save the policy

When you’re satisfied with your changes:

  1. Select Save policy to apply the new values.
  2. The policy updates in real time, and the affected service’s health indicator automatically reflects the latest evaluation results.

Note

Policies are evaluated continuously using the most recent data within the selected time window. If you adjust thresholds, allow a few seconds for the updated evaluation to propagate.

Disable policy

Health policies can be disabled and re-enabled at any time using the Enable / Disable toggle. When disabling a policy, you can choose the scope of the change from the Entity filters field in the Policy Editor:

  • Disable for a specific service: Select a specific service in the Entity filters section, then disable the policy. The policy is disabled only for that service, while it remains active for all other services.
  • Disable globally (Any service): Select Any in the Entity filters section, then disable the policy. This disables the policy for all current and future services monitored under your account.

After a policy is disabled:

  • The policy is displayed as ⬛ Unavailable (Gray) to indicate it is inactive.
  • The policy is no longer evaluated as part of Service Health.

You can re-enable a policy at any time to resume health evaluation.

Best practices

  • Review policies regularly after deployments or scaling events.
  • Tune thresholds to reflect real user experience rather than system noise.
  • Use Preview before saving changes to confirm health impact.
  • Remember: Critical overrides all other states—a single breach sets the service to Critical.
Was this helpful?