Policy-Based Service Health

Policy-based Service Health enhances the Service Health feature by automatically evaluating predefined health policies against your monitored services. This provides a real-time, traffic-light overview of service reliability—without needing to manually configure alerts.

Note

This feature is enabled by default only when using Compact span metric.
After compact metrics start being sent, it may take up to 12 hours for policy-based health status to appear in the Service Catalog.

Glossary

Term	Description
Policy	A rule that determines a service’s health based on a specific data sources (for example, avg. latency or error rate). Each policy includes conditions that define when a service is considered Healthy, Warning, or Critical.
Condition	A condition within a policy that evaluates metric performance against defined threshold. For example: “Warning if error rate > 1%,” “Critical if error rate > 5%,” or “Otherwise Healthy”.
Time Window	The time window in the policy editor determines the period of data used for evaluation. For example, evaluating an error rate over a 1-minute window versus a 15-minute window can produce different results. Policies are continuously recalculated based on the latest data within this time window.
Health Indicator	The visual traffic-light symbol - 🟩 Healthy (Green), 🟨 Warning (Yellow), 🟥 Critical (Red), ⬛ Unavailable (Gray) - that represents the service’s overall health. The most severe active policy result always takes precedence (Critical > Warning > Healthy).

Key benefits

Immediate visibility: Instantly see which services are healthy, warning, or critical—no manual alert setup required.
Simplified setup: Policies are predefined and automatically applied to your monitored services.
Customizable thresholds: You can edit default policy values to match your service’s behavior or performance expectations.
Proactive monitoring: Detect potential issues from latency or error rate trends before they trigger incidents.

View service health at a glance

The Service Health indicator provides an immediate, visual traffic-light assessment of your service’s operational state.

Coralogix offers two modes of Service Health evaluation:

Alert-based Service Health: available for all users and based on active incidents or alerts.
Policy-based Service Health: available to customers using Compact span metric and the OpenTelemetry (OTel) integration, starting from version v0.0.230.

Policy-based Service Health enhances the alert-based version by including incidents and alerts as part of a broader set of predefined health policies—such as latency, error rate, and log error rate—to determine whether a service is Healthy, Warning, or Critical.

These health policies run automatically in the background, using real-time APM metrics, and provide a unified health status for each service.

How health policies determine service status

Each service’s health state is automatically determined based on predefined health policies:
Policy Metric Warning threshold Critical threshold
Incidents Incident priority ≥ P5 > P3
Latency Avg / P90 / P95 / P99 > 400 ms > 800 ms
Span error rate Error rate (%) > 1 % > 5 %
Logs error rate Error rate (%) > 1 % > 5 %

Note

Logs error rate are counted for Critical and Error logs only.

The system evaluates all policies simultaneously and displays the most severe active state:
Indicator Meaning Condition
🟩 Healthy (Green) All policies within thresholds No policy breaches
🟨 Warning (Yellow) At least one policy breaching its Warning threshold No Critical breaches
🟥 Critical (Red) At least one policy breaching its Critical threshold One or more Critical breaches
⬛ Unavailable (Gray) No policies or data available No metrics or configuration detected

Note

If any policy breaches a Critical threshold, the service’s overall health immediately becomes 🟥 Critical (Red).
When viewing service health across multiple environments, the overall status may appear as 🟩 Healthy (Green) even if some environments contain 🟨 Warning (Yellow) or 🟥 Critical (Red) issues. In this case, a banner appears above the Service Overview indicating that issues were detected in one or more environments, but the overall service is healthy.

Configuring policies

Health policies are predefined and automatically applied to all services.

You do not need to create new policies—but you can edit existing ones to align thresholds and time windows with your service’s behavior.

Optionally, you can disable a policy per service or all policies.

Edit directly from the Service Catalog

You can edit a policy for any service directly from the Service Catalog or Overview page:

Hover over the Health card for a specific metric (for example, Latency or Logs error rate).
Select the three-dot menu that appears in the top-right corner of the card.
Select Edit policy from the dropdown menu.

This opens the Policy editor, where you can view and adjust predefined thresholds.

Use the Policy editor

Inside the Policy Editor, you can:

Review the policy name and optionally add a description.
View or refine entity filters (for example, apply a policy only to a specific service).
Edit health thresholds and define new evaluation time windows (1, 5, 10, 15 minutes).
Apply the change to a specific service or globally (to Any service). Selecting a specific service updates the thresholds only for that service, while choosing Any applies the updated policy to all current and future services in your account.

Example configuration:

Select the metric you want to evaluate (for example, entity.latency).
Set warning and critical thresholds, such as:
- Warning if latency > 400 ms
- Critical if latency > 800 ms
Define the time window (for example, average latency in the last 1 second, evaluated over 5 minutes).
The Otherwise clause automatically sets the state to Healthy when thresholds are not breached.

Save the policy

When you’re satisfied with your changes:

Select Save policy to apply the new values.
The policy updates in real time, and the affected service’s health indicator automatically reflects the latest evaluation results.

Note

Policies are evaluated continuously using the most recent data within the selected time window. If you adjust thresholds, allow a few seconds for the updated evaluation to propagate.

Disable policy

Health policies can be disabled and re-enabled at any time using the Enable / Disable toggle. When disabling a policy, you can choose the scope of the change from the Entity filters field in the Policy Editor:

Disable for a specific service: Select a specific service in the Entity filters section, then disable the policy. The policy is disabled only for that service, while it remains active for all other services.
Disable globally (Any service): Select Any in the Entity filters section, then disable the policy. This disables the policy for all current and future services monitored under your account.

After a policy is disabled:

The policy is displayed as ⬛ Unavailable (Gray) to indicate it is inactive.
The policy is no longer evaluated as part of Service Health.

You can re-enable a policy at any time to resume health evaluation.

Best practices

Review policies regularly after deployments or scaling events.
Tune thresholds to reflect real user experience rather than system noise.
Remember: Critical overrides all other states—a single breach sets the service to Critical.

Need help? Contact Support.

What's new? Find out here.

LLM? Read llms.txt.

Previous Alert-Based Service Health

Next Profiles Catalog

Policy	Metric	Warning threshold	Critical threshold
Incidents	Incident priority	≥ P5	> P3
Latency	Avg / P90 / P95 / P99	> 400 ms	> 800 ms
Span error rate	Error rate (%)	> 1 %	> 5 %
Logs error rate	Error rate (%)	> 1 %	> 5 %

Indicator	Meaning	Condition
🟩 Healthy (Green)	All policies within thresholds	No policy breaches
🟨 Warning (Yellow)	At least one policy breaching its Warning threshold	No Critical breaches
🟥 Critical (Red)	At least one policy breaching its Critical threshold	One or more Critical breaches
⬛ Unavailable (Gray)	No policies or data available	No metrics or configuration detected