The Runtime metrics tab in [Service Catalog](https://coralogix.com/docs/user-guides/apm/features/service-catalog/index.md) displays JVM runtime metrics — heap memory, garbage collection, thread states, CPU usage, and class loading — alongside service latency. It correlates JVM internals with user-facing performance, so you can diagnose memory leaks, GC pressure, and thread contention without leaving APM.

Limited availability

This feature is in preview and is subject to change.

## Why it matters

When a Java service slows down, throws errors, or pegs CPU, the cause is often invisible from traces and span metrics alone. The usual suspects — long GC pauses, heap exhaustion, thread contention, classloader leaks — happen inside the JVM, one layer below where APM normally looks.

The Runtime metrics tab pulls JVM internals onto the same screen as your traces, so you can:

- **Tie latency spikes to GC pauses.** The **Service latency vs pauses** widget overlays GC events on request latency. If a spike lines up with a pause marker, you have your answer.
- **Tell a memory leak apart from normal allocation pressure.** Three separate memory widgets (Heap used vs memory, Live set - memory after GC, Heap by pool) let you read the heap properly instead of guessing from a single line.
- **Distinguish CPU saturation from lock contention.** Looking at CPU usage and thread states side-by-side makes it obvious whether the service is busy doing work (CPU-bound) or stuck waiting (locks).
- **Isolate one bad pod.** The JVM instances filter and the per-instance heatmap let you find the single misbehaving instance dragging down the cluster average.

## What you need

OpenTelemetry semantic conventions only

The Runtime metrics tab reads only JVM metrics that follow the [OpenTelemetry JVM semantic conventions](https://opentelemetry.io/docs/specs/semconv/runtime/jvm-metrics/) — for example, `jvm.memory.used`, `jvm.gc.duration`, `jvm.thread.count`. Metrics that use other naming conventions (such as the Micrometer/Prometheus form `jvm_memory_used_bytes`) are not recognized and the tab does not show them. If your services emit JVM metrics under a different naming convention, migrate the instrumentation to the OpenTelemetry conventions before the tab can render data for them.

To make a Java, Scala, or Kotlin service report JVM metrics in the supported format:

- Attach the OpenTelemetry Java agent (`opentelemetry-javaagent.jar`) to the service process. The agent collects all stable `jvm.*` metrics from the OpenTelemetry semantic conventions automatically. See [Java OpenTelemetry instrumentation](https://coralogix.com/docs/opentelemetry/instrumentation-options/java-opentelemetry-instrumentation/index.md) for installation steps, or follow [Send JVM metrics](https://coralogix.com/docs/user-guides/apm/features/runtime-metrics/send-jvm-metrics/index.md) for the end-to-end setup.
- Required JDK: Java 8 or later for the stable metric set.
- The metrics must include the `service.name` resource attribute so they correlate with the service in the Service Catalog.

For services that are not JVM-based (Python, Go, Node.js, and so on — identified by the `telemetry.sdk.language` resource attribute), the Runtime metrics tab is hidden entirely. For JVM-based services that have not started reporting `jvm.*` metrics yet, the tab is shown with an empty state (see [When no JVM metrics are detected](#when-no-jvm-metrics-are-detected)) that points you to instructions for sending the metrics.

### How JVM metrics reach the tab

```
flowchart LR
    JVM["Java / Scala / Kotlin<br/>service (JVM process)"]
    Agent["OpenTelemetry Java agent<br/>opentelemetry-javaagent.jar"]
    Coralogix["Coralogix"]
    Tab["Runtime metrics tab<br/>filtered by service.name<br/>+ instance attribute"]

    JVM -->|JMX MBeans| Agent
    Agent -->|jvm.memory.* · jvm.gc.duration<br/>jvm.thread.count · jvm.cpu.*<br/>jvm.class.* + service.name| Coralogix
    Coralogix --> Tab

    class JVM entry
    class Tab success
```

The OpenTelemetry Java agent collects JVM metrics from your service and sends them to Coralogix tagged with `service.name`. The tab uses that tag, plus the instance attribute it detects (`k8s.pod.name`, `host.name`, and so on), to render the right panels.

## Access the Runtime metrics tab

1. In your Coralogix toolbar, navigate to **APM**, then **Service Catalog**.
1. Select a JVM-based service to open the service drilldown.
1. Select the **Runtime metrics** tab.

The tab loads with the default time range and the JVM Metrics layout.

## When no JVM metrics are detected

If you open the Runtime metrics tab for a JVM service that has not yet started sending JVM metrics in the OpenTelemetry format, the tab loads with an empty state instead of the widget grid. From the empty state, you can either start the JVM Observability extension flow directly in the tab, or open the [Send JVM metrics](https://coralogix.com/docs/user-guides/apm/features/runtime-metrics/send-jvm-metrics/index.md) guide to set up instrumentation yourself.

The tab continues to show the empty state until at least one `jvm.*` metric is observed for the service in the current time window. Once metrics start flowing, the widgets render automatically — no configuration change is needed inside the tab itself.

Note

Non-JVM services (such as Python or Go) do not show the Runtime metrics tab at all. The empty state is reserved for JVM-based services that are eligible to report `jvm.*` metrics but have not yet done so.

## Layout

The JVM Metrics view is organized top to bottom as:

1. **JVM instances**: selector that scopes every panel below it to a subset of running JVM instances.
1. **Instance heatmap**: a row that shows the health of each instance at a glance.
1. **JVM summary**: four stat cards spanning full width.
1. **Memory**: Heap used vs memory, Live set - memory after GC, Heap by pool.
1. **CPU**: JVM CPU utilization, Thread count by state, Class loading.
1. **GC**: GC pause duration, GC event count, Service latency vs pauses.

The five content sections (Instance heatmap, JVM summary, Memory, CPU, GC) are collapsible. The JVM instances selector is a control bar and is always visible.

When **JVM summary** is collapsed, each card collapses into a chip that still shows its current value, trend arrow, and severity color, so the at-a-glance signal is preserved. When **Memory**, **CPU**, or **GC** is collapsed, the section header shows the widget titles as compact chips.

Hovering any chart syncs the crosshair across the other widgets in the tab, so you can correlate the same point in time across memory, CPU, and GC at once.

## JVM instances selector

The instance selector sits at the top of the view and scopes every widget to a subset of the running JVM instances. The default is **All instances (aggregated)**, which sums or averages across every JVM reporting metrics for the service.

Why per-instance filtering matters

JVM metrics are emitted per JVM process. In a horizontally scaled service, every pod runs its own JVM, and aggregate values hide the most common failure mode — one bad instance. A memory leak on a single pod, GC pauses isolated to one instance, or a thread leak on one node is invisible in the aggregate view until the pod fails. The selector lets you compare one suspect instance against the rest of the fleet.

When you select a single instance, every widget below filters to that instance only. When you select multiple instances, every widget shows one series per instance, color-coded by instance.

## Instance heatmap

Below the instance selector is a collapsible per-instance overview row. It is collapsed by default during normal operations and expanded during incident triage.

When expanded, the row renders a compact grid where every running instance has one row and four columns — Heap used %, GC used time %, GC P99, CPU %. Each cell is color-coded by severity (green, amber, red), so an instance that is misbehaving on any single dimension is visually obvious without selecting each instance one by one.

| Column         | Source metric                                                                              | Severity rule                                                                      |
| -------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- |
| Heap used %    | `jvm.memory.used` ÷ `jvm.memory.limit`                                                     | Green at low utilization, amber as it climbs, red at sustained high utilization    |
| GC used time % | `jvm.gc.duration` summed as a fraction of wall time                                        | Green when the JVM spends little wall time paused for GC; red when pauses dominate |
| GC P99         | `jvm.gc.duration` 99th percentile                                                          | Green for short pauses, red for long pauses                                        |
| CPU %          | `rate(jvm.cpu.time)` ÷ `jvm.cpu.count` (or `jvm.cpu.recent_utilization` × 100 as fallback) | Green at low utilization, amber as it climbs, red near saturation                  |

Selecting any row filters every widget below it to that instance. Large fleets are paginated — use the paginator below the heatmap to step through additional instances.

## JVM summary

The summary strip displays four headline numbers an on-call engineer checks first. Each card shows an aggregated current value with a short context line and a directional delta versus the previous equivalent window (↗ red = worsening, ↘ green = improving, — neutral).

| Card         | Source metric                                                             | Aggregation (card subtitle)                     |
| ------------ | ------------------------------------------------------------------------- | ----------------------------------------------- |
| Heap used    | `jvm.memory.used` filtered to `jvm.memory.type=heap`, summed across pools | Avg across instances, in GB                     |
| GC overhead  | `jvm.gc.duration` as a fraction of wall time                              | Avg of wall time spent in GC pauses             |
| GC pause P99 | `jvm.gc.duration`                                                         | P99, averaged across instances, last 30 minutes |
| Thread count | `jvm.thread.count`                                                        | Avg platform threads across instances           |

**What to look for:**

- A red ↗ trend arrow on any card means the metric got worse since the previous window — cross-check with the detailed widgets below to see what is driving it.
- **Heap used** climbing toward the heap limit is a pre-OOM warning.
- **GC overhead** above a few percent is unhealthy — the JVM is losing meaningful wall time to GC. Drill into **GC pause duration** and **GC event count** to see whether long pauses or frequent collections are responsible.
- **GC pause P99** rising means tail pauses are stretching, which directly hurts user-facing latency. Confirm against **Service latency vs pauses**.
- **Thread count** drifting up without matching traffic growth is a thread leak. A sudden drop usually means a thread pool was resized at deployment.

Two cards surface contextual badges when a specific signal appears:

- **GC overhead** shows a **Driven by pod** badge when one instance is responsible for most of the GC overhead — flags a single bad pod without needing to expand the heatmap.
- **Thread count** shows a **Stable** badge when the thread count is steady across the window — confirms no thread leak or runaway pool growth.

What GC overhead measures

The GC overhead card reports the percentage of wall time the JVM was paused for garbage collection. It is not the same as CPU consumed by GC. Concurrent collectors such as ZGC and Shenandoah can spend significant CPU on garbage collection while showing low values here, because they do most of their work without stopping application threads.

## Memory

The Memory row answers three distinct questions, each on its own widget: how much memory is the JVM using, how much is retained after each garbage collection, and how is that usage distributed across heap pools.

### Heap used vs memory

Style: area + lines. Y-axis: bytes (auto-scaled GB/MB), single axis. Underlying data: `jvm.memory.*`, split by `jvm.memory.type`.

A toggle switches between **Heap** and **Non-heap (Metaspace)**. The Y-axis automatically rescales on toggle — heap is typically in the GB range, non-heap (Metaspace) in the hundreds of MB — so the chart stays readable in both modes.

| Series    | Source metric          | Style                       | What it shows                                                     |
| --------- | ---------------------- | --------------------------- | ----------------------------------------------------------------- |
| used      | `jvm.memory.used`      | Filled area (blue)          | Currently allocated memory                                        |
| committed | `jvm.memory.committed` | Dashed line (green)         | Memory the OS has reserved for the JVM                            |
| limit     | `jvm.memory.limit`     | Dashed reference line (red) | Memory ceiling (`-Xmx` for heap, `MaxMetaspaceSize` for non-heap) |

**What to look for:**

- The gap between **used** and **limit** is your headroom before an out-of-memory error.
- The gap between **used** and **committed** is memory the OS has reserved but the JVM is not using yet. A shrinking gap under load is an early warning that the JVM is running out of slack.
- Switch to **non-heap** to surface **Metaspace**, a common source of OOM errors in services that load a lot of classes (plugin-heavy apps, frameworks that use a lot of reflection).

For leak detection, use **Live set - memory after GC** instead. The `used` line bounces with every GC cycle, which makes trends hard to read.

### Live set - memory after GC

Style: stepped line. Y-axis: bytes, auto-ranged to the data — not zero-based. Underlying data: `jvm.memory.used_after_last_gc` summed across heap pools. Each horizontal segment represents memory between two GC events; each vertical drop is one GC firing.

| Series        | Source                          | What it shows                              |
| ------------- | ------------------------------- | ------------------------------------------ |
| live set size | `jvm.memory.used_after_last_gc` | Memory retained after the most recent GC   |
| GC event      | Derived from `jvm.gc.duration`  | Vertical markers where each GC event fired |

Why this Y-axis is not zero-based

A zero-based Y-axis compresses every step into a narrow band at the top of the chart, which hides the rising-baseline pattern — the primary signal for leak detection. The axis automatically ranges to the data, so each step's change is visible, not just its absolute value.

This widget is the cleanest signal of a memory leak.

**What to look for:**

- A roughly flat baseline means the service is healthy under steady load — the GC is reclaiming whatever is not needed.
- A rising staircase, where each step starts higher than the previous, means more memory is being retained after every GC cycle. The chart highlights this pattern with a "rising baseline = leak pattern" label when it detects sustained growth.

### Heap by pool

Style: stacked area. Y-axis: bytes (GB), single axis. Underlying data: `jvm.memory.used` split by `jvm.memory.pool.name`, filtered to `jvm.memory.type=heap`.

Pool names depend on the active GC algorithm — G1, ZGC, Shenandoah, and Parallel GC each report different pool sets. The widget shows whatever pools the JVM reports.

| Series   | Style                 | What it shows                               |
| -------- | --------------------- | ------------------------------------------- |
| old gen  | Stacked area (purple) | Long-lived objects                          |
| survivor | Stacked area (green)  | Objects that survived at least one minor GC |
| eden     | Stacked area (blue)   | Short-lived allocations                     |

**What to look for:**

- **Eden** is where new objects are allocated. It should rise and fall quickly with each young-generation collection — that is normal.
- **Old Gen** holds long-lived objects. If it climbs steadily and never drops back down, the service is heading toward an out-of-memory error.
- **Survivor** holds objects that survived at least one collection. If it stays unusually large, the JVM is keeping objects around longer than expected before promoting them to Old Gen.

## CPU

The CPU row separates JVM CPU consumption, thread state composition, and class-loading activity into three widgets so each signal is readable on its own axis.

### JVM CPU utilization

Style: line with reference line. Y-axis: `cores` (CPU usage expressed in core-equivalents). A `100% ceiling` reference line marks the total available cores, so the saturation point is always in view.

| Series       | Style               | What it shows                                        |
| ------------ | ------------------- | ---------------------------------------------------- |
| jvm.cpu.used | Solid line (purple) | CPU consumed by the JVM process, in core-equivalents |
| 100% ceiling | Reference line      | Total available cores — the saturation ceiling       |

**What to look for:**

- Usage hitting the `100% ceiling` line combined with mostly **runnable** threads in the next widget means the service is CPU-bound — typically a hot loop or heavy compute.
- Usage well below the ceiling with a high **blocked** thread share means the service is stuck waiting on locks, not doing work.

### Thread count by state

Style: stacked area. Y-axis: thread count (integer), single axis. Underlying data: `jvm.thread.count` grouped by `jvm.thread.state`.

| Series   | Source                                                     | Style                | What it shows                                  |
| -------- | ---------------------------------------------------------- | -------------------- | ---------------------------------------------- |
| runnable | `jvm.thread.count` filtered to `jvm.thread.state=runnable` | Stacked area (blue)  | Threads currently running or ready to run      |
| waiting  | `jvm.thread.count` filtered to `jvm.thread.state=waiting`  | Stacked area (amber) | Threads waiting on another thread or condition |
| blocked  | `jvm.thread.count` filtered to `jvm.thread.state=blocked`  | Stacked area (coral) | Threads waiting to acquire a monitor lock      |

The stacked composition matters as much as the total height.

**What to look for:**

- A healthy service usually shows most threads in **runnable** (doing work) or **waiting** (idle between requests).
- A spike in **blocked** threads means lock contention — threads are queued up waiting on a monitor.
- A growing total stack height over time without matching traffic growth is a thread leak.

### Class loading

Style: bars + line. Y-axis: `classes / window` (left), `total` (right). Underlying data: `jvm.class.loaded`, `jvm.class.unloaded`, `jvm.class.count`.

| Series       | Source                     | Style      | What it shows                    |
| ------------ | -------------------------- | ---------- | -------------------------------- |
| load rate    | `rate(jvm.class.loaded)`   | Green bars | Classes loaded per time bucket   |
| unload rate  | `rate(jvm.class.unloaded)` | Coral bars | Classes unloaded per time bucket |
| total loaded | `jvm.class.count`          | Green line | Currently loaded classes         |

**What to look for:**

- The total class count should level off after the application warms up. If it keeps growing without an increase in load rate, you have a classic classloader leak — common in OSGi containers, plugin-heavy applications, or services that hot-reload code in production.
- Unload activity is normally near zero in a healthy JVM.

## GC

The GC row separates pause duration from event frequency — they answer different questions and combining them on one chart obscures both signals — and pairs them with a latency overlay so GC pauses can be aligned with end-user impact.

### GC pause duration

Style: multi-line. Y-axis: `ms`, single axis. Underlying data: `jvm.gc.duration` histogram. Filterable by `jvm.gc.name` (collector name).

| Series | Style      | What it shows              |
| ------ | ---------- | -------------------------- |
| p50    | Green line | Median pause time          |
| p95    | Blue line  | 95th percentile pause time |
| p99    | Red line   | 99th percentile pause time |

Lines are grouped by GC collector — for example, **G1 Young Generation** or **ZGC**. If the JVM reports a more specific action (such as "end of minor GC"), the widget can break the data down further, but it does not force a minor-vs-major split: modern collectors like ZGC and Shenandoah do not separate collections that way, and the widget reflects whatever the JVM actually emits.

**What to look for:**

- A widening gap between p50 and p99 means pauses are becoming unpredictable — most are short, but some run long. This usually points to a fragmented heap or a GC that is struggling to keep up with allocation.

### GC event count

Style: bars. Y-axis: `events / window`, single axis. Underlying data: `jvm.gc.duration` event count, derived as `rate(jvm.gc.duration_count)`.

| Series               | Source                                                                      | Style              | What it shows                                                            |
| -------------------- | --------------------------------------------------------------------------- | ------------------ | ------------------------------------------------------------------------ |
| minor / major (G1GC) | `rate(jvm.gc.duration_count)`, grouped by `jvm.gc.name` and `jvm.gc.action` | Bars per collector | Collections per time bucket, split by collector and action when reported |

When the JVM reports `jvm.gc.action`, the widget shows it as a secondary breakdown so you can see which kind of collection — young-gen vs full — is driving the rate. As with **GC pause duration**, the widget does not force a minor-vs-major split across collectors that do not have one.

**What to look for:**

- A high event rate with low pause durations (in **GC pause duration**) is healthy GC — collections fire often but finish quickly.
- A low event rate with high pause durations is the dangerous pattern: infrequent but expensive full GCs.
- A step-change in event rate at a deployment timestamp means the new code allocates more memory per request than the previous version.

### Service latency vs pauses

Style: line + vertical markers overlay (one of three widgets in the GC row, same width as the others). Y-axis: P99 latency (ms). Underlying data: P99 request latency from span metrics, with `jvm.gc.duration` events annotated as vertical markers. Both series are scoped to the same service and time window as the rest of the JVM Metrics view.

| Series         | Source                                                                        | Style                        | What it shows                             |
| -------------- | ----------------------------------------------------------------------------- | ---------------------------- | ----------------------------------------- |
| p99 latency    | Span metrics for the service                                                  | Purple line with shaded area | End-user request latency over time        |
| GC pause event | `jvm.gc.duration` observations above the configured threshold (default 50 ms) | Vertical dashed red line     | Each GC pause that exceeded the threshold |

Span metrics and JVM metrics live on the same platform, so no cross-system correlation is needed.

**What to look for:**

- A latency spike that lines up with a GC pause marker is GC-caused. The pause stopped all application threads, so any in-flight request piled up wait time during that window.
- A latency spike with no nearby pause marker is not GC-related. Look at downstream calls in [Dependencies](https://coralogix.com/docs/user-guides/apm/features/dependencies/introduction/index.md) or lock contention in **Thread count by state** instead.
- A pause marker with no matching latency spike means requests were short enough, or concurrency low enough, that no request happened to span the pause.

## Common use cases

| Symptom                                                    | Where to look first                                                                                                                                                                                                 |
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Intermittent latency spikes, traces look fine              | **Service latency vs pauses**: align spikes with GC pause markers                                                                                                                                                   |
| Service throws an out-of-memory (OOM) error intermittently | **Heap used vs memory**: check `used` approaching `limit`, then confirm in **Live set - memory after GC** whether the live set is rising                                                                            |
| Memory never returns to baseline after deployment          | **Live set - memory after GC**: a rising staircase after the deployment timestamp indicates a leak introduced in the new version                                                                                    |
| Service is slow but spans show low self-time               | **Thread count by state**: high `blocked` share with low CPU in **JVM CPU utilization** is lock contention                                                                                                          |
| CPU pegged at the ceiling but throughput is low            | **JVM CPU utilization** combined with **Thread count by state**: runnable threads dominating with high CPU is a hot loop or a GC pressure spiral; cross-reference with **GC pause duration** and **GC event count** |
| Metaspace or class count keeps growing                     | **Class loading**: total count rising after warm-up indicates a classloader leak; switch the **Heap used vs memory** widget to non-heap to confirm Metaspace pressure                                               |
| GC overhead jumped after a code change                     | **GC event count**: step change in bar height at deployment time means the new code allocates more per request                                                                                                      |

## Limitations

- JVM metrics are emitted at the JVM process level. Multiple deployed applications inside a single JVM (Tomcat, JBoss, WebLogic) cannot be visualized separately — heap usage, GC behavior, and thread counts reflect the entire JVM process.
- On hosts running more than one JVM outside Kubernetes, instances are distinguished by the combination of `host.name` and `service.name`.
- Instances that have stopped reporting (terminated pods) are excluded from the filter selector after a short grace period.

## Next steps

If your service is not yet sending JVM metrics, follow [Send JVM metrics](https://coralogix.com/docs/user-guides/apm/features/runtime-metrics/send-jvm-metrics/index.md) to enable the OpenTelemetry agent's metrics exporter.

## Related resources

- [Send JVM metrics](https://coralogix.com/docs/user-guides/apm/features/runtime-metrics/send-jvm-metrics/index.md)
- [Service Catalog](https://coralogix.com/docs/user-guides/apm/features/service-catalog/index.md)
- [Java OpenTelemetry instrumentation](https://coralogix.com/docs/opentelemetry/instrumentation-options/java-opentelemetry-instrumentation/index.md)
- [OpenTelemetry JVM semantic conventions](https://opentelemetry.io/docs/specs/semconv/runtime/jvm-metrics/)
