Runtime metrics
The Runtime metrics tab in Service Catalog displays JVM runtime metrics — heap memory, garbage collection, thread states, CPU usage, and class loading — alongside service latency. It correlates JVM internals with user-facing performance, so you can diagnose memory leaks, GC pressure, and thread contention without leaving APM.
Limited availability
This feature is in preview and is subject to change.
Why it matters
When a Java service slows down, throws errors, or pegs CPU, the cause is often invisible from traces and span metrics alone. The usual suspects — long GC pauses, heap exhaustion, thread contention, classloader leaks — happen inside the JVM, one layer below where APM normally looks.
The Runtime metrics tab pulls JVM internals onto the same screen as your traces, so you can:
- Tie latency spikes to GC pauses. The Service latency vs pauses widget overlays GC events on request latency. If a spike lines up with a pause marker, you have your answer.
- Tell a memory leak apart from normal allocation pressure. Three separate memory widgets (Heap used vs memory, Live set - memory after GC, Heap by pool) let you read the heap properly instead of guessing from a single line.
- Distinguish CPU saturation from lock contention. Looking at CPU usage and thread states side-by-side makes it obvious whether the service is busy doing work (CPU-bound) or stuck waiting (locks).
- Isolate one bad pod. The JVM instances filter and the per-instance heatmap let you find the single misbehaving instance dragging down the cluster average.
What you need
OpenTelemetry semantic conventions only
The Runtime metrics tab reads only JVM metrics that follow the OpenTelemetry JVM semantic conventions — for example, jvm.memory.used, jvm.gc.duration, jvm.thread.count. Metrics that use other naming conventions (such as the Micrometer/Prometheus form jvm_memory_used_bytes) are not recognized and the tab does not show them. If your services emit JVM metrics under a different naming convention, migrate the instrumentation to the OpenTelemetry conventions before the tab can render data for them.
To make a Java, Scala, or Kotlin service report JVM metrics in the supported format:
- Attach the OpenTelemetry Java agent (
opentelemetry-javaagent.jar) to the service process. The agent collects all stablejvm.*metrics from the OpenTelemetry semantic conventions automatically. See Java OpenTelemetry instrumentation for installation steps, or follow Send JVM metrics for the end-to-end setup. - Required JDK: Java 8 or later for the stable metric set.
- The metrics must include the
service.nameresource attribute so they correlate with the service in the Service Catalog.
For services that are not JVM-based (Python, Go, Node.js, and so on — identified by the telemetry.sdk.language resource attribute), the Runtime metrics tab is hidden entirely. For JVM-based services that have not started reporting jvm.* metrics yet, the tab is shown with an empty state (see When no JVM metrics are detected) that points you to instructions for sending the metrics.
How JVM metrics reach the tab
flowchart LR
JVM["Java / Scala / Kotlin<br/>service (JVM process)"]
Agent["OpenTelemetry Java agent<br/>opentelemetry-javaagent.jar"]
Coralogix["Coralogix"]
Tab["Runtime metrics tab<br/>filtered by service.name<br/>+ instance attribute"]
JVM -->|JMX MBeans| Agent
Agent -->|jvm.memory.* · jvm.gc.duration<br/>jvm.thread.count · jvm.cpu.*<br/>jvm.class.* + service.name| Coralogix
Coralogix --> Tab
class JVM entry
class Tab successThe OpenTelemetry Java agent collects JVM metrics from your service and sends them to Coralogix tagged with service.name. The tab uses that tag, plus the instance attribute it detects (k8s.pod.name, host.name, and so on), to render the right panels.
Access the Runtime metrics tab
- In your Coralogix toolbar, navigate to APM, then Service Catalog.
- Select a JVM-based service to open the service drilldown.
- Select the Runtime metrics tab.
The tab loads with the default time range and the JVM Metrics layout.
When no JVM metrics are detected
If you open the Runtime metrics tab for a JVM service that has not yet started sending JVM metrics in the OpenTelemetry format, the tab loads with an empty state instead of the widget grid. From the empty state, you can either start the JVM Observability extension flow directly in the tab, or open the Send JVM metrics guide to set up instrumentation yourself.
The tab continues to show the empty state until at least one jvm.* metric is observed for the service in the current time window. Once metrics start flowing, the widgets render automatically — no configuration change is needed inside the tab itself.
Note
Non-JVM services (such as Python or Go) do not show the Runtime metrics tab at all. The empty state is reserved for JVM-based services that are eligible to report jvm.* metrics but have not yet done so.
Layout
The JVM Metrics view is organized top to bottom as:
- JVM instances: selector that scopes every panel below it to a subset of running JVM instances.
- Instance heatmap: a row that shows the health of each instance at a glance.
- JVM summary: four stat cards spanning full width.
- Memory: Heap used vs memory, Live set - memory after GC, Heap by pool.
- CPU: JVM CPU utilization, Thread count by state, Class loading.
- GC: GC pause duration, GC event count, Service latency vs pauses.
The five content sections (Instance heatmap, JVM summary, Memory, CPU, GC) are collapsible. The JVM instances selector is a control bar and is always visible.
When JVM summary is collapsed, each card collapses into a chip that still shows its current value, trend arrow, and severity color, so the at-a-glance signal is preserved. When Memory, CPU, or GC is collapsed, the section header shows the widget titles as compact chips.
Hovering any chart syncs the crosshair across the other widgets in the tab, so you can correlate the same point in time across memory, CPU, and GC at once.
JVM instances selector
The instance selector sits at the top of the view and scopes every widget to a subset of the running JVM instances. The default is All instances (aggregated), which sums or averages across every JVM reporting metrics for the service.
Why per-instance filtering matters
JVM metrics are emitted per JVM process. In a horizontally scaled service, every pod runs its own JVM, and aggregate values hide the most common failure mode — one bad instance. A memory leak on a single pod, GC pauses isolated to one instance, or a thread leak on one node is invisible in the aggregate view until the pod fails. The selector lets you compare one suspect instance against the rest of the fleet.
When you select a single instance, every widget below filters to that instance only. When you select multiple instances, every widget shows one series per instance, color-coded by instance.
Instance heatmap
Below the instance selector is a collapsible per-instance overview row. It is collapsed by default during normal operations and expanded during incident triage.
When expanded, the row renders a compact grid where every running instance has one row and four columns — Heap used %, GC used time %, GC P99, CPU %. Each cell is color-coded by severity (green, amber, red), so an instance that is misbehaving on any single dimension is visually obvious without selecting each instance one by one.
| Column | Source metric | Severity rule |
|---|---|---|
| Heap used % | jvm.memory.used ÷ jvm.memory.limit | Green at low utilization, amber as it climbs, red at sustained high utilization |
| GC used time % | jvm.gc.duration summed as a fraction of wall time | Green when the JVM spends little wall time paused for GC; red when pauses dominate |
| GC P99 | jvm.gc.duration 99th percentile | Green for short pauses, red for long pauses |
| CPU % | rate(jvm.cpu.time) ÷ jvm.cpu.count (or jvm.cpu.recent_utilization × 100 as fallback) | Green at low utilization, amber as it climbs, red near saturation |
Selecting any row filters every widget below it to that instance. Large fleets are paginated — use the paginator below the heatmap to step through additional instances.
JVM summary
The summary strip displays four headline numbers an on-call engineer checks first. Each card shows an aggregated current value with a short context line and a directional delta versus the previous equivalent window (↗ red = worsening, ↘ green = improving, — neutral).
| Card | Source metric | Aggregation (card subtitle) |
|---|---|---|
| Heap used | jvm.memory.used filtered to jvm.memory.type=heap, summed across pools | Avg across instances, in GB |
| GC overhead | jvm.gc.duration as a fraction of wall time | Avg of wall time spent in GC pauses |
| GC pause P99 | jvm.gc.duration | P99, averaged across instances, last 30 minutes |
| Thread count | jvm.thread.count | Avg platform threads across instances |
What to look for:
- A red ↗ trend arrow on any card means the metric got worse since the previous window — cross-check with the detailed widgets below to see what is driving it.
- Heap used climbing toward the heap limit is a pre-OOM warning.
- GC overhead above a few percent is unhealthy — the JVM is losing meaningful wall time to GC. Drill into GC pause duration and GC event count to see whether long pauses or frequent collections are responsible.
- GC pause P99 rising means tail pauses are stretching, which directly hurts user-facing latency. Confirm against Service latency vs pauses.
- Thread count drifting up without matching traffic growth is a thread leak. A sudden drop usually means a thread pool was resized at deployment.
Two cards surface contextual badges when a specific signal appears:
- GC overhead shows a Driven by pod badge when one instance is responsible for most of the GC overhead — flags a single bad pod without needing to expand the heatmap.
- Thread count shows a Stable badge when the thread count is steady across the window — confirms no thread leak or runaway pool growth.
What GC overhead measures
The GC overhead card reports the percentage of wall time the JVM was paused for garbage collection. It is not the same as CPU consumed by GC. Concurrent collectors such as ZGC and Shenandoah can spend significant CPU on garbage collection while showing low values here, because they do most of their work without stopping application threads.
Memory
The Memory row answers three distinct questions, each on its own widget: how much memory is the JVM using, how much is retained after each garbage collection, and how is that usage distributed across heap pools.
Heap used vs memory
Style: area + lines. Y-axis: bytes (auto-scaled GB/MB), single axis. Underlying data: jvm.memory.*, split by jvm.memory.type.
A toggle switches between Heap and Non-heap (Metaspace). The Y-axis automatically rescales on toggle — heap is typically in the GB range, non-heap (Metaspace) in the hundreds of MB — so the chart stays readable in both modes.
| Series | Source metric | Style | What it shows |
|---|---|---|---|
| used | jvm.memory.used | Filled area (blue) | Currently allocated memory |
| committed | jvm.memory.committed | Dashed line (green) | Memory the OS has reserved for the JVM |
| limit | jvm.memory.limit | Dashed reference line (red) | Memory ceiling (-Xmx for heap, MaxMetaspaceSize for non-heap) |
What to look for:
- The gap between used and limit is your headroom before an out-of-memory error.
- The gap between used and committed is memory the OS has reserved but the JVM is not using yet. A shrinking gap under load is an early warning that the JVM is running out of slack.
- Switch to non-heap to surface Metaspace, a common source of OOM errors in services that load a lot of classes (plugin-heavy apps, frameworks that use a lot of reflection).
For leak detection, use Live set - memory after GC instead. The used line bounces with every GC cycle, which makes trends hard to read.
Live set - memory after GC
Style: stepped line. Y-axis: bytes, auto-ranged to the data — not zero-based. Underlying data: jvm.memory.used_after_last_gc summed across heap pools. Each horizontal segment represents memory between two GC events; each vertical drop is one GC firing.
| Series | Source | What it shows |
|---|---|---|
| live set size | jvm.memory.used_after_last_gc | Memory retained after the most recent GC |
| GC event | Derived from jvm.gc.duration | Vertical markers where each GC event fired |
Why this Y-axis is not zero-based
A zero-based Y-axis compresses every step into a narrow band at the top of the chart, which hides the rising-baseline pattern — the primary signal for leak detection. The axis automatically ranges to the data, so each step's change is visible, not just its absolute value.
This widget is the cleanest signal of a memory leak.
What to look for:
- A roughly flat baseline means the service is healthy under steady load — the GC is reclaiming whatever is not needed.
- A rising staircase, where each step starts higher than the previous, means more memory is being retained after every GC cycle. The chart highlights this pattern with a "rising baseline = leak pattern" label when it detects sustained growth.
Heap by pool
Style: stacked area. Y-axis: bytes (GB), single axis. Underlying data: jvm.memory.used split by jvm.memory.pool.name, filtered to jvm.memory.type=heap.
Pool names depend on the active GC algorithm — G1, ZGC, Shenandoah, and Parallel GC each report different pool sets. The widget shows whatever pools the JVM reports.
| Series | Style | What it shows |
|---|---|---|
| old gen | Stacked area (purple) | Long-lived objects |
| survivor | Stacked area (green) | Objects that survived at least one minor GC |
| eden | Stacked area (blue) | Short-lived allocations |
What to look for:
- Eden is where new objects are allocated. It should rise and fall quickly with each young-generation collection — that is normal.
- Old Gen holds long-lived objects. If it climbs steadily and never drops back down, the service is heading toward an out-of-memory error.
- Survivor holds objects that survived at least one collection. If it stays unusually large, the JVM is keeping objects around longer than expected before promoting them to Old Gen.
CPU
The CPU row separates JVM CPU consumption, thread state composition, and class-loading activity into three widgets so each signal is readable on its own axis.
JVM CPU utilization
Style: line with reference line. Y-axis: cores (CPU usage expressed in core-equivalents). A 100% ceiling reference line marks the total available cores, so the saturation point is always in view.
| Series | Style | What it shows |
|---|---|---|
| jvm.cpu.used | Solid line (purple) | CPU consumed by the JVM process, in core-equivalents |
| 100% ceiling | Reference line | Total available cores — the saturation ceiling |
What to look for:
- Usage hitting the
100% ceilingline combined with mostly runnable threads in the next widget means the service is CPU-bound — typically a hot loop or heavy compute. - Usage well below the ceiling with a high blocked thread share means the service is stuck waiting on locks, not doing work.
Thread count by state
Style: stacked area. Y-axis: thread count (integer), single axis. Underlying data: jvm.thread.count grouped by jvm.thread.state.
| Series | Source | Style | What it shows |
|---|---|---|---|
| runnable | jvm.thread.count filtered to jvm.thread.state=runnable | Stacked area (blue) | Threads currently running or ready to run |
| waiting | jvm.thread.count filtered to jvm.thread.state=waiting | Stacked area (amber) | Threads waiting on another thread or condition |
| blocked | jvm.thread.count filtered to jvm.thread.state=blocked | Stacked area (coral) | Threads waiting to acquire a monitor lock |
The stacked composition matters as much as the total height.
What to look for:
- A healthy service usually shows most threads in runnable (doing work) or waiting (idle between requests).
- A spike in blocked threads means lock contention — threads are queued up waiting on a monitor.
- A growing total stack height over time without matching traffic growth is a thread leak.
Class loading
Style: bars + line. Y-axis: classes / window (left), total (right). Underlying data: jvm.class.loaded, jvm.class.unloaded, jvm.class.count.
| Series | Source | Style | What it shows |
|---|---|---|---|
| load rate | rate(jvm.class.loaded) | Green bars | Classes loaded per time bucket |
| unload rate | rate(jvm.class.unloaded) | Coral bars | Classes unloaded per time bucket |
| total loaded | jvm.class.count | Green line | Currently loaded classes |
What to look for:
- The total class count should level off after the application warms up. If it keeps growing without an increase in load rate, you have a classic classloader leak — common in OSGi containers, plugin-heavy applications, or services that hot-reload code in production.
- Unload activity is normally near zero in a healthy JVM.
GC
The GC row separates pause duration from event frequency — they answer different questions and combining them on one chart obscures both signals — and pairs them with a latency overlay so GC pauses can be aligned with end-user impact.
GC pause duration
Style: multi-line. Y-axis: ms, single axis. Underlying data: jvm.gc.duration histogram. Filterable by jvm.gc.name (collector name).
| Series | Style | What it shows |
|---|---|---|
| p50 | Green line | Median pause time |
| p95 | Blue line | 95th percentile pause time |
| p99 | Red line | 99th percentile pause time |
Lines are grouped by GC collector — for example, G1 Young Generation or ZGC. If the JVM reports a more specific action (such as "end of minor GC"), the widget can break the data down further, but it does not force a minor-vs-major split: modern collectors like ZGC and Shenandoah do not separate collections that way, and the widget reflects whatever the JVM actually emits.
What to look for:
- A widening gap between p50 and p99 means pauses are becoming unpredictable — most are short, but some run long. This usually points to a fragmented heap or a GC that is struggling to keep up with allocation.
GC event count
Style: bars. Y-axis: events / window, single axis. Underlying data: jvm.gc.duration event count, derived as rate(jvm.gc.duration_count).
| Series | Source | Style | What it shows |
|---|---|---|---|
| minor / major (G1GC) | rate(jvm.gc.duration_count), grouped by jvm.gc.name and jvm.gc.action | Bars per collector | Collections per time bucket, split by collector and action when reported |
When the JVM reports jvm.gc.action, the widget shows it as a secondary breakdown so you can see which kind of collection — young-gen vs full — is driving the rate. As with GC pause duration, the widget does not force a minor-vs-major split across collectors that do not have one.
What to look for:
- A high event rate with low pause durations (in GC pause duration) is healthy GC — collections fire often but finish quickly.
- A low event rate with high pause durations is the dangerous pattern: infrequent but expensive full GCs.
- A step-change in event rate at a deployment timestamp means the new code allocates more memory per request than the previous version.
Service latency vs pauses
Style: line + vertical markers overlay (one of three widgets in the GC row, same width as the others). Y-axis: P99 latency (ms). Underlying data: P99 request latency from span metrics, with jvm.gc.duration events annotated as vertical markers. Both series are scoped to the same service and time window as the rest of the JVM Metrics view.
| Series | Source | Style | What it shows |
|---|---|---|---|
| p99 latency | Span metrics for the service | Purple line with shaded area | End-user request latency over time |
| GC pause event | jvm.gc.duration observations above the configured threshold (default 50 ms) | Vertical dashed red line | Each GC pause that exceeded the threshold |
Span metrics and JVM metrics live on the same platform, so no cross-system correlation is needed.
What to look for:
- A latency spike that lines up with a GC pause marker is GC-caused. The pause stopped all application threads, so any in-flight request piled up wait time during that window.
- A latency spike with no nearby pause marker is not GC-related. Look at downstream calls in Dependencies or lock contention in Thread count by state instead.
- A pause marker with no matching latency spike means requests were short enough, or concurrency low enough, that no request happened to span the pause.
Common use cases
| Symptom | Where to look first |
|---|---|
| Intermittent latency spikes, traces look fine | Service latency vs pauses: align spikes with GC pause markers |
| Service throws an out-of-memory (OOM) error intermittently | Heap used vs memory: check used approaching limit, then confirm in Live set - memory after GC whether the live set is rising |
| Memory never returns to baseline after deployment | Live set - memory after GC: a rising staircase after the deployment timestamp indicates a leak introduced in the new version |
| Service is slow but spans show low self-time | Thread count by state: high blocked share with low CPU in JVM CPU utilization is lock contention |
| CPU pegged at the ceiling but throughput is low | JVM CPU utilization combined with Thread count by state: runnable threads dominating with high CPU is a hot loop or a GC pressure spiral; cross-reference with GC pause duration and GC event count |
| Metaspace or class count keeps growing | Class loading: total count rising after warm-up indicates a classloader leak; switch the Heap used vs memory widget to non-heap to confirm Metaspace pressure |
| GC overhead jumped after a code change | GC event count: step change in bar height at deployment time means the new code allocates more per request |
Limitations
- JVM metrics are emitted at the JVM process level. Multiple deployed applications inside a single JVM (Tomcat, JBoss, WebLogic) cannot be visualized separately — heap usage, GC behavior, and thread counts reflect the entire JVM process.
- On hosts running more than one JVM outside Kubernetes, instances are distinguished by the combination of
host.nameandservice.name. - Instances that have stopped reporting (terminated pods) are excluded from the filter selector after a short grace period.
Next steps
If your service is not yet sending JVM metrics, follow Send JVM metrics to enable the OpenTelemetry agent's metrics exporter.
Related resources
- Send JVM metrics
- Service Catalog
- Java OpenTelemetry instrumentation
- OpenTelemetry JVM semantic conventions