Skip to content

Runtime metrics

The Runtime metrics tab in Service Catalog displays JVM runtime metrics — heap memory, garbage collection, thread states, CPU usage, and class loading — alongside service latency. It correlates JVM internals with user-facing performance, so you can diagnose memory leaks, GC pressure, and thread contention without leaving APM.

Limited availability

This feature is in preview and is subject to change.

Why it matters

When a Java service slows down, throws errors, or pegs CPU, the cause is often invisible from traces and span metrics alone. The usual suspects — long GC pauses, heap exhaustion, thread contention, classloader leaks — happen inside the JVM, one layer below where APM normally looks.

The Runtime metrics tab pulls JVM internals onto the same screen as your traces, so you can:

  • Tie latency spikes to GC pauses. The Service latency vs pauses widget overlays GC events on request latency. If a spike lines up with a pause marker, you have your answer.
  • Tell a memory leak apart from normal allocation pressure. Three separate memory widgets (Heap used vs memory, Live set - memory after GC, Heap by pool) let you read the heap properly instead of guessing from a single line.
  • Distinguish CPU saturation from lock contention. Looking at CPU usage and thread states side-by-side makes it obvious whether the service is busy doing work (CPU-bound) or stuck waiting (locks).
  • Isolate one bad pod. The JVM instances filter and the per-instance heatmap let you find the single misbehaving instance dragging down the cluster average.

What you need

OpenTelemetry semantic conventions only

The Runtime metrics tab reads only JVM metrics that follow the OpenTelemetry JVM semantic conventions — for example, jvm.memory.used, jvm.gc.duration, jvm.thread.count. Metrics that use other naming conventions (such as the Micrometer/Prometheus form jvm_memory_used_bytes) are not recognized and the tab does not show them. If your services emit JVM metrics under a different naming convention, migrate the instrumentation to the OpenTelemetry conventions before the tab can render data for them.

To make a Java, Scala, or Kotlin service report JVM metrics in the supported format:

  • Attach the OpenTelemetry Java agent (opentelemetry-javaagent.jar) to the service process. The agent collects all stable jvm.* metrics from the OpenTelemetry semantic conventions automatically. See Java OpenTelemetry instrumentation for installation steps, or follow Send JVM metrics for the end-to-end setup.
  • Required JDK: Java 8 or later for the stable metric set.
  • The metrics must include the service.name resource attribute so they correlate with the service in the Service Catalog.

For services that are not JVM-based (Python, Go, Node.js, and so on — identified by the telemetry.sdk.language resource attribute), the Runtime metrics tab is hidden entirely. For JVM-based services that have not started reporting jvm.* metrics yet, the tab is shown with an empty state (see When no JVM metrics are detected) that points you to instructions for sending the metrics.

How JVM metrics reach the tab

flowchart LR
    JVM["Java / Scala / Kotlin<br/>service (JVM process)"]
    Agent["OpenTelemetry Java agent<br/>opentelemetry-javaagent.jar"]
    Coralogix["Coralogix"]
    Tab["Runtime metrics tab<br/>filtered by service.name<br/>+ instance attribute"]

    JVM -->|JMX MBeans| Agent
    Agent -->|jvm.memory.* · jvm.gc.duration<br/>jvm.thread.count · jvm.cpu.*<br/>jvm.class.* + service.name| Coralogix
    Coralogix --> Tab

    class JVM entry
    class Tab success

The OpenTelemetry Java agent collects JVM metrics from your service and sends them to Coralogix tagged with service.name. The tab uses that tag, plus the instance attribute it detects (k8s.pod.name, host.name, and so on), to render the right panels.

Access the Runtime metrics tab

  1. In your Coralogix toolbar, navigate to APM, then Service Catalog.
  2. Select a JVM-based service to open the service drilldown.
  3. Select the Runtime metrics tab.

The tab loads with the default time range and the JVM Metrics layout.

When no JVM metrics are detected

If you open the Runtime metrics tab for a JVM service that has not yet started sending JVM metrics in the OpenTelemetry format, the tab loads with an empty state instead of the widget grid. From the empty state, you can either start the JVM Observability extension flow directly in the tab, or open the Send JVM metrics guide to set up instrumentation yourself.

The tab continues to show the empty state until at least one jvm.* metric is observed for the service in the current time window. Once metrics start flowing, the widgets render automatically — no configuration change is needed inside the tab itself.

Note

Non-JVM services (such as Python or Go) do not show the Runtime metrics tab at all. The empty state is reserved for JVM-based services that are eligible to report jvm.* metrics but have not yet done so.

Layout

The JVM Metrics view is organized top to bottom as:

  1. JVM instances: selector that scopes every panel below it to a subset of running JVM instances.
  2. Instance heatmap: a row that shows the health of each instance at a glance.
  3. JVM summary: four stat cards spanning full width.
  4. Memory: Heap used vs memory, Live set - memory after GC, Heap by pool.
  5. CPU: JVM CPU utilization, Thread count by state, Class loading.
  6. GC: GC pause duration, GC event count, Service latency vs pauses.

The five content sections (Instance heatmap, JVM summary, Memory, CPU, GC) are collapsible. The JVM instances selector is a control bar and is always visible.

When JVM summary is collapsed, each card collapses into a chip that still shows its current value, trend arrow, and severity color, so the at-a-glance signal is preserved. When Memory, CPU, or GC is collapsed, the section header shows the widget titles as compact chips.

Hovering any chart syncs the crosshair across the other widgets in the tab, so you can correlate the same point in time across memory, CPU, and GC at once.

JVM instances selector

The instance selector sits at the top of the view and scopes every widget to a subset of the running JVM instances. The default is All instances (aggregated), which sums or averages across every JVM reporting metrics for the service.

Why per-instance filtering matters

JVM metrics are emitted per JVM process. In a horizontally scaled service, every pod runs its own JVM, and aggregate values hide the most common failure mode — one bad instance. A memory leak on a single pod, GC pauses isolated to one instance, or a thread leak on one node is invisible in the aggregate view until the pod fails. The selector lets you compare one suspect instance against the rest of the fleet.

When you select a single instance, every widget below filters to that instance only. When you select multiple instances, every widget shows one series per instance, color-coded by instance.

Instance heatmap

Below the instance selector is a collapsible per-instance overview row. It is collapsed by default during normal operations and expanded during incident triage.

When expanded, the row renders a compact grid where every running instance has one row and four columns — Heap used %, GC used time %, GC P99, CPU %. Each cell is color-coded by severity (green, amber, red), so an instance that is misbehaving on any single dimension is visually obvious without selecting each instance one by one.
ColumnSource metricSeverity rule
Heap used %jvm.memory.used ÷ jvm.memory.limitGreen at low utilization, amber as it climbs, red at sustained high utilization
GC used time %jvm.gc.duration summed as a fraction of wall timeGreen when the JVM spends little wall time paused for GC; red when pauses dominate
GC P99jvm.gc.duration 99th percentileGreen for short pauses, red for long pauses
CPU %rate(jvm.cpu.time) ÷ jvm.cpu.count (or jvm.cpu.recent_utilization × 100 as fallback)Green at low utilization, amber as it climbs, red near saturation

Selecting any row filters every widget below it to that instance. Large fleets are paginated — use the paginator below the heatmap to step through additional instances.

JVM summary

The summary strip displays four headline numbers an on-call engineer checks first. Each card shows an aggregated current value with a short context line and a directional delta versus the previous equivalent window (↗ red = worsening, ↘ green = improving, — neutral).
CardSource metricAggregation (card subtitle)
Heap usedjvm.memory.used filtered to jvm.memory.type=heap, summed across poolsAvg across instances, in GB
GC overheadjvm.gc.duration as a fraction of wall timeAvg of wall time spent in GC pauses
GC pause P99jvm.gc.durationP99, averaged across instances, last 30 minutes
Thread countjvm.thread.countAvg platform threads across instances

What to look for:

  • A red ↗ trend arrow on any card means the metric got worse since the previous window — cross-check with the detailed widgets below to see what is driving it.
  • Heap used climbing toward the heap limit is a pre-OOM warning.
  • GC overhead above a few percent is unhealthy — the JVM is losing meaningful wall time to GC. Drill into GC pause duration and GC event count to see whether long pauses or frequent collections are responsible.
  • GC pause P99 rising means tail pauses are stretching, which directly hurts user-facing latency. Confirm against Service latency vs pauses.
  • Thread count drifting up without matching traffic growth is a thread leak. A sudden drop usually means a thread pool was resized at deployment.

Two cards surface contextual badges when a specific signal appears:

  • GC overhead shows a Driven by pod badge when one instance is responsible for most of the GC overhead — flags a single bad pod without needing to expand the heatmap.
  • Thread count shows a Stable badge when the thread count is steady across the window — confirms no thread leak or runaway pool growth.

What GC overhead measures

The GC overhead card reports the percentage of wall time the JVM was paused for garbage collection. It is not the same as CPU consumed by GC. Concurrent collectors such as ZGC and Shenandoah can spend significant CPU on garbage collection while showing low values here, because they do most of their work without stopping application threads.

Memory

The Memory row answers three distinct questions, each on its own widget: how much memory is the JVM using, how much is retained after each garbage collection, and how is that usage distributed across heap pools.

Heap used vs memory

Style: area + lines. Y-axis: bytes (auto-scaled GB/MB), single axis. Underlying data: jvm.memory.*, split by jvm.memory.type.

A toggle switches between Heap and Non-heap (Metaspace). The Y-axis automatically rescales on toggle — heap is typically in the GB range, non-heap (Metaspace) in the hundreds of MB — so the chart stays readable in both modes.
SeriesSource metricStyleWhat it shows
usedjvm.memory.usedFilled area (blue)Currently allocated memory
committedjvm.memory.committedDashed line (green)Memory the OS has reserved for the JVM
limitjvm.memory.limitDashed reference line (red)Memory ceiling (-Xmx for heap, MaxMetaspaceSize for non-heap)

What to look for:

  • The gap between used and limit is your headroom before an out-of-memory error.
  • The gap between used and committed is memory the OS has reserved but the JVM is not using yet. A shrinking gap under load is an early warning that the JVM is running out of slack.
  • Switch to non-heap to surface Metaspace, a common source of OOM errors in services that load a lot of classes (plugin-heavy apps, frameworks that use a lot of reflection).

For leak detection, use Live set - memory after GC instead. The used line bounces with every GC cycle, which makes trends hard to read.

Live set - memory after GC

Style: stepped line. Y-axis: bytes, auto-ranged to the data — not zero-based. Underlying data: jvm.memory.used_after_last_gc summed across heap pools. Each horizontal segment represents memory between two GC events; each vertical drop is one GC firing.
SeriesSourceWhat it shows
live set sizejvm.memory.used_after_last_gcMemory retained after the most recent GC
GC eventDerived from jvm.gc.durationVertical markers where each GC event fired

Why this Y-axis is not zero-based

A zero-based Y-axis compresses every step into a narrow band at the top of the chart, which hides the rising-baseline pattern — the primary signal for leak detection. The axis automatically ranges to the data, so each step's change is visible, not just its absolute value.

This widget is the cleanest signal of a memory leak.

What to look for:

  • A roughly flat baseline means the service is healthy under steady load — the GC is reclaiming whatever is not needed.
  • A rising staircase, where each step starts higher than the previous, means more memory is being retained after every GC cycle. The chart highlights this pattern with a "rising baseline = leak pattern" label when it detects sustained growth.

Heap by pool

Style: stacked area. Y-axis: bytes (GB), single axis. Underlying data: jvm.memory.used split by jvm.memory.pool.name, filtered to jvm.memory.type=heap.

Pool names depend on the active GC algorithm — G1, ZGC, Shenandoah, and Parallel GC each report different pool sets. The widget shows whatever pools the JVM reports.
SeriesStyleWhat it shows
old genStacked area (purple)Long-lived objects
survivorStacked area (green)Objects that survived at least one minor GC
edenStacked area (blue)Short-lived allocations

What to look for:

  • Eden is where new objects are allocated. It should rise and fall quickly with each young-generation collection — that is normal.
  • Old Gen holds long-lived objects. If it climbs steadily and never drops back down, the service is heading toward an out-of-memory error.
  • Survivor holds objects that survived at least one collection. If it stays unusually large, the JVM is keeping objects around longer than expected before promoting them to Old Gen.

CPU

The CPU row separates JVM CPU consumption, thread state composition, and class-loading activity into three widgets so each signal is readable on its own axis.

JVM CPU utilization

Style: line with reference line. Y-axis: cores (CPU usage expressed in core-equivalents). A 100% ceiling reference line marks the total available cores, so the saturation point is always in view.
SeriesStyleWhat it shows
jvm.cpu.usedSolid line (purple)CPU consumed by the JVM process, in core-equivalents
100% ceilingReference lineTotal available cores — the saturation ceiling

What to look for:

  • Usage hitting the 100% ceiling line combined with mostly runnable threads in the next widget means the service is CPU-bound — typically a hot loop or heavy compute.
  • Usage well below the ceiling with a high blocked thread share means the service is stuck waiting on locks, not doing work.

Thread count by state

Style: stacked area. Y-axis: thread count (integer), single axis. Underlying data: jvm.thread.count grouped by jvm.thread.state.
SeriesSourceStyleWhat it shows
runnablejvm.thread.count filtered to jvm.thread.state=runnableStacked area (blue)Threads currently running or ready to run
waitingjvm.thread.count filtered to jvm.thread.state=waitingStacked area (amber)Threads waiting on another thread or condition
blockedjvm.thread.count filtered to jvm.thread.state=blockedStacked area (coral)Threads waiting to acquire a monitor lock

The stacked composition matters as much as the total height.

What to look for:

  • A healthy service usually shows most threads in runnable (doing work) or waiting (idle between requests).
  • A spike in blocked threads means lock contention — threads are queued up waiting on a monitor.
  • A growing total stack height over time without matching traffic growth is a thread leak.

Class loading

Style: bars + line. Y-axis: classes / window (left), total (right). Underlying data: jvm.class.loaded, jvm.class.unloaded, jvm.class.count.
SeriesSourceStyleWhat it shows
load raterate(jvm.class.loaded)Green barsClasses loaded per time bucket
unload raterate(jvm.class.unloaded)Coral barsClasses unloaded per time bucket
total loadedjvm.class.countGreen lineCurrently loaded classes

What to look for:

  • The total class count should level off after the application warms up. If it keeps growing without an increase in load rate, you have a classic classloader leak — common in OSGi containers, plugin-heavy applications, or services that hot-reload code in production.
  • Unload activity is normally near zero in a healthy JVM.

GC

The GC row separates pause duration from event frequency — they answer different questions and combining them on one chart obscures both signals — and pairs them with a latency overlay so GC pauses can be aligned with end-user impact.

GC pause duration

Style: multi-line. Y-axis: ms, single axis. Underlying data: jvm.gc.duration histogram. Filterable by jvm.gc.name (collector name).
SeriesStyleWhat it shows
p50Green lineMedian pause time
p95Blue line95th percentile pause time
p99Red line99th percentile pause time

Lines are grouped by GC collector — for example, G1 Young Generation or ZGC. If the JVM reports a more specific action (such as "end of minor GC"), the widget can break the data down further, but it does not force a minor-vs-major split: modern collectors like ZGC and Shenandoah do not separate collections that way, and the widget reflects whatever the JVM actually emits.

What to look for:

  • A widening gap between p50 and p99 means pauses are becoming unpredictable — most are short, but some run long. This usually points to a fragmented heap or a GC that is struggling to keep up with allocation.

GC event count

Style: bars. Y-axis: events / window, single axis. Underlying data: jvm.gc.duration event count, derived as rate(jvm.gc.duration_count).
SeriesSourceStyleWhat it shows
minor / major (G1GC)rate(jvm.gc.duration_count), grouped by jvm.gc.name and jvm.gc.actionBars per collectorCollections per time bucket, split by collector and action when reported

When the JVM reports jvm.gc.action, the widget shows it as a secondary breakdown so you can see which kind of collection — young-gen vs full — is driving the rate. As with GC pause duration, the widget does not force a minor-vs-major split across collectors that do not have one.

What to look for:

  • A high event rate with low pause durations (in GC pause duration) is healthy GC — collections fire often but finish quickly.
  • A low event rate with high pause durations is the dangerous pattern: infrequent but expensive full GCs.
  • A step-change in event rate at a deployment timestamp means the new code allocates more memory per request than the previous version.

Service latency vs pauses

Style: line + vertical markers overlay (one of three widgets in the GC row, same width as the others). Y-axis: P99 latency (ms). Underlying data: P99 request latency from span metrics, with jvm.gc.duration events annotated as vertical markers. Both series are scoped to the same service and time window as the rest of the JVM Metrics view.
SeriesSourceStyleWhat it shows
p99 latencySpan metrics for the servicePurple line with shaded areaEnd-user request latency over time
GC pause eventjvm.gc.duration observations above the configured threshold (default 50 ms)Vertical dashed red lineEach GC pause that exceeded the threshold

Span metrics and JVM metrics live on the same platform, so no cross-system correlation is needed.

What to look for:

  • A latency spike that lines up with a GC pause marker is GC-caused. The pause stopped all application threads, so any in-flight request piled up wait time during that window.
  • A latency spike with no nearby pause marker is not GC-related. Look at downstream calls in Dependencies or lock contention in Thread count by state instead.
  • A pause marker with no matching latency spike means requests were short enough, or concurrency low enough, that no request happened to span the pause.

Common use cases

SymptomWhere to look first
Intermittent latency spikes, traces look fineService latency vs pauses: align spikes with GC pause markers
Service throws an out-of-memory (OOM) error intermittentlyHeap used vs memory: check used approaching limit, then confirm in Live set - memory after GC whether the live set is rising
Memory never returns to baseline after deploymentLive set - memory after GC: a rising staircase after the deployment timestamp indicates a leak introduced in the new version
Service is slow but spans show low self-timeThread count by state: high blocked share with low CPU in JVM CPU utilization is lock contention
CPU pegged at the ceiling but throughput is lowJVM CPU utilization combined with Thread count by state: runnable threads dominating with high CPU is a hot loop or a GC pressure spiral; cross-reference with GC pause duration and GC event count
Metaspace or class count keeps growingClass loading: total count rising after warm-up indicates a classloader leak; switch the Heap used vs memory widget to non-heap to confirm Metaspace pressure
GC overhead jumped after a code changeGC event count: step change in bar height at deployment time means the new code allocates more per request

Limitations

  • JVM metrics are emitted at the JVM process level. Multiple deployed applications inside a single JVM (Tomcat, JBoss, WebLogic) cannot be visualized separately — heap usage, GC behavior, and thread counts reflect the entire JVM process.
  • On hosts running more than one JVM outside Kubernetes, instances are distinguished by the combination of host.name and service.name.
  • Instances that have stopped reporting (terminated pods) are excluded from the filter selector after a short grace period.

Next steps

If your service is not yet sending JVM metrics, follow Send JVM metrics to enable the OpenTelemetry agent's metrics exporter.