Skip to content

Optimize AI costs

The Cost section in AI Center pairs two complementary views. The monitoring widgets — KPIs, model breakdown, token distribution, and top spenders — give you the overall picture: what you're spending, how it's trending, and which apps, models, and users drive the cost. Optimization insights works the other side: it scans live span data for costly patterns — truncated responses, runaway tool loops, untapped caching — and surfaces ranked, span-linked suggestions for reducing spend.

AI Center Overview Cost section showing Total cost, Cost change, Avg cost per span, and Cache hit rate KPIs, plus Token distribution and Cost over time charts

Surfaces overall AI spend with the four headline KPIs, a Token distribution bar, and a Cost over time area chart so you can see direction at a glance.

What you need

Every AI span must carry the standard OpenTelemetry GenAI usage attributes — token counts, the model, the provider, and the cache-read and cache-write variants for the cache metrics. See the Span attribute inventory for the full list and per-attribute notes.

If your spans don't yet emit cache attributes, jump to Send cache and cost data to Coralogix.

Where cost lives

Cost is available in two scopes, with the same widget set:
ViewScopeExtra widget
AI Center OverviewAll applications in your organizationMost expensive applications
Application DrilldownA single applicationHigh-spending users

Both views respect the time range picker.

To open Cost:

  1. In the Coralogix UI, select AI Center, then Overview.
  2. To see organization-wide cost, select Cost in the sidebar.
  3. To see cost for one application, open Application Catalog, select an application, then select Cost in the drilldown sidebar.

Optimization insights

Optimization insights runs six deterministic rules against your live span data. The rules that fire on your data show up as cards, ranked by estimated savings.

Take Responses are getting cut off as an example. The card fires when at least 5% of your responses stop at the model's token limit instead of finishing naturally. That means you're paying for calls that didn't deliver an answer, users are seeing truncated responses, and the spend on those calls is wasted. The card tells you the exact share, links to a representative span in AI Explorer so you can see the cut-off in context, and points at the fix: raise the max-token limit or shorten prompts so calls finish, and stop paying for output you can't use.

The six rules:
#InsightFires whenDefault threshold
1Responses are getting cut offAt least 5% of responses stop with finish_reason="length" (truncated)5% truncated share
2Conversations are running very longp95 conversation length is at least 20 turns20 turns, minimum 20 distinct conversations
3Tool-call chains are spiralingp95 tool calls per trace exceeds 25 (the agent guardrail)25 calls, minimum 20 traces
4Rarely-used tools are bloating every promptA declared tool appears in fewer than 5% of calls but ships on every request5% usage share, minimum 100 tool calls
5Large prompts are a caching opportunityAt least 5% of calls send a prompt larger than 2048 tokens (big repeated prefix → turn on caching)2048 tokens, 5% share, minimum 100 spans
6Prompt caching is barely hittingA cache-capable model has a cache hit rate under 10%10% hit rate, minimum 100 spans with cache data

Every card exposes the same See a span example → action so you can confirm the pattern before changing anything. Cards are qualitative — they tell you what looks wrong and where to investigate, not what to change automatically. Rules with minimum-sample gates (long-conversation, tool-chain, unused-tool, large-prompt, low-cache-hit-rate) wait for enough traffic to be reliable, so low-traffic applications may show no insights at all.

Optimization insights panel with cost-saving suggestion cards such as "Rarely-used tools are bloating every prompt" and "Large prompts are a caching opportunity", each linking to a representative span example

Monitoring widgets

KPI strip

Four headline metrics tell you at a glance whether AI spend is on track and where to dig if it isn't.

  • Total cost — Your AI bill for the selected range. Scoped to one application on the Application Drilldown or the whole organization on the AI Center Overview.
  • Cost change — Absolute and percentage change in total cost between the selected time range and the equivalent preceding period. Looking at the last 7 days compares against the 7 days before it, so you can tell at a glance whether spend is trending up or down.
  • Avg cost / span — Total cost divided by AI span count. Catches the expensive-per-call pattern when totals look flat.
  • Cache hit rate — Percentage of input tokens served from cache. Tells you whether caching is paying off; a low rate on a cache-capable model is money on the table (and Insight #6 flags this automatically).

Token distribution

A stacked bar splitting tokens in the selected range into Input, Output, and Cached, so you can see where token volume concentrates.

Cost over time

A stacked area chart of cost across the selected window, split by input, output, and cached tokens — shows when a spike started and which token type drove it.

Cost by model

A table ranking every model in use by total cost, with Avg $ / span, % of spend, and Cache hit rate alongside (Cache hit rate reads N/A when a model's spans carry no cache attributes).

High-spending users

Ranks users by total cost in the selected range; visible on the Application Drilldown.

Cost by model and High-spending users tables side by side — models and users ranked by total cost, with cost, avg cost per span, percentage of spend, and cache hit rate columns

Most expensive applications

Ranks applications by total cost in the selected range; visible on the AI Center Overview.

How Coralogix derives cost and cache metrics

  • Total cost sums the per-span price tags gen_ai.prompt_price, gen_ai.response_price, gen_ai.read_cache_price, and gen_ai.write_cache_price across the selected range and scope.
  • Cache hit rate is cache-read tokens divided by input tokens. Cached reads count as a subset of input tokens.
  • Cost change compares total cost for the selected range against total cost for the equivalent preceding period (the last 7 days versus the 7 days before), reported as both an absolute difference and a percentage.

Send cache and cost data to Coralogix

Cache hit rate and accurate cost on cache-writing providers rely on two span attributes, both defined by the OpenTelemetry GenAI semantic conventions:
AttributeMeaningUsed for
gen_ai.usage.cache_read.input_tokensInput tokens served from cacheCache hit rate
gen_ai.usage.cache_creation.input_tokensInput tokens written to cache (Anthropic and Bedrock only)Total cost (cache-write billing)

Note

Always also send gen_ai.provider.name so Coralogix applies the correct hit-rate formula per provider. Send the raw value exactly as the provider returns it — do not pre-normalize it.

Providers that return cache usage

Confirm your provider returns cache data before relying on Cache hit rate.
ProviderReturns cacheReadsWrites
OpenAI / Azure OpenAIYesYesNo (automatic)
AnthropicYesYesYes
AWS BedrockYes (model-dependent)YesYes
Google GeminiYes (cachedContentTokenCount)YesImplicit or explicit
DeepSeekYesYesNo (automatic)
Self-hosted (vLLM, Ollama, …)No

Emit the cache tags

Use an open-source GenAI instrumentation library — for example, OpenLLMetry or the Pydantic-AI auto-instrumentor — which sets these attributes on the span for you. See Send GenAI data to Coralogix for the full setup.

If your instrumentation does not set the cache attributes yet, add a response hook. Most libraries expose a callback that runs while the provider response is still in scope — read the cache fields off it and write the canonical attributes on the span before export.

Note

Hook names vary by library: response_hook, responseHook, on_response. The principle is the same — the hook fires while the response is in scope, so you can copy the cache fields onto the span. AWS Bedrock Converse names the write field cacheWriteInputTokens. For whichever shape your SDK returns, set the canonical attribute name on the span.

Python hook

Register this response_hook with your instrumentation library to copy each provider's cache fields onto the canonical gen_ai.usage.cache_read.input_tokens and gen_ai.usage.cache_creation.input_tokens attributes. The branches cover OpenAI, Azure, DeepSeek, Anthropic, AWS Bedrock, and Google Gemini in one function.

def response_hook(span, request, response):
    def read(obj, key):
        if obj is None:
            return None
        return obj.get(key) if isinstance(obj, dict) else getattr(obj, key, None)

    def emit(attr, value):
        if value is not None:  # 0 is a valid count, so guard on None
            span.set_attribute(attr, value)

    READ = "gen_ai.usage.cache_read.input_tokens"
    WRITE = "gen_ai.usage.cache_creation.input_tokens"

    usage = read(response, "usage")  # object (OpenAI/Anthropic) or dict (Bedrock)
    if usage is None:
        # Gemini reports cache under usage_metadata, not usage
        emit(READ, read(read(response, "usage_metadata"), "cached_content_token_count"))
        return

    # OpenAI / Azure / DeepSeek (via OpenAI-compatible wrappers)
    emit(READ, read(read(usage, "prompt_tokens_details"), "cached_tokens"))

    # Anthropic (snake_case) — reads and writes
    emit(READ, read(usage, "cache_read_input_tokens"))
    emit(WRITE, read(usage, "cache_creation_input_tokens"))

    # AWS Bedrock Converse (camelCase) — reads and writes
    emit(READ, read(usage, "cacheReadInputTokens"))
    emit(WRITE, read(usage, "cacheWriteInputTokens"))


# Register per your instrumentation API, for example:
OpenAIInstrumentor().instrument(response_hook=response_hook)
TypeScript / Node hook

Same logic in TypeScript: pass this responseHook to OpenLLMetry's instrumentation (or any library exposing an equivalent callback) and it maps each provider's cache fields onto the canonical attributes. Handles the same OpenAI-compatible, Anthropic, AWS Bedrock, and Gemini response shapes as the Python version.

import { OpenAIInstrumentation } from '@traceloop/instrumentation-openai';
import type { Span } from '@opentelemetry/api';

const responseHook = (span: Span, response: unknown): void => {
  const obj = (v: unknown): Record<string, unknown> =>
    (typeof v === 'object' && v !== null ? v : {}) as Record<string, unknown>;
  const set = (attr: string, value: unknown): void => {
    if (typeof value === 'number') span.setAttribute(attr, value); // 0 is valid
  };

  const r = obj(response);
  const usage = r['usage']; // object (OpenAI/Anthropic) or dict (Bedrock)
  if (usage === undefined || usage === null) {
    // Gemini reports cache under usageMetadata, not usage
    set(
      'gen_ai.usage.cache_read.input_tokens',
      obj(r['usageMetadata'])['cachedContentTokenCount'],
    );
    return;
  }
  const u = obj(usage);

  // OpenAI / Azure / DeepSeek (via OpenAI-compatible wrappers)
  set(
    'gen_ai.usage.cache_read.input_tokens',
    obj(u['prompt_tokens_details'])['cached_tokens'],
  );

  // Anthropic (snake_case) — reads and writes
  set('gen_ai.usage.cache_read.input_tokens', u['cache_read_input_tokens']);
  set(
    'gen_ai.usage.cache_creation.input_tokens',
    u['cache_creation_input_tokens'],
  );

  // AWS Bedrock Converse (camelCase) — reads and writes
  set('gen_ai.usage.cache_read.input_tokens', u['cacheReadInputTokens']);
  set(
    'gen_ai.usage.cache_creation.input_tokens',
    u['cacheWriteInputTokens'],
  );
};

new OpenAIInstrumentation({ responseHook });

Next steps

Drill into the specific span behind any cost insight with AI Explorer.