How the Coralogix CLI Adds Production Intelligence to Any Agent for Any Use Case
The new interface into production telemetry is a tool call, made from whichever agent runtime the operator happens to be using at that moment. A finance lead in Claude Code, a product manager in Cursor, an engineer in Codex. Three different jobs, three different agents, three different reasoning loops. The thing they have in common is the data layer underneath. If that layer was built for dashboards, the agent overflows its context, hallucinates the schema, or burns thousands of dollars in tokens retrieving raw data it then has to summarize. If that layer was built for agents, the agent gets a small structured answer and moves on.
The Coralogix CLI (cx) is that data layer, exposed as a binary any agent can call. Below are three workflows run against the same Coralogix tenant this morning. Let’s explore these three use cases, to see how the Coralogix CLI has added both production intelligence and autonomous remediation to the Claude Code agent.
The FinOps agent analysing thirty-six million logs with a few hundred tokens
The FinOps lead’s standing question is straightforward. Where is observability cost going, and which applications are pulling more weight than they should? Two years ago that question started in a UI. Today it starts in Claude Code.
> Show me log and span volumes for the last 24 hours by application.
I'll run two server-side aggregations against the Coralogix backbone. The
DataPrime queries return one row per application, so the tokens stay small
even though the underlying corpus is in the tens of millions.
cx logs 'groupby $l.applicationname aggregate count() as log_count |
orderby log_count desc | limit 10' --start now-24h
{ "applicationname": "default", "log_count": 34029239 }
{ "applicationname": "kube-system", "log_count": 890432 }
{ "applicationname": "074157727657", "log_count": 611398 }
...
cx spans 'groupby $l.applicationName aggregate count() as span_count |
orderby span_count desc | limit 10' --start now-24h
{ "applicationName": "default", "span_count": 37392548 }
{ "applicationName": "074157727657", "span_count": 477752 }
...
Total: ~36.6M log lines and ~37.5M spans in 24 hours. The "default" app
is responsible for 93% of log volume and 99% of span volume. kube-system
is the second-largest log producer at ~890k lines/day, almost all
INFO-severity infrastructure noise.
Underneath, this is server-side aggregation through DataPrime and the Distributed Query Engine. The agent never holds thirty-six million log lines in its context window. It sends a groupby, gets ten rows back, and spends maybe two hundred output tokens reading them. Published agent benchmarks measure agent systems consuming ten to one hundred times the tokens of the underlying model when they retrieve raw data. Aggregating before returning is what closes that gap. Compressed ingestion is the only billing dimension at Coralogix, so the FinOps agent can issue thousands of queries a day without per-query meter risk.
Thanks to in-stream processing, the FinOps agent can immediately remediate without losing coverage or blocking crucial data. The kube-system data is currently being ingested in Frequent Search. Coralogix’s TCO Optimizer routes data into three priority tiers: Frequent Search (hot, OpenSearch-backed, fastest queries), Monitoring (customer-owned S3, queryable via DataPrime in seconds with no rehydration, with full alerting and dashboarding), and Compliance (customer-owned S3, long retention). Moving kube-system from Frequent Search to Monitoring keeps full coverage, full DataPrime querying, and full alerting. What changes is how often the data is touched at hot-storage prices, and the agent does the rebalance directly:
> Move kube-system to the Monitoring tier and confirm the policy is active.
I'll send the policy definition over stdin to `cx tco create`. Priority is
PRIORITY_TYPE_MEDIUM (Monitoring), source is logs, and the application rule
matches the kube-system app exactly.
cx tco create --from-file - <<'EOF'
{
"name": "kube-system-monitoring",
"priority": "PRIORITY_TYPE_MEDIUM",
"sourceType": "SOURCE_TYPE_LOGS",
"applicationRule": {
"ruleTypeId": "RULE_TYPE_ID_IS",
"name": "kube-system"
}
}
EOF
Created TCO policy 'kube-system-monitoring' (ID: pol_8d2f...) in profile 'default'.
cx tco list -o text
ID NAME PRIORITY SOURCE TYPE SEVERITY
pol_8d2f... kube-system-monitoring MEDIUM LOGS -
...
Estimated daily reclaim: ~890k events × ~2.4kb avg × (1 - 1/2.5)
= ~1.3 GB of Frequent-Search-equivalent throughput.
Coverage holds. DataPrime queries still run against the kube-system data, alerts still fire, and the data still sits in the customer’s own S3 bucket in open Parquet where it always did. What changed is the price of touching it.
The product manager, figuring out what to prioritise
The product manager’s standing question is utilization. How many real users hit the site this week, where do they bounce, and why? The honest version of that question requires Real User Monitoring data, and learning a new query language to ask it has historically been the friction.
> How many unique users hit our site over the past week, and what's the
health picture?
cx logs "filter \\$l.subsystemname == 'cx_rum' | groupby true aggregate
approx_count_distinct(\\$d.cx_rum.session_context.user_id) as users,
approx_count_distinct(\\$d.cx_rum.session_context.session_id) as sessions,
count() as events" --start now-7d
{ "users": 302, "sessions": 72352, "events": 4742314 }
302 distinct users across 72k sessions and 4.7M RUM events. Now error and
performance health against the same dataset.
cx logs "filter ... && \\$d.cx_rum.event_context.severity:num == 5 | groupby
rum_template_id aggregate count() as error_count, ..." --start now-7d
...
Top error: 30,461 occurrences of an S3 onclick failure on the staging
amplify deployment. Several smaller errors trace to dynamodb access
failures and a translation-rendering bug.
cx logs "filter ... && web_vitals_context.name == 'LCP' | groupby
page_fragments aggregate percentile(0.75, value) as LCP_p75_ms..." --start now-7d
{ "page": "/docs/dataprime/.../replace/", "LCP_p75_ms": 50159 }
{ "page": "/docs/integrations/aws/aws-eks-fargate/", "LCP_p75_ms": 41887 }
{ "page": "/docs/dataprime/.../formattimestamp/", "LCP_p75_ms": 41392 }
...
LCP p75 is over fifty seconds on the slowest reference pages. Google's
"good" threshold is 2.5 seconds. The pages affected are the ones search
indexes link into, so new users land on the slowest renders.
Recommendation: latency on the /dataprime reference pages is likely the
largest single cause of bounce. I'd prioritize a fix here over the open
S3 onclick error, which is concentrated on a single internal staging session.
The rum skill is what makes that conversation work. It teaches the agent the RUM data model: event types, severity 5 as the universal error marker, rum_template_id as the error fingerprint, and Web Vitals names like Largest Contentful Paint (LCP) and Load Time (LT) as the things to ask about. It teaches the agent how to ask, not what is wrong. Underneath, RUM is regular logs in a cx_rum subsystem, queried with the same DataPrime that drives every other surface. The semantic layer answers low-volume metadata queries cheaply, so the agent composes a more expensive aggregate against confirmed field paths instead of fishing. Stanford, SambaNova, and Berkeley’s Agentic Context Engineering paper measured a 17.1 percentage point AppWorld accuracy gain from context engineering alone, no model change. The skills are that engineering, packaged as recipes the agent loads on demand through npx skills add coralogix/cx-cli.
The product manager reprioritizes the backlog. The latency is doing more damage than the noisy error count suggested, and the answer was reachable in one conversation.
A DevOps engineer, catching the database problem before a release
The engineer is two hours from running an online migration on the orders database. The checklist has one open item: confirm the database is not already on fire. The agent will compose the queries; the engineer reads the answer.
> I'm about to run a migration on the orders DB. Are there any active
database errors I should know about before I start?
cx logs 'filter $m.severity == ERROR && ($d.message ~ "database"
|| $d.message ~ "sql" || $d.message ~ "postgres"
|| $d.message ~ "connection") | groupby $l.subsystemname aggregate
count() as errors | orderby errors desc | limit 8' --start now-24h
{ "subsystemname": "frontend", "errors": 2746 }
{ "subsystemname": "cll-lambda-python", "errors": 708 }
{ "subsystemname": "databaserelatederrorsfunction-faas", "errors": 705 }
...
The faas subsystem is the most operationally suspect of the three. Looking
at message distribution next.
cx logs "filter ... 'databaserelatederrorsfunction-faas' | groupby
\\$d.message aggregate count() as occurrences | orderby occurrences desc
| limit 3" --start now-24h
{ "message": "err: database error for customer 12345: err: query timeout error",
"occurrences": 705 }
...
All 705 errors are the same template, scoped to customer 12345, sustained
at fifty to seventy per hour across the full 24-hour window. The migration
will not introduce this issue, but it will run against the same database
while the issue is happening.
cx logs "filter ... | groupby roundTime(\\$m.timestamp, 1h) as hour aggregate
count() as errors | orderby hour asc" --start now-24h
[24 hourly buckets, errors holding between 26 and 70 per hour]
Recommendation: hold the migration. The query-timeout pattern on customer
12345 indicates a long-running query, a runaway transaction, or a saturated
connection pool. The migration's lock acquisition will queue behind whatever
this customer's query is doing.
The mechanism behind this is cx search-fields, which lets the agent find the right field path semantically before it filters, so the wildcard match against $d.message is a fallback rather than a fishing expedition. The agents output mode (-o agents) returns TOON-encoded results with metadata stripped and auto-spill to a temp file when the payload exceeds 100 KiB. The engineer’s agent can run twenty queries in series, follow a hypothesis, and not blow its context.
The migration does not run until the customer 12345 issue is understood. The terminal got the engineer to that conclusion in under a minute.
What sits underneath each one
The Coralogix CLI enriches any agent with full, token efficient access to all customer intelligence in the Coralogix platform, powered by the Distributed Query Engine for server-side aggregation, the semantic layer for cheap field discovery, agent-shaped output through TOON, multi-profile fan-out for cross-region answers, and the skills bundle that ships these recipes to forty-plus coding agents. The data lives in customer-owned object storage in open Parquet & Open TSDB, and Coralogix is the query and control plane for it.
And the research supports this
Quesma’s analysis of SWE-Bench Pro found a 22 percentage point swing between basic and optimized agent scaffolds with the same model weights. The CLI is part of that scaffold. So is the skill that taught the agent how to compose the query in the first place. Ensuring that your models have access to the correct data, in an effective and token efficient way, is the best impact you can have on agent performance today.
Try it today
The Coralogix CLI can be installed by following our documentation today, with prebuilt skills for over 40 agents, and the ability to access every single Coralogix API. This single tool call brings production reality and autonomous remediation to all of your agents, regardless of the use case, with minimal token footprint.