Skip to content

Aggregating data

Goal

By the end of this guide, you should be able to use aggregate, groupby, avg, count_if, and other functions to calculate metrics such as averages, totals, and conditional counts from your logs or traces.

Why it matters

Understanding what your systems are doing at scale often requires more than just viewing raw logs, it requires summarization. Aggregations let you answer questions like “Which endpoints are the slowest?”, “How many error traces exceeded 1s?” or “What is the average request size per service?” without manually inspecting individual events.

Count all logs or traces

Start with the most basic aggregation: a total count of documents.

aggregate count() as total_logs

Output:
total_logs
42

This gives you a single summary document showing how many logs or traces are in the current time range. It's often used as a quick validation before applying filters or grouping.


Group by a field and calculate an average

To understand how a metric varies across categories—like average response time per route—you can combine groupby with avg.

source spans 
| groupby cloud_region aggregate avg(duration) as avg_duration

Output
cloud_regionavg_duration
eu-north-1577441.4083135392
eu-west-12090822.126859143
us-east-2301.00681648269494

This result shows that us-east2 is significantly faster than the other groupings. Grouping by key fields like path, service, or subsystem can reveal performance bottlenecks across your system.


Add multiple aggregations at once

You can compute multiple metrics per group in a single query.

source spans
| groupby cloud_region aggregate
  avg(duration) as avg_duration,
  max(duration) as max_duration,
  count() as request_count

Output
cloud_regionavg_durationmax_durationrequest_count
eu-north-1621596.12980902398436403613
eu-west-12116719.928825623605597291124
us-east-2365.99741605926649694866952

This result shows not just the average but also the worst-case response time and how often the endpoint is hit. It helps you correlate performance issues with traffic volume.


Filter before aggregating

It's best practice to apply filters before aggregating to focus your analysis. This makes your query more performant. For example, counting only slow requests:

source spans
| filter duration > 500
| groupby cloud_region aggregate count() as slow_request_count

Output
cloud_regionslow_request_count
eu-north-13408
eu-west-11067
us-east-25880

This output highlights that most slow requests are concentrated on the us-east-2 cloud region. You can adjust the threshold or use other fields (e.g., status_code, severity) to customize your filters.


Use count_if for conditional metrics

count_if lets you tally only the events that meet a condition, while still preserving the full dataset.

source spans
| groupby cloud_region aggregate
  count() as total_requests,
  count_if($m.duration > 500) as slow_requests

Output
cloud_regiontotal_requestsslow_requests
eu-north-140643784
eu-west-111341086
us-east-2752194194

This lets you compute derived metrics like error rates or latency percentiles. For example, you could later calculate count_if($m.duration > 500) / count() as slow_request_ratio to get the fraction of problematic requests per path.


Group by calculated severity level

You’re not limited to grouping by raw fields—groupby can also use expressions. One common use case is grouping logs into custom severity bands based on a numeric field like duration.

For example, let’s say any request over 1000ms is super slow, over 500ms is a slow, and anything faster is ok. You can express this logic inline using a case expression:

source spans
| groupby case {
  duration > 1000 -> 'super slow',
  duration > 500 -> 'slow',
  _ -> 'ok'
} as latency_level
aggregate count() as request_count

Output
latency_levelrequest_count
ok160687
slow65991
super slow191280

This gives you a simple breakdown of how many requests fall into each latency tier—ideal for tracking SLO compliance or surfacing anomalies in dashboards. You can tweak the thresholds or logic to match any internal definition of "slowness."


Aggregate over spans (traces)

Aggregations work just as well on trace data. You can switch to the spans source and run the same types of calculations.

source spans
| aggregate count() as trace_count, max($m.duration) as max_trace_duration

Output
trace_countmax_trace_duration
75590360440745

This summary shows how many traces were recorded in the selected time window and what the longest span duration was—useful for understanding system throughput and outliers in trace performance.


Common pitfalls or gotchas

  • groupby removes all fields not explicitly included. Add fields as grouping keys or aggregations if you want to preserve them.
  • Aggregations like avg, count_if, and max ignore null values. Make sure fields you're aggregating on are populated.
  • You can only group by scalar values—arrays and objects are not supported unless flattened or transformed first.