Aggregating data
Goal
By the end of this guide, you should be able to use aggregate
, groupby
, avg
, count_if
, and other functions to calculate metrics such as averages, totals, and conditional counts from your logs or traces.
Why it matters
Understanding what your systems are doing at scale often requires more than just viewing raw logs, it requires summarization. Aggregations let you answer questions like “Which endpoints are the slowest?”, “How many error traces exceeded 1s?” or “What is the average request size per service?” without manually inspecting individual events.
Count all logs or traces
Start with the most basic aggregation: a total count of documents.
Output:
total_logs |
---|
42 |
This gives you a single summary document showing how many logs or traces are in the current time range. It's often used as a quick validation before applying filters or grouping.
Group by a field and calculate an average
To understand how a metric varies across categories—like average response time per route—you can combine groupby
with avg
.
Output
cloud_region | avg_duration |
---|---|
eu-north-1 | 577441.4083135392 |
eu-west-1 | 2090822.126859143 |
us-east-2 | 301.00681648269494 |
This result shows that us-east2
is significantly faster than the other groupings. Grouping by key fields like path
, service
, or subsystem
can reveal performance bottlenecks across your system.
Add multiple aggregations at once
You can compute multiple metrics per group in a single query.
source spans
| groupby cloud_region aggregate
avg(duration) as avg_duration,
max(duration) as max_duration,
count() as request_count
Output
cloud_region | avg_duration | max_duration | request_count |
---|---|---|---|
eu-north-1 | 621596.129809023 | 9843640 | 3613 |
eu-west-1 | 2116719.928825623 | 60559729 | 1124 |
us-east-2 | 365.9974160592664 | 96948 | 66952 |
This result shows not just the average but also the worst-case response time and how often the endpoint is hit. It helps you correlate performance issues with traffic volume.
Filter before aggregating
It's best practice to apply filters before aggregating to focus your analysis. This makes your query more performant. For example, counting only slow requests:
Output
cloud_region | slow_request_count |
---|---|
eu-north-1 | 3408 |
eu-west-1 | 1067 |
us-east-2 | 5880 |
This output highlights that most slow requests are concentrated on the us-east-2
cloud region. You can adjust the threshold or use other fields (e.g., status_code
, severity
) to customize your filters.
Use count_if
for conditional metrics
count_if
lets you tally only the events that meet a condition, while still preserving the full dataset.
source spans
| groupby cloud_region aggregate
count() as total_requests,
count_if($m.duration > 500) as slow_requests
Output
cloud_region | total_requests | slow_requests |
---|---|---|
eu-north-1 | 4064 | 3784 |
eu-west-1 | 1134 | 1086 |
us-east-2 | 75219 | 4194 |
This lets you compute derived metrics like error rates or latency percentiles. For example, you could later calculate count_if($m.duration > 500) / count() as slow_request_ratio
to get the fraction of problematic requests per path.
Group by calculated severity level
You’re not limited to grouping by raw fields—groupby
can also use expressions. One common use case is grouping logs into custom severity bands based on a numeric field like duration
.
For example, let’s say any request over 1000ms is super slow, over 500ms is a slow, and anything faster is ok. You can express this logic inline using a case
expression:
source spans
| groupby case {
duration > 1000 -> 'super slow',
duration > 500 -> 'slow',
_ -> 'ok'
} as latency_level
aggregate count() as request_count
Output
latency_level | request_count |
---|---|
ok | 160687 |
slow | 65991 |
super slow | 191280 |
This gives you a simple breakdown of how many requests fall into each latency tier—ideal for tracking SLO compliance or surfacing anomalies in dashboards. You can tweak the thresholds or logic to match any internal definition of "slowness."
Aggregate over spans (traces)
Aggregations work just as well on trace data. You can switch to the spans
source and run the same types of calculations.
Output
trace_count | max_trace_duration |
---|---|
755903 | 60440745 |
This summary shows how many traces were recorded in the selected time window and what the longest span duration was—useful for understanding system throughput and outliers in trace performance.
Common pitfalls or gotchas
groupby
removes all fields not explicitly included. Add fields as grouping keys or aggregations if you want to preserve them.- Aggregations like
avg
,count_if
, andmax
ignorenull
values. Make sure fields you're aggregating on are populated. - You can only group by scalar values—arrays and objects are not supported unless flattened or transformed first.