groupby - Perform aggregations on expressions
The groupby
operator is a fundamental component of DataPrime, unlocking the ability for users to perform aggregations, like sum
, average
, max
, count
and more on a working set of documents.
The groupby
operator supports grouping by one or more expressions. This means either keypaths or the results of some calculation.
Syntax
groupby <grouping_expression> [as <alias>] [, <grouping_expression_2> [as <alias_2>], ...] [calculate|calc|aggregate|agg]
<aggregate_function> [as <result_keypath>]
[, <aggregate_function_2> [as <result_keypath_2], ...]
NOTE: calculate
& calc
have now been deprecated. Users should default to aggregate
or agg
.
Dynamic aggregation expressions
While it is most common to aggregate on an existing keypath, DataPrime also supports aggregations on the results of expressions too. For example:
In this example, this will set all values of the username
keypath to its lower case variant, ensuring that username
values that are the same with different cases will still be grouped together. Please see examples for more complex use cases.
Example - Computing average HTTP Request Latency for Each Path
In this example, we're going to group our HTTP traffic logs by a given URL, which is available in a field path
. We will then perform an aggregation to calculate the average duration for each of the path values.
NOTE: This type of aggregation removes all fields besides the grouping key and any calculated aggregations. In effect, our resultant document looks like this:
Example - Grouping Kubernetes OTel traces by IP subnet
Consider the following traces:
{
"action": "/home/purchase",
"process": {
"tags": {
"k8s.pod.ip": "172.31.99.5"
}
}
},
{
"action": "/home/purchase",
"process": {
"tags": {
"k8s.pod.ip": "172.31.98.3"
}
}
},
{
"action": "/home/purchase",
"process": {
"tags": {
"k8s.pod.ip": "172.31.99.1"
}
}
},
We can see different IP addresses here. An easier way to view this data is to cluster the IP addresses into their appropriate subnet ranges. Assuming a subnet range size of 255
, the following command demonstrates how to do this using groupby
and ipPrefix
:
source spans
| groupby ipPrefix(process.tags['k8s.pod.ip'], 24) as subnet_range agg count() as activity_count
This results in the following documents:
{
"subnet_range": "172.31.99.0/24",
"activity_count": 2
},
{
"subnet_range": "172.31.98.0/24",
"activity_count": 1
}
Example - Converting summed bytes into megabytes on the fly
The groupby
command also supports aggregating on the results of calculations. Consider the following documents:
{
"ip": "172.31.99.0",
"bytes_returned": 245987
},
{
"ip": "172.31.98.0",
"bytes_returned": 126234
}
We can use groupby
to compute the average value for bytes_returned
for each IP address, but we can also perform calculations, such that the grouping is done as the result of an expression:
source logs
| groupby ip aggregate (avg(bytes_returned) / 1024 / 1024).round(2) as bytes_returned_avg_mb
This will result in the following documents:
{
"ip": "172.31.99.0",
"bytes_returned_avg_mb": 0.31
},
{
"ip": "172.31.98.0",
"bytes_returned_avg_mb": 0.42
}
Example - Tracking high duration HTTP requests in traces
count_if
is very useful when we're looking to aggregate on a specific field, like operationName
, but we are only interested in a subset of the values. Consider the following traces:
{
"operationName": "/home",
"duration": 567
},
{
"operationName": "/about",
"duration": 512
},
{
"operationName": "/store",
"duration": 33
}
Our goal is to count the number of requests, by operationName
, that exceed 500ms
. To do this, we can use count_if
:
Example - Calculating 99th percentile from trace Data
Let's say we're interested in the 99th percentile of a lambda function. We have trace data, and the lambda function's operation name is Lambda.Invoke
. We can then run the following query to compute its 99th percentile:
source spans
| filter $l.operationName == 'Lambda.Invoke'
| groupby $l.operationName agg percentile(0.99, $m.duration)