Skip to content

groupby - Perform aggregations on expressions

The groupby operator is a fundamental component of DataPrime, unlocking the ability for users to perform aggregations, like sum, average, max, count and more on a working set of documents.

The groupby operator supports grouping by one or more expressions. This means either keypaths or the results of some calculation.

Syntax

groupby <grouping_expression> [as <alias>] [, <grouping_expression_2> [as <alias_2>], ...] [calculate|calc|aggregate|agg]
  <aggregate_function> [as <result_keypath>]
  [, <aggregate_function_2> [as <result_keypath_2], ...]

NOTE: calculate & calc have now been deprecated. Users should default to aggregate or agg.

Dynamic aggregation expressions

While it is most common to aggregate on an existing keypath, DataPrime also supports aggregations on the results of expressions too. For example:

groupby username.toLowerCase() agg count() as activity_count

In this example, this will set all values of the username keypath to its lower case variant, ensuring that username values that are the same with different cases will still be grouped together. Please see examples for more complex use cases.

Example - Computing average HTTP Request Latency for Each Path

In this example, we're going to group our HTTP traffic logs by a given URL, which is available in a field path. We will then perform an aggregation to calculate the average duration for each of the path values.

groupby $d.path agg avg($d.duration) as average_duration_for_path

NOTE: This type of aggregation removes all fields besides the grouping key and any calculated aggregations. In effect, our resultant document looks like this:

{
    "path": "/home",
    "average_duration_for_path": 500
}

Example - Grouping Kubernetes OTel traces by IP subnet

Consider the following traces:

{
  "action": "/home/purchase",
  "process": {
    "tags": {
      "k8s.pod.ip": "172.31.99.5"
    }
  }
},
{
  "action": "/home/purchase",
  "process": {
    "tags": {
      "k8s.pod.ip": "172.31.98.3"
    }
  }
},
{
  "action": "/home/purchase",
  "process": {
    "tags": {
      "k8s.pod.ip": "172.31.99.1"
    }
  }
},

We can see different IP addresses here. An easier way to view this data is to cluster the IP addresses into their appropriate subnet ranges. Assuming a subnet range size of 255, the following command demonstrates how to do this using groupby and ipPrefix:

source spans
| groupby ipPrefix(process.tags['k8s.pod.ip'], 24) as subnet_range agg count() as activity_count

This results in the following documents:

{
  "subnet_range": "172.31.99.0/24",
  "activity_count": 2
},
{
  "subnet_range": "172.31.98.0/24",
  "activity_count": 1
}

Example - Converting summed bytes into megabytes on the fly

The groupby command also supports aggregating on the results of calculations. Consider the following documents:

{
  "ip": "172.31.99.0",
  "bytes_returned": 245987
},
{
  "ip": "172.31.98.0",
  "bytes_returned": 126234
}

We can use groupby to compute the average value for bytes_returned for each IP address, but we can also perform calculations, such that the grouping is done as the result of an expression:

source logs 
| groupby ip aggregate (avg(bytes_returned) / 1024 / 1024).round(2) as bytes_returned_avg_mb

This will result in the following documents:

{
  "ip": "172.31.99.0",
  "bytes_returned_avg_mb": 0.31
},
{
  "ip": "172.31.98.0",
  "bytes_returned_avg_mb": 0.42
}

Example - Tracking high duration HTTP requests in traces

count_if is very useful when we're looking to aggregate on a specific field, like operationName, but we are only interested in a subset of the values. Consider the following traces:

{
    "operationName": "/home",
    "duration": 567
},
{
    "operationName": "/about",
    "duration": 512
},
{
    "operationName": "/store",
    "duration": 33
}

Our goal is to count the number of requests, by operationName, that exceed 500ms. To do this, we can use count_if:

groupby $l.operationName calculate count_if($m.duration > 500) as high_duration_request_count

Example - Calculating 99th percentile from trace Data

Let's say we're interested in the 99th percentile of a lambda function. We have trace data, and the lambda function's operation name is Lambda.Invoke. We can then run the following query to compute its 99th percentile:

source spans
| filter $l.operationName == 'Lambda.Invoke'
| groupby $l.operationName agg percentile(0.99, $m.duration)