Conditional aggregations
In addition to basic aggregation, DataPrime offers more advanced techniques such as conditional aggregations. Conditional aggregations allow you to apply functions based on specific conditions within your dataset, making it easier to group and analyze your data flexibly.
What you’ll learn in this guide:
- What is conditional aggregation?
- Primary uses
- Conditional functions as the basis for conditional aggregations
- Supported conditional aggregations
What is conditional aggregation?
Conditional aggregation enables you to apply aggregation functions (e.g., count
, sum
, avg
) selectively based on specific conditions. Unlike standard aggregation, which processes all data without filtering, conditional aggregation evaluates subsets of data that meet defined criteria. This makes it ideal for complex scenarios requiring finer control over data summarization.
What can I use it for?
Conditional aggregation is essential when you need to:
- Analyze subsets of data without splitting the dataset into multiple queries.
- Simplify operations by aggregating based on custom conditions such as
count_if
in one step. - Ensure data consistency when certain values are missing from the dataset. In datasets with missing or null values, conditional aggregation allows you to handle these gaps by applying default values or specific rules for missing data. This can help prevent inconsistencies when performing analyses, as you can control how missing values are treated.
- Generate compact, easy-to-read outputs that still provide detailed insights.
if
conditional function
At the basis of conditional aggregations is the function if
, the syntax for which is as follows:
If the <condition>
expression evaluates to TRUE, it returns the <value1>
, otherwise the function returns the <value2>
.
Basic examples
This example counts the number of error responses (status code 500) for each region. The if
function checks if the status code is equal to 500, and if so, it counts it as an error.
The query groups the logs by region
and uses the if
function to check the status_code
for each log entry. If the status_code
is 500 (indicating an error), the function returns 1
; otherwise, it returns 0
. The sum()
aggregation then adds up all the 1
s and 0
s, resulting in the total number of error responses per region, which is labeled as error_count
.
Supported conditional aggregations
Conditional counting with count_if
This technique counts occurrences of specific conditions in the data. It’s ideal for scenarios where you want to track how often each condition is met within each group.
Syntax:
Consider a situation where you are analyzing system logs, and you want to count how many times each severity level (such as ERROR
, WARNING
, INFO
, etc.) appears within each subsystem. Instead of filtering the logs for each severity level individually, you can use count_if
to count each severity condition in a single query.
Query:
source logs
| groupby $l.subsystemname
aggregate
count_if($m.severity == ERROR) as error_count,
count_if($m.severity == WARNING) as warning_count,
count_if($m.severity == DEBUG) as debug_count,
count_if($m.severity == INFO) as info_count,
count_if($m.severity == VERBOSE) as verbose_count
In this query, the count_if
function applies a condition (e.g., $m.severity == ERROR
) and counts only the rows that satisfy it. Each count_if
generates a separate column for its respective condition.
Output:
[
{
"subsystemname": "074157727657",
"debug_count": 47,
"error_count": 321,
"info_count": 1612,
"verbose_count": 22,
"warning_count": 12
},
{
"subsystemname": "aws-cloudtrail",
"debug_count": 0,
"error_count": 82,
"info_count": 11734,
"verbose_count": 0,
"warning_count": 0
},
{
"subsystemname": "prometheus",
"debug_count": 56,
"error_count": 0,
"info_count": 56,
"verbose_count": 0,
"warning_count": 0
}
]
The query's result will give you a table in which each row represents a subsystem and each column contains the count of logs for each severity level.
Now consider the following query, which builds upon the previous one. The user previously aggregated the logs by subsystem name
, counting the total occurrences of different severity levels (such as ERROR
, DEBUG
, etc.) for each subsystem. However, the user may now want to determine if a subsystem has an unusually high error rate. Specifically, the user wants to flag any subsystem where more than 10% of its log entries are marked as ERROR
.
Query:
source logs
| groupby $l.subsystemname agg count() as total_count,
count_if($m.severity == ERROR) as error_count,
count_if($m.severity == DEBUG) as debug_count
| create analysis from if(error_count / total_count > 0.1, 'Problem', 'OK')
| filter analysis == 'Problem'
In the query, the if
function plays a key role in evaluating whether the error rate in a subsystem is high. Specifically, it checks if the ratio of error_count
to total_count
exceeds 0.1 (10%). If this condition is met, the function assigns the value 'Problem'
to the newly created field analysis
; otherwise, it assigns 'OK'
. This enables the query to flag subsystems with a high error rate, effectively categorizing them as "Problem" or "OK." The result is a refined dataset that only includes subsystems with a problematic error rate, allowing for more targeted analysis.
Output:
[
{
"subsystemname": "074157727657",
"total_count": 2000,
"error_count": 321,
"debug_count": 47,
"analysis": "Problem"
},
{
"subsystemname": "aws-cloudtrail",
"total_count": 11800,
"error_count": 118,
"debug_count": 0,
"analysis": "Problem"
}
]
The output of this query is a document for each subsystem that includes total_count
, error_count
, debug_count
, and analysis
. For subsystems with an error rate greater than 10%, the analysis field will be labeled 'Problem’.
Compact summarization with collect
collect
collects the result of some expression and produces an array. Use it to aggregate multiple values into a compact format, combining related data into a single row for better readability.
Syntax:
Query:
source logs
| groupby $l.applicationname, $m.severity agg count() as cnt
| groupby applicationname aggregate
collect (`{severity}={cnt}`) as severities_with_counts
- The
collect
function groups related data into an array. - Each entry in the array combines severity levels with their respective counts, reducing redundancy and enabling a quick overview.
Output:
[
{
"applicationname": "074157727657",
"severities_with_counts": [
"Debug=51",
"Verbose=24",
"Info=1709",
"Warning=31",
"Error=337"
]
},
{
"applicationname": "aws-cloudtrail",
"severities_with_counts": [
"Info=11066",
"Error=84"
]
},
{
"applicationname": "prometheus",
"severities_with_counts": [
"Debug=57",
"Info=57"
]
}
]
Notice that the array will contain only the severities that actually exist in the data. This means that if a particular severity level (like ERROR
or WARNING
) does not appear in the logs for a given applicationname
, it will not appear in the aggregated results for that application. This ensures that the data remains compact and relevant, showing only the severity levels that are actually present in the dataset.
Sorting results alphabetically with arraySort
Adding arraySort()
to the previous example ensures consistent ordering of aggregated values, making results easier to interpret.
Syntax:
Query:
source logs
| groupby $l.applicationname, $m.severity agg count() as cnt
| groupby applicationname aggregate
collect (`{severity}={cnt}`).arraySort() as severities_with_counts
The arraySort()
function sorts the collected array alphabetically, ensuring that severity levels always appear in the same order across rows. By returning a new, sorted array, arraySort()
aligns with these best practices, ensuring that your data transformations remain predictable and safe.
Note
The arraySort()
function does not modify the original array. Instead, it creates a new array that contains the sorted elements and returns this new, sorted array. The original array remains unchanged, lending itself to predictability, immutability, and error prevention.
Output:
[
{
"applicationname": "074157727657",
"severities_with_counts": [
"Debug=51",
"Error=360",
"Info=1679",
"Verbose=21",
"Warning=30"
]
},
{
"applicationname": "aws-cloudtrail",
"severities_with_counts": [
"Error=86",
"Info=11944"
]
},
{
"applicationname": "prometheus",
"severities_with_counts": [
"Debug=58",
"Info=58"
]
}
]
Notice that the severity levels are consistently ordered alphabetically in the output from the arraySort()
function. This ensures that, regardless of the order in which data is processed, the results will always display severity levels in the same sequence. For instance, in the case of applicationname = 074157727657
, the severity levels are listed as Debug
, Error,
Info
,Verbose
, and Warning
, in that specific order.
By using arraySort()
, we ensure uniformity in the output, making it easier to compare severity counts across applications, as the data is consistently arranged. For example, the severity levels of Error
and Warning
in aws-cloudtrail
are sorted alphabetically, and no severity types that are not present in the data (like Verbose
) are included.
Logical sorting using severity_value
When performing aggregations based on categories such as severity levels, the default order might not always reflect their inherent logical hierarchy. For example, you may want to sort severities from least to most critical, where a severity level like Verbose
should appear before Debug
, and Error
should come after Warning
. To achieve this logical sorting, we can assign numeric values to each severity level, which represent their priority or urgency.
This is done by using create
to create a new field, such as severity_value
, where each severity level is mapped to a corresponding numeric value. The numeric values will allow you to sort the data logically, from least severe to most severe. Once this numeric field is created, you can use sorting functions to display the severity levels in the desired order.
This technique involves assigning numeric values to conditions for logical ordering (e.g., sorting severities from least to most critical). It does this by creating a new field called severity_value ****
using ****the case_equals
function, which is used to map each severity level to a corresponding numeric value. Find out more about case_*
constructs here.
Query:
source logs
| groupby $l.applicationname, $m.severity agg count() as cnt
| create severity_value from case_equals { severity, VERBOSE -> 0, DEBUG -> 1, INFO -> 2, WARNING -> 3, ERROR -> 4, CRITICAL -> 5 }
| groupby applicationname agg
collect (`{severity_value}:{severity}={cnt}`).arraySort() as severities_with_counts
-
case_equals
assigns numerical values to severity levels. -
The combination of
severity_value
andarraySort()
ensures that results are displayed in logical severity order, regardless of their natural order in the dataset. By using theseverity_value
and applyingarraySort()
, the data is sorted according to the logical severity order. The numeric values added at the start of each string help to achieve this sorted display, even if the severity levels in the original dataset aren't in the desired order.
Output:
[
{
"applicationname": "074157727657",
"severities_with_counts": [
"0:Verbose=23",
"1:Debug=55",
"2:Info=1692",
"3:Warning=26",
"4:Error=401"
]
},
{
"applicationname": "aws-cloudtrail",
"severities_with_counts": [
"2:Info=11933",
"4:Error=70"
]
},
{
"applicationname": "prometheus",
"severities_with_counts": [
"1:Debug=58",
"2:Info=59"
]
}
]
Notice that the case_equals
function assigns a numeric value to each severity level, enabling logical sorting. By adding this numeric value at the start of the severity string, we ensure that the data is ordered in a way that reflects the intended hierarchy—from least severe to most critical. Without this numeric prefix, the severity levels would be sorted alphabetically, which doesn't align with their actual priority.
Also, the use of arraySort()
plays a key role in maintaining consistency across the results. It ensures that regardless of how the data is processed, the severity levels will always appear in the same logical order. This approach guarantees that critical severity levels like Error
and Warning
are correctly placed after less severe levels like Verbose
and Debug
, providing a more meaningful and easily interpretable output.
Conclusion
These advanced aggregation techniques offer flexible ways to analyze and summarize log data. By understanding their strengths and limitations, you can select the best approach for your needs:
-
count_if
simplifies conditional counting but can lead to verbose queries. -
collect
condenses results into a compact format, suitable for quick summaries. -
Sorting methods (
arraySort
,severity_value
) add predictability and logical ordering to aggregated data.