join - Join two queries based on a condition

join merges the results of the current (left) query with a second (right) query based on a specified condition. At its core, it creates all possible combinations of the left and right results (called the Cartesian product). For each combination, the event of the right side will be added into the event of the left side at the keypath provided after into. Furthermore, some extra logic is applied depending on the modifier.

Overview

This document describes usage for join and its variations:

join: Functions as a left join by default. It selects matching rows from the right query for each row in the left query based on the specified condition. If no match is found, all rows from the left query are still included, with unmatched columns from the right query set to null.

join inner: Only returns rows where there are non-null results from both queries. inner joins have no additional logic, returning the filtered Cartesian product only.

join full: Returns a row for every event, including those that may not have a match in either query (left or right), filling missing values with null. In the case of full, in addition to the behavior of left, all rows from the right side not present in the filtered Cartesian product are also returned, inside of the into keypath, with no other fields surrounding it.

join cross: Pairs each row from the left query with every row from the right query, generating the full Cartesian product. Unlike other join types, join cross does not support on or using conditions—it functions similarly to an inner join but without any filtering, returning all possible row combinations.

Note

For left (default), inner, and full joins, you can specify either a join condition using the on keyword or a keypath using the using keyword. The Cartesian product is filtered to retain only rows where the condition holds true or where the keypath values match on both sides.
Since all joins—regardless of the modifier—are based on the Cartesian product, duplicate results may occur if the join condition matches multiple times. To prevent unintended duplication, consider preprocessing subqueries, such as by using distinct.

Syntax

<left_side_query> | join [left/inner/full] (<right_side_query>) on <condition> into <right_side_target>
<left_side_query> | join [left/inner/full] (<right_side_query>) using <join_keypath_1> [, <join_keypath_2>, ...] into <right_side_target>
<left_side_query> | join cross (<right_side_query>) into <right_side_target>

Basic join concepts

There are some key concepts to cover when thinking about utilizing joins in DataPrime.

Right-side query - The right_side_query denotes the new query to join with.

Left-side query - The left_side_query denotes the initial query, for example, in the query source logs | filter x != null | join ..., the left-hand side query is source logs | filter x != null.

Join condition - The condition for whether results from both queries should be joined.

Join keypath - join_keypath_N as a join key means to join results where a given keypath is equal in the results of both the left-side query and the right-side query.

Right-side target - The keypath in which the joined data will be added to the current query.

Note

A query can begin directly with join as source logs | is optional.

The condition

In the condition, you can use the left=> and right=> prefixes to refer to the events of the left and right queries, respectively. However, it is not required if a keypath exists in only one of the queries.

When using the == (equality) operator in your condition, it must compare a keypath from the left query with a keypath from the right query. However, given that the keypaths must be unique or prefixed with left=> or right=>, the ordering of the operands is not important.

Limitations of `join`

There are a number of limitations to be mindful of when writing a join query:

The join condition only supports keypath equality (==). If multiple equality conditions are needed, they can be combined with && (logical AND).

One side of the join (either the current query or the join query) must be small (< 200MB). To bring down the size of the query, make use of filter and remove.

Left outer joins require all columns in the condition to be non-null. Any null columns will not be joined. To include right-hand query null columns, opt for join full. To exclude all null columns produced by the left and right joins, use join inner.

Example usage

Basic usage

Consider the following documents from a custom enrichment table named users:

{ "id": "111", "name": "John" }
{ "id": "222", "name": "Emily" }
{ "id": "333", "name": "Alice" }

We can see that they are documents pertaining to a user ID and a name. Now, consider this set of documents:

{ "userid": "111", "timestamp": "2022-01-01T12:00:00Z" }
{ "userid": "111", "timestamp": "2022-01-01T12:30:00Z" }
{ "userid": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "userid": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "userid": "222", "timestamp": "2022-01-01T13:00:00Z" }

These documents indicate user login events, but we have a problem. There is no name in these documents, meaning it's difficult to see who logged in. This is where we can leverage join.

source users | join (source logs | countby userid) on id == userid into logins

Breaking down this query into stages, we:

source the custom enrichment table entitled users.
Within the join, we generate a count by the userid field. This gives us our count stats.
We compare the id field in the custom enrichment table with the userid field in the logs.
We declare that the result should be pushed into logins. If the logins key already exists on the left hand query, it will be overwritten.

This results in the following documents:

{ "id": "111", "name": "John", "logins": { "userid": "111", "_count": 2 } }
{ "id": "222", "name": "Emily", "logins": { "userid": "222", "_count": 3 } }
{ "id": "333", "name": "Alice", "logins": null }

Notice how the result of the right side query now sits inside the logins field. Also notice that there were no logins for user ID 333 (Alice), so the logins field is null because there was no result matched by the join condition.

Introducing the `using` keyword

If we consider the same dataset, but this time, the logs look like this:

{ "id": "111", "timestamp": "2022-01-01T12:00:00Z" }
{ "id": "111", "timestamp": "2022-01-01T12:30:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }

Now, we have two fields entitled id on both sides. We can leverage the using keyword, to quickly join these results:

source users | join (source logins | countby id) using id into logins

This will produce the same result, but instead of a field userid, it will be id:

{ "id": "111", "name": "John", "logins": { "id": "111", "_count": 2 } }
{ "id": "222", "name": "Emily", "logins": { "id": "222", "_count": 3 } }
{ "id": "333", "name": "Alice", "logins": null }

Note

If you have two fields that are named differently but which to simplify your join query, you can move one of the fields to match the other, to make the keypaths match on both sides.

Leveraging left=> and right=>

If we do not wish to use the using keyword, we can also use left=> and right=> to signify the correct keypaths. Assuming the same dataset as the previous example, we can write the following query:

source users | join (source logins | countby id) on left=>id == right=>id into logins

This was necessary because both documents contain a field with the same name, meaning in order for DataPrime to uniquely identify a field, it must know which side of the field to which we are referring. This query will result in the same output as the previous using keyword.

Using `join full`

Assume that we have the following documents from a custom enrichment table named users:

{ "id": "111", "name": "John" }
{ "id": "222", "name": "Emily" }
{ "id": "333", "name": "Alice" }

Now, consider this set of documents, as before:

{ "id": "001", "timestamp": "2022-01-01T12:00:00Z" }
{ "id": "111", "timestamp": "2022-01-01T12:00:00Z" }
{ "id": "111", "timestamp": "2022-01-01T12:30:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }

The second set of documents (right query) includes a log entry with "id": "001" that does not exist in the first set of documents (left query). If you use a standard join, this entry from the right query will be ignored and will not appear in the result. To ensure that every "id" field is included in the output—whether it appears in the left or right query—you can use join full as shown below:

source users | join full (source logins | countby id) using id into logins

Result:

{ "id": "001", "name": "null", "logins": { "id": "001", "_count": 1 } }
{ "id": "111", "name": "John", "logins": { "id": "111", "_count": 2 } }
{ "id": "222", "name": "Emily", "logins": { "id": "222", "_count": 3 } }
{ "id": "333", "name": "Alice", "logins": null }

By using join full, all "id" fields from both datasets are preserved, and any missing values are filled with null.

Hint: join full is particularly valuable when the results of both queries include time buckets. For example, if the left query's results are missing a time bucket for a specific hour (e.g., XX:XX:XX), join full ensures this data point is included. This is especially useful for comparing two time series on a graph, ensuring consistency in time bucket representation.

Using `join inner`

If the user wants to remove any rows if column results for the left or right queries produce a null value, use join inner.

The following query removes rows with unmatched data from either side:

source users | join inner (source logins | countby id) using id into logins

The left query source users retrieves the users dataset containing fields like "id" and "name".
The right query (source logins | countby id) retrieves the logins dataset, grouping by "id" and counting occurrences for each "id".
join inner matches rows where "id" exists in both datasets and merges their data into a single record.
Rows with no match in either dataset are excluded from the final results.

In this case, for the two document sets above, results will be as follows:

{ "id": "111", "name": "John", "logins": { "id": "111", "_count": 2 } }
{ "id": "222", "name": "Emily", "logins": { "id": "222", "_count": 3 } }

Using `join cross`

join cross combines every row from the left query with every row from the right query, producing a Cartesian product of the two sets.

Assume that we have the following documents from a custom enrichment table named users:

{ "id": "111", "name": "John" }
{ "id": "222", "name": "Emily" }
{ "id": "333", "name": "Alice" }

Now, consider this set of documents, as before:

{ "id": "111", "timestamp": "2022-01-01T12:00:00Z" }
{ "id": "111", "timestamp": "2022-01-01T12:30:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }
{ "id": "222", "timestamp": "2022-01-01T13:00:00Z" }

The following query will produce a Cartesian product of the users and logs datasets because join cross pairs every row from the left query with every row from the right query, regardless of any matching conditions.

source users | join cross (source logs) into logins

{ "id": "111", "name": "John", "logins": { "id": "111", "timestamp": "2022-01-01T12:00:00Z" } }
{ "id": "111", "name": "John", "logins": { "id": "111", "timestamp": "2022-01-01T12:30:00Z" } }
{ "id": "111", "name": "John", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "111", "name": "John", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "111", "name": "John", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "222", "name": "Emily", "logins": { "id": "111", "timestamp": "2022-01-01T12:00:00Z" } }
{ "id": "222", "name": "Emily", "logins": { "id": "111", "timestamp": "2022-01-01T12:30:00Z" } }
{ "id": "222", "name": "Emily", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "222", "name": "Emily", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "222", "name": "Emily", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "333", "name": "Alice", "logins": { "id": "111", "timestamp": "2022-01-01T12:00:00Z" } }
{ "id": "333", "name": "Alice", "logins": { "id": "111", "timestamp": "2022-01-01T12:30:00Z" } }
{ "id": "333", "name": "Alice", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "333", "name": "Alice", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }
{ "id": "333", "name": "Alice", "logins": { "id": "222", "timestamp": "2022-01-01T13:00:00Z" } }

The query results in each user being paired with every log entry: 3 rows from users multiplied by 5 rows from logs, resulting in 15 rows. Each user (John, Emily, Alice) is paired with every log entry.

Hint: Using join cross is especially useful when you are interested in seeing a complete picture of your data, then adding a left or right join to these results.

multigroupby

Next limit

join - Join two queries based on a condition

Overview

Syntax

Basic join concepts

The condition

Limitations of join

Example usage

Basic usage

Introducing the using keyword

Leveraging left=> and right=>

Using join full

Using join inner

Using join cross

Limitations of `join`

Introducing the `using` keyword

Using `join full`

Using `join inner`

Using `join cross`