stitch
Description
The stitch
command performs a horizontal union of two datasets, combining them side-by-side. It aligns rows from one dataset with rows from another and concatenates their columns, creating a single, unified dataset.
When using the stitch
command:
- Datasets must be ordered, as rows are combined in sequence (e.g., row 1 of dataset A is stitched with row 1 of dataset B).
- If one dataset has more rows than the other, unmatched rows will have
null
values in the stitched columns. - The resulting dataset will contain all columns from both datasets.
How stitch
differs from union
stitch
combines datasets horizontally by appending columns row-by-row.union
appends rows vertically, stacking datasets on top of each other.
Syntax
Usage
Consider the following datasets created using custom enrichments:
sales
dataset:
{ "product": "Widget", "sales": 100 }
{ "product": "Gadget", "sales": 200 }
{ "product": "Dashboard", "sales": 150 }
revenue
dataset:
{ "product": "Widget", "revenue": 5000 }
{ "product": "Gadget", "revenue": 8000 }
{ "product": "Dashboard", "revenue": 6000 }
The goal is to combine these datasets side-by-side, ensuring that each row from one dataset aligns with the corresponding row from the other.
Query
Query breakdown
-
source sales
Fetches all rows from the
sales
dataset, which contains products and their corresponding sales figures. -
orderby product
Sorts the
sales
dataset by theproduct
field, ensuring a consistent order for row alignment. -
stitch (source revenue | orderby product)
- Fetches rows from the
revenue
dataset and sorts them by theproduct
field. - Combines the
sales
andrevenue
datasets horizontally, aligning rows based on their order after sorting.
- Fetches rows from the
-
into combined_data
Stores the stitched dataset in a variable called
combined_data
. The resulting dataset includes columns from both thesales
andrevenue
datasets.
Result
The resulting dataset aligns rows from both datasets horizontally:
{ "product": "Widget", "sales": 100, "combined_data": { "product": "Widget", "revenue": 5000 } }
{ "product": "Gadget", "sales": 200, "combined_data": { "product": "Gadget", "revenue": 8000 } }
{ "product": "Dashboard", "sales": 150, "combined_data": { "product": "Dashboard", "revenue": 6000 } }
Handling unequal row counts
If the datasets have unequal rows, the stitch
command fills missing values with null
.
Consider the following datasets:
sales
dataset: (3 rows)
{ "product": "Widget", "sales": 100 }
{ "product": "Gadget", "sales": 200 }
{ "product": "Dashboard", "sales": 150 }
revenue
dataset: (2 rows)
Query
Result
{ "product": "Widget", "sales": 100, "combined_data": { "product": "Widget", "revenue": 5000 } }
{ "product": "Gadget", "sales": 200, "combined_data": { "product": "Gadget", "revenue": 8000 } }
{ "product": "Dashboard", "sales": 150, "combined_data": { "product": "Dashboard", "revenue": null } }
Guiding notes
- Rows must correlate logically for stitching to produce meaningful results. Ensure that rows from both datasets represent the same entities and are in the same order. For example, if the
product
field in thesales
dataset does not match theproduct
field in therevenue
dataset for corresponding rows, stitching will not work as expected. - If datasets differ in row count, the result will include
null
values for missing data in the shorter dataset.