dedupeby
Description
The dedupeby command removes duplicate documents based on one or more expressions, keeping only N events for each unique combination of the specified fields. This is especially useful for sampling representative data from large datasets without aggregation.
Conceptually, it functions like a smart filter: it doesn’t modify event content or compute summaries—it simply trims redundancy by retaining a limited number of examples per group.
Use the optional orderby clause to control which events are kept within each group—for example, the most recent entries by sorting on $m.timestamp
desc, or the slowest requests by sorting on a latency field. Without orderby, the choice of which events to retain per group is not deterministic.
Note
The content of each retained document remains unchanged. dedupeby only limits how many documents are kept for each unique grouping.
Syntax
dedupeby <expression1> [, <expression2> ...] keep N [orderby <expression> [asc|desc] [, <expression> [asc|desc] ...]]
Example 1
Use case: Sample unique requests per operation name
Suppose your application receives many repeated requests across endpoints, such as /index and /healthcheck. You want to inspect only a few examples of each to spot anomalies or patterns without processing every event. dedupeby can keep just a fixed number of samples for each unique operation.
Example data
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "index", "latency": 110 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 },
{ "operationName": "healthcheck", "latency": 350 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "index", "latency": 135 },
{ "operationName": "healthcheck", "latency": 109 },
{ "operationName": "healthcheck", "latency": 4150 }
Example query
Example output
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 }
The dedupeby command keeps two events for each unique operationName, trimming duplicates while preserving the original event content. This provides a quick, representative sample for inspection or debugging.
Example 2
Use case: Keep the slowest requests per operation name
Add an orderby clause to control which events are retained within each group. Sorting by latency in descending order makes dedupeby keep the two highest-latency events for each unique operationName, producing a deterministic sample focused on the slowest requests.
Example data
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "index", "latency": 110 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 },
{ "operationName": "healthcheck", "latency": 350 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "index", "latency": 135 },
{ "operationName": "healthcheck", "latency": 109 },
{ "operationName": "healthcheck", "latency": 4150 }
Example query
Example output
{ "operationName": "index", "latency": 135 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "healthcheck", "latency": 4150 },
{ "operationName": "healthcheck", "latency": 4000 }
Without the orderby clause, the same query would still return two events per operationName, but which specific events are kept would be non-deterministic.
Example 3
Use case: Keep the latest status per entity
A common pattern is collapsing a stream of state changes down to the most recent record per entity. Combine keep 1 with orderby $m.timestamp desc to keep only the latest event for each unique incident_id.
Example data
{ "incident_id": "INC-1", "state": "open", "$m": { "timestamp": "2026-04-25T09:00:00Z" } },
{ "incident_id": "INC-1", "state": "ack", "$m": { "timestamp": "2026-04-25T09:05:00Z" } },
{ "incident_id": "INC-1", "state": "resolved", "$m": { "timestamp": "2026-04-25T09:30:00Z" } },
{ "incident_id": "INC-2", "state": "open", "$m": { "timestamp": "2026-04-25T10:00:00Z" } },
{ "incident_id": "INC-2", "state": "ack", "$m": { "timestamp": "2026-04-25T10:15:00Z" } }
Example query
Example output
{ "incident_id": "INC-1", "state": "resolved", "$m": { "timestamp": "2026-04-25T09:30:00Z" } },
{ "incident_id": "INC-2", "state": "ack", "$m": { "timestamp": "2026-04-25T10:15:00Z" } }
For each unique incident_id, only the event with the latest $m.timestamp is retained, giving you a current-state view of every incident.