# `dedupeby`

## Description

The `dedupeby` command removes duplicate documents based on one or more expressions, keeping only *N* events for each unique combination of the specified fields. This is especially useful for sampling representative data from large datasets without aggregation.

Conceptually, it functions like a smart filter: it doesn’t modify event content or compute summaries—it simply trims redundancy by retaining a limited number of examples per group.

Use the optional `orderby` clause to control *which* events are kept within each group—for example, the most recent entries by sorting on `$m.timestamp desc`, or the slowest requests by sorting on a latency field. Without `orderby`, the choice of which events to retain per group is not deterministic.

Note

The content of each retained document remains unchanged. `dedupeby` only limits how many documents are kept for each unique grouping.

## Syntax

```dataprime
dedupeby <expression1> [, <expression2> ...] keep N [orderby <expression> [asc|desc] [, <expression> [asc|desc] ...]]
```

## Example 1

**Use case: Sample unique requests per operation name**

Suppose your application receives many repeated requests across endpoints, such as `/index` and `/healthcheck`. You want to inspect only a few examples of each to spot anomalies or patterns without processing every event. `dedupeby` can keep just a fixed number of samples for each unique operation.

### Example data

```json
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "index", "latency": 110 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 },
{ "operationName": "healthcheck", "latency": 350 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "index", "latency": 135 },
{ "operationName": "healthcheck", "latency": 109 },
{ "operationName": "healthcheck", "latency": 4150 }
```

### Example query

```dataprime
dedupeby operationName keep 2
```

### Example output

```json
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 }
```

The `dedupeby` command keeps two events for each unique `operationName`, trimming duplicates while preserving the original event content. This provides a quick, representative sample for inspection or debugging.

## Example 2

**Use case: Keep the slowest requests per operation name**

Add an `orderby` clause to control which events are retained within each group. Sorting by `latency` in descending order makes `dedupeby` keep the two highest-latency events for each unique `operationName`, producing a deterministic sample focused on the slowest requests.

### Example data

```json
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "index", "latency": 110 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 },
{ "operationName": "healthcheck", "latency": 350 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "index", "latency": 135 },
{ "operationName": "healthcheck", "latency": 109 },
{ "operationName": "healthcheck", "latency": 4150 }
```

### Example query

```dataprime
dedupeby operationName keep 2 orderby latency desc
```

### Example output

```json
{ "operationName": "index", "latency": 135 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "healthcheck", "latency": 4150 },
{ "operationName": "healthcheck", "latency": 4000 }
```

Without the `orderby` clause, the same query would still return two events per `operationName`, but which specific events are kept would be non-deterministic.

## Example 3

**Use case: Keep the latest status per entity**

A common pattern is collapsing a stream of state changes down to the most recent record per entity. Combine `keep 1` with `orderby $m.timestamp desc` to keep only the latest event for each unique `incident_id`.

### Example data

```json
{ "incident_id": "INC-1", "state": "open",       "$m": { "timestamp": "2026-04-25T09:00:00Z" } },
{ "incident_id": "INC-1", "state": "ack",        "$m": { "timestamp": "2026-04-25T09:05:00Z" } },
{ "incident_id": "INC-1", "state": "resolved",   "$m": { "timestamp": "2026-04-25T09:30:00Z" } },
{ "incident_id": "INC-2", "state": "open",       "$m": { "timestamp": "2026-04-25T10:00:00Z" } },
{ "incident_id": "INC-2", "state": "ack",        "$m": { "timestamp": "2026-04-25T10:15:00Z" } }
```

### Example query

```dataprime
dedupeby incident_id keep 1 orderby $m.timestamp desc
```

### Example output

```json
{ "incident_id": "INC-1", "state": "resolved", "$m": { "timestamp": "2026-04-25T09:30:00Z" } },
{ "incident_id": "INC-2", "state": "ack",      "$m": { "timestamp": "2026-04-25T10:15:00Z" } }
```

For each unique `incident_id`, only the event with the latest `$m.timestamp` is retained, giving you a current-state view of every incident.
