dedupeby
Description
The dedupeby command removes duplicate documents based on one or more
expressions, keeping only N events for each unique combination of the
specified fields. This is especially useful for sampling representative data
from large datasets without aggregation.
Conceptually, it functions like a smart filter: it doesn’t modify event content or compute summaries—it simply trims redundancy by retaining a limited number of examples per group.
Use the optional orderby clause to control which events are kept within
each group—for example, the most recent entries by sorting on $m.timestamp desc, or the slowest requests by sorting on a latency field. Without
orderby, the choice of which events to retain per group is not
deterministic.
The content of each retained document remains unchanged. dedupeby only
limits how many documents are kept for each unique grouping.
Syntax
dedupeby <expression1> [, <expression2> ...] keep N [orderby <expression> [asc|desc] [, <expression> [asc|desc] ...]]
Example 1
Use case: Sample unique requests per operation name
Suppose your application receives many repeated requests across endpoints,
such as /index and /healthcheck. You want to inspect only a few examples
of each to spot anomalies or patterns without processing every event.
dedupeby can keep just a fixed number of samples for each unique operation.
Example data
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "index", "latency": 110 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 },
{ "operationName": "healthcheck", "latency": 350 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "index", "latency": 135 },
{ "operationName": "healthcheck", "latency": 109 },
{ "operationName": "healthcheck", "latency": 4150 }
Example query
dedupeby operationName keep 2
Example output
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 }
The dedupeby command keeps two events for each unique operationName,
trimming duplicates while preserving the original event content. This provides
a quick, representative sample for inspection or debugging.
Example 2
Use case: Keep the slowest requests per operation name
Add an orderby clause to control which events are retained within each
group. Sorting by latency in descending order makes dedupeby keep the
two highest-latency events for each unique operationName, producing a
deterministic sample focused on the slowest requests.
Example data
{ "operationName": "index", "latency": 120 },
{ "operationName": "index", "latency": 98 },
{ "operationName": "index", "latency": 110 },
{ "operationName": "healthcheck", "latency": 4000 },
{ "operationName": "healthcheck", "latency": 200 },
{ "operationName": "healthcheck", "latency": 350 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "index", "latency": 135 },
{ "operationName": "healthcheck", "latency": 109 },
{ "operationName": "healthcheck", "latency": 4150 }
Example query
dedupeby operationName keep 2 orderby latency desc
Example output
{ "operationName": "index", "latency": 135 },
{ "operationName": "index", "latency": 125 },
{ "operationName": "healthcheck", "latency": 4150 },
{ "operationName": "healthcheck", "latency": 4000 }
Without the orderby clause, the same query would still return two events
per operationName, but which specific events are kept would be
non-deterministic.
Example 3
Use case: Keep the latest status per entity
A common pattern is collapsing a stream of state changes down to the most
recent record per entity. Combine keep 1 with orderby $m.timestamp desc
to keep only the latest event for each unique incident_id.
Example data
{ "incident_id": "INC-1", "state": "open", "$m": { "timestamp": "2026-04-25T09:00:00Z" } },
{ "incident_id": "INC-1", "state": "ack", "$m": { "timestamp": "2026-04-25T09:05:00Z" } },
{ "incident_id": "INC-1", "state": "resolved", "$m": { "timestamp": "2026-04-25T09:30:00Z" } },
{ "incident_id": "INC-2", "state": "open", "$m": { "timestamp": "2026-04-25T10:00:00Z" } },
{ "incident_id": "INC-2", "state": "ack", "$m": { "timestamp": "2026-04-25T10:15:00Z" } }
Example query
dedupeby incident_id keep 1 orderby $m.timestamp desc
Example output
{ "incident_id": "INC-1", "state": "resolved", "$m": { "timestamp": "2026-04-25T09:30:00Z" } },
{ "incident_id": "INC-2", "state": "ack", "$m": { "timestamp": "2026-04-25T10:15:00Z" } }
For each unique incident_id, only the event with the latest $m.timestamp
is retained, giving you a current-state view of every incident.