- we have no exactly-once delivery guarantees
- user-side software can send events more than once
- we have to rely on flawed algorithms
There are four strategies planned regarding incorporating deduplication mechanisms in RDB Shredder:
|Strategy||Batch?||Same event ID?||Same event fingerprint?||Availability|
|In-batch natural de-duplication||In-batch||Yes||Yes||R76 Changeable Hawk-Eagle|
|In-batch synthetic de-duplication||In-batch||Yes||No||R86 Petra|
|Cross-batch natural de-duplication||Cross-batch||Yes||Yes||R88 Angkor Wat|
|Cross-batch synthetic de-duplication||Cross-batch||Yes||No||Planned|
In-batch natural de-duplication
As of the R76 Changeable Eagle-Hawk release, RDP de-duplicates “natural duplicates”
- i.e. events which share the same event ID (
event_id) and the same event payload (based by
event_fingerprint), meaning that they are semantically identical to each other. For a given ETL run (batch) of events being processed, RDB Shredder keeps only the first out of each group of natural duplicates; all others will be discarded.
To enable this functionality you need to have the Event Fingerprint Enrichment enabled in order to correctly populate the
In-batch synthetic de-duplication
As of the R86 Petra, RDP de-duplicates “synthetic duplicates” – i.e. events which
share the same event ID (
event_id), but have different event payload (based on
event_fingerprint), meaning that they can be either semantically independent events (caused by
the flawed algorithms discussed above) or the same events with slightly different payloads (caused
by third-party software). For a given ETL run (batch) of events being processed,
RDB Shredder uses the following strategy:
- Collect all the events with identical
event_idwhich are left after natural-deduplication
- Generate new random
event_idfor each of them
- Create a
duplicatecontext with the original
event_idfor each event where the duplicated
There is no configuration required for this functionality – de-duplication is performed automatically in RDB Shredder, but it is highly recommended to use the Event Fingerprint Enrichment in order to correctly populate the
Cross-batch natural de-duplication
With cross-batch natural de-duplication, we have to face a new issue: we need to track events across
multiple ETL batches to detect duplicates. We don’t need to store the whole event – just the
event_id and the
event_fingerprint metadata. We also need to store these in a database that
allows fast random access – we chose Amazon DynamoDB, a fully managed NoSQL database service.
Cross-batch natural deduplication implemented in both RDB Shredder and Snowflake Transformer on top Snowplow Events Manifest Scala library.
DynamoDB table design
We store the event metadata in a DynamoDB table with the following attributes:
eventId, a String
fingerprint, a String
etlTime, a Date
ttl, a Date
A lookup into this table will tell us if the event we are looking for has been seen before based on
We store the
etl_timestamp to prevent issues in the case of a failed run.
If a run fails and is then rerun, we don’t want the rerun to consider rows in the DynamoDB table
which were written as part of the prior failed run; otherwise all events in the rerun would be
rejected as dupes!
WARNING Due used algorithm in cross-batch deduplication, we strictly discourage anyone from deleting
enriched/good folder, as pipeline recovery step after RDB Shred job has started. Reprocessing known
fingerprints will mark events as duplicates and therefore will result in data loss.
It is clear as to when we need to read the event metadata from DynamoDB: during the RDB Shredder process. But when do we write the event metadata for this run back to DynamoDB? Instead of doing all the reads and then doing all the writes, we decided to use DynamoDB’s conditional updates to perform a check-and-set operation inside RDB Shredder, on a per-event basis.
The algorithm is simple:
- Attempt to write the
event_id-event_fingerprint-etl_timestamptriple to DynamoDB but only if the
event_id-event_fingerprintpair cannot be found with an earlier
etl_timestampthan the provided one
- If the write fails, we have a natural duplicate
- If the write succeeds, we know we have an event which is not a natural duplicate (it could still be a synthetic duplicate however)
If we discover a natural duplicate, we delete it. We know that we have an “original” of this event already safely in Redshift (because we have found it in DynamoDB).
In the code, we perform this check after we have grouped the batch by
event_fingerprint; this ensures that all check-and-set requests to a specific
event_id-event_fingerprint pair in DynamoDB will come from a single mapper.
To enable cross-batch natural de-duplication you must provide a DynamoDB table configuration to EmrEtlRunner and provide necessary rights in IAM. If this is not provided, then cross-batch natural de-duplication will be disabled. In-batch de-duplication will still work however.
To avoid “cold start” problems you may want to use the Event-manifest-populator Spark job, which backpopulates duplicate storage with events from the specified point in time.
To make sure the DynamoDB table is not going to be overpopulated we’re using the DynamoDB Time-to-Live feature, which provides automatic cleanup after the specified time. For event manifests this time is the etl timestamp plus 180 days and stored in the
Costs and performance penalty
Cross-batch deduplication uses DynamoDB as transient storage and therefore has associated AWS costs. Default write capacity is 100 units, which means no matter how powerful your EMR cluster is – whole RDB Shredder can be throttled by AWS DynamoDB. The rough cost of the default setup is 50USD per month, however throughput can be tweaked according to your needs.