This one-off job solves the “cold start” problem for identifying cross-batch natural deduplicates in Snowplow’s Relational Database Shredder step.
In other words, without running this job you still will be able to deduplicate events across batches, but if Relational Database Shredder encounters duplicate of event that was shredded before you enabled cross-batch deduplication it will land into
In order to use Event Manifest Populator, you need to have boto2 installed:
$ pip install boto
As next step you need to grab
run.py file with instructions to run job on AWS EMR.
You can do it by downloading it directly from Github:
$ wget https://raw.githubusercontent.com/snowplow/snowplow/master/5-data-modeling/event-manifest-populator/run.py
Now you can run Event Manifest Populator with a single command (inside a directory with
Code language: PHP (php)
$ python run.py $ENRICHED_ARCHIVE_S3_PATH $STORAGE_CONFIG_PATH $IGLU_RESOLVER_PATH
Task has three required arguments:
- Path to enriched events archive. It can be found in
aws.s3.buckets.enriched.archivesetting in your config.yml.
- Local path to Duplicate storage configuration JSON.
- Local path to Iglu resolver configuration JSON.
Optionally, you can also pass following arguments:
--sinceto reduce amount of data to be stored in DynamodDB.
If this option was passed Manifest Populator will process enriched events only after specified date.
Input date supports two formats:
--log-pathto store EMR job logs on S3. Normally, Manifest Populator does not
produce any logs or output, but if some error occured you’ll be able to
inspect it in EMR logs stored in this path.
--profileto specify AWS profile to create this EMR job.
--jarto specify S3 path to custom JAR
Note that Event Manifest Populator must be used only with run ids produced with version of snowplow newer than R73 Cuban Macaw as format of TSV files has been changed.