Commands
Run command
The most useful command is the run
command which allows you to actually run your EMR job:
$ ./snowplow-emr-etl-runner run
The available options are as follows:
Usage: run [options]
-c, --config CONFIG configuration file
-n, --enrichments ENRICHMENTS enrichments directory
-r, --resolver RESOLVER Iglu resolver file
-t, --targets TARGETS targets directory
-d, --debug enable EMR Job Flow debugging
-f {enrich,shred,elasticsearch,archive_raw,rdb_load,analyze,archive_enriched,archive_shredded,staging_stream_enrich},
--resume-from resume from the specified step
-x {staging,enrich,shred,elasticsearch,archive_raw,rdb_load,consistency_check,analyze,load_manifest_check,archive_enriched,archive_shredded,staging_stream_enrich},
--skip skip the specified step(s)
-i, --include {vacuum} include additional step(s)
-l, --lock PATH where to store the lock
--ignore-lock-on-start ignore the lock if it is set when starting
--consul ADDRESS address to the Consul server
Code language: JavaScript (javascript)
Note that the config
and resolver
options are mandatory.
Note that in Stream Enrich mode you cannot skip nor resume from staging
, enrich
and archive_raw
.
Instead of staging
and enrich
, in Stream Enrich mode single special staging_stream_enrich
is used.
2.2 Lint commands
Other useful commands include the lint
commands which allows you to check the validity of your
resolver or enrichments with respect to their respective schemas.
If you want to lint your resolver:
$ ./snowplow-emr-etl-runner lint resolver
The mandatory options are:
Usage: lint resolver [options]
-r, --resolver RESOLVER Iglu resolver file
Code language: CSS (css)
If you want to lint your enrichments:
$ ./snowplow-emr-etl-runner lint enrichments
The mandatory options are:
Usage: lint enrichments [options]
-r, --resolver RESOLVER Iglu resolver file
-n, --enrichments ENRICHMENTS enrichments directory
Code language: CSS (css)
Checking the results
Once you have run the EmrEtlRunner you should be able to manually inspect in S3 the folder specified in the :out:
parameter in your config.yml
file and see new files generated, which will contain the cleaned data
either for uploading into a storage target (e.g. Redshift) or for
analysing directly using Hive or Spark or some other querying tool on
EMR.
Note: most Snowplow users run the ‘spark’ version of the ETL process, in which case the data generated is saved into subfolders with names of the form part-000...
. If, however, you are running the legacy ‘hive’ ETL (because e.g. you want to use Hive as your storage target, rather than Redshift, which is the only storage target the ‘spark’ etl currently supports), the subfolders names will be of the format dt=...
.
Next steps
Comfortable using EmrEtlRunner? Then schedule it so that it regularly takes new data generated by stream enrich, shreds it (using RDB Shredder) and loaders it into Redshift (using RDB Loader).