1. Overview
Once you have the EmrEtlRunner process working smoothly, you can schedule it to automate the regular load shredding and loading of data into Redshift.
We run our daily ETL jobs at 3 AM UTC so that we are sure that we have processed all of the events from the day before (CloudFront logs can take some time to arrive).
To consider your different scheduling options in turn:
2. cron
Running EmrEtlRunner as Ruby (rather than JRuby apps) is no longer actively supported. The latest version of the EmrEtlRunner is available from our Bintray here. |
---|
The recommended way of scheduling the ETL process is as a daily cronjob.
0 4 * * * root cronic /path/to/eer/snowplow-emr-etl-runner run -c config.yml
This will run the ETL job daily at 4 AM, emailing any failures to you via cronic.
3. Jenkins
Some developers use the Jenkins continuous integration server (or Hudson, which is very similar) to schedule their Hadoop and Hive jobs.
Describing how to do this is out of scope for this guide, but the blog post Lowtech Monitoring with Jenkins is a great tutorial on using Jenkins for non-CI-related tasks, and could be easily adapted to schedule EmrEtlRunner.