Getting started on Snowplow Open Source

  1. Home
  2. Docs
  3. Getting started on Snowplow Open Source
  4. Setup Snowplow Open Source on AWS
  5. Setup Destinations
  6. Redshift
  7. Setup EmrEtlRunner
  8. Schedule EmrEtlRunner

Schedule EmrEtlRunner

1. Overview

Once you have the EmrEtlRunner process working smoothly, you can schedule it to automate the regular load shredding and loading of data into Redshift.

We run our daily ETL jobs at 3 AM UTC so that we are sure that we have processed all of the events from the day before (CloudFront logs can take some time to arrive).

To consider your different scheduling options in turn:

2. cron

Running EmrEtlRunner as Ruby (rather than JRuby apps) is no longer actively supported. The latest version of the EmrEtlRunner is available from our Bintray here.

The recommended way of scheduling the ETL process is as a daily cronjob.

0 4 * * * root cronic /path/to/eer/snowplow-emr-etl-runner run -c config.yml

This will run the ETL job daily at 4 AM, emailing any failures to you via cronic.

3. Jenkins

Some developers use the Jenkins continuous integration server (or Hudson, which is very similar) to schedule their Hadoop and Hive jobs.

Describing how to do this is out of scope for this guide, but the blog post Lowtech Monitoring with Jenkins is a great tutorial on using Jenkins for non-CI-related tasks, and could be easily adapted to schedule EmrEtlRunner.