Getting started on Snowplow Open Source

  1. Home
  2. Docs
  3. Getting started on Snowplow Open Source
  4. Setup Snowplow Open Source on GCP
  5. Setup Google Cloud Storage (GCS) Destination
  6. Downloading and Running the Google Cloud Storage Loader

Downloading and Running the Google Cloud Storage Loader

Cloud Storage Loader comes as a ZIP archive, a Docker image or a Cloud Dataflow template, feel free to choose the one which fits your use case the most.

Template

You can run Dataflow templates using a variety of means:

  • Using the GCP console
  • Using gcloud
  • Using the REST API

Refer to the documentation on executing templates to know more.

Here, we provide an example using gcloud:

gcloud dataflow jobs run [JOB-NAME] \
  --gcs-location gs://sp-hosted-assets/4-storage/snowplow-google-cloud-storage-loader/0.1.0/SnowplowGoogleCloudStorageLoaderTemplate-0.1.0 \
  --parameters \
    inputSubscription=projects/[PROJECT]/subscriptions/[SUBSCRIPTION],\
    outputDirectory=gs://[BUCKET]/YYYY/MM/dd/HH/,\ # partitions by date
    outputFilenamePrefix=output,\ # optional
    shardTemplate=-W-P-SSSSS-of-NNNNN,\ # optional
    outputFilenameSuffix=.txt,\ # optional
    windowDuration=5,\ # optional, in minutes
    compression=none,\ # optional, gzip, bz2 or none
    numShards=1 # optional

ZIP archive

You can find the archive hosted on our Bintray.

Once unzipped the artifact can be run as follows:

./bin/snowplow-google-cloud-storage-loader \
  --runner=DataFlowRunner \
  --project=[PROJECT] \
  --streaming=true \
  --zone=europe-west2-a \
  --inputSubscription=projects/[PROJECT]/subscriptions/[SUBSCRIPTION] \
  --outputDirectory=gs://[BUCKET]/YYYY/MM/dd/HH/ \ # partitions by date
  --outputFilenamePrefix=output \ # optional
  --shardTemplate=-W-P-SSSSS-of-NNNNN \ # optional
  --outputFilenameSuffix=.txt \ # optional
  --windowDuration=5 \ # optional, in minutes
  --compression=none \ # optional, gzip, bz2 or none
  --numShards=1 # optional

To display the help message:

./bin/snowplow-google-cloud-storage-loader --help

To display documentation about Cloud Storage Loader-specific options:

./bin/snowplow-google-cloud-storage-loader --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options

Docker image

You can also find the imageon our Bintray.

A container can be run as follows:

docker run \ -v $PWD/config:/snowplow/config \ # if running outside GCP -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \ # if running outside GCP snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 \ --runner=DataFlowRunner \ --jobName=[JOB-NAME] \ --project=[PROJECT] \ --streaming=true \ --zone=[ZONE] \ --inputSubscription=projects/[PROJECT]/subscriptions/[SUBSCRIPTION] \ --outputDirectory=gs://[BUCKET]/YYYY/MM/dd/HH/ \ # partitions by date --outputFilenamePrefix=output \ # optional --shardTemplate=-W-P-SSSSS-of-NNNNN \ # optional --outputFilenameSuffix=.txt \ # optional --windowDuration=5 \ # optional, in minutes --compression=none \ # optional, gzip, bz2 or none --numShards=1 # optional

To display the help message:

docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 \ --help

To display documentation about Cloud Storage Loader-specific options:

docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 \ --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options