Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. S3 Loader

S3 Loader

Overview

Snowplow S3 Loader consumes records from an Amazon Kinesis stream and writes them to S3. A typical Snowplow pipeline would use the S3 loader in several places:

  • Load collector payloads from the “raw” stream, to maintain an archive of the original data, before enrichment.
  • Load enriched events from the “enriched” stream. These serve as input for the RDB loader when loading to a warehouse.
  • Load failed events from the “bad” stream.

Records that can’t be successfully written to S3 are written to a second Kinesis stream with the error message.

Output Formats

LZO

Records are treated as raw byte arrays. Elephant Bird’s BinaryBlockWriter class is used to serialize them as a Protocol Buffers array (so it is clear where one record ends and the next begins) before compressing them.

The compression process generates both compressed .lzo files and small .lzo.index files (splittable LZO). Each index file contain the byte offsets of the LZO blocks in the corresponding compressed file, meaning that the blocks can be processed in parallel.

LZO encoding is generally used for raw data produced by Snowplow Collector.

Gzip

The records are treated as byte arrays containing UTF-8 encoded strings (whether CSV, JSON or TSV). New lines are used to separate records written to a file. This format can be used with the Snowplow Kinesis Enriched stream, among other streams.

Gzip encoding is generally used for both enriched data and bad data.

Running

Available on Terraform Registry

A Terraform module which deploys the Snowplow S3 Loader on AWS EC2 for use with Kinesis. For installing in other environments, please see the other installation options below.

Docker image

docker run \ -d \ --name snowplow-s3-loader \ --restart always \ --log-driver awslogs \ --log-opt awslogs-group=snowplow-s3-loader \ --log-opt awslogs-stream=`ec2metadata --instance-id` \ --network host \ -v $(pwd):/snowplow/config \ -e 'JAVA_OPTS=-Xms512M -Xmx1024M -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN' \ snowplow/snowplow-s3-loader:2.0.0 \ --config /snowplow/config/config.hocon
Code language: JavaScript (javascript)

Jar

java -jar snowplow-s3-loader-2.0.0.jar --config config.hocon
Code language: CSS (css)

JAR can be found attached to the Github release.

Running the jar requires to have the native LZO binaries installed. For example for Debian this can be done with:

sudo apt-get install lzop liblzo2-dev
Code language: JavaScript (javascript)

Articles