- Load collector payloads from the “raw” stream, to maintain an archive of the original data, before enrichment.
- Load enriched events from the “enriched” stream. These serve as input for the RDB loader when loading to a warehouse.
- Load failed events from the “bad” stream.
Records that can’t be successfully written to S3 are written to a second Kinesis stream with the error message.
Records are treated as raw byte arrays. Elephant Bird’s
BinaryBlockWriter class is used to serialize them as a Protocol Buffers array (so it is clear where one record ends and the next begins) before compressing them.
The compression process generates both compressed .lzo files and small .lzo.index files (splittable LZO). Each index file contain the byte offsets of the LZO blocks in the corresponding compressed file, meaning that the blocks can be processed in parallel.
LZO encoding is generally used for raw data produced by Snowplow Collector.
The records are treated as byte arrays containing UTF-8 encoded strings (whether CSV, JSON or TSV). New lines are used to separate records written to a file. This format can be used with the Snowplow Kinesis Enriched stream, among other streams.
Gzip encoding is generally used for both enriched data and bad data.
Available on Terraform Registry
A Terraform module which deploys the Snowplow S3 Loader on AWS EC2 for use with Kinesis. For installing in other environments, please see the other installation options below.
docker run \ -d \ --name snowplow-s3-loader \ --restart always \ --log-driver awslogs \ --log-opt awslogs-group=snowplow-s3-loader \ --log-opt awslogs-stream=`ec2metadata --instance-id` \ --network host \ -v $(pwd):/snowplow/config \ -e 'JAVA_OPTS=-Xms512M -Xmx1024M -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN' \ snowplow/snowplow-s3-loader:2.0.0 \ --config /snowplow/config/config.hocon
Code language: CSS (css)
java -jar snowplow-s3-loader-2.0.0.jar --config config.hocon
JAR can be found attached to the Github release.
Running the jar requires to have the native LZO binaries installed. For example for Debian this can be done with:
sudo apt-get install lzop liblzo2-dev