Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. S3 Loader
  6. S3 loader configuration reference

S3 loader configuration reference

This is a complete list of the options that can be configured in the S3 loader HOCON config file. The example configs in github show how to prepare an input file.

purposeRequired. Use RAW to sink data exactly as-is. Use ENRICHED_EVENTS to also enable event latency metrics. Use SELF_DESCRIBING to enable partitioning self-describing data by its schema.
input.appNameRequired. Kinesis Client Lib app name (corresponds to DynamoDB table name)
input.streamNameRequired. Name of the kinesis stream from which to read inputs.
input.positionRequired. Use “TRIM_HORIZON” to start streaming at the last untrimmed record in the shard, which is the oldest data record in the shard. Or use “LATEST” to start streaming just after the most recent record in the shard.
input.customEndpointOptional. Override the default endpoint for kinesis client api calls.
input.maxRecordsRequired. How many records the client should pull from kinesis each time.
output.s3.pathRequired. Full path to output data, e.g. s3://acme-snowplow-output/raw/
output.s3.dateFormatOptional. Configures a time-based partitioning structure in S3 directories, e.g. date=%Y-%M-%d
output.s3.filenamePrefixOptional. Adds a prefix to output files.
output.s3.compressionRequired. Either LZO or GZIP
output.s3.maxTimeoutRequired. Maximum Timeout that the application is allowed to fail for, e.g. in case of S3 outage
output.s3.customEndpointOptional. Override the default endpoint for s3 client api calls.
regionOptional. When used with the output.s3.customEndpoint option, this sets the region of the bucket. Also sets the region of the dynamoDB table. Defaults to the current region.
output.bad.streamNmeRequired. Name of a kinesis stream to output failures.
buffer.byteLimitRequired. Maximum bytes to read from kinesis before flushing a file to S3.
buffer.recordLimitRequired. Maximum records to read from kinesis before flushing a file to S3.
buffer.timeLimitRequired. Maximum time to wait in milliseconds between writing files to S3.
monitoring.snowplow.collectorOptional. E.g. https://snplow.acme.ru. URI of a snowplow collector. Used for monitoring application lifecycle and failure events.
monitoring.snowplow.appIdRequired only if the collector uri is also configured. Sets the appId field of the snowplow events.
monitoring.sentry.dsnOptional, for tracking uncaught run time exceptions
monitoring.metrics.cloudwatchOptional boolean, with default true. This is used to disable sending metrics to cloudwatch.
monitoring.metrics.hostnameOptional, for sending loading metrics (latency and event counts) to a statsd server.
monitoring.metrics.portOptional, port of the statsd server.
monitoring.metrics.tagsE.g. { "key1": "value1", "key2": "value2" }. Tags are used to annotate the statsd metric with any contextual information.
monitoring.metrics.prefixOptional, default “snoplow.s3loader”. Configures the prefix of statsd metric names.