Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. RDB Loader
  6. Transforming enriched data
  7. RDB Transformer configuration reference

RDB Transformer configuration reference

The configuration reference in this page is written for RDB Transformer 4.0.0

The configuration reference pages for previous versions can be found here.

An example of the minimal required config for the Spark transformer can be found here and a more detailed one here.

An example of the minimal required config for the stream transformer can be found here and a more detailed one here.

This is a complete list of the options that can be configured:

Spark transformer only

inputRequired. S3 URI of the enriched archive. It must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories.
runInterval.*Specifies interval to process.
runInterval.sinceTimestampOptional. Start processing after this timestamp.
runInterval.sinceAgeOptional. A duration that specifies the maximum age of folders that should get processed. If sinceAge and sinceTimestamp are both specified, then the latest value of the two determines the earliest folder that will be processed.
runInterval.untilOptional. Process until this timestamp.
monitoring.sentry.dsnOptional. For tracking runtime exceptions.

Stream transformer only

input.typeOptional. The only supported values are kinesis and file. The default is kinesis
input.appNameOptional. KCL app name. The default is snowplow-rdb-transformer.
input.streamNameRequired for kinesis. Enriched Kinesis stream name.
input.regionAWS region of the Kinesis stream. Optional if it can be resolved with AWS region provider chain.
input.positionOptional. Kinesis position: LATEST or TRIM_HORIZON. The default is LATEST.
windowingOptional. Frequency to emit shredding complete message. The default is 10 minutes.
monitoring.metrics.*Send metrics to a StatsD server or stdout.
monitoring.metrics.statsd.*Optional. For sending metrics (good and bad event counts) to a StatsD server.
monitoring.metrics.statsd.hostnameRequired if monitoring.metrics.statsd section is configured. The host name of the StatsD server.
monitoring.metrics.statsd.portRequired if monitoring.metrics.statsd section is configured. Port of the StatsD server.
monitoring.metrics.statsd.tagsOptional. Tags which are used to annotate the StatsD metric with any contextual information.
monitoring.metrics.statsd.prefixOptional. Configures the prefix of StatsD metric names. The default is snoplow.transformer.
monitoring.metrics.stdout.*Optional. For sending metrics to stdout.
monitoring.metrics.stdout.prefixOptional. Overrides the default metric prefix.

Common settings

output.pathRequired. S3 URI of the transformed output.
output.compressionOptional. One of NONE or GZIP. The default is GZIP.
output.regionAWS region of the S3 bucket. Optional if it can be resolved with AWS region provider chain.
queue.typeRequired. Type of the message queue. Can be either sqs or sns.
queue.queueNameRequired if queue type is sqs. Name of the SQS queue.
queue.topicArnRequired if queue type is sns. ARN of the SNS topic.
queue.regionAWS region of the SQS queue or SNS topic. Optional if it can be resolved with AWS region provider chain.
formats.*Schema-specific format settings.
formats.transformationTypeRequired. Type of transformation, either shred or widerow. See Shredded data and Wide row format.
formats.fileFormatOptional. The default is JSON. Output file format produced when transformation is widerow. Either JSON or PARQUET.
formats.defaultOptional. The default is TSV. Data format produced by default when transformation is shred. Either TSV or JSON. TSV is recommended as it enables table autocreation, but requires an Iglu Server to be available with known schemas (including Snowplow schemas). JSON does not require an Iglu Server, but requires Redshift JSONPaths to be configured and does not support table autocreation.
formats.tsvOptional. List of Iglu URIs, but can be set to empty list [] which is the default. If default is set to JSON this list of schemas will still be shredded into TSV.
formats.jsonOptional. List of Iglu URIs, but can be set to empty list [] which is the default. If default is set to TSV this list of schemas will still be shredded into JSON.
formats.skipOptional. List of Iglu URIs, but can be set to empty list [] which is the default. Schemas for which loading can be skipped.
validations.*Optional. Criteria to validate events against.
validations.minimumTimestampThis is currently the only validation criterion. It checks that all timestamps in the event are older than a specific point in time, eg 2021-11-18T11:00:00.00Z.
featureFlags.*Optional. Enable features that are still in beta, or which aim to enable smoother upgrades.
featureFlags.legacyMessageFormatThis currently the only feature flag. Setting this to true allows you to use a new version of the transformer with an older version of the loader.

Deduplication (Spark transformer only)

The below settings exist for the purposes of benchmarking only and we strongly discourage changing the preset defaults:

deduplication.synthetic.typeCan be NONE (disable), BROADCAST (default) and JOIN (different low-level implementations).
deduplication.synthetic.cardinalityDo not deduplicate pairs with less-or-equal cardinality. The default is 1.

Articles