Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. RDB Loader
  6. Transforming enriched data
  7. RDB Transformer configuration reference
  8. RDB Transformer Previous Versions
  9. RDB Transformer 3.0.x

RDB Transformer 3.0.x

An example of the minimal required config for the Spark transformer can be found here and a more detailed one here.

An example of the minimal required config for the stream transformer can be found here and a more detailed one here.

This is a complete list of the options that can be configured:

Spark transformer only

inputRequired. S3 URI of the enriched archive. It must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories.
runInterval.*Specifies interval to process.
runInterval.sinceTimestampOptional. Start processing after this timestamp.
runInterval.sinceAgeOptional. A duration that specifies the maximum age of folders that should get processed. If sinceAge and sinceTimestamp are both specified, then the latest value of the two determines the earliest folder that will be processed.
runInterval.untilOptional. Process until this timestamp.

Stream transformer only

input.typeOptional. The only supported values are kinesis and file. The default is kinesis
input.appNameOptional. KCL app name. The default is snowplow-rdb-transformer.
input.streamNameRequired for kinesis. Enriched Kinesis stream name.
input.regionAWS region of the Kinesis stream. Optional if it can be resolved with AWS region provider chain.
input.positionOptional. Kinesis position: LATEST or TRIM_HORIZON. The default is LATEST.
windowingOptional. Frequency to emit shredding complete message. The default is 10 minutes.

Common settings

output.pathRequired. S3 URI of the transformed output.
output.compressionOptional. One of NONE or GZIP. The default is GZIP.
output.regionAWS region of the S3 bucket. Optional if it can be resolved with AWS region provider chain.
queue.typeRequired. Type of the message queue. Can be either sqs or sns.
queue.queueNameRequired if queue type is sqs. Name of the SQS queue.
queue.topicArnRequired if queue type is sns. ARN of the SNS topic.
queue.regionAWS region of the SQS queue or SNS topic. Optional if it can be resolved with AWS region provider chain.
formats.*Schema-specific format settings.
formats.transformationTypeRequired. Type of transformation, either shred or widerow. See Shredded data and Wide row format.
formats.defaultRequired. Either TSV or JSON. Data format produced by default. TSV is recommended as it enables table autocreation, but requires an Iglu Server to be available with known schemas (including Snowplow schemas). JSON does not require an Iglu Server, but requires Redshift JSONPaths to be configured and does not support table autocreation.
formats.tsvRequired. List of Iglu URIs, but can be set to empty list [] which is the default. If default is set to JSON this list of schemas will still be shredded into TSV.
formats.jsonRequired. List of Iglu URIs, but can be set to empty list [] which is the default. If default is set to TSV this list of schemas will still be shredded into JSON.
formats.skipRequired. List of Iglu URIs, but can be set to empty list [] which is the default. Schemas for which loading can be skipped.
monitoring.sentry.dsnOptional. For tracking runtime exceptions.
validations.*Optional. Criteria to validate events against.
validations.minimumTimestampThis is currently the only validation criterion. It checks that all timestamps in the event are older than a specific point in time, eg 2021-11-18T11:00:00.00Z.
featureFlags.*Optional. Enable features that are still in beta, or which aim to enable smoother upgrades.
featureFlags.legacyMessageFormatThis currently the only feature flag. Setting this to true allows you to use a new version of the transformer with an older version of the loader.

Deduplication (Spark transformer only)

The below settings exist for the purposes of benchmarking only and we strongly discourage changing the preset defaults:

deduplication.synthetic.typeCan be NONE (disable), BROADCAST (default) and JOIN (different low-level implementations).
deduplication.synthetic.cardinalityDo not deduplicate pairs with less-or-equal cardinality. The default is 1.