Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. Snowplow RDB Loader
  6. RDB loader configuration reference

RDB loader configuration reference

Both shredder and loader use the same configuration HOCON. An example can be found here.

This is a complete list of the options that can be configured

nameRequired. Human-readable identifier, can be random. This does nothing, even though it is a required field.
idRequired. Machine-readable unique identificator, must be UUID. This does nothing, even though it is a required field.
regionRequired. AWS region of the S3 bucket.
messageQueueRequired. A SQS topic name used by the shredder and loader to communicate.
shredder.typeRequired. Set this to “batch”. (The “stream” mode is in beta and beyond scope of these docs)
shredder.inputRequired. S3 url the enriched archive. It must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories.
shredder.output.pathRequired. S3 url of the shredded output.
shredder.output.compressionRequired. One of “NONE” or “GZIP”
formats.defaultRequired, either TSV or JSON. Data format produced by default by the shredder. TSV is recommended as it enables table autocreation, but requires Iglu Server to be available with known schemas (including Snowplow schemas). JSON does not require Iglu Server, but requires Redshift JSONPaths to be configured and does not support table autocreation
formats.tsvRequired, list of iglu uri, but can be set to empty list []. If defulat is set to JSON these list of schemas will still be shredded into TSV
formats.jsonRequired, list of iglu uri, but can be set to empty list []. If default is set to TSV these list of schemas will still be shredded into JSON
formats.skipRequired, list of iglu uri, but can be set to empty list []. Schemas for which loading can be skipped.
jsonpathsOptional. A S3 URI that holds JSONPath files.
storage.typeRequired, must be “redshift”
storage.hostRequired. Host name of redshift.
storage.portRequired. Port of redshift.
storage.roleArnRequired. WS Role ARN allowing Redshift to load data from S3
storage.schemaRequired. Redshift schema name, e.g. “atomic”
storage.usernameRequired. DB user with permission to load data.
storage.passwordRequired. Password of DB user
storage.jdbc.blockingRowsOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.disableIsValidQueryOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.dsiLogLevelOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.filterLevelOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.loginTimeoutOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.logLevelOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.socketTimeoutOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.sslOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.sslModeOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.sslRootCertOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.tcpKeepAliveOptional. Refer to the Redshift JDBC driver reference.
storage.jdbc.tcpKeepAliveMinutesOptional. Refer to the Redshift JDBC driver reference.
storage.maxErrorRequired. Configures the Redshift MAXERROR load option
stepsRequired, list of strings. Can be “analyze”, “vacuum”, “transit_copy”. Use the empty list [] for no extra steps.
monitoring.snowplow.appIdOptional. When using Snowplow tracking, set this appId in the event.
monitoring.snowplow.collectorOptional. Set to a collector url to turn on snowplow tracking.
monitoring.sentry.dsnOptional. For tracking runtime exceptions.
monitoring.statsd.hostnameOptional, for sending loading metrics (latency and event counts) to a statsd server.
monitoring.statsd.portOptional, port of the statsd server.
monitoring.statsd.tagsE.g. { "key1": "value1", "key2": "value2" }. Tags are used to annotate the statsd metric with any contextual information.
monitoring.statsd.prefixOptional, default “snoplow.rdbloader”. Configures the prefix of statsd metric names.
monitoring.folders.stagingOptional, configuration for periodic unloaded/corrupted folders checks. Path where Loader could store auxiliary logs. Loader should be able to write here, Redshift should be able to load from here
monitoring.folders.period`Optional. How often to check for unloaded/corrupted folders.