Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. RDB Loader
  6. Loading transformed data
  7. RDB Loader configuration reference

RDB Loader configuration reference

The configuration reference in this page is written for RDB Loader 4.0.0

The configuration reference pages for previous versions can be found here.

An example of the minimal required config for the Redshift loader can be found here and a more detailed one here.

An example of the minimal required config for the Snowflake loader can be found here and a more detailed one here.

An example of the minimal required config for the Databricks loader can be found here and a more detailed one here.

All applications use a common module for core functionality, so only the storage sections are different in their config.

This is a complete list of the options that can be configured:

Redshift Loader storage section

typeOptional. The only valid value is the default: redshift.
hostRequired. Host name of Redshift cluster.
portRequired. Port of Redshift cluster.
databaseRequired. Redshift database which the data will be loaded to.
roleArnRequired. AWS Role ARN allowing Redshift to load data from S3.
schemaRequired. Redshift schema name, eg “atomic”.
usernameRequired. DB user with permissions to load data.
passwordRequired. Password of DB user.
maxErrorOptional. Configures the Redshift MAXERROR load option. The default is 10.
jdbc.*Optional. Custom JDBC configuration. The default value is {"ssl": true}.
jdbc.BlockingRowsModeOptional. Refer to the Redshift JDBC driver reference.
jdbc.DisableIsValidQueryOptional. Refer to the Redshift JDBC driver reference.
jdbc.DSILogLevelOptional. Refer to the Redshift JDBC driver reference.
jdbc.FilterLevelOptional. Refer to the Redshift JDBC driver reference.
jdbc.loginTimeoutOptional. Refer to the Redshift JDBC driver reference.
jdbc.loglevelOptional. Refer to the Redshift JDBC driver reference.
jdbc.socketTimeoutOptional. Refer to the Redshift JDBC driver reference.
jdbc.sslOptional. Refer to the Redshift JDBC driver reference.
jdbc.sslModeOptional. Refer to the Redshift JDBC driver reference.
jdbc.sslRootCertOptional. Refer to the Redshift JDBC driver reference.
jdbc.tcpKeepAliveOptional. Refer to the Redshift JDBC driver reference.
jdbc.TCPKeepAliveMinutesOptional. Refer to the Redshift JDBC driver reference.

Snowflake Loader storage section

typeOptional. The only valid value is the default: snowflake.
snowflakeRegionRequired. AWS Region used by Snowflake to access its endpoint.
usernameRequired. Snowflake user with necessary role granted to load data.
roleOptional. Snowflake role with permission to load data. If it is not provided, the default role in Snowflake will be used.
passwordRequired. Password of the Snowflake user. Can be plain text, or read from the EC2 parameter store (see below).
password.ec2ParameterStore.parameterNameAlternative way for passing in the user password.
accountRequired. Target Snowflake account.
warehouseRequired. Snowflake warehouse which the SQL statements submitted by Snowflake Loader will run on.
databaseRequired. Snowflake database which the data will be loaded to.
schemaRequired. Target schema
transformedStageRequired. Snowflake stage for transformed events.
folderMonitoringStageRequired if monitoring.folders section is configured. Snowflake stage to load folder monitoring entries into temporary Snowflake table.
appNameOptional. Name passed as ‘application’ property while creating Snowflake connection. The default is Snowplow_OSS.
maxErrorOptional. A table copy statement will skip an input file when the number of errors in it exceeds the specified number. This setting is used during initial loading and thus can filter out only invalid JSONs (which is impossible situation if used with Transformer).
jdbcHostOptional. Host for the JDBC driver that has priority over automatically derived hosts. If it is not given, host will be created automatically according to given snowflakeRegion.

Databricks Loader storage section

typeOptional. The only valid value is the default: databricks.
hostRequired. Hostname of Databricks cluster.
passwordRequired. Databricks access token. Can be plain text, or read from the EC2 parameter store (see below).
password.ec2ParameterStore.parameterNameAlternative way for passing in the access token.
schemaRequired. Target schema.
portRequired. Port of Databricks cluster.
httpPathRequired. Http Path of Databricks cluster. Get it from the JDBC connection details after the cluster has been created.
catalogOptional. The default value is hive_metastore. Databricks unity catalog name.
userAgentOptional. The default value is snowplow-rdbloader-oss. User agent name for Databricks connection.

Common loader settings

regionOptional if it can be resolved with AWS region provider chain. AWS region of the S3 bucket.
messageQueueRequired. The name of the SQS queue used by the transformer and loader to communicate.
jsonpathsOptional. An S3 URI that holds JSONPath files.
schedules.*Optional. Periodic schedules to stop loading, eg for Redshift maintenance window.
schedules.noOperation.[*]Required if schedules section is configured. Array of objects which specifies no-operation windows.
schedules.noOperation.[*].nameHuman-readable name of the no-op window.
schedules.noOperation.[*].whenCron expression with second granularity.
schedules.noOperation.[*].durationFor how long the loader should be paused.
retryQueue.*Optional. Additional backlog of recently failed folders that could be automatically retried. Retry queue saves a failed folder and then re-reads the info from shredding_complete S3 file. (Despite the legacy name of the message, which is required for backward compatibility, this also works with wide row format data.)
retryQueue.periodRequired if retryQueue section is configured. How often batch of failed folders should be pulled into a discovery queue.
retryQueue.sizeRequired if retryQueue section is configured. How many failures should be kept in memory. After the limit is reached new failures are dropped.
retryQueue.maxAttemptsRequired if retryQueue section is configured. How many attempts to make for each folder. After the limit is reached new failures are dropped.
retryQueue.intervalRequired if retryQueue section is configured. Artificial pause after each failed folder being added to the queue.
retries.*Optional. Unlike retryQueue these retries happen immediately, without proceeding to another message.
retries.backoffRequired if retries section is configured. Starting backoff period, eg ’30 seconds’.
retries.strategyThe only possible value is EXPONENTIAL
retries.attemptsOptional. How many attempts to make before sending the message into retry queue. If missing, cumulativeBound will be used.
retries.cumulativeBoundOptional. When backoff reaches this delay, eg ‘1 hour’, the loader will stop retrying. If both this and attempts are not set, the loader will retry indefinitely.
timeouts.loadingOptional. How long, eg ‘1 hour’, COPY statement execution can take before considering Redshift unhealthy. If no progress (ie, moving to a different subfolder) within this period, the loader will abort the transaction.
timeouts.nonLoadingOptional. How long, eg ’10 mins’, non-loading steps such as ALTER TABLE can take before considering Redshift unhealthy.
timeouts.sqsVisibilityOptional. The time window in which a message must be acknowledged. Otherwise it is considered abandoned. If a message has been pulled, but hasn’t been acked, the time before it is again available to consumers is equal to this, eg ‘5 mins’. Another consequence is that if the loader has failed on processing a message, the next time it will get this (or anything) from the queue has this delay.
readyCheck.*Optional. Check the target destination to make sure it is ready.
readyCheck.backoffOptional. The default value is 15 seconds. Starting backoff period.
readyCheck.strategyOptional. The default value is CONSTANT. Backoff strategy used during retry. The possible values are JITTER, CONSTANT, EXPONENTIAL, FIBONACCI.

Common monitoring settings

monitoring.webhook.endpointOptional. An HTTP endpoint where monitoring alerts should be sent.
monitoring.webhook.tagsOptional. Custom key-value pairs which can be added to the monitoring webhooks. Eg, {“tag1”: “label1”}.
monitoring.snowplow.appIdOptional. When using Snowplow tracking, set this appId in the event.
monitoring.snowplow.collectorOptional. Set to a collector URL to turn on Snowplow tracking.
monitoring.sentry.dsnOptional. For tracking runtime exceptions.
monitoring.metrics.*Send metrics to a StatsD server or stdout.
monitoring.metrics.periodOptional. The default is 5 minutes. Period for metrics emitted periodically.
monitoring.metrics.statsd.*Optional. For sending loading metrics (latency and event counts) to a StatsD server.
monitoring.metrics.statsd.hostnameRequired if monitoring.metrics.statsd section is configured. The host name of the StatsD server.
monitoring.metrics.statsd.portRequired if monitoring.metrics.statsd section is configured. Port of the StatsD server.
monitoring.metrics.statsd.tagsOptional. Tags which are used to annotate the StatsD metric with any contextual information.
monitoring.metrics.statsd.prefixOptional. Configures the prefix of StatsD metric names. The default is snoplow.rdbloader.
monitoring.metrics.stdout.*Optional. For sending metrics to stdout.
monitoring.metrics.stdout.prefixOptional. Overrides the default metric prefix.
monitoring.folders.*Optional. Configuration for periodic unloaded / corrupted folders checks.
monitoring.folders.stagingRequired if monitoring.folders section is configured. Path where loader could store auxiliary logs for folder monitoring. Loader should be able to write here, storage target should be able to load from here.
monitoring.folders.periodRequired if monitoring.folders section is configured. How often to check for unloaded / corrupted folders.
monitoring.folders.sinceOptional. Specifies from when folder monitoring will start to monitor. Note that this is a duration, eg 7 days, relative to when the loader is launched.
monitoring.folders.untilOptional. Specifies until when folder monitoring will monitor. Note that this is a duration, eg 7 days, relative to when the loader is launched.
monitoring.folders.transformerOutputRequired if monitoring.folders section is configured. Path to transformed archive.
monitoring.folders.failBeforeAlarmRequired if monitoring.folders section is configured. How many times the check can fail before generating an alarm. Within the specified tolerance, failures will log a WARNING instead.
monitoring.healthCheck.*Optional. Periodic DB health check, raising a warning if DB hasn’t responded to SELECT 1.
monitoring.healthCheck.frequencyRequired if monitoring.healthCheck section is configured. How often to run a periodic DB health check.
monitoring.healthCheck.timeoutRequired if monitoring.healthCheck section is configured. How long to wait for a health check response.

Articles