Relational Database Loader (RDB Loader) can refer to either a family of applications intended to load enriched data to Amazon Redshift or an application within this family, performing data discovery and executing SQL statements. This document refers to the application.
RDB Loader was previously known as StorageLoader JRuby app, which it is replaced in Snowplow R90 release.
The RDB Loader gets submitted to EMR via EmrEtlRunner after RDB Shredder Spark job. EmrEtlRunner passes to it configuration file (minus AWS credentials) and storage targets configuration.
The RDB Loader role in ETL process
The enriched files contain the tab-separated values contributing to atomic.events
and custom tables. The shredding process:
- reads Snowplow enriched events from enriched good files (produced and temporary stored in HDFS as a result of enrichment process);
- extracts any unstructured (self-describing) event JSONs and contexts JSONs found;
- validates that these JSONs conform to the corresponding schemas located in Iglu registry;
- adds metadata to these JSONs to track their origins;
- writes these JSONs out to nested folders on S3 dependent on their schema.
As a result the enriched good file is “shredded” into a few shredded good files (provided the event file contained data from at least one of the following: custom self-describing events, contexts, configurable enrichments):
- a TSV formatted file containing the data for
atomic.events
table; - possibly one or more JSON files related to custom user specific (self-describing) events extracted from
unstruct_event
field of the enriched good file; - possibly one or more JSON files related to custom contexts extracted from
contexts
filed of the enriched good file; - possibly one or more JSON files related to configurable enrichments (if any was enabled) extracted from
derived_contexts
field of the enriched good file.
Those files end up in S3 and are used to load the data into Redshift tables dedicated to each of the above files under the RDB Loader orchestration.
The whole process could be depicted with the following dataflow diagram.

Setup
RDB Loader itself does not require any configuration and will be submitted automatically by EmrEtlRunner for each configuration file in targets
directory. However, each particular storage target can requires specifi setup to be used with RDB Loader.
Usage
RDB Loader need to be used as an EMR step submitted by EmrEtlRunner, which exactly knows what arguments need to be passed.
In classic batch pipeline there’s no need to know about any RDB Loader internals, everything should be handled by EmrEtlRunner.
However in some cases it is possible to use it from local machine.
RDB Loader accepts following options:
--config <config.yml>
accepts full base64-encoded Snowplowconfig.yml
withou AWS credentials. AWS Credentials Chain used instead to get access. Required--target <target.json>
accepts full base64-encoded Snowplow Storage Target JSON configuration. RDB Loader can load only one target at time. Required--resolver <resolver.json>
accepts full base64-encoded Iglu Resolver JSON configuration. Used to validate Storage Target JSON config. Required--logkey <s3key>
accepts path to S3 key, where end output will be dumbed and later can be grabbed by EmrEtlRunner. Optional since 0.14.0--include <step>
list of optional steps to include--skip <step>
list of enabled by-default steps to skip--folder <s3folder>
accepts path to particular run folder on S3 to load--dry-run
makes RDB Loader to only print load SQL statements and not actually perform any DB IO
Available steps to skip:
analyze
– applicaple only to Redshift. Do not performANALYZE
after loadconsistency_check
– don’t wait until S3 becomes consistent. Can save up to 1 hour on loads with big shredded types cardinality, but increases possibility to fail to actual load.
Available steps to include:
vacuum
– applicaple only to Redshift. PerformVACUUM
after load
To manually pass --config
, --target
and --resolver
it is required to encode content of these files with Base64.
EmrEtlRunner cannot pass --dry-run
and --folder
. These options are exclusively for manual use.
Load process
RDB Loader internally run has four steps:
- Configuration parsing. Parse all arguments. If this step did not succeed – it won’t dump data to
logkey
path and EmrEtlRunner won’t be able to print it to stdout. Debug output however can be retrieved from EMRstdout
. - Discovering. Find run folders in
shredded.good
, discover shredded types and atomic events there. Only non-empty files are taken into account. If no files inatomic-events
found – RDB Loader will exit with error. - Loading. Connect to database and perform
COPY
,ANALYZE
andVACUUM
statements for atomic and shredded events. Insert entity intomanifest
table. - Log file dumping and monitoring.