Relational Database Loader (RDB Loader) can refer to either a family of applications intended to load enriched data to Amazon Redshift or an application within this family, performing data discovery and executing SQL statements. This document refers to the application.
The RDB Loader gets submitted to EMR via EmrEtlRunner after RDB Shredder Spark job. EmrEtlRunner passes to it configuration file (minus AWS credentials) and storage targets configuration.
The RDB Loader role in ETL process
The enriched files contain the tab-separated values contributing to
atomic.events and custom tables. The shredding process:
- reads Snowplow enriched events from enriched good files (produced and temporary stored in HDFS as a result of enrichment process);
- extracts any unstructured (self-describing) event JSONs and contexts JSONs found;
- validates that these JSONs conform to the corresponding schemas located in Iglu registry;
- adds metadata to these JSONs to track their origins;
- writes these JSONs out to nested folders on S3 dependent on their schema.
As a result the enriched good file is “shredded” into a few shredded good files (provided the event file contained data from at least one of the following: custom self-describing events, contexts, configurable enrichments):
- a TSV formatted file containing the data for
- possibly one or more JSON files related to custom user specific (self-describing) events extracted from
unstruct_eventfield of the enriched good file;
- possibly one or more JSON files related to custom contexts extracted from
contextsfiled of the enriched good file;
- possibly one or more JSON files related to configurable enrichments (if any was enabled) extracted from
derived_contextsfield of the enriched good file.
Those files end up in S3 and are used to load the data into Redshift tables dedicated to each of the above files under the RDB Loader orchestration.
The whole process could be depicted with the following dataflow diagram.
RDB Loader itself does not require any configuration and will be submitted automatically by EmrEtlRunner for each configuration file in
targets directory. However, each particular storage target can requires specifi setup to be used with RDB Loader. e.g. Redshift
RDB Loader need to be used as an EMR step submitted by EmrEtlRunner, which exactly knows what arguments need to be passed.
In classic batch pipeline there’s no need to know about any RDB Loader internals, everything should be handled by EmrEtlRunner.
However in some cases it is possible to use it from local machine.
RDB Loader accepts following options:
--config <config.yml>accepts full base64-encoded Snowplow
config.ymlwithou AWS credentials. AWS Credentials Chain used instead to get access. Required
--target <target.json>accepts full base64-encoded Snowplow Storage Target JSON configuration. RDB Loader can load only one target at time. Required
--resolver <resolver.json>accepts full base64-encoded Iglu Resolver JSON configuration. Used to validate Storage Target JSON config. Required
--logkey <s3key>accepts path to S3 key, where end output will be dumbed and later can be grabbed by EmrEtlRunner. Optional since 0.14.0
--include <step>list of optional steps to include
--skip <step>list of enabled by-default steps to skip
--folder <s3folder>accepts path to particular run folder on S3 to load
--dry-runmakes RDB Loader to only print load SQL statements and not actually perform any DB IO
Available steps to skip:
analyze– applicaple only to Redshift. Do not perform
consistency_check– don’t wait until S3 becomes consistent. Can save up to 1 hour on loads with big shredded types cardinality, but increases possibility to fail to actual load.
Available steps to include:
vacuum– applicaple only to Redshift. Perform
To manually pass
--resolver it is required to encode content of these files with Base64.
EmrEtlRunner cannot pass
--folder. These options are exclusively for manual use.
RDB Loader internally run has four steps:
- Configuration parsing. Parse all arguments. If this step did not succeed – it won’t dump data to
logkeypath and EmrEtlRunner won’t be able to print it to stdout. Debug output however can be retrieved from EMR
- Discovering. Find run folders in
shredded.good, discover shredded types and atomic events there. Only non-empty files are taken into account. If no files in
atomic-eventsfound – RDB Loader will exit with error.
- Loading. Connect to database and perform
VACUUMstatements for atomic and shredded events. Insert entity into
- Log file dumping and monitoring.