Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. Snowplow BigQuery Loader
  6. Configuration reference

Configuration reference

This is a complete list of the options that can be configured in the Snowplow BigQuery Loader HOCON config file. The example configs in github show how to prepare an input file.

Required options

projectIdRequired. The GCP project in which all required Pub/Sub, BigQuery and GCS resources are hosted, eg my-project.
loader.input.subscriptionRequired. Enriched events subscription consumed by Loader and StreamLoader, eg enriched-sub.
loader.output.good.datasetIdRequired. Specify the dataset to which the events table belongs, eg snowplow.
loader.output.good.tableIdRequired. The name of the events table, eg events.
loader.output.bad.topicRequired. The name of the topic where bad rows will be written, eg bad-topic.
loader.output.types.topicRequired. The name of the topic where observed types will be written, eg types-topic.
loader.output.failedInserts.topicRequired. The name of the topic where failed inserts will be written, eg failed-inserts-topic.
mutator.input.subscriptionRequired. A subscription on the loader.output.types.topic, eg types-sub.
mutator.output.good.*Required. Equivalent to loader.output.good.*. Can be specified in detail or as ${loader.output.good}.
repeater.input.subscriptionRequired. Failed inserts subscription consumed by Repeater. Must be attached to the loader.output.failedInserts.topic, eg failed-inserts-sub.
repeater.output.good.*Required. Equivalent to loader.output.good.*. Can be specified in detail or as ${loader.output.good}.
repeater.output.deadLetters.bucketRequired. Failed inserts that repeatedly fail to be inserted into BigQuery are stored on GCS in this bucket, eg gs://dead-letter-bucket.
monitoring.*Required. Can be left blank, ie {}, to disable this functionality. See below for details.

Monitoring options

monitoring.statsd.*Optional. If set up, metrics will be emitted from StreamLoader and Repeater using the StatsD protocol.
monitoring.statsd.hostnameOptional, eg statsd.acme.gl.
monitoring.statsd.portOptional, eg 1024.
monitoring.statsd.tagsOptional. You can use env vars, eg {"worker": ${HOST}}.
monitoring.statsd.periodOptional, eg 10 sec.
monitoring.statsd.prefixOptional, eg snowplow.monitoring.
monitoring.dropwizard.*Optional. If set up, metrics will be emitted from Loader using the Dropwizard protocol.
monitoring.dropwizard.periodOptional, eg 10000 ms.

Advanced options

The defaults should be good for the overwhelming majority of deployments and hopefully you should never need to change these.

loader.loadMode.*BigQuery supports two loading APIs:
Streaming inserts API
Load jobs API


This setting configures which one will be used.

StreamLoader only supports the Streaming inserts API. Loader supports both but using the Load jobs API has experimental status.
loader.loadMode.typeDefaults to StreamingInserts. The only other possible option is FileLoads.
loader.loadMode.retryDefaults to false. Specifies if failed inserts should be retried infinitely or sent straight to the failedInserts topic. When set to true, if a row cannot be inserted, it will be re-tried indefinitely, which can throttle the whole load. In that case a restart might be required. This setting is only supported by the Streaming inserts API.
loader.loadMode.frequencyDefaults to null. Specifies how often the load job should be performed, in seconds. Unlike the near-real-time Streaming inserts API, load jobs are more batch-oriented. This setting is only supported by the Load jobs API. An example value is 60000.
loader.consumerSettings.*Settings for the PubsubGoogleConsumer object in the StreamLoader code. For more details see here.
loader.sinkSettings.good.*Settings for the good sink value in the StreamLoader code. For more details see here. For recommended number of records in each request, see here. For the HTTP request size limit, see here.
loader.sinkSettings.bad.*Settings for the bad sink value in the StreamLoader code. For more details see here.
loader.sinkSettings.types.*Settings for the type sink value in the StreamLoader code. For more details see here.
loader.sinkSettings.failedInserts.*Settings for the failed insert sink value in the StreamLoader code. For more details see here.

Config parser hints

These settings only exist as hints to the config parsing library we use, so that the configuration can be represented as Scala code. They each only have one possible value and should never be changed.

loader.input.typePubSub
loader.output.good.typeBigQuery
loader.output.bad.typePubSub
loader.output.types.typePubSub
loader.output.failedInserts.typePubSub
mutator.input.typePubSub
repeater.input.typePubSub
repeater.output.deadLetters.typeGcs