Pipeline Components and Applications

  1. Home
  2. Docs
  3. Pipeline Components and Applications
  4. Loaders and storage targets
  5. Snowplow BigQuery Loader
  6. 1.0.x upgrade guide

1.0.x upgrade guide

Configuration

The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater’s --failedInsertsSub option. For more details, see the setup guide and configuration reference.

Using Repeater as an example, if your configuration for 0.6.x looked like this:

{ "schema": "iglu:com.snowplowanalytics.snowplow.storage/bigquery_config/jsonschema/1-0-0", "data": { "name": "Alpha BigQuery test", "id": "31b1559d-d319-4023-aaae-97698238d808", "projectId": "com-acme", "datasetId": "snowplow", "tableId": "events", "input": "enriched-sub", "typesTopic": "types-topic", "typesSubscription": "types-sub", "badRows": "bad-topic", "failedInserts": "failed-inserts-topic", "load": { "mode": "STREAMING_INSERTS", "retry": false }, "purpose": "ENRICHED_EVENTS" } }
Code language: JSON / JSON with Comments (json)

it will now look like this:

{ "projectId": "com-acme" "loader": { "input": { "subscription": "enriched-sub" } "output": { "good": { "datasetId": "snowplow" "tableId": "events" } "bad": { "topic": "bad-topic" } "types": { "topic": "types-topic" } "failedInserts": { "topic": "failed-inserts-topic" } } } "mutator": { "input": { "subscription": "types-sub" } "output": { "good": ${loader.output.good} # will be automatically inferred } } "repeater": { "input": { "subscription": "failed-inserts-sub" } "output": { "good": ${loader.output.good} # will be automatically inferred "deadLetters": { "bucket": "gs://dead-letter-bucket" } } } "monitoring": {} # disabled }
Code language: PHP (php)

And instead of running it like this:

$ ./snowplow-bigquery-repeater \ --config=$CONFIG \ --resolver=$RESOLVER \ --failedInsertsSub="failed-inserts-sub" \ --deadEndBucket="gs://dead-letter-bucket" --desperatesBufferSize=20 \ --desperatesWindow=20 \ --backoffPeriod=900 \ --verbose
Code language: PHP (php)

you will run it like this:

$ ./snowplow-bigquery-repeater \ --config=$CONFIG \ --resolver=$RESOLVER \ --bufferSize=20 \ --timeout=20 \ --backoffPeriod=900 \ --verbose
Code language: PHP (php)

New events table field

The first time you deploy Mutator 1.0.0 it will add a new column to your events table: load_tstamp. This represents the exact moment when the row was inserted into BigQuery. It shows you when events have arrived in the warehouse, which makes it possible to use incremental processing of newly arrived data in your downstream data modelling.

Depending on your traffic volume and pattern, there might be a short time period in which the loader app cannot write to BigQuery because the new column hasn’t propagated and is not yet visible to all workers. For that reason, we recommend that you upgrade Mutator first.

Migrating to StreamLoader

StreamLoader has been built as a standalone application, replacing Apache Beam and no longer requires you to use Dataflow.

Depending on your data volume and traffic patterns, this might lead to significant cost reductions. However, by migrating away from Dataflow, you no longer benefit from its exactly-once processing guarantees. As such, there could be a slight increase in the number of duplicate events loaded into BigQuery.

Duplicate events generally are to be expected in a Snowplow pipeline, which provides an at-least-once guarantee.

In our tests, we found that duplicates arise only during extreme autoscaling of the loader, eg if your pipeline has a sudden extreme spike in events. Aside from autoscaling events, we found the number of duplicate rows to be very low, however this depends on the type of worker infrastructure you use.