Getting started on Snowplow Open Source

  1. Home
  2. Docs
  3. Getting started on Snowplow Open Source
  4. Setup Snowplow Open Source on AWS
  5. Setup Validation and Enrich

Setup Validation and Enrich

Stream Enrich is an application which:

  1. Reads raw Snowplow events off a stream populated by the Scala Stream Collector
  2. Validates each raw event
  3. Enriches each event (e.g. infers the location of the user from his/her IP address)
  4. Writes the enriched Snowplow event to another stream

This guide covers how to setup Stream Enrich.

Install, run and configure Stream Enrich

Firstly you need to install the Stream Enrich application and get it running. See the Setup documentation for how to do this:

Add any desired Enrichments

Snowplow offers a large number of enrichments that can be used to enhance your events. As of Enrich 1.x.x the order of enrichments is hard-coded and cannot be changed, below table lists available enrichments in order they executed.

EnrichmentDescription
IABUse the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its’ IP address and user agent.
User Agent utilsDeprecated – please consider switching to YAUAA.
UA parserParse the useragent and attach detailed useragent information to each event.
Currency conversionConvert the values of all transactions to a specified base currency using Open Exchange Rates. To use it, you need an Open Exchange Rates account.
Referer parserExtracts attribution data from referer URLs.
Campaign attributionChoose which querystring parameters will be used to generate the marketing campaign fields. If you do not enable the campaign_attribution enrichment, those fields will not be populated.
Event fingerprintGenerate a fingerprint for the event using a hash of client-set fields. Helpful for deduplicating events.
Cookie extractorSpecify cookies that you want to extract if found.
HTTP Header extractorSpecify headers that you want to extract via a regex pattern, if found each extracted header will be attached to your event.
Weather EnrichmentPull weather information at the location of event taking a place (non-working as of Enrich 1.4.x)
YAUAAParse and analyze the user agent string of an event and extract as many relevant attributes as possible using YAUAA API.
IP lookupsLookup useful data based on a user’s IP address using the MaxMind database.
JavaScript scriptWrite a JavaScript function which is executed for each event.
SQL QueryPerform dimension widening on a Snowplow event via your own internal relational database.
API RequestPerform dimension widening on a Snowplow event via your own or third-party proprietary http(s) API.
IP anonymizationAnonymize the IP addresses found in the user_ipaddress field by replacing a certain number of octets with “x”s.
PII PseudonymizationBetter protect the privacy rights of data subjects by psuedoanonymizing collected data.

Each enrichment is enabled by configuring a JSON config file (one per enrichment), loading these into DynamoDB and then passing the location of the configs in DynamoDB to stream enrich on running it using the --enrichments argument as documented.

Sink the enriched data to S3 from Kinesis

Now that you have Stream Enrich running, you should have validated, enriched data being output into a Kinesis stream.

The next step is to setup the Snowplow S3 loader to sink this data to S3.

Instructions on how to load the data into other data stores e.g. Redshift, SnowflakeDB and Elastic can be found under Destinations