Stream Enrich is an application which:
- Reads raw Snowplow events off a stream populated by the Scala Stream Collector
- Validates each raw event
- Enriches each event (e.g. infers the location of the user from his/her IP address)
- Writes the enriched Snowplow event to another stream
This guide covers how to setup Stream Enrich.
Install, run and configure Stream Enrich
Firstly you need to install the Stream Enrich application and get it running. See the Setup documentation for how to do this:
Add any desired Enrichments
Snowplow offers a large number of enrichments that can be used to enhance your events. As of Enrich 1.x.x the order of enrichments is hard-coded and cannot be changed, below table lists available enrichments in order they executed.
|IAB||Use the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its’ IP address and user agent.|
|User Agent utils||Deprecated – please consider switching to YAUAA.|
|UA parser||Parse the useragent and attach detailed useragent information to each event.|
|Currency conversion||Convert the values of all transactions to a specified base currency using Open Exchange Rates. To use it, you need an Open Exchange Rates account.|
|Referer parser||Extracts attribution data from referer URLs.|
|Campaign attribution||Choose which querystring parameters will be used to generate the marketing campaign fields. If you do not enable the campaign_attribution enrichment, those fields will not be populated.|
|Event fingerprint||Generate a fingerprint for the event using a hash of client-set fields. Helpful for deduplicating events.|
|Cookie extractor||Specify cookies that you want to extract if found.|
|HTTP Header extractor||Specify headers that you want to extract via a regex pattern, if found each extracted header will be attached to your event.|
|Weather Enrichment||Pull weather information at the location of event taking a place (non-working as of Enrich 1.4.x)|
|YAUAA||Parse and analyze the user agent string of an event and extract as many relevant attributes as possible using YAUAA API.|
|IP lookups||Lookup useful data based on a user’s IP address using the MaxMind database.|
|SQL Query||Perform dimension widening on a Snowplow event via your own internal relational database.|
|API Request||Perform dimension widening on a Snowplow event via your own or third-party proprietary http(s) API.|
|IP anonymization||Anonymize the IP addresses found in the user_ipaddress field by replacing a certain number of octets with “x”s.|
|PII Pseudonymization||Better protect the privacy rights of data subjects by psuedoanonymizing collected data.|
Each enrichment is enabled by configuring a JSON config file (one per enrichment), loading these into DynamoDB and then passing the location of the configs in DynamoDB to stream enrich on running it using the
--enrichments argument as documented.
Sink the enriched data to S3 from Kinesis
Now that you have Stream Enrich running, you should have validated, enriched data being output into a Kinesis stream.
The next step is to setup the Snowplow S3 loader to sink this data to S3.
Instructions on how to load the data into other data stores e.g. Redshift, SnowflakeDB and Elastic can be found under Destinations