In order to analyze Snowplow data, it is important to understand how it is structured. We have tried to make the structure of Snowplow data as simple, logical, and easy-to-query as possible.
- Each line represents one event. Each line in the Snowplow events table represents a single event, be that a page view, add to basket, play video, like etc.
- Structured data. Snowplow data is structured: individual fields are stored in their own columns, making writing sophisticated queries on the data easy, and making it straightforward for analysts to plugin any kind of analysis tool into their Snowplow data to compose and execute queries
- Extensible schema. Snowplow started life as a web analytics data warehousing platform, and has a basic schema suitable for performing web analytics, with a wide range of web-specific dimensions (related to page URLs, browsers, operating systems, devices, IP addresses, cookie IDs) and web-specfic events (page views, page pings, transactions). All of these fields can be found in the
atomic.eventstable, which is a “fat” (many columns) table. As Snowplow has evolved into a general purpose event analytics platform, we’ve enabled Snowplow users to define additional event types (we call these custom unstructured events) and define their own entities (we call these custom contexts) so that they can extend the schema to suit their own businesses. For Snowplow users running Amazon Redshift, each custom unstructured event and custom context will be stored in its own dedicated table, again with one line per event. These additional tables can be joined back to the core
atomic.eventstable, by joining on th e
root_idfield in the custom unstructured event / custom context table with the
- Single table. All the events are effectively stored in a single table, making running queries across the data very easy. Even if you’re running Snowplow with Redshift and have extended the schema as described above, you can still query the data as if it were in a single fat table. This is because:
- The joins from the additional tables to the core
atomic.eventstable are one-to-one
- The field joined on is the distribution key for both tables, so queries are as fast as if the data were in a single table
- The joins from the additional tables to the core
- Immutable log. The Snowplow data table is designed to be immutable: the data in each line should not change over time. Data points that we would expect to change over time (e.g. what cohort a particular user belongs to, how we classify a particular visitor) can be derived from Snowplow data. However, our recommendation is that these derived fields should be defined and calculated at analysis time, stored in a separate table and joined to the Snowplow events table when performing any analysis
2. Understanding the individual fields
2.1 Common fields (platform and event independent)
2.1.1 Application fields
The application ID is used to distinguish different applications that are being tracked by the same Snowplow stack, e.g.
The platform ID is used to distinguish the same app running on different platforms, e.g.
2.1.2 Date / time fields
|timestamp||Time stamp for the event recorded by the collector||Yes||‘2013-11-26 00:02:05’|
|timestamp||Timestamp event was recorded on the client device||No||‘2013-11-26 00:03:57.885’|
|timestamp||When the event was sent by the client device||No||‘2013-11-26 00:03:58.032’|
|timestamp||Timestamp event began ETL||No||‘2017-01-26 00:01:25.292’|
|text||Client operating system timezone||No||‘Europe/London’|
|timestamp||Timestamp making allowance for innaccurate device clock||No||‘2013-11-26 00:02:04’|
|timestamp||User-set “true timestamp” for the event||No||‘2013-11-26 00:02:04’|
2.1.3 Event / transaction fields
|text||The type of event recorded||Yes||‘page_view’|
|text||A UUID for each event||Yes||‘c6ef3124-b53a-4b13-a233-0088f79dcbcb’|
|int||Transaction ID set client-side, used to de-dupe records||No||421828|
|text||Hash client-set event fields||No||AADCE520E20C2899F4CED228A79A3083|
A complete list of event types is given here.
2.1.4 Snowplow version fields
|text||Collector version||Yes||‘clj-tomcat-0.1.0’, ‘cf’|
|text||JSON of tags for this ETL run||No||“[‘prod’]”|
Some Snowplow Trackers allow the user to name each specific Tracker instance.
name_tracker corresponds to this name, and can be used to distinguish which tracker generated which events.
|text||Unique ID set by business||No||‘email@example.com‘|
|text||User ID set by Snowplow using 1st party cookie||No||‘bc2e92ec6c204a14’|
|text||User ID set by Snowplow using 3rd party cookie||No||‘ecdff4d0-9175-40ac-a8bb-325c49733607’|
|text||User IP address||No||‘18.104.22.168’|
|int||A visit / session index||No||3|
|text||A visit / session identifier||No||‘c6ef3124-b53a-4b13-a233-0088f79dcbcb’|
domain_sessionidx is the number of the current user session. For example, an event occurring during a user’s first session would have
2.1.6 Device and operating system fields
|text||Type of device||No||‘Computer’|
|boolean||Is the device mobile?||No||1|
|int||Screen height in pixels||No||1024|
|int||Screen width in pixels||No||1900|
|text||Name of operating system||No||‘Android’|
|text||Operating system family||No||‘Linux’|
|text||Company responsible for OS||No||‘Apple’|
2.1.7 Location fields
|text||ISO 3166-1 code for the country the visitor is located in||No||‘GB’, ‘US’|
|text||ISO-3166-2 code for country region the visitor is in||No||‘I9’, ‘TX’|
|text||City the visitor is in||No||‘New York’, ‘London’|
|text||Postcode the visitor is in||No||‘94109’|
|text||Visitor location latitude||No||37.443604|
|text||Visitor location longitude||No||-122.4124|
|text||Visitor region name||No||‘Florida’|
|text||Visitor timezone name||No||‘Europe/London’|
|text||Visitor’s ISP||No||‘FDN Communications’|
|text||Organization associated with the visitor’s IP address – defaults to ISP name if none is found||No||‘Bouygues Telecom’|
|text||Second level domain name associated with the visitor’s IP address||No||‘nuvox.net’|
|text||Visitor’s connection type||No||‘Cable/DSL’|
2.1.9 Metadata fields
Fields containing information about the event type.
|text||Who defined the event||No||‘com.acme’|
|text||Format for event||No||‘jsonschema’|
|text||Version of event schema||No||‘1-0-2’|
2.2.1 Web-specific fields
|text||The page URL||Yes||‘http://www.example.com‘|
|text||Scheme aka protocol||Yes||‘https’|
|text||Host aka domain||Yes||‘“www.snowplowanalytics.com’|
|int||Port if specified, 80 if not||80|
|text||Path to page||No||‘/product/index.html’|
|text||Fragment aka anchor||No||‘4-conclusion’|
|text||URL of the referrer||No||‘http://www.referrer.com‘|
|text||Web page title||No||‘Using ChartIO to visualize and interrogate Snowplow data – Snowplow Analytics’|
|text||Referer page path||No||‘/images/search’|
|text||Referer URL querystring||No||‘q=psychic+oracle+cards’|
|text||Referer URL fragment||No|
|text||Type of referer||No||‘search’, ‘internal’|
|text||Name of referer if recognised||No||‘Bing images’|
|text||Keywords if source is a search engine||No||‘psychic oracle cards’|
|text||The Snowplow domain_userid of the referring website||No||‘bc2e92ec6c204a14’|
|timestamp||The time of attaching the domain_userid to the inbound link||No||‘2013-11-26 00:02:05’|
|text||The page’s character encoding||No||, ‘UTF-8’|
|int||The page’s width in pixels||No||1024|
|int||The page’s height in pixels||No||3000|
|Marketing / traffic source fields|
|text||Type of traffic source||No||‘cpc’, ‘affiliate’, ‘organic’, ‘social’|
|text||The company / website where the traffic came from||No||‘Google’, ‘Facebook’|
|text||Any keywords associated with the referrer||No||‘new age tarot decks’|
|text||The content of the ad. (Or an ID so that it can be looked up.)||No||13894723|
|text||The campaign ID||No||‘diageo-123’|
|text||The click ID||No||‘ac3d8e459’|
|text||The ad network to which the click ID belongs||No||‘DoubleClick’|
|int||A user fingerprint generated by looking at the individual browser features||No||2161814971|
|text||Type of internet connection||No||No|
|boolean||Does the browser support persistent cookies?||No||1|
|text||Browser name||No||‘Firefox 12’|
|text||Browser rendering engine||No||‘GECKO’|
|text||Language the browser is set to||No||‘en-GB’|
|boolean||Whether the browser recognizes PDFs||No||1|
|boolean||Whether Flash is installed||No||1|
|boolean||Whether Java is installed||No||1|
|boolean||Whether Adobe Shockwave is installed||No||1|
|boolean||Whether QuickTime is installed||No||1|
|boolean||Whether RealPlayer is installed||No||1|
|boolean||Whether mplayer2 is installed||No||1|
|boolean||Whether Google Gears is installed||No||1|
|boolean||Whether Microsoft Silverlight is installed||No||1|
|boolean||Whether cookies are enabled||No||1|
|int||Bit depth of the browser color palette||No||24|
2.3 Event-specific fields
Snowplow includes specific fields to capture data associated with specific events.
Note that to date, all event types have been defined by Snowplow. Also note that
event_vendor values follow the Java package naming convention.
Snowplow currently supports (or will support in the near future) the following event types:
|Event type||Value of |
|2.3.3||Ecommerce transactions||‘transaction’ and ‘transaction_item’|
|2.3.4||Custom structured events||‘struct’|
|2.3.5||Custom unstructured events||‘unstruct’|
Details of which fields are available for which events are given below.
2.3.1 Page views
There are currently no fields that are specific to
page_view events: all the fields that are required are part of the standard fields available for any web-based event e.g.
2.3.2 Page pings
There are four additional fields included with page pings that indicate how a user has scrolled over a web page since the last page ping:
|integer||Minimum page x offset seen in the last ping period||No||0|
|integer||Maximum page x offset seen in the last ping period||No||100|
|integer||Minimum page y offset seen in the last ping period||No||0|
|integer||Maximum page y offset seen in the last ping period||No||200|
2.3.3 Ecommerce transactions
There are a large number of fields specifically for transaction events.
Fields that start
tr_ relate to the transaction as a whole. Fields that start
ti_ refer to the specific item included in the transaction. (E.g. a product in the basket.) Single transactions typically span multiple lines of data: there will be a single line where
transaction, where the
tr_ fields are set, and multiple lines (one for each product included) where
transaction_item and the
ti_ fields are set.
|text||Transaction affiliation (e.g. store where sale took place)||No||‘web’|
|decimal||Total transaction value||Yes||12.99|
|decimal||Total tax included in transaction value||No||3.00|
|decimal||Delivery cost charged||No||0.00|
|decimal||Total in base currency||No||12.99|
|decimal||Total tax in base currency||No||3.00|
|decimal||Delivery cost in base currency||No||0.00|
|text||Delivery address, city||No||‘London’|
|text||Delivery address, state||No||‘Washington’|
|text||Delivery address, country||No||‘France’|
|text||Product name||No||‘Cone pendulum’|
|text||Product category||No||‘New Age’|
|decimal||Product unit price||Yes||9.99|
|decimal||Price in base currency||No||9.99|
|integer||Number of product in transaction||Yes||2|
* Set exclusively by the Currency conversion enrichment.
2.3.4 Custom structured events
If you wish to track an event that Snowplow does not recognise as a first class citizen (i.e. one of the events listed above), then you can track them using the generic ‘custom structured events’. There are five fields that are available to store data related to custom events:
|text||Category of event||Yes||‘ecomm’, ‘video’|
|text||Action performed / event name||Yes||‘add-to-basket’, ‘play-video’|
|text||The object of the action e.g. the ID of the video played or SKU of the product added-to-basket||No||‘pbz00123’|
|text||A property associated with the object of the action||No||‘HD’, ‘large’|
|decimal||A value associated with the event / action e.g. the value of goods added-to-basket||No||9.99|
2.3.5 Custom unstructured events
Custom unstructured events are a flexible tool that enable Snowplow users to define their own event types and send them into Snowplow.
When a user sends in a custom unstructured event, they do so as a JSON of name-value properties that conforms to a JSON schema defined for the event earlier.
The unstructured event is not part of the
atomic.events table; instead, for users running on Redshift, it is shredded into its own table. The fields in this table will be determined by the JSON schema defined for the event in advance. Users can query just the table for that particular unstructured event, if that’s all that’s required for their analysis, or join that table back to the
atomic.events table by
atomic.my_example_unstructured_event_table.root_id = atomic.events.event_id
Contexts enable Snowplow users to define their own entities that are related to events, and fields that are related to each of those entities. For example, an online retailer may choose to define a
user context, to store information about a particular user, which might include data points like the users Facebook ID, age, membership details etc. In addition, they may also define a
product context, with product data e.g. SKU, name, created date, description, tags etc.
An event can have any number of custom contexts attached. Each context is passed into Snowplow as a JSON. Additionally, the Snowplow Enrichment process can derive additional contexts.
Contexts are not part of the
atomic.events table; instead, for users running on Redshift, Snowplow will shred each context JSON into a dedicated table in the
atomic schema, making it much more efficient for analysts to query data passed in in any one of the contexts. Those contexts can be joined back to the core
atomic.events table on
atomic.my_custom_context_table.root_id = atomic.events.event_id, which is a one-to-one join or a many-to-one join.
2.4 Specific unstructured events and custom contexts
These are also a variety of unstructured events and custom contexts defined by Snowplow. You can find their schemas here.
3. A note about storage data formats
- Currently, Snowplow data is stored in S3 (for processing in Apache Hive, Pig, and / or Mahout), Redshift and PostgreSQL (for processing by any SQL-compatible analytics package).
- There are minor differences between the structure of data in both formats. These relate to data structures that Hive and PostgreSQL support (e.g. JSONs) that Redshift does not
- Nevertheless, the structure of both is similar: representing a fat table
- Going forwards our intention is to move the storage format for data on S3 from the current flatfile structure to Avro. This will become the ‘canonical Snowplow data structure’. Other formats (e.g. Redshift, PostgreSQL etc.) will simply be ‘flattened’ versions of the same data. We have outlined some of our plans in this blog post.