1. Home
  2. Docs
  3. Understanding your pipeline
  4. Understanding the structure of Snowplow data

Understanding the structure of Snowplow data

1. Overview

In order to analyze Snowplow data, it is important to understand how it is structured. We have tried to make the structure of Snowplow data as simple, logical, and easy-to-query as possible.

  • Each line represents one event. Each line in the Snowplow events table represents a single event, be that a page view, add to basket, play video, like etc.
  • Structured data. Snowplow data is structured: individual fields are stored in their own columns, making writing sophisticated queries on the data easy, and making it straightforward for analysts to plugin any kind of analysis tool into their Snowplow data to compose and execute queries
  • Extensible schema. Snowplow started life as a web analytics data warehousing platform, and has a basic schema suitable for performing web analytics, with a wide range of web-specific dimensions (related to page URLs, browsers, operating systems, devices, IP addresses, cookie IDs) and web-specfic events (page views, page pings, transactions). All of these fields can be found in the atomic.events table, which is a “fat” (many columns) table. As Snowplow has evolved into a general purpose event analytics platform, we’ve enabled Snowplow users to define additional event types (we call these custom unstructured events) and define their own entities (we call these custom contexts) so that they can extend the schema to suit their own businesses. For Snowplow users running Amazon Redshift, each custom unstructured event and custom context will be stored in its own dedicated table, again with one line per event. These additional tables can be joined back to the core atomic.events table, by joining on th e root_id field in the custom unstructured event / custom context table with the event_id in the atomic.events table
  • Single table. All the events are effectively stored in a single table, making running queries across the data very easy. Even if you’re running Snowplow with Redshift and have extended the schema as described above, you can still query the data as if it were in a single fat table. This is because:
    • The joins from the additional tables to the core atomic.events table are one-to-one
    • The field joined on is the distribution key for both tables, so queries are as fast as if the data were in a single table
  • Immutable log. The Snowplow data table is designed to be immutable: the data in each line should not change over time. Data points that we would expect to change over time (e.g. what cohort a particular user belongs to, how we classify a particular visitor) can be derived from Snowplow data. However, our recommendation is that these derived fields should be defined and calculated at analysis time, stored in a separate table and joined to the Snowplow events table when performing any analysis

2. Understanding the individual fields

2.1 Common fields (platform and event independent)

2.1.1 Application fields

FieldTypeDescriptionReqd?Example
app_idtextApplication IDYes‘angry-birds’
platformtextPlatformYes‘web’

The application ID is used to distinguish different applications that are being tracked by the same Snowplow stack, e.g. production versus dev.

The platform ID is used to distinguish the same app running on different platforms, e.g. iOS vs web.

2.1.2 Date / time fields

FieldTypeDescriptionReqd?Example
collector_tstamptimestampTime stamp for the event recorded by the collectorYes‘2013-11-26 00:02:05’
dvce_created_tstamptimestampTimestamp event was recorded on the client deviceNo‘2013-11-26 00:03:57.885’
dvce_sent_tstamptimestampWhen the event was sent by the client deviceNo‘2013-11-26 00:03:58.032’
etl_tstamptimestampTimestamp event began ETLNo‘2017-01-26 00:01:25.292’
os_timezonetextClient operating system timezoneNo‘Europe/London’
derived_tstamptimestampTimestamp making allowance for innaccurate device clockNo‘2013-11-26 00:02:04’
true_tstamptimestampUser-set “true timestamp” for the eventNo‘2013-11-26 00:02:04’

2.1.3 Event / transaction fields

FieldTypeDescriptionReqd?Example
eventtextThe type of event recordedYes‘page_view’
event_idtextA UUID for each eventYes‘c6ef3124-b53a-4b13-a233-0088f79dcbcb’
txn_idintTransaction ID set client-side, used to de-dupe recordsNo421828
event_fingerprinttextHash client-set event fieldsNoAADCE520E20C2899F4CED228A79A3083

A complete list of event types is given here.

2.1.4 Snowplow version fields

FieldTypeDescriptionReqd?Example
v_trackertextTracker versionYes‘no-js-0.1.0’
v_collectortextCollector versionYes‘clj-tomcat-0.1.0’, ‘cf’
v_etltextETL versionYes‘serde-0.5.2’
name_trackertextTracker namespaceNo‘cloudfront-1’
etl_tagstextJSON of tags for this ETL runNo“[‘prod’]”

Some Snowplow Trackers allow the user to name each specific Tracker instance. name_tracker corresponds to this name, and can be used to distinguish which tracker generated which events.

2.1.5 User-related fields

FieldTypeDescriptionReqd?Example
user_idtextUnique ID set by businessNojon.doe@email.com
domain_useridtextUser ID set by Snowplow using 1st party cookieNo‘bc2e92ec6c204a14’
network_useridtextUser ID set by Snowplow using 3rd party cookieNo‘ecdff4d0-9175-40ac-a8bb-325c49733607’
user_ipaddresstextUser IP addressNo‘92.231.54.234’
domain_sessionidxintA visit / session indexNo3
domain_sessionidtextA visit / session identifierNo‘c6ef3124-b53a-4b13-a233-0088f79dcbcb’

domain_sessionidx is the number of the current user session. For example, an event occurring during a user’s first session would have domain_sessionidx set to 1. The JavaScript Tracker calculates this field by storing a visit count in a first-party cookie. Whenever the Tracker fires an event, if more than 30 minutes have elapsed since the last event, the visitor count is increased by 1. (Whenever an event is fired, a “session cookie” is created and set to expire in 30 minutes. This is how the Tracker can tell whether the visit count should be incremented.) Thirty minutes is the default value and can be changed using the setSessionCookieTimeout method.

2.1.6 Device and operating system fields

FieldTypeDescriptionReqd?Example
useragenttextRaw useragentYes
dvce_typetextType of deviceNo‘Computer’
dvce_ismobilebooleanIs the device mobile?No1
dvce_screenheightintScreen height in pixelsNo1024
dvce_screenwidthintScreen width in pixelsNo1900
os_nametextName of operating systemNo‘Android’
os_familytextOperating system familyNo‘Linux’
os_manufacturertextCompany responsible for OSNo‘Apple’

2.1.7 Location fields

FieldTypeDescriptionReqd?Example
geo_countrytextISO 3166-1 code for the country the visitor is located inNo‘GB’, ‘US’
geo_regiontextISO-3166-2 code for country region the visitor is inNo‘I9’, ‘TX’
geo_citytextCity the visitor is inNo‘New York’, ‘London’
geo_zipcodetextPostcode the visitor is inNo‘94109’
geo_latitudetextVisitor location latitudeNo37.443604
geo_longitudetextVisitor location longitudeNo-122.4124
geo_region_nametextVisitor region nameNo‘Florida’
geo_timezonetextVisitor timezone nameNo‘Europe/London’

2.1.8 IP address-based fields

FieldTypeDescriptionReqd?Example
ip_isptextVisitor’s ISPNo‘FDN Communications’
ip_organizationtextOrganization associated with the visitor’s IP address – defaults to ISP name if none is foundNo‘Bouygues Telecom’
ip_domaintextSecond level domain name associated with the visitor’s IP addressNo‘nuvox.net’
ip_netspeedtextVisitor’s connection typeNo‘Cable/DSL’

2.1.9 Metadata fields

Fields containing information about the event type.

FieldTypeDescriptionReqd?Example
event_vendortextWho defined the eventNo‘com.acme’
event_nametextEvent nameNo‘link_click’
event_formattextFormat for eventNo‘jsonschema’
event_versiontextVersion of event schemaNo‘1-0-2’

2.2 Platform-specific fields

2.2.1 Web-specific fields

FieldTypeDescriptionReqd?Example
Page fields
page_urltextThe page URLYeshttp://www.example.com
page_urlschemetextScheme aka protocolYes‘https’
page_urlhosttextHost aka domainYes‘“www.snowplowanalytics.com’
page_urlportintPort if specified, 80 if not80
page_urlpathtextPath to pageNo‘/product/index.html’
page_urlquerytextQuerystringNo‘id=GTM-DLRG’
page_urlfragmenttextFragment aka anchorNo‘4-conclusion’
page_referrertextURL of the referrerNohttp://www.referrer.com
page_titletextWeb page titleNo‘Using ChartIO to visualize and interrogate Snowplow data – Snowplow Analytics’
refr_urlschemetextReferer schemeNo‘http’
refr_urlhosttextReferer hostNo‘www.bing.com’
refr_urlportintReferer portNo80
refr_urlpathtextReferer page pathNo‘/images/search’
refr_urlquerytextReferer URL querystringNo‘q=psychic+oracle+cards’
refr_urlfragmenttextReferer URL fragmentNo
refr_mediumtextType of refererNo‘search’, ‘internal’
refr_sourcetextName of referer if recognisedNo‘Bing images’
refr_termtextKeywords if source is a search engineNo‘psychic oracle cards’
refr_domain_useridtextThe Snowplow domain_userid of the referring websiteNo‘bc2e92ec6c204a14’
refr_dvce_tstamptimestampThe time of attaching the domain_userid to the inbound linkNo‘2013-11-26 00:02:05’
Document fields
doc_charsettextThe page’s character encodingNo, ‘UTF-8’
doc_widthintThe page’s width in pixelsNo1024
doc_heightintThe page’s height in pixelsNo3000
Marketing / traffic source fields
mkt_mediumtextType of traffic sourceNo‘cpc’, ‘affiliate’, ‘organic’, ‘social’
mkt_sourcetextThe company / website where the traffic came fromNo‘Google’, ‘Facebook’
mkt_termtextAny keywords associated with the referrerNo‘new age tarot decks’
mkt_contenttextThe content of the ad. (Or an ID so that it can be looked up.)No13894723
mkt_campaigntextThe campaign IDNo‘diageo-123’
mkt_clickidtextThe click IDNo‘ac3d8e459’
mkt_networktextThe ad network to which the click ID belongsNo‘DoubleClick’
Browser fields
user_fingerprintintA user fingerprint generated by looking at the individual browser featuresNo2161814971
connection_typetextType of internet connectionNoNo
cookiebooleanDoes the browser support persistent cookies?No1
br_nametextBrowser nameNo‘Firefox 12’
br_versiontextBrowser versionNo‘12.0’
br_familytextBrowser familyNo‘Firefox’
br_typetextBrowser typeNo‘Browser’
br_renderenginetextBrowser rendering engineNo‘GECKO’
br_langtextLanguage the browser is set toNo‘en-GB’
br_features_pdfbooleanWhether the browser recognizes PDFsNo1
br_features_flashbooleanWhether Flash is installedNo1
br_features_javabooleanWhether Java is installedNo1
br_features_directorbooleanWhether Adobe Shockwave is installedNo1
br_features_quicktimebooleanWhether QuickTime is installedNo1
br_features_realplayerbooleanWhether RealPlayer is installedNo1
br_features_windowsmediabooleanWhether mplayer2 is installedNo1
br_features_gearsbooleanWhether Google Gears is installedNo1
br_features_silverlightbooleanWhether Microsoft Silverlight is installedNo1
br_cookiesbooleanWhether cookies are enabledNo1
br_colordepthintBit depth of the browser color paletteNo24
br_viewheightintViewport heightNo1000
br_viewwidthintViewport widthNo1000

2.3 Event-specific fields

Snowplow includes specific fields to capture data associated with specific events.

Note that to date, all event types have been defined by Snowplow. Also note that event_vendor values follow the Java package naming convention.

Snowplow currently supports (or will support in the near future) the following event types:

Event typeValue of event field in model
2.3.1Page views‘page_view’
2.3.2Page pings‘page_ping’
2.3.3Ecommerce transactions‘transaction’ and ‘transaction_item’
2.3.4Custom structured events‘struct’
2.3.5Custom unstructured events‘unstruct’

Details of which fields are available for which events are given below.

2.3.1 Page views

There are currently no fields that are specific to page_view events: all the fields that are required are part of the standard fields available for any web-based event e.g. page_urlscheme, page_title.

2.3.2 Page pings

There are four additional fields included with page pings that indicate how a user has scrolled over a web page since the last page ping:

FieldTypeDescriptionReqd?Example
pp_xoffset_minintegerMinimum page x offset seen in the last ping periodNo0
pp_xoffset_maxintegerMaximum page x offset seen in the last ping periodNo100
pp_yoffset_minintegerMinimum page y offset seen in the last ping periodNo0
pp_yoffset_maxintegerMaximum page y offset seen in the last ping periodNo200

2.3.3 Ecommerce transactions

There are a large number of fields specifically for transaction events.

Fields that start tr_ relate to the transaction as a whole. Fields that start ti_ refer to the specific item included in the transaction. (E.g. a product in the basket.) Single transactions typically span multiple lines of data: there will be a single line where event = transaction, where the tr_ fields are set, and multiple lines (one for each product included) where event = transaction_item and the ti_ fields are set.

FieldTypeDescriptionReqd?Example
tr_orderidtextOrder IDYes‘#134’
tr_affiliationtextTransaction affiliation (e.g. store where sale took place)No‘web’
tr_totaldecimalTotal transaction valueYes12.99
tr_taxdecimalTotal tax included in transaction valueNo3.00
tr_shippingdecimalDelivery cost chargedNo0.00
tr_total_base*decimalTotal in base currencyNo12.99
tr_tax_base*decimalTotal tax in base currencyNo3.00
tr_shipping_base*decimalDelivery cost in base currencyNo0.00
tr_citytextDelivery address, cityNo‘London’
tr_statetextDelivery address, stateNo‘Washington’
tr_countrytextDelivery address, countryNo‘France’
tr_currencytextCurrencyNo‘USD’
ti_orderidtextOrder IDYes‘#134’
ti_skutextProduct SKUYes‘pbz00123’
ti_nametextProduct nameNo‘Cone pendulum’
ti_categorytextProduct categoryNo‘New Age’
ti_pricedecimalProduct unit priceYes9.99
ti_price_base*decimalPrice in base currencyNo9.99
ti_quantityintegerNumber of product in transactionYes2
ti_currencytextCurrencyNo‘EUR’
base_currency*textReporting currencyNo‘GBP’

* Set exclusively by the Currency conversion enrichment.

2.3.4 Custom structured events

If you wish to track an event that Snowplow does not recognise as a first class citizen (i.e. one of the events listed above), then you can track them using the generic ‘custom structured events’. There are five fields that are available to store data related to custom events:

FieldTypeDescriptionReqd?Example
se_categorytextCategory of eventYes‘ecomm’, ‘video’
se_actiontextAction performed / event nameYes‘add-to-basket’, ‘play-video’
se_labeltextThe object of the action e.g. the ID of the video played or SKU of the product added-to-basketNo‘pbz00123’
se_propertytextA property associated with the object of the actionNo‘HD’, ‘large’
se_valuedecimalA value associated with the event / action e.g. the value of goods added-to-basketNo9.99

2.3.5 Custom unstructured events

Custom unstructured events are a flexible tool that enable Snowplow users to define their own event types and send them into Snowplow.

When a user sends in a custom unstructured event, they do so as a JSON of name-value properties that conforms to a JSON schema defined for the event earlier.

The unstructured event is not part of the atomic.events table; instead, for users running on Redshift, it is shredded into its own table. The fields in this table will be determined by the JSON schema defined for the event in advance. Users can query just the table for that particular unstructured event, if that’s all that’s required for their analysis, or join that table back to the atomic.events table by

atomic.my_example_unstructured_event_table.root_id = atomic.events.event_id

2.3.6 Contexts

Contexts enable Snowplow users to define their own entities that are related to events, and fields that are related to each of those entities. For example, an online retailer may choose to define a user context, to store information about a particular user, which might include data points like the users Facebook ID, age, membership details etc. In addition, they may also define a product context, with product data e.g. SKU, name, created date, description, tags etc.

An event can have any number of custom contexts attached. Each context is passed into Snowplow as a JSON. Additionally, the Snowplow Enrichment process can derive additional contexts.

Contexts are not part of the atomic.events table; instead, for users running on Redshift, Snowplow will shred each context JSON into a dedicated table in the atomic schema, making it much more efficient for analysts to query data passed in in any one of the contexts. Those contexts can be joined back to the core atomic.events table on atomic.my_custom_context_table.root_id = atomic.events.event_id, which is a one-to-one join or a many-to-one join.

2.4 Specific unstructured events and custom contexts

These are also a variety of unstructured events and custom contexts defined by Snowplow. You can find their schemas here.

3. A note about storage data formats

  • Currently, Snowplow data is stored in S3 (for processing in Apache Hive, Pig, and / or Mahout), Redshift and PostgreSQL (for processing by any SQL-compatible analytics package).
  • There are minor differences between the structure of data in both formats. These relate to data structures that Hive and PostgreSQL support (e.g. JSONs) that Redshift does not
  • Nevertheless, the structure of both is similar: representing a fat table
  • Going forwards our intention is to move the storage format for data on S3 from the current flatfile structure to Avro. This will become the ‘canonical Snowplow data structure’. Other formats (e.g. Redshift, PostgreSQL etc.) will simply be ‘flattened’ versions of the same data. We have outlined some of our plans in this blog post.