Once your data is stored in S3 and Redshift, the basic setup is complete. You now have access to the event stream: a long list of packets of data, where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:
- Join event-level data with other data sets (e.g. customer data)
- Aggregate event-level data into smaller data sets (e.g. sessions)
- Apply business logic (e.g. user segmentation)
We call this process data modeling. The data model is usually developed in three stages.
Step 1: Implement the standard data model
In a first step, the basic Snowplow data model is set up in so-called full mode. This creates a set of derived tables in Redshift which is recomputed each time the pipeline runs, which allows for rapid iteration as long as date volumes are relatively small. The basic data model can be downloaded from GitHub and the underlying logic explained in more detail in the An introduction to event data modeling on the Snowplow website.
Step 2: Customize the SQL data model
The basic model can then be modified to include business-specific logic. This could mean adding e-commerce fields or aggregating events in different ways, ideally joining Snowplow data with other data sets (e.g. customer data).
Step 3: Move to an incremental approach
Once the custom data model is in a stable state, i.e. all data needed to build reports has been added, the queries can be migrated from a full to an incremental mode. The incremental model updates the derived tables using only the most recent events, rather than recompute the tables each time using all events. The difference between both modes is described in more detail in the Making SQL data models incremental post of our Discourse forum.
The data model consists of a set of SQL queries which are executed in sequence using our SQLRunner application. The order in which the queries are executed is determined by a config file, an example of which can be found on GitHub.