1. Home
  2. Docs
  3. Modeling your data
  4. The Snowplow mobile data model

The Snowplow mobile data model

Overview

The Snowplow mobile data model processes and aggregates data collected by the Snowplow iOS and Android mobile trackers into a set of derived tables – screen views, sessions and users – with each table generated by its respective module.

In addition to these ‘standard’ modules there are optional modules that can be enabled depending on your tracker implementation. Currently there is only one such optional module, Application Errors, which provides application crash reporting. These modules produce their own derived output table, as well as being aggregated and joined into ‘standard’ modules such as sessions for high level KPIs. As the tracker’s functionality continues to expand, so will this library of optional modules.

Unlike the web model the level of aggregation is not linear, as shown in the diagram below. This allows the sessions module to consume events other than just screen views, such as app installs or crashes.

Web vs Mobile

The model runs incrementally, only processing new events (and events that have previously been modeled but are part of screen views, sessions or users for which/whom there is new information) with every run. The incremental logic is separate from the logic that calculates the tables so as to make customization of the model easier. More information on the model structure and customization options can be found below.

The Snowplow mobile data model v1.0.0 can be found in the snowplow/data-models GitHub repository.

Currently Redshift is the only data warehouse supported by this model however Google BigQuery and Snowflake versions will be released soon.

Requirements

  • Snowplow Android or iOS mobile tracker version 1.1.0 or later implemented.
  • Mobile session context enabled.
  • Screen view events enabled.

Quick start

Prerequisites

  • SQL-runner is installed.
  • Snowplow Android or iOS mobile tracker implemented with mobile events dataset present in a database.

Playbook Configuration

Setting Schemas

For all playbooks within mobile/v1/redshift/sql-runner/playbooks/standard, configure the following:

  • input_schema – Set to the schema containing your Snowplow events data.
  • scratch_schema – Set to the scratch/staging schema to contain the intermediate tables generated by this model. Create if required.
  • output_schema – Set to a schema to contain the model’s final output tables (mobile_screen_views, mobile_sessions etc.)
Base Module

Within the base module’s main playbook, 01-base-main.yml.tmpl:

  1. Set the start_date as desired.
  2. Enable the following contexts as desired by setting to True:
    • Mobile – Device type, OS etc.
    • Geolocation – Device latitude, longitude, bearing etc.
    • Application – App version and build.
    • Screen – Screen details associated with a mobile event.
  3. Adjust filtering of events if required:
    • platform_filters: Default – platform = 'mob'. Override if required.
    • app_id: Default – No filter on app_id. Add if required.
# 01-base-main.yml.tmpl :variables: ... :start_date: 2020-01-01 #Set as required ... #Enable contexts if desired :mobile_context: false :geolocation_context: false :application_context: false :screen_context: false :platform_filters: [] #Override default if required e.g. ['mob','tv'] :app_id_filters: [] #Add app_id filter if required e.g ['my_app']
Code language: YAML (yaml)

The remaining variables are set to the recommended defaults. For more information please see the base module’s README within the playbooks directory.

Optional Modules

Optional modules are disabled by default. Please enable where appropriate by setting enabled: true in the main playbook of each module.

Running the model

Authentication
  1. Set environmental variable, REDSHIFT_PASSWORD, to your database password.
  2. Fill in the template .scripts/templates/redshift.yml.tmpl with your database details, leaving PASSWORD_PLACEHOLDER as is. This placeholder will be replaced with the credentials from env var REDSHIFT_PASSWORD at run time.
Execution

For a full run of the model:

bash .scripts/run_config.sh -b ~/pathTo/sql-runner -c mobile/v1/redshift/sql-runner/configs/datamodeling.json;
Code language: Bash (bash)

For more details please see the README within the .scripts directory.

Authentication

For each playbook within mobile/v1/redshift/sql-runner/playbooks/standard fill in your database connection details. For more details please refer to the SQL Runner Docs.

Execution

There are many tools available to schedule and run SQL jobs such as the mobile model. While these tools differ in operation the basic principles to running the mobile model should be similar.

A series of tasks will need to be set to execute SQL Runner against each individual playbook of the model in turn. For example, to run the main playbook of the base module:

bash {sql-runner-path} -playbook sql-runner/playbooks/standard/01-base/01-base-main.yml.tmpl -sqlroot sql-runner/sql
Code language: Bash (bash)

The correct order of playbook execution and the dependancies between them can be determined from the datamodeling config file within the repo.

Testing

The mobile model comes with a suite of data validation checks to ensure the model is running as intended. These checks are performed by Great Expectations. The tests were used during the model’s development and are not required to run the mobile model. If desired, they can however be used during the deployment of the mobile model on your own Snowplow dataset to validate the output. For further details please refer to the README in the .scripts directory.

Advanced Configurations

This quick start guide covers the implementation process for the majority of use cases. Further configurations can be made such as:

  • Running specific modules more frequently than others.
  • Running specific modules only.
  • Adjusting table scan limits.

For further details please refer to the READMEs within each module’s playbook directory.

Technical architecture

The entire model

This model consists of a series of modules, each is idempotent and can be run on independent schedules, and each produces a table which serves as the input to the next module.

The individual modules

The ‘standard’ modules can be thought of as source code for the core logic of the model, which Snowplow maintains. These modules carry out the incremental logic in such a way as custom modules can be written to plug into the model’s structure, without needing to write a parallel incremental logic. We recommend that all customisations are written in this way, which allows us to safely maintain and roll out updates to the model, without impact on dependent custom SQL. For more information on this, jump to the section on customizing the model below.

Each module produces a table which acts as the input to the subsequent module (the _staged tables), and updates a production table – with the exception of the base module, which takes atomic data as its input, and does not update a production table.

Each module comes with a 99-{module}-complete playbook, which marks that module complete by updating any relevant manifests, and truncating the input, and cleaning up temporary tables. The complete steps may be run at the end of each module, or at the end of the run.

More detail on each module can be found in the relevant READMEs in the GitHub repository.

Architecture of an individual mobile module

Customizing the model

Custom modules can fit into the incremental structure by consuming the same inputs, and running before the 99-{module}-complete playbook runs. Custom modules may also consume and intermediary tables of the standard module, which will not be dropped until the 99-{module}-complete playbook runs.

Any custom SQL that depends on a _staged table as its input should run before the complete step of the module which handles that same input. For example, custom logic which takes mobile_events_staged as an input should run before the 99-sessions-complete playbook which truncates mobile_events_staged.

An example custom module has been included in mobile/v1/redshift/sql-runner/sql/custom. In this module we:

  1. Read screen views from scratch.mobile_screen_views_staged.
  2. Aggregate screen views to one row per session_id.
  3. Delete and insert into the output table derived.session_goals. This table can then be joined onto derived.mobile_sessions on session_id.

The playbooks should then be run in the following order:

standard/01-base/01-base-main standard/01-base/99-base-complete standard/02-screen-views/01-screen-views-main standard/02-screen-views/99-screen-views-complete standard/03-optional-modules/01-app-errors/01-app-errors-main standard/03-optional-modules/01-app-errors/99-app-errors-complete standard/04-sessions/01-sessions-main custom/04-session-goals/01-session-goals-main custom/04-session-goals/01-session-goals-complete standard/04-sessions/01-sessions-complete
Code language: JSON / JSON with Comments (json)

An example of this can be seen in the datamodeling_custom_module.json file within the config directory.

Custom modules can also be created with their own independent manifesting logic, in which case they are more complex, but don’t rely on the standard modules.

More information on customizing the Snowplow mobile data model can be found in the custom module README.

Metadata

Metadata is logged in the {{.output_schema}}.datamodel_metadata{{.entropy}} table, per-module and per-run. A persistent ID is created, so that separate modules within the same run may be identified.

If you’d like to learn more about Snowplow Insights you can book a demo with our team, or if you’d prefer, you can try Snowplow technology for yourself quickly and easily.