Event and Entity Definition

Introduction

When it comes to collecting data into Snowplow there is already a large number of predefined events and entities that are available for use. You can find these in the console by particular sources or at the full list of publicly available schema.

Snowplow users can however define their own event and entity (context) types as well. To do this, users create schemas for each event or entity. The schema defines what fields are recorded with each event and entity. Once this is done, the user can record the new event or entity type into Snowplow and view the data e.g. in their data warehouse.

In this guide, we will walk you through the process of defining a new event type, recording that event, and reading that event data from your data warehouse.

  1. Defining events and entities involves the following steps:
  2. Create self-describing JSON schema
  3. Test your schema by sending events to your test pipeline
  4. Edit Schema
  5. Publish Schema to Production

1. Event & Context Definitions (Writing Schemas)

Snowplow predefined fields that capture contextual information for events like pageviews or ecommerce transactions are a great starting place, but eventually there will come a time where your particular organization has a unique set of contexts or a unique event that there are no predefined schemas for.

When that time comes here are the steps to getting your data into your pipeline:

  1. Login to the Snowplow Insights Console Schema section.
  2. Click the “create new schema button”
  3. Give your event a name that makes it clear to anyone else consuming the data what the event or entity is.
  4. Out of the box, the version number of a new schema is 1-0-0. You’ll change this version in the future if you need to update your schema. For more info on versioning see here.
  5. Update the example schema template to add the specific fields you want to collect. See below for a form example.

add-new-schema-btn-sml Add new schema button.

Let’s take a look at an example JSON schema to talk about its constituent parts:

{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "description": "Schema for an example event",
  "self": {
    "vendor": "com.snowplowanalytics",
    "name": "example_event",
    "format": "jsonschema",
    "version": "1-0-0"
  },
  "type": "object",
  "properties": {
    "example_field_1": {
      "type": "string",
      "maxLength": 128
    },
    "example_field_2": {
      "type": [
        "string",
        "null"
      ],
      "maxLength": 128
    },
    "example_field_3": {
      "type": [
        "string",
        "null"
      ],
      "maxLength": 128
    }
  },
  "additionalProperties": false
}

“$schema” - This argument instructs the snowplow pipeline on how to read self describing schemas and in most circumstances should always be left as shown in the example.

“description” - This argument is intended as place to put detailed information on the purpose of this schema. This will be particularly helpful for others when they want to know if a schema already exists for something they want to track.

“self” - This section of arguments contains metadata which makes the schema “self-describing”.

“vendor” - This normally refers to the company who has authored the schema. Most times this will be your company’s name. This could also be for organizing schemas from different groups in your organization (e.g. com.acme.android) if you have multiple teams working on different events and contexts. Snowplow uses the reversed company internet domain for vendor names (e.g. com.snowplowanalytics).

“name” - This is the name you want to give your schema. Much like the description above, this is a good chance to help others like a data analyst who might be consuming this data know exactly what your schema is meant to capture.

“format” - This field simply states the format of the schema which will always be jsonschema.

“version” - As your needs for data evolve so too will your need to define the events and contexts you are collecting through schemas. Rather than always creating brand new ones, Snowplow allows you to increment versions. Snowplow uses SchemaVer. SchemaVer is defined as follows: MODEL-REVISION-ADDITION It works like this:

  • New versions are always 1-0-0
  • If you make a backward compatible change to a schema, you increment the minor version i.e. 1-0-0 -> 1-0-1. Common example: adding an optional field
  • If you make a change that breaks a schema e.g. add a new compulsory field, or change a field type, this creates a new major version i.e. 1-0-0 -> 2-0-0
  • If you widen a field so that it can have two additional types e.g. make a field that used to be a integer a string or an integer, increment the minor version i.e. 1-0-0 -> 1-1-0

When drafting and testing a schema you might not need to keep changing your version number. However, once you’ve published a version for production any changes you would like to make will require a version “bump” giving the changed or edited schema a new version number.

After the self section the remainder of the schema is where you will begin describing the event or context fields that you will be collecting.

“type” - Type should always be set as object.

“properties” - Here is where you will begin describing the fields you intend on collecting. The idea of ensuring data quality through enforcing schema validation is built here. Rather than just leaving the naming of fields and the values they possess to the interpretation of individuals in different disciplines, the schema should clearly define what is being collected. The concept of nested data applies with jsonschema, so you can get as specific as you need. The most common types within properties are: string, numeric, object, array, Boolean and null. For more information on types see here.

Example

If we take an example of collecting data for the filling out of a form there are several data points that we want to make sure we get right and therefore want to be explicit in defining their schema.

Example form on a website:

Screen Shot 2018-12-05 at 13.27.58

If we take the first field “First Name”. Let’s say that the form owner makes the decisions that:

  1. The name of the field should be ‘form1_first_name’ (so everyone knows how to find it in the database later)
  2. Values entered should be a string (text)
  3. The string should have a minimum of 2 characters (to avoid initials) and a maximum of 100 characters.
  4. This field is required.

Therefore in the schema we would display these decisions as properties of the field as such:

"properties": {
    "form1_first_name": {
        "type": "string",
        “minLength”: 2,
        "maxLength": 100
    },

For the second form field the form owner might decide:

  1. The name of the field should be “form1_contact_number”
  2. The values entered must be a number
  3. The minimum and maximum number length should be 10 digits
  4. This is an optional field

So we would add the properties in the schema as such:

"form1_contact_number": {
    "type": ["number",”null”]
    “minLength”: 10,
    "maxLength": 10
},

By adding the “null” type above this means that if there is no data value sent across for this field the event will still pass schema validation.

Finally for the 3rd form field the decisions are as follows:

  1. The name of the field should be “opt_into_marketing”
  2. The values sent are True/False; Yes = true, No = false
  3. This field is required.

Our schema definition would be as follows:

"opt_into_marketing": {
    "type": "boolean"
},

Putting it all together our schema for capturing the additional contexts of individual form fields on the event that someone submits this form might look something like this:

{
   "$schema" : "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
   "description": "Schema for individual form field values of Form1 found on acme.com/form1",
    "self": {
        "vendor": "com.acme",
        "name": "form1_fields",
        "format": "jsonschema",
        "version": "1-0-0"
    },
    "type": "object",
    "properties": {
        "form1_first_name": {
            "type": "string",
            “minLength”: 2,
            "maxLength": 100
        },
        "form1_contact_number": {
            "type": ["number", "null"],
            "minLength": 10,
            "maxLength": 10
        },
        "opt_into_marketing": {
            "type": "boolean"
        }
    },
    "additionalProperties": false,
    "required": [
       "form1_first_name",
       "opt_into_marketing"
   ]
}

We’ve added “additionalProperties”: false. Additional properties: Setting to false, means any events sent with properties not defined in the schema will fail validation and will be written to bad rows rather than to your data warehouse.

If set to true, the events will pass validation but properties not in the schema will be archived rather than loaded into your data warehouse when using Big Query or Redshift. Snowflake DB will load the event. In cases where you have more control over the data collection like 1st party sources you may want to be more strict, whereas with 3rd party sources you might not want to be as strict.

We’ve also added the “required” argument and set form1_first_name and opt_into_marketing in the required array. Fields specified as required need to be present in the event for it to pass validation and land in your data warehouse. If an event is passed without fields specified as required, the event will be written to bad rows.

Great! We now have a schema written for our form. Next is to make sure we’ve created a valid schema.

Validation

Since the schema is an integral part to validating events before they are written to your data warehouse, it is important to make sure you have written a valid schema.

The validation process checks that the schema is ready to be published. It raises two kinds of issues with the schema:

  1. Schema is invalid and cannot be used. If, for example, it is not a valid JSON (e.g. there is a missing ” or } then the schema will fail validation. Similarly if it is not a valid jsonschema (e.g. if a field type is misspelt) it will fail validation. These issues need to be fixed before the schema can be published
  2. Warnings that may be an issue, but can be ignored. For example, the user will be warned if a min and max value for an integer is not specified. This is useful to specify because if you e.g. load the data into Redshift, this information is used to ensure the correct integer type in Redshift. (It supports three.) So providing this information in the schema is best practice, although it can be ignored as it is not required.

On the schema editor screen at the bottom you will see a validation button (see below). Click the button to run validation on your schema.

validate-btn-sml

The output of the validation will let you know if your schema is ready to be published, and may also provide you with additional messages for additional formatting.

If we continue with our form example from above and choose to validate the example schema we created we will see the following validation messages:

validation

The warning messages here are letting us know that our properties don’t have “descriptions”. Descriptions within properties, like the description for the overall schema help others that might need to use or edit your schema, understand the purpose of the field and/or where this field is being collected. Although in this case, the warning messages are not mandatory to fix, you should take them into account if they are relevant to your schema.

Let’s add descriptions to the properties like such:

"form1_contact_number": {
     "type": ["number", "null"],
     "minimum": 0,
     "maximum": 9999999999,
     "description": "This is the contact number field from form1 on acme.com/form1"
 },

When we click on the validate button again we should get a message telling us our schema is all ready to publish.

2018-11-01 1950

We can now click on the publish to development button.

publish-btn-sml

Publish success message:

publish success

If you try to create a schema that has the same name and version number of one that already exists you will be prompted with a warning message:

Screen Shot 2018-12-05 at 13.50.32

Keep in mind it’s possible to overwrite a schema in the Dev registry - if there’s someone else on the team that might be working on the same one it’s best to check with them first.

Once you have successfully published your schema to the development registry. You can begin sending test events to the Snowplow mini testing pipeline.

2. Testing Your Schema

Now that you have written your schema and published to the development registry, the next step is to test it to ensure that it works the way you want. To do that, you need to send some data to Snowplow Mini with the new event.

In order to do this, it is necessary to initialize the Snowplow Tracker you’re using to send the data to use the Snowplow Mini Collector. You then need to track the event using the a “track self describing event” or “track unstructured event” method. Below we give an example using Javascript and another using Python.

JavaScript

For our form example we would likely be using the JavaScript Tracker to send across the necessary contexts on the submit button click event. It might look something like this:

<script>
function form1submit(){ 
  window.snowplow('trackSelfDescribingEvent', {
    schema: 'iglu:com.snowplowanalytics/form1_fields/jsonschema/1-0-0',
    data: {
        form1_first_name: first_name_str,
        form1_contact_number: contact_number_int,
        opt_into_marketing: marketing_choice_bool
    }
  });
};
</script>

Once you add this code to your page and send events you can see if they validate against your schema by using the Kibana discovery tool.

If your event data is sent incorrectly the event will end up in the “bad rows”:

kibanabad

If your event data is sent correctly it will end up in good:

kibana good

Python

Here is another example using the Python tracker.

If we had an application let’s say that had pages users could visit to see movie posters, we could track the pageview event with the custom context information about the poster like this:

from snowplow_tracker import SelfDescribingJson

# Create a simple Emitter which will log events to http://d3rkrsqld9gmqf.cloudfront.net/i
e = Emitter("d3rkrsqld9gmqf.cloudfront.net")

# Create a Tracker instance
t = Tracker(emitters=e, namespace="cf", app_id="movieApp")

# Create a Subject corresponding to a user
s1 = Subject()

poster_context = SelfDescribingJson(
  "iglu:com.acme_company/movie_poster/jsonschema/2-1-1",
  {
    "movie_name": "Solaris",
    "poster_country": "JP",
    "poster_year": "1978-01-01"
  }
)

To fire the pageview event we would use:

t.track_page_view("http://www.films.com", "Homepage", context=[poster_context, geo_context])

3. Editing Schema

If after testing you decide you’d like to make changes to your schema, you can do so by clicking on the edit icon in the list view.

Editing schemas takes on the same workflow as creating new ones as far as writing or pasting in the schema, then validating and publishing. The main difference is we provide a “diff view” showing the difference between the new draft and the previous version (see below).

2018-12-03 1436

4. Publishing Schemas to Your Production Pipeline

Once you have tested your schema and are ready to move it to your production pipeline so that you can send live events to be validated against it, it’s time to “move” the schema into the production schema registry.

Only “administrator (admin)” users in the console have the ability to migrate a schema to the production registry.

To see which users have admin privileges for your company you can check the users section which can be found by clicking on the person icon in the upper right then clicking on “view, edit and add users”. For admins, there is a migrate icon next to the pencil icon in the list of schemas in the development registry.

You will not be allowed to overwrite an existing schema and therefore must increase the version number in order to publish.

Once published successfully your new schema is live and ready to validate events being sent in to your pipeline!