If you were using the
secure example scripts unedited in the last section, you will have created a Postgres database where all of your data is stored. Your Postgres database will contain the following standard Snowplow schemas:
atomic: this is your rich, high quality data
atomic_bad: this is the data that has failed pipeline validation
Step 1. Querying your good data in Postgres
To query the good data in atomic.events, you will first you need to connect to your Postgres database.
- Connect to the database using the username and password you provided when creating the pipeline, along with the
db_portyou noted down after the pipeline was created.
- Run a query against your atomic.events table to take a look at the page view event that you generated in the previous step (
where event_name = ‘page_view’). You can understand more about each field in the canonical event here.
SELECT * FROM atomic.events WHERE event_name = 'page_view';
By default, there are 5 enrichments enabled, as listed below. These enrichments add extra properties and values to your events in real time as they are being processed by the Enrich application.
Some enrichments are legacy and therefore populate your atomic.events table. From the above list, these are the campaign attribution, referer parser and event fingerprint enrichments. The UA parser and YAUAA enrichment also add a separate entity to each event (these are also referred to as contexts since they add additional context to the events in your atomic.events table). The contexts are loaded into separate tables:
Note: you can join these contexts back to your atomic.events using root_id = event_id.
Step 2. Querying your bad data in Postgres
Your atomic_bad schema holds events that have failed to be processed by your pipeline. These are called failed events.
You will see in Postgres that you have a table called
In the last section, we sent a test event that would fail to be processed by your pipeline (specifically one that fails to validate against a schema). This is a fundamental aspect of Snowplow; ensuring that only good quality data reaches your stream, lake and warehouse and syphoning off poor quality data so that you have the ability to correct and recover it.
As the custom
product_view event passed through your pipeline, the Enrich application fetches the schema for the event. It does this so it can validate that the structure of the event conforms to what was defined up front, therefore ensuring it is of the quality expected. Since the schema for the
product_view event doesn’t yet exist in your Iglu schema registry, the event failed to validate and landed in
In the next section, we guide you through creating a custom schema so that your custom event would validate against it and not become a failed event.
Note: you might also see adapter failure failed events in Postgres. Many adaptor failures are caused by bot traffic, so do not be surprised to see some of them in your pipeline. Find out more here.
Step 3: Querying your data on S3
S3 provides an important backup of your data and can also serve as your data lake.
- Navigate to the AWS management console, search for S3 and select
- If you have multiple buckets on S3 already, you can navigate to the correct one by searching for the s3 bucket name that you entered when spinning up your pipeline
When you created your pipeline you also created three directories in your S3 bucket:
bad/ directory holds your enriched data, and the data that has failed to be validated by your pipeline. We took a look at this data in Postgres in the last step.
raw/ directory holds the events that come straight out of your collector and have not yet been validated (i.e. quality checked) or enriched by the Enrich application. They are thrift records and are therefore a little tricky to decode – there are not many reasons to use this data, but backing this data up gives you the flexibility to replay this data should something go wrong further downstream in the pipeline.