1. Home
  2. Docs
  3. Understanding your pipeline
  4. Developer FAQ

Developer FAQ

Is Snowplow real-time?


Does implementing Snowplow on my site effect site performance e.g. page load times?

Snowplow will have an impact on site performance, just as implementing any JavaScript-based tracking will impact site performance.

However, we have done everything we can to minimise the effect on site performance: by default the Snowplow JavaScript tracker is minified, and hosted on Amazon CloudFront. We also recommend using the JavaScript tracker’s asynchronous tags to minimize impact on page load.

Does Snowplow have a graphical user interface?

No, currently Snowplow does not have a GUI. Analysts who want to query data collected by Snowplow can use any third-party tool, such as Tableau, Chartio or PowerPivot.

We have written tutorials on using Tableau and Chartio to analyze Snowplow data.

Does Snowplow use first- or third-party cookies?

The Snowplow JavaScript tracker uses first-party cookies to track a unique user ID and the user’s session information. The CloudFront collector simply logs this data.

However, if you use the Clojure-based collector then this first-party user ID is overwritten with a unique user ID which is set server-side by the collector (i.e. a third-party cookie on the collector’s own domain). This is extremely useful for tracking users across multiple domains.

Does Snowplow scale?

Yes! In fact we designed Snowplow primarily with extreme scalability in mind. In particular:

  • All Snowplow components are designed to be horizontally scalable – e.g. to Enrich more events, just add more instances to your Elastic MapReduce cluster
  • Snowplow is architected as a loosely coupled system, to minimize the chance of performance bottlenecks
  • Snowplow is a protocol-first solution – meaning that an under-performing implementation of any component can be replaced by a more-performant version, as long as it respects Snowplow’s input/output protocols

Does Snowplow support custom variables/properties for events?

In Snowplow language, we refer to this as adding “custom context” to events (see this blog post for details).

This has not yet been implemented; our current thinking is that we will re-use our unstructured event support to allow custom context to be added to all event types in the form of arbitrary name:value properties. We are still exploring how scoping for custom context should work – for example, for the JavaScript Tracker we have identified three scopes of interest:

  1. Session-common context – context shared by all events in a session
  2. Page-common context – context shared by all events on a page (e.g. the title and URL of that page)
  3. Event-specific context – context specific to one event (e.g. time of that event)

For other trackers, there will be other scopes of interest (e.g. for a mobile app tracker, install-common context).

Because our ideas for custom context are dependent on unstructured event support, it only makes sense to add this to Snowplow after unstructured event support is finalized. Please see the related answer When will support for unstructured events be completed? for information on timings here.

In the meantime, two successful workarounds for the lack of custom context support are:

  1. Fire additional custom structured events containing the custom context you want to track
  2. Load the custom context into your event warehouse as a separate table (e.g. a data extract from your CMS). You can then JOIN this context to your Snowplow event data using common IDs (e.g. page URLs)

How reliable is the CloudFront collector?

To write.

How long do CloudFront access logs take to arrive in S3?

Thanks to Gabor Ratky for this answer:

CloudFront logs arrive with varying times and it is normal for them to arrive with delays.

As a rule of thumb that others have stated as well, 95% of the logs arrive within 3 hours and ~100% of the logs arrive within 24 hours so you should take that into consideration when you schedule your ETL process and query the resulting data.

Running daily ETL’s at 6am UTC, you will have near 100% of the events for the previous day (UTC). It is recommended that you do not query or use data from the same day unless it is for investigation purposes.

Is Snowplow IPv6 compliant?

IPv6 (Internet Protocol version 6) is a revision of the Internet Protocol (IP) which allows for far more addresses to be assigned than with the current IPv4.

At the moment, the CloudFront-based collector is not IPv6 compliant – because Amazon CloudFront is not yet IPv6 compliant – however the Clojure-based collector running on Elastic Beanstalk is IPv6 compliant.

How often can I run the Enrichment process?

Many Snowplow users simply schedule the Enrichment process to run overnight, so that they have yesterday’s latest data ready for them when they get to the office.

However, if you require better data recency, you can run the Enrichment process more often. Some users run the job every 4 or 6 hours, and we know of at least one company running the process every hour.

As you increase run frequency towards the every-hour mark, there are some important things to bear in mind:

  • Do make sure that your Enrichment process can happily finish within the 1 hour period. The next Enrichment process starting before the last one has finished will break things currently (see #195 for details)
  • Be aware that more frequent runs increases the chance of you running into Elastic MapReduce “failing to launch” every few days, which is not yet resolved (see #195 for details)

What’s next on the roadmap?

Plenty! Checkout our Discourse for details.

How can I contribute to Snowplow?

The Snowplow team welcomes contributions! The core team (Snowplow Analytics Ltd) is small so we would love more people to join in and help realise our objectives of building the world’s most powerful analytics platform. Stay tuned for a more detailed update on how best you can contribute to Snowplow.

Question not on this list?

Get in touch with us and ask it! See our website for contact details.