An introduction to Snowplow

Snowplow self-identifies as a "best in-class data collection platform". We explore what this means, discussing what Snowplow does do and what it does not do. Get to know Snowplow's architecture, and whether it is a tool you need.

11 Mar 2020 · 16 min read · 3142 words

Snowplow truck at night Image by skeeze from Pixabay

Here’s a common scenario: you run a website and would like to know how your users use it. You look into Google Analytics 360. Do you want to give your business the tools you need to better understand its customers? Of course you do! Access to raw event data, insights based on all information as opposed to just a fraction of it? Sounds good! As you are getting ready to sign up, you notice Analytic’s price tag — starting at $150k / year. Gulp. What now? Certainly not all businesses can justify such a flabbergastingly high price.

Google Analytics 360 pricing screenshot

An alternative to Google Analytics that often pops up is Snowplow. Snowplow is a data collection platform actively developed by Snowplow Analytics. Snowplow can “collect” many kinds of telemetry data, but has a special place in its heart for clickstream data, offering many features relevant for web tracking out of the box.

Snowplow Analytics provides its platform in various formats, including as a managed service called Snowplow Insights. All of its core components are open-source and can be used for free in a build-it-yourself, self-managed way. You can have your own production-ready, scalable real-time data ingestion pipeline running in the public cloud for around $200 per month. Smaller deployments or dev deployments are possible too (using Snowplow Mini) at around $40 per month.

Snowplow can "collect" all sorts of telemetry data, but has a special place in its heart for clickstream events

Of course, such a simple price-based comparison of Snowplow and Google Analytics is not useful. The overlap between these products is actually surprisingly small. As someone who just deployed Snowplow to the Google Cloud, I know it took me a while to figure out whether Snowplow fulfilled our needs. This story introduces you to what Snowplow is, and whether it can be of use to you.

Compared to other posts, such as the Snowplow for Media series, this story emphasises what Snowplow actually does by diving into its architecture and deployment, instead of focussing on the value of the data that its trackers generate. With a good understanding of this architecture you can make up your own mind, and decide whether Snowplow should be a part of your data analytics efforts too!

Data collection platform

What exactly is a data collection platform? At its core, Snowplow consists of a processing pipeline capturing, cleaning, enriching and saving all the information that is presented to it through calls to a HTTP endpoint. GET and POST requests go in the pipeline, and out comes structured data in blob storage or a queryable database. Snowplow also comes with a number of utilities such as a JavaScript web tracker and tracking SDKs that will generate these HTTP calls in response to actions taken by your users on your website or in your app, for example in response to page views or page clicks.

HTTP requests go in the pipeline, out comes queryable structured data

Platforms concept illustration Remember: platforms ≠ platformers, pipes ≠ pipelines

What Snowplow does do

Snowplow started out as a web analytics platform, only supporting tracking with their web tracker and their own tracker protocol. Now, it aims to be a one-stop shop for all your event data collection efforts. Want to capture subscribe events to your mailing list backed by MailChimp? Snowplow has you covered! Capturing custom events generated by your own application? That’s no problem either! Website tracking almost seems boring when you think of all the internet of things (IoT) data you could be getting your hands on. Snowplow allows you to define and use custom event schemata. It accepts data from everywhere, while providing direct access to every bit of raw data it collects. Snowplow easily scales up to thousands of collected events per second.

Vintage stamp collection Collecting stamps is out, collecting data is in.

What Snowplow does not do

Obtaining “insights” from this raw, structured data, however, requires you to put in some elbow grease. Snowplow does little to process the data it ingests for you, and if you expect it to present you with easily-interpretable graphs or enlightening statistics you will probably be disappointed. A basic installation of open-source Snowplow collects data really well, but does little beyond collecting, cleaning and saving data in a structured format.

Elbow grease product advertisement While out for elbow grease, don’t forget to stock up on headlight fluid.

Snowplow does not come with a GUI. When collecting clickstream data, it simply provides you with atomic events, such as page click, page view or page ping events; no “user flows” or derived statistics like scroll depth. This makes sense knowing that Snowplow is not limited to ingesting clickstream data. Snowplow is therefore also not a true drop-in replacement to Google Analytics. In fact, it totally makes sense to run Snowplow and Google Analytics in parallel!

Snowplow is not a true drop-in replacement to Google Analytics

You’re not left completely to your own devices either. Snowplow does have a number of analytics SDKs in place that can help you analyse the data it generates. There’s also the Snowplow web data model project helping specifically with the analysis of clickstream data by grouping atomic tracking events into browsing sessions.

Snowplow’s processing pipeline

So, Snowplow collects all kinds of event data, processes them and then saves them by letting the data flow through “a pipeline”. Let’s make this process a bit more tangible. The Snowplow processing pipeline looks like this:

Snowplow processing pipeline diagram Snowplow processing pipeline.

This section discusses how data flows through the pipeline, and what interfaces connect the different components of the pipeline.

Collector component

Data generated by a Snowplow tracker (e.g. the Snowplow JavaScript tracker), a web hook or a call from one of the Snowplow tracker SDKs hits the collector component. The collector is a basic web server open to HTTP requests, that encodes and publishes all incoming data on a message bus. If the request does not include a cookie identifying the user, it also embeds a randomly generated cookie into the Set-Cookie HTTP response header.

Enrichment component

The enrichment component is a subscriber to this message bus implementing the Snowplow enrichment process. During this process, Snowplow validates incoming data, verifying that it is specified in a protocol that it understands. It then extracts event properties and enriches events. At the end of the enrichment process, events adhere to the Snowplow canonical event model. Enriched events are published on another message bus.

The enrichment component is without a doubt the most complex and interesting part of Snowplow. If you want to know more about it, check out my follow-up story: Enrichment and batch processing in Snowplow.

Storage component

The storage component subscribes to the message bus to which the enrichment component publishes. It persists messages in blob storage or a queryable data store such as BigQuery or Redshift. If the target storage is a (structured) database, event properties map onto columns.

Component implementations

Components have multiple compatible implementations, allowing you to use those that best suit your needs. Some implementations use cloud-native technology specific to a particular public cloud, such as the storage component Snowplow BigQuery Loader interfacing with BigQuery on GCP and running on Cloud Dataflow. Others are built on open-source technology and can easily be deployed on your own hardware, such as the Scala Stream Collector which runs on the JVM and can push messages not only to AWS Kinesis or Cloud Pub/Sub but also to Apache Kafka.

Component interfaces

What do the messages used for communication between components look like?

The collector’s interface is arguably the most important one, as it is outwards-facing. The collector accepts all HTTP requests, but only requests that implement a known protocol make it through enrichment. These are the format described by the Snowplow tracker protocol, one of the formats natively supported by Snowplow-provided collector adapters or any format for which you implemented your own remote HTTP adapter. Among others, Snowplow provides an adapter for self-describing JSON and for Google’s Measurement Protocol, used by Google Analytics.
The interface between the collector and the enrichment component consists of HTTP headers and payloads encoded by Apache Thrift using the payload scheme embedded in this self-describing JSON schema.
The interface between the enrichment component and the storage component consists of canonical events encoded in TSV (tab-separated value) format (no Thrift!); some values contain JSON. This interface is mostly undocumented.

To stream or not to stream

Some Snowplow components do not implement the (streaming) interfaces as presented above. These components process data in batches. The short of it is that you should ignore batch implementations and only bother getting to know the streaming processing pipeline presented above, as Snowplow’s batch processing pipeline has been deprecated. If you for some reason still would like to know more about batch processing, check out my story.

Snowplow batch processing has been deprecated

Snowplow’s output

Snowplow’s enriched canonical events can be stored to blob storage or a structured database. The precise format of the data that exists your pipeline depends on the storage component you use. Component implementations optimize for the properties of the target data store.

Snowplow’s output in BigQuery

When dumping the data into BigQuery, all events are dumped into a single big events table. After initialization, before even adding a single event, this events table already has about 128 properties!

BigQuery events table schema screenshot Part of the events table’s schema

As you can see, all properties are marked as NULLABLE. (In fact some are legacy and equal NULL for all of my records. Before Snowplow started using an extensible event type scheme, they would just add columns to a fat table with columns relevant for web analytics. The columns of this fat table define what is now an “atomic event”.) Because BigQuery is a columnar data store, columns that are always NULL do not actually cost anything: adding columns does not increase the size of your table or decrease the speed of querying. Adding events of a new type adds even more columns to the events table.

Additional BigQuery columns screenshot Additional columns / field types that were added to the events table after ingesting new types of data (link clicks and focus form events)

New columns will be NULL for all older data, and NULL for all new records that are inserted but of another type.

Snowplow’s output in other data stores

Not all data stores have the same properties as BigQuery. Consequently, not all data stores use the same “fat table”-approach as the BigQuery storage component.

For example, when storing data in Redshift on AWS, every event results in a record in a table atomic.events. Inserting data of a new type results in the creation of a new table; entries in this table can be joined to atomic.events. Snowplow refers to this event-splitting procedure as shredding.

Why to plow snow

Snowplow takes HTTP requests and saves them to a data store in near real-time. That does not sound particularly exciting; many cloud datastores have a HTTP front-end too, and can’t you just dump Pub/Sub events to Cloud Storage automagically? Why bother with Snowplow?

Comparing Snowplow to other offerings

Person giving thumbs up gesture Validation is important.

Compared to generic telemetry data gathering offerings, such as Amazon Kinesis (Data Analytics) or Azure Application Insights, Snowplow offers you the advantage of validation and enrichment of incoming data. Only clean data enters your database! When tracking web applications, Snowplow also sets cookies from the server side on your own domain allowing for reliable tracking across browsing sessions. However, Snowplow is more complex to install and keep running, and relatively expensive if you are processing few events.

By validating incoming data, Snowplow makes sure only clean data enters your database

Compared to applications that offer advanced processing of types of data, Snowplow offers you custom extensibility, access to the raw data, the ability to ingest many different kinds of data, and a low price. Examples of specialized service are Google Analytics or Matomo for web analytics, and applications such as New Relic, Datadog or honeycomb.io for application monitoring. However, unlike these specialized programs, Snowplow does not really further process or analyse your data; it does not even offer you a GUI. Comparing these programs to Snowplow is not always useful.

Some reasons to use Snowplow

You want to gather telemetry data but also remain in control of your processing pipeline. You like that Snowplow is open source and do not want to tie yourself to cloud-specific offerings.
None of the data type-specific offerings do exactly what you want. You want full customization. You want raw data. Snowplow offers you a good base to build on.
You want to perform web tracking using Snowplow’s excellent JavaScript tracker and leverage Snowplow’s event model, which is well-suited for web tracking events.
Snowplow ingests, enriches and stores data in (near-)real-time. This is especially useful if you use the collected data for fraud detection or similar applications.
You want to perform custom enrichment but do not need the full power and complexity of full-fledged stream processing platforms such as Apache Flink or Apache Beam. Snowplow makes it comparatively easy to build your own enrichments by implementing a JavaScript function or by implementing a HTTP (micro)service.
You want to keep on using Google Analytics, but want access to all raw data without paying 150k / year. You can do this very easily by siphoning off all the data you sent to Google to your Snowplow collector using Snowplow’s plugin for the Google Analytics tracker, because Google’s Measurement Protocol is one of the protocols supported by a Snowplow adapter.
You like using one of the analytics SDKs, available in many programming languages, to analyse your data.

Some reasons not to use Snowplow

You prefer not to deploy your own infrastructure, either because you do not have the expertise or because you do not want to maintain it. You can still use Snowplow, but should look into using the managed Snowplow platform Snowplow Insights.
You do not have the expertise, time or interest to analyse your data. By itself, the raw data that Snowplow generates is not very useful. You can look into more specialized applications, such as Matomo or Google Analytics for web tracking. There’s also support for analyzing Snowplow data in Looker.
Snowplow only collects event data, not other types of telemetry data. New Relic identifies telemetry data as belonging to one of four categories: metrics, events, logs and traces, or M.E.L.T for short. To collect metrics with Snowplow, you will need to take care of aggregation yourself. To do (distributed) tracing, you will need to add trace contexts to events yourself. Sending raw logs to Snowplow is not recommended.

Deploying Snowplow

So, what does a deployment of Snowplow actually look like?

Our deployment to Google Cloud Platform

Our current deployment looks like this:

Snowplow deployment architecture on Google Cloud Platform Our current Snowplow-deployment on GCP

Let’s walk through this figure:

Our only data source (currently) is the Snowplow JavaScript tracker, embedded on one of our websites. This component generates HTTP requests to the collector adhering to the Snowplow Tracker Protocol.

// Using the JavaScript tracker is as simple as embedding a tracking tag on your website
!(function (e, o, n, t, s, r) {
  (e[t] = e[t] || []),
    (r = o.createElement(n)),
    (r.async = 1),
    (r.src = "//cdn.jsdelivr.net/gh/snowplow/sp-js-assets@2.10.2/sp.js"),
    (r.onload = function () {
      e[t].push("snowplow", "newTracker", "cf", "{{COLLECTOR_URL}}", {
        appId: "my-app-id",
      });
    }),
    o.head.appendChild(r);
})(window, document, "script", "snowplow");

We deployed the Scala Stream Collector to Cloud Run, a fully managed, serverless compute platform that allows us not to worry about scaling or availability. Note that when ingesting large numbers of events (millions per day on average), running the collector on App Engine or Compute can be a lot cheaper.
When a HTTP request hits the collector, it publishes messages to a Pub/Sub topic collector_good.
We use the enrichment implementation Beam Enrich, which runs on Dataflow. Dataflow is also a managed service offering auto-scaling out of the box.
Messages that pass validation in Beam Enrich are enriched and then published to enriched_good. Messages read from collector_good that do not pass validation are published to enriched_bad. We do not currently ever read from the bad topic.
We use the storage component implementation BigQuery Loader, which also runs on Dataflow. Loader subscribes to enriched_good and inserts events into a single big table in BigQuery, events. It also publishes the event type (~ the names and types of the properties in the events) onto a PubSub topic types. When insertion in BigQuery fails, events are published on a Pub/Sub topic bq_bad_rows
Insertion usually fails because events have properties for which no columns exist in the events table. A Scala-application running on a compute instance, the BigQuery Mutator, watches the types topic and adds columns to the events table when necessary. We could, but do not currently ever re-process the events in bq_bad_rows; the first couple of events of a new type are therefore always lost.

Your own deployment

If you want to deploy Snowplow to GCP yourself, the official installation instructions and this blog post by Simo Avaha are excellent resources. You may also want to check out this or this collection of deployment scripts. I decided to forego these scripts in favor of Terraform combined with some custom scripts of my own.

There’s also guides for installing Snowplow on AWS. Currently, there’s no official support for Azure, and no Azure-specific component implementations.

If our deployment looks complicated or expensive to you or if your are just looking to explore Snowplow’s capabilities, you should take a look at Snowplow Mini. Mini implements all necessary components (and more) in a single image that can be deployed to a single virtual machine. It allows you to run a full Snowplow stack on GCP for around $40 / month. Snowplow Mini is not recommended for production use as it is neither scalable nor highly available.

Any mistakes or something you feel I missed? Let me know in the comments!

I work at Data Minded, an independent Belgian data analytics consultancy, and this is where I document and share my learnings from deploying Snowplow at Publiq.

Publiq is a non-profit organisation managing a database of activities in Flanders, Belgium. As part of an exciting project that will make Publiq more data-driven, we investigate using clickstream data to improve the quality of recommendations.