Validating Clickstream Events

At Turo, we track user interactions (clickstream events) on our website & mobile apps using Segment.

Segment not only handles the infrastructure and client libraries for us, they also help us to connect data sources to data destinations.

Below is an illustration of their main product:

(source: segment.com)

The data collected is used by multiple teams throughout Turo:

  • Analytics
  • Data Science
  • Product
  • Business Development
  • Finance

This data represents a vast source of knowledge to improve our product. It allows us to perform A/B testing, funnel analysis, and machine learning models.

Wait, what’s an event?

It’s just a JSON payload; here is an example:

{
"anonymousId": "23adfd82-aa0f-45a7-a756-24f2a7a4c895",
"context": {
"library": {
"name": "analytics.js",
"version": "2.11.1"
},
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
"ip": "108.0.78.21"
},
"event": "search_event",
"integrations": {},
"messageId": "ajs-f8ca1e4de5024d9430b3928bd8ac6b96",
"properties": {
"city" : "San Francisco",
"country” : "USA",
"pickup_date” : "2018-12-12T19:11:01.266Z",
"drop_off_date”: "2018-12-13T17:00:00.266Z",
"business_class": True,

},
"receivedAt": "2015-12-12T19:11:01.266Z",
"sentAt": "2015-12-12T19:11:01.169Z",
"timestamp": "2015-12-12T19:11:01.249Z",
"type": "track",
"userId": "AiUGstSDIg",
"originalTimestamp": "2015-12-12T19:11:01.152Z"
}

As you can see in the properties object, we track the filters used in a search. Later on, we will be able to analyze the usage of the different filters, and drill down in further analysis (number of search in a market/city, etc.…).
The other attributes are metadata collected by the Segment library.

Problem

As the company and our needs grow, it becomes harder for us to maintain these events.

Just to give you an idea, we currently have ~400 distinct events tracked in our product and ~8 million are sent to Segment every day.

However, not all of these events have the same impact and importance. A bug introduced in the booking flow will get more attention than a bug on a peripheral feature implemented 2 years ago.

If a product manager wants to revamp a feature but finds that the data has changed and a bug has been introduced, the analysis becomes nearly impossible.

Inconsistency

Humans are prone to errors. An event defined in a certain way could be implemented in a different way.

Here are some examples :

  • Setting 0 or 1 for a boolean doesn’t sound too bad on iOS, but other clients will set True or False. To deal with this issue, Segment will, by default convert everything to a string, so it can still load into the data warehouse. However, every analysis will require some extra work.
  • New releases can introduce side effects and the data is affected. We don’t have strong tests on all platforms.

Suddenly without realizing it, this bug impacts multiple teams across the organization and the trust in the data diminishes…

Solution

It became a recurring problem, so we needed to take action.

Here is what we wanted to accomplish:

  • refactor the process around how we define/create events as a whole
  • be able to create/update events with confidence
  • detect discrepancies/bugs early on
  • monitor events
  • alert teams

Build vs Buy

After a rough idea of what we wanted to do, we explored different technologies:

Apache Beam got our attention as we could reuse this SDK for other data processing use cases. No vendor lock-in, it’s open-source by Google and you can run it on multiple executors such as Spark, Dataflow, Flink. The combination of Kinesis and Lambda was an alternative that we looked into as well. The main concern is it’s not fully integrated with Segment and we would have to serialize the events into Parquet and then load them in Redshift using batch jobs.

Protocols come into play

Turo has been a Segment customer for a couple of years, we got invited to the beta version of their new product, Protocols. It’s just a fancy name for JSON data validation. Of course, it’s not free but it’s reasonable. There is always the “build vs buy” dilemma, but in this case, we didn’t think twice.

Building data validation wouldn’t be hard to implement, but we would be developing and maintaining a product that already exists and is already integrated with Segment. Despite the temptation to develop this feature ourselves, we must allocate our time to more important projects.

The idea behind the protocol is simple: you define your event in JSON and then every event will be validated against that schema.

Here is an example of a library doing it in Python:

As you can see, the price property expects an integer but if a string is given the validation fails. You can even specify more granular rules like:

  • regex
  • ranges for numerical values
  • string length
  • nested properties

I recommend checking out this website to learn more about JSON validation: https://json-schema.org/understanding-json-schema/index.html

We are enforcing more granular rules on new and existing schemas. For existing events, we can test new constraints without impacting our production environment using repeaters.

Data pipelines

Previous Architecture
Every client pushes to a single source in Segment

Current architecture
Multiple sources with repeaters

Each platform has its own Segment source and we regroup them into functional components.

This new architecture brings us more flexibility, and it’s easier for us to monitor, debug, and improve without impacting other sources. The legacy source is the main source from our previous architecture, so we can keep receiving events from people who haven’t updated their mobile app yet. We have the exact same architecture in our sandbox environment so product engineers have a full testing environment with validation.

We use Segment’s Repeaters to connect sources to one or many destinations. However, one drawback is that the message_id is overwritten going through repeaters so the data lineage is not ideal.

Versions

Let’s say you want to update an existing field. It doesn’t sound like a big deal. Then you realize that releases are shipped every week on Android & iOS, meaning that we will drop all the events from clients that didn’t update to the latest app.

We thought about two solutions:

  • Have both the new and the old properties in the same JSON schema. We accept both for a while and then drop the previous one. The main drawback is that it will require extra coordination between the teams.
  • We create a new version of the event when we have a breaking change (change property name, data types …) We change the downstream dependencies to use table_V2 . For a while we import data from table_v1 to table_v2 and when the events coming into table_V1 are negligible we stop that process.

The second option looks easier for us to handle. It will require less work across the different teams.

Continuous integration

We want to be super confident in every change we make on an event schema.

Here is what we test in Continuous Integration :

  • schema is a valid JSON file
  • schema validates against the test events
  • reprocess the schema against older data, so we don’t introduce regressions
  • push the tracking plan to a dev environment in Segment and check that the tracking plan is valid so Segment can process it.

Alerting

We have multiple ideas for alerting that we are still working on.

We will probably end up with some Airflow jobs computing ratio of failed/valid events and send exception alerts (we use Bugsnag) when we reach a specific threshold.

Summing up

We’re already seeing very good results. We had around 50 events that did not match across all the different platforms. For at least ten of them, we were sending completely invalid data and didn’t notice it.

Fixing the events is trivial, however synching up with the different teams and understanding how we could fix them without impacting other stakeholders is not. It can be time consuming.

Currently, we are able to catch failures quickly. This allows us to inform teams right away that we have an issue, rather than having to first spend the time 
cleansing the data or writing transformation rules in SQL. Additionally, since we no longer need our custom SQL transformations, we will be able to delete thousands of lines of code to clean up our codebase.

Additionally, even if we receive invalid events we will be able to process them because we store them in an S3 bucket — giving us a fallback mechanism for investigation.

With this new approach we are now able to iterate much faster, with a lot more confidence, and most importantly, have a greater trust in the data.