Unified Analytics Event Schema that Improves Backend, Frontend, and Data Scientists’ Productivity

Published in

Tech @shift.com

6 min readSep 1, 2021

Analytics Events and Their Importance

Analytics events are very important for understanding how users interact with an application. The events play crucial roles in decision-making. Usually, an application will try to collect as many events as it needs to playback a user’s full interaction.

It’s common for the application to send tens and even hundreds of events during one user browsing one page, such as Page Load , Viewed xxx Promotion , Experiment xxx Assigned , Clicked xxx Link , Entered xxx Viewport , etc. Most of the time, the events are from both client-side (to capture the interactions of a user), and server-side (to capture the behaviors of the APIs that are invoked).

The events are then sent by the application to the analytics tools, such as Google Analytics, Segment, etc. These vendors will provide API for your application to use, and the application usually has its own wrapper to enrich the raw event with more context. Here is an example of a raw event and its enriched version:

# raw event
{ 
  "event": 'Viewed Black Friday Promotion',
  "properties": {
    "url": "demo.com/shopping"
  }
  "userId": "c9fc-3d50-bkase1co2"
  "timestamp": "2021-08-31T09:12:05:045Z"
}# enriched event{ 
  "event": 'Viewed Black Friday Promotion',
  "properties": {
    "url": "demo.com/shopping",
  }
  "userId": "c9fc-3d50-bkase1co2"
  "timestamp": "2021-08-31T09:12:05:045Z"
  "context":  {
    "app_version": "v2"
    "user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)"
  }
}

Then the events will be forwarded by the vendor to the storage destination e.g.(Redshift, Snowflake, BigQuery) for further analysis by the data scientists.

Problems We Faced While Organizing the Analytic Events

Now here are the problems we faced:

Events can have name collisions since there is no enforcement at the code level. I can have one event called Linked Clicked in shopping_page.js, and another one called Linked Clicked in careers.js. The result will be that events from them have different payloads but being recognized as the same event because of the same name. Our event storage will be messed and data scientists will be mad.
Event schema is not evolving with deliberate oversight and input from the data scientists. Engineers can change/update/delete/rename attributes of an event. Engineers can also name the event and its attributes freely without consulting the data scientist’s naming suggestion (a great naming rule can improve productivity significantly, think about troubles brought by events named xxx Link Clicked ,Clicked xxx Link , Click xxx Link coexists with each other). There is no enforcement from the code level to bring the oversight from data scientists.
Static typing of the event. The API provided by the analytics vendor is loosely typed, the event is basically a map of map[attribute_name][attribute_value], and the attribute_name is a string type (sure, what else can it be?). This means engineers can use whatever string they like for the attribute_name. We ended up seeing for the URL attribute, we have places call it URL , url, endpoint , path, url_path . Although they convey the same information, it leaves more work for data scientists to clean them up.

The root cause of these problems is that we don’t have code level (CI/CD pipeline) enforcement to bring data scientists into the code review when the schema of an event gets changed. We only rely on individuals to bring a data scientist into the code review, and that’s also not reliable because 1. sometimes people forget to bring. 2. Data scientists need to look at events defined in server-side and client-side programming languages which are not productive and error-prone.

Unify the schema of analytic events with code generation

The solution is to build the enforcement to bring data scientists into the code review for a schema change. We create an analytics schema file in txt format to describe the schema of an event:

Viewed Black Friday Promotion
  url: string
  referrer: string
  signed_in: bool# comments to describe the event (optional)
Clicked xxx Button
  url: string
  time_spent_on_page: int...

Then enforces any change of a schema will trigger an update to this file. In addition, we make the data science team the only maintainers of this file. The CI pipeline will trigger the update of this file schema changes are detected. The CD pipeline checks whether the updates of this file are approved by its maintainer (at least one data scientist). I will go through how the triggers are built in the next section.

Solution in Detail

The solution contains:

The wrapper around our analytic API.
A Go struct tag to denote the struct is an analytic event schema.
A script to parse the source code into AST
A code generator to generate a .txt file and .ts/.js code from looking at the AST。
Codeowner tool in the deployment pipeline to enforce code review.
(3)(4)(5) being included in the CI/CD pipeline.

In the wrapper of our analytic API, instead of making it accept a map[string]interface{} (in go, it means a map of string to any), we make it accept an Eventinterface.

# analytics_wrapper.gotype Event interface {
   // The unexported method prevents implementations outside of this package. So the only way a client can create a type that implements Event is by embedding a BaseEvent in a struct.
   private()
}

type BaseEvent struct{}

func (BaseEvent) private() {}Func Send(event Event){
  // Analytics API to send event out
  ...
}

To fulfill this Event interface, the caller needs to have this `BaseEvent` as one field of the argument passed:

type BlackFridayEvent struct {
  analytics_wrapper.BaseEvent `event: Viewed Black Friday Promotion`
  url string
  referrer string
  signed_in bool
}analytics_wrapper.Send(BlackFridayEvent: BlackFridayEvent{
  url: "/shopping"
  referrer: "google.com/xxx"
  signed_in: false
}

Notice this event tag. We enforce this tag in our static analysis tool (meaning that having analytics_wrapper.BaseEvent without this tag will not pass the test).

Static Analysis

The same as the other static analysis tests, we have a script to parse the source code into AST. For people not familiar with this, it is by getting the types, variables, functions of the source code by reading the source code text (without executing it). We add a check that for any struct that contains analytics_wrapper.BaseEvent , it must have a Go tag event with a string as the event name for this field. In addition, we also check whether all fields of this struct are alphanumeric or boolean types. This way, we enforce the usage of the implementations of Event to be the schema of an analytic event.

Code Generation

Next, We build a code generation stage. We used the AST parsing result from the last step. For every struct type that has analytics_wrapper.BaseEvent, we generate the corresponding schema text and put it into the schema file. For the frontend code that’s written in typescript/javascript, we require the engineers to define the schema in the backend go code. And the code generation will also generate the typescript/javascript code (a simple wrapper to send a strongly type frontend event) when being noted the event is used in the frontend.

Now since the schema file gets automatically updated when one event in changed, and the only code owners are data scientists. We enforce at the code level that any schema updates get reviewed by data scientists.

The New Workflow for an Event Schema Change

Engineer creates the event (both server-side and client-side) schema in go code. The engineer then pushes the change to review and the CI pipeline gets triggered. The CI pipeline will run tests and, more importantly, update the schema file if any change to any event happens. The CD pipeline, by reading the code owner, now requires the change to include at least one data scientist, and the engineer will pick one to get the schema file reviewed. The data scientist will make suggestions to the event changes(usually they are synced beforehand, but in case they aren’t, here is the opportunity before bad things happen). After the data scientist approves, the CD pipeline will allow this change to be deployed to production.

Conclusion

Now, we have the schema file as the glossary of our analytic events. It contains the event name and its schema. Any change to an analytic event is under the oversight of data scientists. Server-side, Client-side, and downstream analytics see the same schema and thus speak the same language. We now have hundreds of events using this new framework. Ever since this framework is built, engineers and data scientists work more closely on event schema changes and are no longer need to worry about accidental changes of any event. We spend less time on firefighting, and hence the productivity is being improved significantly.

Let me know in the comments if you use a flavor of this approach, or have other ideas to share. We are a learning organization and would love to hear! Found this interesting? We are working on a broad array of exciting technical challenges. Come join us at shift.com/careers.