Developing a Consistent Taxonomy for Behavioral Analytics at CircleCI

Published in

CircleCI

13 min readDec 6, 2016

This is the second and final post on building analytics at CircleCI. This post picks up where the last left off, and specifically deals with the in-house implementations I came up with to deal with the problem(s) from Part 1.

The third-party platforms we had decided on (Segment, Amplitude, and Looker) solved the issues of combining demographic (persistent) and behavioral (event-specific) data and of creating org funnels. We solved the problem of data integrity–created by a poor integration with our previous analytics provider–by rewriting our integration code when switching to our new provider, Segment. The only other main issue this combination of platforms did not help alleviate was data cohesion and consistency. As a result, this became my main focus during implementation.

A consistent taxonomy, free of duplicate events, is the bedrock of a robust behavioral analytics infrastructure. Having this foundation increases ease of understanding, making it easier to understand and use the data effectively, which ultimately accelerates analysis and adoption rates.

So, it was my job to build a platform that would ensure developers named events in a way that was not only clear to users, but also wouldn’t accidentally create analogous or duplicate events. However, equally as important, the taxonomy had to make it simple for developers to add new events. Because when event creation became too difficult, it is seen as burden on productivity, and is not added to the development workflow.

Event Schema and Taxonomy

There are two main pieces of behavioral analytics, and therefore two places I needed focus on ensuring consistency: event names and event data.

Event Names

Without an authoritative schema for event names, our previous implementation had seen issues surrounding analogous event names such as clicked-signup and signup-clicked. To counteract this Wild West of event names I tackled my first challenge: designing a schema that was both consistent and intuitive.

Looking at the user interactions we care about on the site, it was clear that the majority were impressions, clicks, and state changes triggered by a user’s action. After reviewing many of our existing event names, I boiled the necessary structure down to a schema that would address all of these cases. The new convention became: -.

An example of this breakdown can be seen in the execution of: upgrade-button-impression, upgrade-button-clicked, and plan-upgraded. Each represents a segment of our plan upgrade funnel: one for when the user sees the upgrade button, another for when they click it, and another when their plan is upgraded in the database. And each falls within the schema.

What is excluded from the event name is just as important as what is included. Take this beauty: blue-signup-button-header-enterprise-page-clicked. CircleCI is not alone in this crime against brevity; I’ve seen this type of verbosity at many of my former companies. On the one hand, this name is extremely intuitive. It would be nearly impossible for a data consumer to misinterpret this event’s action. However, if event names are too granular, they run the risk of polluting the analytics namespace, becoming a burden on analysis.

From a data analytics perspective, this type of aggressive segmentation creates statistical nightmares. In order to calculate something as basic as the number of signup buttons clicked, it is now necessary to aggregate information across several events. Simplifying your behavioral data allows you to quickly and easily have a high-level view of which actions are being taken on which features across your platform.

If this example seems a bit hyperbolic it’s because it is. Usually, developers don’t try to force all information about an event into its name. However, there is one piece of information that developers are consistently keen on putting into the name: the location of the event. However, the schema very purposely excludes this information from the event name. Event names should be view- (and component-) agnostic; names should only represent user actions, and exclude where that action took place.

Event Data

Now, I’m not saying that an event’s location is unimportant; it absolutely is. It just shouldn’t be stored in the event name. If you see a 30% rise in signups one week, you’ll definitely want to see which page generated those signups.

Enter the use of event data.

With our behavioral analytics platforms (Segment + Amplitude), event data is crucial for segmenting events. In our SQL warehouse, Segment creates a table for each unique event name and converts each piece of the event data into a column in that table. This allows us to use SQL statements like WHERE and GROUP BY to partition event tables and gain more granular insights.

In Amplitude, we can use these same event data pieces to segment the data or create user cohorts. For example, we could group users into cohorts depending on how they followed their first project (via our “Add Projects” page or the “Follow Project” button at the top of a build). These cohorts (project-followed.view = add-projects and project-followed.view = build, respectively) could then be used to measure retention.

As with event names, it’s important to have a consistent event data schema. If org refers to the organization name on one event, and the organization’s id on another event, then the event data becomes just as confusing and ineffectual as our event names used to be.

Because each event may require custom data, it’s challenging to completely schematize the event data taxonomy. For example, our teammates-invited event has a num-teammates property which represents how many teammates were sent invites. Other events would rarely use this num-teammates data, so it doesn’t make sense to add it to a defined event data taxonomy.

There are, however, four pieces of event data that are fairly ubiquitous to all of CircleCI’s events: org, view, user, and repo. This data typically represents a user taking an action on a page (view) under the context of an organization (org), and sometimes under the context of a specific project (repo). Given this global context for our event data, tracking these four properties seemed like the most logical starting point for a global event data schema.

Turning Thoughts into Code

The crux of my plan was to create an infrastructure that enforced the event naming schema I had come up with, as well as creating a base for a global event data schema.

To enforce this consistency, I used the plumatic schema library. For those unfamiliar with Schema: “a Schema is a Clojure(Script) data structure describing a data shape, which can be used to document and validate functions and data.” This validation functionality meant that I could use “schema checking” to ensure that the data passed into the analytics library was from a predefined set of good values.

Note 1: Because I am using the plumatic Schema library to enforce my event naming schema, ‘schema’ is a bit of an overloaded term in the following section. For clarity, I will refer to the library and the objects created by that library as proper nouns (Schema), and my event naming schema as a common noun (schema).

Note 2: The code examples in this post are not exact representations of our production code. I’ve done this for two reasons. First, to try and highlight the implementation of the concepts I’ve laid out, and second, to make it more accessible to non-Clojure developers. However, the code snippets are based on our open source frontend, so if you want to see the real thing, feel free to check it out!

Consistent Names

The first thing I used Schema to validate was a central repository of event names. As some of you are aware, creating an event naming schema is one thing, enforcing it is another.

To get started, I created a list of all the events that were currently being fired from within our web app. I then used Schema to validate the event names passed into the track function by adding the following code:

;; Below are the lists of our supported events. ;; Events should NOT be view-specific.
;; They should be view agnostic and include a view in the properties.
;; Add new events here and keep each list of event types sorted alphabetically
(def supported-events
    #{:account-settings-clicked
      …
      :web-notifications-permissions-set});; Create an enum from the set of supported events
(def SupportedEvents (apply s/enum supported-events));; Create an AnalyticsEvent which is a merge of the CoreAnalyticsEvent
;; and a dictionary with a key :event-type, whose value must be found in the
;; enum of SupportedEvents
(def AnalyticsEvent
     (merge CoreAnalyticsEvent
            {:event-type SupportedEvents}));; Create a track function, which expects a dictionary (event-data) of the
;; schema defined by AnalyticsEvent. That way, if the argument passed to track
;; does not have a key :event-data, with a value found in the SupportedEvents enum,
;; it will throw an Exception and not fire the event.
(defn track [event-data :- AnalyticsEvent]
    track stuff…)

Plumatic Schema ensures that a developer adding a new event will have to first manually add it to the set of supported events, or they’ll see errors in their JS console while testing. This enforcement accomplishes three things, it:

Describes the event-naming schema (-) the developer should use to add the event, increasing the chance they will follow the naming schema. Provides numerous examples that _do_ follow the naming schema, encouraging proper use of the nomenclature. Discourages the creation of analogous events by keeping the events sorted alphabetically, providing developers with a chance to see the existing events for a feature.

An additional, personal, benefit: as the owner of our analytics, this gives me a place to check that events outside the schema are not being added. Previously, events were scattered throughout the entire code base, so enumerating them was difficult. This list of event names has already allowed me to catch events outside the event naming schema.

Consistent Data

The next place I needed to ensure consistency was in the event data, since well chosen event data provides invaluable insight into the more granular details of user engagement.

Luckily for us at CircleCI, we use Om (a clojurescript wrapper on Facebook’s React framework) for our frontend web app. This means our app is powered by a large state map held in memory and updated by user interactions or responses from API calls. A major benefit of this structure is that at any given time, our app “knows” its operating context, whether that’s the current page, the current user, or a specific organization or repository.

To ensure this state information was always included in the event data of analytics calls, I did several things:

First, I extracted this information from the state in a consistent way. Second, I ensured that the state was always passed to the analytics function. Finally, I always added these keys to the event data before firing the track call to our third-party providers. With those goals in mind, I ended up with the following code:

;; Define an `AnalyticsEvent` schema, which requires an `:event-type` key and
;; a `:current-state` key. The `:event-type` value must come from our our enum of `supported-events` (described in the code block above).
;; The `:current-state` can be a map with any values (this is the app state).
(def AnalyticsEvent
     {:event-type SupportedEvents
      :current-state {s/Any s/Any});; Given the `current-state`, return a dictionary of the properties that we want
;; to track with every event.
(defn- properties-to-track-from-state [current-state]
     "Get a map of the mutable properties we want to track out of       the state. Also add a timestamp."
     {:user (get-in current-state state/user-login-path)
      :view (get-in current-state state/current-view-path)
      :repo (get-in current-state state/navigation-repo-path)
      :org (get-in current-state state/navigation-org-path)});; A `track` function which takes a single argument `event-data`, a dictionary
;; in the shape described by the schema `AnalyticsEvent`.
;; If the input data is not in the correct shape, it throws an Exception.
(defn track [event-data :- AnalyticsEvent]
    (let [{:keys [event-type current-state]} event-data]
        (segment/track-external-click event-type
                                      (properties-to-track-from-state current-state))))

As you can see by reading the code above, our track function takes a single argument event-data, which will have a key current-state (the app’s state when the event was fired). We then parse the important properties out of the state, and send that dictionary as the event’s data.

BUT: although this implementation ensures that each event fires with a consistent set of populated keys, it’s extremely rigid. What if the developer needs to add extra data, like an A/B test treatment? Or worse, what if there is an edge case where the auto-populated data is wrong?

Consider our dashboard page. While this page has neither an organization nor repository context (it shows all orgs and repos, but isn’t specific to any), it does have links that are specific to an organization or repository. For example, any link on the branch selector component on the left side of the page clearly has an associated organization and repository.

Because of these cases, we need to be able to modify the properties dictionary. I accomplished this by adding a new properties argument to the track function signature, which take precedence over the auto-generated map. Below is the updated code:

;; Same as the AnalyticsEvent above, but now has an optional key `properties`
;; `properties` is a map that has Clojure keywords as keys, and allows custom user-set values.
(def AnalyticsEvent
    {:event-type (s/enum supported-api-response-events)
     :current-state {s/Any s/Any}
    (s/optional-key :properties) {s/Keyword s/Any}});; This function gets the data we want to track automatically out of the `current-state`,
;; and then merges the `properties` passed in. Because we pass in `properties` as the second
;; argument to the `merge` call, the `properties` take precedence. ;; So, if `properties` has an `:org` key, its value will overwrite the value of `:org` returned from
;; `properties-to-track-from-state`.
(defn- supplement-tracking-properties
    [{:keys [properties current-state]}]
    "Fill in any unsupplied property values with those supplied in the current app state."
    (-> current-state
        (properties-to-track-from-state)
        (merge properties)));; Same as the track call above, but can take a `properties` key in its input map.
;; This map of properties takes precedence over the map generated automatically
;; by parsing the app’s state.
(defn track :default [event-data :- AnalyticsEvent]
    (let [{:keys [event-type properties current-state]} event-data]   
        (segment/track-event event-type
                             (supplement-tracking-properties 
                                 {:properties properties
                                  :current-state current-state}))))

The code now allows a developer to add custom properties to the event data as needed. Also, in the case that the auto-added properties are wrong, it allows the developer to overwrite them with the correct values.

And That’s a Wrap!

Three third-party platforms and one massive event naming taxonomy overhaul later, we now have an analytics library that ensures some consistency for our data consumers, while allowing for flexibility and ease of use for our developers.

Looking Forward — Areas for Optimization

In the eight months since this analytics library was first launched it’s become the primary way we collect behavioral data. During this time, I’ve learned about what I built well and what I built poorly. I don’t think this would be an honest blog post if I didn’t address what were either mistakes or scaling issues, so here are some things that are top of mind eight months later.

Separation of Concerns

In our current architecture, there is no separation between what is part of the CircleCI Frontend Analytics and what is actually part of a portable Analytics Library. Although this isn’t a concern when firing events from a single source, as we scale our behavior analytics to multiple services, this infrastructure is crucial in validating those events and their data. Having a common library encourages best practices across different code bases.

A place we’ve already felt this pain is in our server side events. While the frontend is very well schematized, the server side didn’t have rigid schematization. The result is that, among similar events, there are different keys representing the same thing (ie: org and org-name on different events). Having a proper library would have allowed me to reuse the schema checking and prevent this from happening.

Scaling of the Event Names

The set of event names has been very successful so far but, as we add more and more events, that list is rapidly growing. At this point, I’m starting to wonder if the list is maintaining its original usefulness, or if there’s a better way to layer it for easier consumption.

If you look at the list, you’ll notice the event names are really just combinations of features and actions. So why have them together? Why not have a set of valid features, a set of valid actions, and then have the analytics library autogenerate the event name based on a feature/action combination passed to it by the client? That would be more scalable and make it harder to create event names that fall outside of the taxonomy. It’s something I’d like to try out in the future.

The Rest of the Event Data

We properly schematize four event data keys, but how about the rest? We run A/B tests, so those treatments should also auto populate the event data. The more data, the better, but also the more potential for divergence among our event data dictionaries.

So how can we ensure that as the complexity of our behavior analytics grows, the schema of our event data stays consistent? This is probably the hardest problem to solve, but one that will be important in scaling our use of behavioral data.

Waiting for Om Next

We are currently in the process of migrating our existing frontend from Om to Om Next (you can read about it here). One of the exciting things about a massive migration like this is it allows developers to address problems that might have slipped through the cracks in the first architected solution.

For example, when we moved to Om, CircleCI did not have data analysts or a growth team, and was less committed to utilizing data. Now that we are more data-dependent, analytics is receiving first class consideration in this migration, which will allow us to fix some issues we’re seeing with our current implementation.

Conclusion

No matter how thorough the plan and design, things will always be forgotten. This implementation of event taxonomy and schema has taken us from a company that didn’t prioritize data to one that runs on data-informed decisions. That said, as we scale, it has some cracks that are beginning to show. I’m looking forward to taking some time to fix these issues and, hopefully, writing another blog post to share future insights.

Interested in hearing more about this topic? RSVP for December Office Hours in San Francisco where Justin will present on all things data, Wednesday December 14th at the Heavybit Clubhouse. Space is limited.

Originally published at circleci.com on December 6, 2016.