Why the majority of data projects fail: the case for a Universal Data Language
Let’s say you have to do something really hard. Anything really. In our case, it’s creating a data platform, but it could just as easily be taking a car engine apart or even forensic accounting.
The clarity with which you define the moving parts and the way they work together becomes imperative — both for others who help with the project and for your future self. Put very simply: failure to do this results in rapidly compounding complexity.
This is the essence of why so few data projects find value. Many data teams are attempting to unravel a mess that has already been made.
So let’s look at what causes these risks to data productivity and then analyze how a Universal Data Language can mitigate them.
Cause 1: Ad hoc definitions and unenforced data dictionaries
Joan, our front-end dev is implementing some tracking on a Friday afternoon. It’s not her favorite task and so it has waited until the end of the current sprint.
When she finally implements the tracking, it’s not uniform, as she’s relatively new and has slightly misunderstood the instructions in the data dictionary.
The Tracking Designer was a contractor, as is often the case, and has recently left the project, so Joan doesn’t have anyone to consult about how to correctly implement the tracking.
The event being tracked was originally intended to fire on search results appearing, not on the button click. This has been recorded in a data dictionary Google Sheet, but not in a format which was clear to Joan.
Marc, the data consumer now has data that doesn’t make complete sense. He can’t tell if it actually matches the intent of the tracking designer, and the naming conventions used in each field are also unclear so he isn’t even sure what the event actually describes.
Tracking the lineage of the data back to its source in order to correct these issues is challenging, and long Slack threads ensue as Marc waits for answers from Joan, who is equally confused. All the while, the project has stalled and motivation plummets.
Cause 2: Versioning
Closely tied to the idea of the need for tight definitions is the concept of version control.
Rachel and Tim are both data engineers and sometimes need to make changes to the structure or definitions of the data sets. These can be breaking or unbreaking changes.
Since they work in different time zones, they tend to have communication difficulties, and changes can be missed. This leads to breaking changes being ignored and tests being run against old, outdated data sets.
Ultimately, this can render enormous amounts of data useless, or — again — mean picking through old Slack threads and spreadsheets to work out what happened.
Cause 3: Data Exhaust
Alongside ad hoc definitions and versioning issues, the team is facing another problem: data exhaust.
Rachel, who you’ll remember is one of the data engineers, has to extract the data from many different sources. These are generally:
- Blackbox analytics tools, like CDPs
- Packaged analytics solutions (GA)
- Third-party SaaS apps, like Salesforce
Each of these produces data with differing levels of completeness and accuracy. In addition, these tools can create multiple tables for the same data, resulting in complex stateful joins, or nested table structures, leading to overly complicated queries.
Expecting Rachel to account for this complexity perfectly every time while under pressure leads to inevitable human error. Even when she manages to get results, the process is extremely time-consuming and means projects can slow to a crawl.
When Marc — the data consumer — queries the lineage of the data, or asks why it appears in a certain way, Rachel is often stumped as the black-box logic in these tools means implicit assumptions have been made before she even saw the data.
This even extends to not knowing how old the data is, ranging from days to a couple of minutes, rendering efforts to create apps that rely on predictable latency useless.
Rachel is aware that their processes aren’t optimal, but she is working incredibly hard to try and drive results with the tools available. Unfortunately, as 50% of her time is spent on data preparation, the return on her data projects is just not forthcoming and her motivation is at an all-time low.
How to drive value from you data projects: Data Creation and a Universal Data Language
Now we’ll take a look at how Joan, Marc, Rachel and Tim’s pains could have been avoided.
Data Creation is the opposite of data exhaust. Rather than extracting data with different levels of aggregation, accuracy etc., teams plan and create the data directly from source.
The major benefit of this approach is that the metrics created can be anything your team can dream up to best describe the product — rather than an approximate fit.
Secondly, each event can be enforced by JSON schemas. These are both machine and human-readable, and pre-validate data before it even hits your storage location. No human-to-human communication is necessary to make this system work, as — after the initial design — the Tracking Designer is taken out of the communication loop.
These rules are tightly controlled and can be versioned and tested in a sandbox environment. Hardcoding data definitions into a rule-based system in this way prevents the domino effect of misunderstandings we saw with data exhaust..
What does a Universal Data Language look like in practice?
Before events are even sent, we can define exactly what they’ll look like by writing a set of rules as JSON schemas. This is the backbone of our Universal Data Language.
For example, the ruleset for a click event:
{"element_name": {"enum": ["share","like","submit_email","rate",..."close_popup”],"description": "The name of the element that is clicked"},"element_location": {"type": ["string", "null"],"description": "Where on the screen is the button shown eg. Top, left"},"value": {"type": ["string", "null"],"description": "Optional value associated with the click"},"click_error_reason": {"type": ["string", "null"],"description": "If the click resulted in an error, what was the reason eg. Invalid character in text field"},"click_occured_at": {"type": "string","format": "date-time""description": "Time when the click happened on the client device",},"position_x": {"type": ["number", "null"],"description": "X coordination position of element when clicked",},"position_y": {"type": ["number", "null"],"description": "Y coordination position of element when clicked",},}
Prior to loading the data to your warehouse, each event is checked to see if it conforms to the rules laid out. There are two ways of doing this:
- If you are using a 3rd-party data collection vendor such as GA — validate client side
- If you have 1st-party data collection such as a home-built pipeline or Snowplow — validate in the data collection pipeline, prior to warehouse loading
Either method means the structure of data in the warehouse is controlled strictly by those consuming it.
With this simple change to the setup of introducing an enforced ruleset, your front-end devs can finally QA your analytics in the same way as they would QA new website code. Integrated data testing suites mitigate for errors before the domino effect begins — check out the open-source tool Micro for more on this.
Data workflows with a Universal Data Language
Summary
The reason most data projects struggle to find value comes back to the systems in place to prevent human error and enforce standards in tracking as well as the quality of the data sources in question.
Snowplow has pioneered the creation of a Universal Data Language and Data Creation. Armed with these tools, data can be defined in a machine and human-readable format, reducing the time wasted by data teams on cleaning and wrangling and vastly improving data quality.
Take a look at the level of detail in Snowplow behavioral data tables to learn more; the data includes hundreds of custom entities and properties, all conforming to a tight but customisable set of definitions.