The Five Stages of Data Modeling
Get structured or get lost.
A kickoff meeting for a new project. Engineering, product management, operations, and marketing get together to define and document key data entities and relationships. The result is the Data Dictionary, a cornerstone of the holistic data view, shared, understood, revision-tracked, and kept up to date by everyone in the company, regardless of the role, and… oh who are we kidding?! When was the last time this actually happened?
Software is eating the world. Unfortunately, data is eating software even faster.
Fast-forward a few months. The project appears wildly successful. The iOS, Android and Web versions of the app are highly polished and of course sharing-enabled. The glowing TechCrunch piece is out. Users are signing up like crazy.
Yet something is off. The CEO is gloomy. “I’m flying blind!” she cries.
Data modeling is neither a vitamin nor a painkiller. It’s the healthy lifestyle that helps prevent life-threatening diseases in the first place.
User churn is high. Mixpanel charts contradict New Relic graphs, and Google Analytics disagrees with both. Marketing complains about lopsided engagement numbers. Analysts can’t get anything out of Redis, while DevOps refuse to move to Mongo. Optimizely reports great conversions with A, whereas retention is noticeably higher with B. Engineers explain that exporting data into ElasticSearch will take another quarter.
Why? Why do bad things happen to great teams proficient with the best tools and funded by the wisest investors?!
“I already know what every bit of data means in my code. Do I really have to describe every JSON field and every event in this dictionary thing, keep track of data model versions, and coordinate changes with marketing and ops? This is too much work! I need to ship a new feature tomorrow! Why are you asking me to invest time into things that I know won’t maker the app livelier or increase the cuteness of its UI? Don’t I dutifully define new Mixpanel events every time marketing asks? Can’t somebody find a schema inference tool or something? What more do you want from me?”
When did fancy charts become the state of the art in data intelligence?
In the spirit of moving fast, the company in our story chose to postpone structuring its data, explicitly and carefully, across different departments, roles, modules, codebases, and datastores. Unfortunately, and with remarkable predictability, this classic early stage bargain leads to failure: by the time the flag of data intelligence is finally raised, it turns out that everyone has their own implicit view of what means what, and different people use different tools to manage their own data silos.
What is our poor company to do?
Hire a Data Science team? Too late. By the time these enlightened creatures ramp up, build the requisite Hadoop cluster and collate data from various silos into a decent system of record, the users will evaporate, disappointed by the product’s inability to meet their evolving needs once the novelty of the pretty surface wears off. What’s more, tons of invaluable data is now residing on third-party servers and can’t be repatriated. That’s the very data that could be actively used to understand the audience and its emerging segments, cater to its collective and individual interests, react to user behavior in real time, and keep the customers happy.
Outsourcing data modeling is stupid. Data divided against itself cannot stand.
But wait, it gets worse: lack of explicitly defined data dictionary precludes versioning. Even if carefully collected, logs of user activity and other historical records become devilishly difficult to normalize across multiple implicit schemas. As the result, past data becomes effectively unreadable, and valuable insights are lost forever.
Instead of designing the product from the data up and explicitly defining the schemas across all modules and deployment targets, the company ends up with badly fragmented data silos. Absent the common data language, engineering, marketing, product management, and operations stop talking to one another. User leave. Investors bail.
It goes without saying that raw data in and of itself is useless. The goal is to establish and keep up the process that continuously crunches data flowing in from all the sources, turning it into knowledge on the fly and keeping the users happy. How? By carefully structuring the data upfront, maintaining a sensible versioning policy, and most important, empowering the team to directly translate data insights into quantitatively and qualitatively measurable product improvements. That’s what it means to be data-driven, both as a company and as a software product.
Traffic stats and funnel graphs look great but what do they do for the users? To be effective, data insights must be actionable, ideally in real time.
Sure, third-party analytics can help harvest low-hanging fruit of product improvements. But it’s slow, error-prone, and requires many multidisciplinary meetings.
To expand its appeal beyond early adopters, the product must encompass all the intelligence it accumulated about each and every user, and utilize it in real time. And to achieve this business-critical goal, engineers must be able to turn real-time data insights into KPI improvements the one and only way they know how: by writing code.
PS. Is there a happy ending to our fictional company’s story, you ask? Did it accept its failings and learn its lessons? Has it found a way out of the data swamp of its own making? We’re happy to report that indeed it has. But that’s the subject of our future posts. Stay tuned!