Data Verticals for the Data Hierarchy: Tips for practical data hierarchy implementation

Natalie Nakamine
4 min readSep 27, 2021

--

image source

Many articles exist on (Maslow’s) hierarchy of data needs. The simplicity of this framework has the potential to make analytics prioritization very consumable as it’s easy for non-data folks to understand.

This popular Medium article summarizes the concept nicely. The main takeaway is that data can exist at different stages: collect, clean, define & track, analyze, and optimize & predict. Need at the lower end of the hierarchy need to be addressed before you can move up. Ex. you can’t analyze data that isn’t collected.

The practice of implementing these stages may be a little more complicated than the hierarchy suggests. Moving to different stages may not be a linear process. There will likely be iteration between steps and all data at the company will not move through the hierarchy at the same rate. Sounds messy, right? How to make sense of this?

Data Verticals

Enter data verticals! A data vertical is a concept that can supplement data hierarchies. Think of data verticals as different types of data at the company. This could be email data, website event data, mobile app data, etc. There’s a good chance that different types of data are captured differently at your company. Maybe some data are stored in segment tables, or tables generated by an Amazon firehose, or a third-party email service, etc.

Data verticals should be defined with respect to the business. Some examples might be A/B testing, upgrades, email data, etc. These verticals could even be split by product. Data verticals serve as concept to make the hierarchy of data needs more flexible and tangible.

Example Data Vertical

Let’s go through the data hierarchy with email data as an example.

1. Collect

As an analyst, you likely don’t have a lot of control over which data are collected and end up in the database. Collaboration with software developers or data engineers is incredibly important to ensure that necessary data are being collected properly. At this stage, the data may not be the most interpretable and answering business questions by creating ad hoc queries from these tables could be dangerous to the business as reproducibility is unlikely.

Using email data an example, the data here might be in the form of a few tables. Let’s call these tables email_messages, email_opens, and email_clicks. To answer any question about email click rates requires a few joins.

2. Clean, define and track

I’ve merged the next two steps in the hierarchy for simplicity. At this stage, maybe a data engineer, analytics engineer, or full stack analyst begins to create cleaner tables using tools like DBT or Airflow.

Using our email data example. Imagine moving from multiple tables to one email table (let’s call this table email_funnel) that tracks email sends, opens, and clicks for each email message. At this point, defining metrics with stakeholders is important. Alignment with product managers and leadership is important to answer questions like:

  • What is a click? Ex. Maybe we only want to count clicks on the primary CTA.
  • Do we want to restrict clicks to a certain period after the email is sent so the metric doesn’t change after a certain point? Ex. Maybe we want to restrict clicks to within a week after the email was sent. Doing so ensures that email numbers won’t change after a week.
  • Do we want to remove certain types of clicks, such as unsubscribes, from clicks? Ex. Maybe we want to remove unsubscribes and track them in a separate field to monitor our unsubscribe click rate.

3. Analyze

Once clean tables exist, it become much easier to answer questions within the relevant business vertical.

With our new email_funnel table, monitoring open rates, click through rates, unsubscribe rates, etc. can now be done by querying from one email_funnel table instead of 3 different email tables. Analyses are also more reproducible when querying from the email_funnel table. We no longer need to be concerned with how different analyses are defining “click” because that definition is embedded in the email_funnel table. P

4. Optimize and predict

We can now use ML to build on findings from analyses. Questions like “What’s the probability that user x will open this email?” can now be explored. Note, the data may not be in a perfect state to create predictive models at this point. It’s likely that feature engineering will need to take place to create relevant email variables such as average email metrics for each user. The feature engineering can still use the email_funnel table instead of the 3 tables mentioned in the collect phase, ensuring consistent definitions for certain email metrics.

Notice how we only moved through the data hierarchy with email data. Non-email data may still be in the same state that they were prior to this cleanup. Next step would be to define the next data vertical and move through the data hierarchy again!

Final Thoughts

My hope for this article is that data verticals as a concept makes intuitive sense and provides some structure for folks who want to move through the data hierarchy in their own work.

Moving through the data hierarchy one data vertical at a time takes time. It is explicit work that requires collaboration with different stakeholders and should have dedicated resources, such as analytics engineers.

Writing this article also left me with some lingering questions:

  • Are there data verticals that can be universal across different companies?
  • Are there types of data that simply don’t fit the data hierarchy and data vertical frameworks?
  • What’s the best route for defining metrics? This is often an iterative process in the clean, define, and track step. Is there a way to optimize the process?

Would love to hear your thoughts!

--

--