Footnote: Data Science Hierarchy of Needs

Jonathan Burley
4 min readMar 27, 2023

--

In another post I talk about a data science hierarchy of needs and how it extends to a broader concept of thinking about a mountain range of data products and the roots those mountains should build on. Considering the long-term view of data products as a mountain range is a novel framework but, as I discovered when researching the article, a data science “hierarchy of needs” is not.

Despite data science hierarchy of needs graphics already existing, I made a new mountain-based graphic and changed the associated text from existing formats. Is that worth talking about? And why did I do it?

I am talking about it for two reasons. Partly I want to more explicitly recognise that others used the hierarchy concept before I did (rather than just linking to them in the article), and partly I want to explicitly explain why the changes were made rather than leaving the reader guessing at the underlying reasoning.

To explain the why, I’ll first show my graphic alongside the progenitor that all others copy:

Hierarchy of data science: a single mountain to climb, with successive tiers of profitable facility gained as it is ascended.
Jonathan Burley, Data Science Hierarchy of Needs, 2022 (author). Defines a sequential sequence of capabilities gained by maturing data science organisations.
Monica Rogati’s Data Science Hierarchy of Needs, circa mid-2017. The earliest use of the concept I can find. Insightful enough that it has been copied verbatim across the internet. It defined the requirements of production AI systems.

The changes I’ve made are:

  1. The example texts now start with a descriptive goal that defines the facility gained at this tier (collect raw data, data storage, descriptive analytics, diagnostic analytics, predictive models and prescriptive optimisation). This tells us the intent of the tier in general terms alongside the list of tasks that commonly occur.
  2. The X/Y tier titles have been altered. The less important change is making them alliterative when possible. The more important change is aligning their meaning to the slightly adjusted tiers:
    — What happened? (descriptive analytics | Explore/Exploit)
    — Why did it happen? (diagnostic analytics | Learn/Label)
    — What will happen? (predictive analytics | Predict/Prove)
    — What is the best action we can take? (prescriptive optimisation | AI/Continuous Learning)
  3. Each tier has a clear reason why a full and successful data science product can stop there (successful data science does not need to be, and often should not be, AI). It is clear why a business can be thrilled to implement understandings from diagnostic analytics in Learn/Label, but not entirely clear how the similar Aggregate/Label tier creates a positive ROI decision or product.
  4. I don’t mention data cleaning. This is a philosophical point: since 2017 data cleaning has, in my opinion and others’, become an unclear term that hinders practitioners. Cleaning is an ongoing process of understanding data and applying reusable transformations that spans storage (outliers and data typing), descriptive analytics (anomalies), and diagnostic analytics (feature engineer and label). Therefore I do not name “data cleaning” as part of any individual tier.

And the changes have one major goal: promoting a view of Data Science ecosystems with multiple versions of success, where valuable applications of data science exist in each tier of the hierarchy and projects can target end-goals in analytics and KPIs.

The most important part of data science is the “so what?” of your discovery. What should be done differently because of this, and how can that change be realised? Seen this way, the entire toolkit of data science is focused on making better decisions and changing outcomes — it is obvious that there are significant, meaningful improvements a company can make by analytics without ever touching Artificial Intelligence.

I don’t think my hierarchy is a correction to the pre-existing hierarchy views — it is merely a different lens with which we can observe the process. Both should absolutely exist and be used. If your end goal is a deep learning AI deployed at scale, then I’ve added nuance that distracts from your end goal; if you’re working to establish data science products within a business context, then this hierarchy should be a good way to think about it.

That is a partial answer of “why” to publish the modified hierarchy — it helps make single product decisions — and the rest of the “why” comes from the systematization we can build on this framework.

The corollary to my data science hierarchy with its sequential tiers of reusable capabilities is the “product range of data science”. If each product is a mountainous peak then over time an organisation is building a mountain range with shared roots.

Product range of data science: a suite of data products built on the same mountain roots. Future products and experimentation are progressively cheaper when components are reusable.
Product Range of Data Science: Like a mountain range there are multiple peaks building on the same roots, and to different tier-levels in the hierarchy. Pre-existing core functionality makes future projects and experimentation faster and cheaper. Valuable data projects can succeed with descriptive analytics (what trends are in the data) and diagnostic analytics (why did those happen).

Smart organisations should plan investments accordingly and prioritise items based on the net present value of future initiatives. Experimentation and future products are progressively cheaper when components are reusable.

This product range view answers the “why” of publishing a different data science hierarchy. We need a slightly different view of the hierarchy to link into a product range view. And as more companies become more familiar with data science + AI and need to build or buy data science tools it is important for them to have the product range framework to think about those decisions.

Edit: Since publication the Hierarchy has many thousands of views. This technical footnote, interestingly, has a longer tail of views that the main article. It looks like it might be linked from some company wikis? Feel free to leave a comment or reach out if you have questions.

--

--

Jonathan Burley

Head of Data Science at Actif.ai | PhD computational models | Oxford & Cambridge grad