Moving from Data Science to Model Science

Steve Jones
Data & AI Masters
5 min readSep 27, 2022

--

On a recent AI futures call we had a discussion on the impact of Auto-ML, Diffusion models and other generative approaches on the career of Data Scientists, the general consensus was that in future we will see the rise of Model Scientists, who test and prove out the best solution based on models which are auto-generated. Shifting the role from being a Data Scientist creating and proving individual models from data, to one who is validating and proving across hundreds or thousands of models.

The Data Science Process

The Data Science Process

The normal role of data scientists includes a significant amount of data collection, cleaning and constructing the basic understanding of data, before progressing to creating and testing models.

CRISP-DM

This is the dominant approach today in data science and AI, where understanding the data and manipulating it to create the most effective models. Crafting models and the iteration through data prep takes most of the time for a data scientist.

Will model generation change the job?

How will this change if model generation becomes the ‘simple’ part of the process but model evaluation becomes the challenge? Where evaluation doesn’t go through data preparation and understanding, but on guiding and informing an Auto-ML solution to generate models? What if understanding the data isn’t as important as understanding the business and the outcomes?

Excel didn’t kill accounting, Auto-ML won’t kill data science

Way back when a large part of accounting was just adding up columns of numbers, this is one of the first tasks that computers were applied to, then with the PC we got spreadsheets. Does this mean that accounting disappeared because adding up financials became automated? No of course not. In fact in many ways it unleashed accountancy into areas that just were not possible without that automation, some of which have proven ‘problematic’.

The Model Scientist Cycle

So if we have a world where Auto-ML can generate a thousand potential models, how will this alter the role of the data scientist to become a model scientist?

Cycle showing Business Understanding to Data Understanding to Outcome Understanding to Data Prep to Model Generation to Outcome Validation, if error then to Outcome Understanding and back to Data Prep, if right then to Model Validation, if Model Validation error then back to Business Understanding, else Deployment, and then Deployment back to Outcome Validation
Model Science Cycle

Because the models themselves are not being created by the data scientist the role is transitioning more into two elements, firstly understanding the outcomes that are required, to assist with reinforcement learning but also to move the focus and separate from model validation, and then validation of the model for validity against the business constraints.

Why outcomes are critical

The biggest change over CRISP-DM is the focus on Outcomes not just the data. Because we are looking at potentially thousands of generated models we are more interested in validating their fit in driving outcomes than validating the model itself, that needs to be done, but only after we’ve validated that the model is delivering the right results. This is linked to the need to be able to explain decisions rather than models and understanding how you validate outcomes.

With Auto-ML the model scientist needs to be thinking much more on what “good” looks like in terms of outcomes than on preparing data and tuning the specific model. This is particularly true when reinforcement learning is being used, but is generally the case for all Auto-ML approaches.

This is also an area where automation will become hugely important, a model scientist can not review one by one a thousand models. If none of the models are able to produce the required outcomes, then back through the cycle it goes, if some models can, it is on to the next stage.

Validate the model if it delivers the right outcomes

Trimming down the thousands of models to a few candidates then takes us to the second stage, which is to validate that the model itself. Here we are looking at Trusted AI and ensuring the model meets our business, regulatory and importantly ethical concerns.

In other words is the model reaching the right outcomes, but using the wrong features to get there? Is the model going to be stable under all conditions, when will it become unstable? Do we trust the model to do its job and can we explain the decisions that it is making? We might not understand the details of the model, but we must understand the conditions under which it makes its decisions so we can understand where it can, and therefore will, go wrong.

At this stage we might find out that while all of the generated models are creating things that look like the right outcomes, they are not doing so in a way that we could really trust, therefore we need to go back to understanding the business challenge and going back round the loop again. If however we are confident in the model, we move to the stage where we hold our breath.

Deploy and verify

It is not right to think of “deploy” as the final stage, as in reality it is simply a continuation of the cycle. We will deploy, but to trust the model we must continue to verify its outcomes, in case it moves beyond bounds, or reacts badly to inputs we had not expected. This is why Trusted AI is so critical, its important for the future, we need to understand this trust across the whole lifecycle.

Model Science is an evolution of Data Science

Data Science was focused on deriving insight, often using models, from data. In Model Science it is a combination of both Data and models that becomes important. Newer techniques that generate large numbers of models, or where the influence focus is on the model itself requires us to evolve the profession and include new skills, new methods while remaining focused on the ultimate goal which is to deliver better outcomes.

Just as Agile is not a removal of discipline from software development, it is not right to think of Model Science as “just” the generation of models and some testing, the reality is that just as Agile requires more formalism and discipline, so Model Science increases the discipline required to deliver on those outcomes. Anyone can just generate a model and deploy it blindly, it takes real discipline and focus to work with complex generative approaches and sculpt a provable and explainable outcome.

--

--

Steve Jones
Data & AI Masters

My job is to make exciting technology dull, because dull means it works. All opinions my own.