A Data Science Practitioner’s Guide (Part 1: Scoping)

Introduction

In my work, I’ve become acutely aware of a disconnect between the power and potential of AI research and companies' ability to actually harness these technologies to deliver practical business value.

As a data science practitioner, I come across clients daily who have been burned by failed AI projects and are increasingly skeptical of the ability of machine learning (ML) to deliver tangible value. I’ve helped many of these companies turn their data products around and I like to think I’ve learned a thing or two along the way.

In this 3 part series, I’ve tried to distill some of these lessons to help other data science practitioners avoid some of the most common mistakes made by companies undertaking ML projects. The series will be broken down into lessons about (1) scoping, (2) modeling, and (3) deployment. This is the first of that series.

What is ‘Scoping’ for a Data Science Project anyway?

The ‘Scoping’ or ‘Ideation’ phase of data science involves understanding the underlying business problem, linking this to potential analytical solutions, and assessing if all the building blocks are in place for this solution to actually work! It should be the first step of the process for any potential ML project and, if you get this right, then everything downstream will be a lot easier.

Below I’ve summarized five of the most important lessons I’ve learned while scoping data science and ML projects:

1. Defining the business and technical KPIs for a project is so important:

Without understanding what the business and technical goals of the project are (and how they relate to each other!) then making sure an ML project is actually valuable, and that value can be measured effectively, is impossible. That is why it is so important to engage the relevant business and technical stakeholders as early as possible and ensure their goals for the project align and that these can be measured effectively. Once these goals are clear it is important to make sure everyone is on the same page and keep these business goals front and center during the entire project. Your ultimate goal is not to create a high-performing model but to reduce the time/pain/cost for real people doing real things! What your technical KPIs are (is 80% accuracy a good or bad outcome?) are totally dependent on the business goals of the project and this should be reinforced often.

2. Bubbling up data requirements and privacy issues as early as possible will make your life a lot easier:

All too often the theoretical value of an ML project is mortally undermined by data constraints. If you have the perfect use-case but training data is not being collected in an appropriate way, then building an ML model is not going to be immediately possible. Similarly, if the data exists but privacy protection regulations mean it cannot be viewed or used for other purposes then again you’ll find the project stuck. It is best to understand and flag these issues as soon as possible and map out possible solutions before committing to building the actual project. Often this means working with the client to include data collection and anonymization as part of the (pre-)project remit and can be a great way to teach companies about best-practices in data storage and privacy protection for ML projects in the process.

3. There is no substitute for actually seeing how people work

One of the key things I always do in the scoping phase is to understand what the current (manual/clunky/time consuming) process is and how a new ML-powered process might improve on it. To do this there is no substitute for just seeing directly how the people involved are currently working. For example, if you are proposing a model which will automatically map customer-request tickets to the right category instead of the manual way it is currently done, go and speak to the actual people doing that task! Not only will it help you think through what the problem is but there may be additional nuances that simply were not mentioned before. The client (especially if it is someone not directly connected to the actual work) does not always see things as they actually are. You need to go find out yourself; so set up a call with actual users and start talking to them directly! Not only will this make you more familiar with the real pain points you are solving but it will also help get buy-in from the actual people who will ultimately use your solution!

4. Making time and performance promises is stupid:

There is often a temptation when scoping ML projects to make promises about how much time a model will take to build or how well it will perform. Statements like, “we will build a model with >99% accuracy” are bullshit and should be treated as such. So are statements like “we will be able to build this model in 2 sprints” or “we will need 1,000 examples of the data input to begin modelling”. The truth is questions like this can only be answered empirically (i.e. you need to look at the data before you can estimate how accurate a model could be, or how many data inputs you’d need). Instead what you can do is promise to explore different options and deliver concrete answers to the viability of those options within a given time. For example (once you have data access) you can write a 2 week gate into the project with the clear deliverable to have evaluated how viable it would be to use ML for the given use-case at all. This is not only much easier to estimate but also gives the client the evidence and options they need to be able to decide whether to actually continue to finance the full ML project. To this end it is also helpful to scope ML projects into multiple stages with ‘gates’ at the end of each where you hope to answer clear and definable questions like this. Such gates will let you and the client know how the project is going in an evidence-based way and if it makes sense to continue, adapt on stop the project in an agile way. Gates are a win-win and can help reduce the risk of ML projects for both parties so it is worth discussing where and why gates make sense even during the scoping phase.

5. Honesty is the best policy:

Finally, and probably most importantly, one of the most valuable lessons I’ve learnt about scoping ML projects is that honesty and openness are the best policy. During a period of relative disillusionment with ML where many companies have been burnt by expensive investments in failed “AI” projects, clients really value honest and unbiased advice on if and where ML actually makes sense for their business. Indeed some problems simply don’t require advanced ML solutions and being honest about this — even if it means descoping a particular project to use a much simpler/cheaper solution — is critical for building trust with clients and always pays in the long run! The goal of any project is to solve people’s problems, not to develop ‘complex’ AI solutions for the sake of it. Building this trust and understanding with clients is perhaps the most critical foundation for the success of any ML project and leads to much more fruitful and sustainable relationships in the long run.

I’ve found having an early discover workshop with the client to go through a framework like the AI canvas can really help in the ideation phase to make sure you have not forgotten important business and technical topics while scoping the project with the client:

Conclusion

Scoping is an absolutely critical step in any successful ML project and, if you get it right, then everything downstream will be a lot easier.

I hope this article was helpful! For more information on scoping ML projects or anything else feel free to reach out to me directly. Looking forward to hearing from you and see you in the next part of this series.

Geek Culture

Proud to geek out.

Sign up for Geek Culture Hits

By Geek Culture

Subscribe to receive top 10 most read stories of Geek Culture — delivered straight into your inbox, once a week. Take a look.

Check your inbox
Medium sent you an email at to complete your subscription.

Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Samuel Dylan Trendler King

Written by

Machine Learning | Data Science | Climate Change

Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store