Organizing your data science project

Published in

Data Science at Microsoft

7 min readMay 25, 2021

How would you define data science? More importantly, for this article, how would you describe the process of a data science project? There are many definitions and approaches, but no one-size-fits-all answer. Much depends on the circumstances and task at hand. Literature and academic papers reflect on these differences, and while many parts of data science work have a lot in common, they differ in important ways. For example, the process of building a Machine Learning model differs from the process of building an ingestion pipeline. Although both are data related — and might even serve the same overall goal — they embody distinct processes.

The job of the data scientist ranges widely, from doing basic but necessary tasks such as data cleansing, to developing advanced predictive models and putting them into production. Given this complexity, is it possible to have a framework that applies to most scenarios?

People reviewing notes spread out on the table.

Planning to plan

Office Space, the iconic 1999 comedy film, features a scene in a conference room with a whiteboard note that says “planning to plan.” It ridicules what can be the never-ending and inefficient process of planning that happens too often in the corporate world, involving a multitude of diagrams, shapes, and arrows. And lists: top 3, top, 4 top 5… bulleted, sub-bulleted, numbered. We’ve all been there.

In traditional data mining, one of the most widely used tools for planning is the CRISP process, a multi-step approach encompassing business understanding, data understanding, data preparation, modeling, evaluation, and deployment. In the data visualization space, a recommended approach involves considering purpose (the why), content (the what), structure (the how) and formatting (everything else). I have found these and many other individual approaches to be useful in my own data science work, but I have also wanted a more broadly applicable list that can be used across data science that is also straightforward and easy to remember, so I created my own consisting of four steps: problem, data, analysis, and storytelling.

Problem

What is the business question to be answered? Understanding the business need is critical. Before embarking on a long (and, at times, uncomfortable) journey of stitching data and chasing insights, you must clarify who wants to know what — and why. In an ideal world, requirements would come well formulated and next steps would be well defined. But that’s usually not the case. All business domains have their complexities, and stakeholders don’t always know exactly what they want or need. A certain level of ambiguity is inevitable at the start.

A stakeholder may approach you with a set of questions and their priorities, but it’s up to you to define what’s acceptable before you start working. I like to call this the process of reaching “enoughness,” a sufficient amount of information that enables you to say yes to beginning the project. The threshold may vary depending on domain and circumstances, but at minimum, some basic criteria should always be met: The request must serve company or team goals, and the outcome must be actionable and realistic. A way to validate this is to ask: If my business stakeholders had the insights they are looking for, what would they do with them? Of course, to begin, the data must be available, leading to the next stage.

Data

The building blocks, the ingredients, the meat — these are ways to think about data. The data represents what may easily be the least glorious but most important step in the process. (Data engineers: We love you!) Considering data means thinking about data sources, ingestion pipelines, databases, and tables. All your work depends on data. A house with a fancy balcony and a beach view is not very useful if it’s built with flawed materials. Likewise with data, this is where GIGO, or “garbage-in, garbage-out” comes in. Data must be reliable in addition to being timely and accessible. Unfortunately, it rarely comes this way, meaning that a large portion of data science work involves finding and cleansing data so it is ready for use.

Remember, however, that not every data challenge carries the same weight. For instance, having inaccurate data is a much different problem than having incomplete data. Reaching a certain level of completeness may be acceptable, depending on your objective. Maybe a smaller sample size will do to test your hypothesis and be generalizable enough to make some inferences. Inaccurate data, however, whether raw or transformed, is never acceptable. These are the faulty bricks that will bring down the house, so you must take good care of them first. Once you’ve done that, you are ready to showcase your statistics and math skills.

Analysis

This is the step you have been waiting for — what most of us think about as data science. It involves translating the business problem from step 1, working with the data in step 2, and then moving into statistics, coefficients, and vectors. I will not describe the multitude of specific techniques and methodologies because that is not the purpose of this article, and they are covered extensively elsewhere. The overall process, however, can be summarized in two stages: exploratory analysis and advanced analytics. Exploratory data analysis (or EDA) is a must in any scenario. It provides not only a “feel” of the dataset and its basic statistical properties, but also serves as a preliminary means of communication back to the stakeholder.

Remember the ambiguity issues from step 1? The first insights surfaced in EDA are a great conversation-starter to address these. As you communicate initial findings back to the business, they help you in two ways: 1) To confirm whether your analysis is going in the right direction, and 2) To enable you to speak intelligently about the problem. When you have a fundamental understanding of the problem space, you can engage more deeply in stakeholder conversations and propose next steps. Oftentimes, good EDA is more than enough for stakeholders to take action, meaning you don’t have to go beyond it. But if you establish the need for something more, you are ready to invest your time in further work involving statistical learning and ML modelling. Then when you are finished, you are ready to share your work so it achieves impact.

Storytelling

It’s important to not take this last step for granted or to skip over it: It can make or break the successful delivery of a data science project. Although the output of your analysis is full of useful and fascinating facts and findings, it’s your job to clean up and deliver them in the best possible way so stakeholders can easily consume them.

Primarily, you must know your audience. What do they care about? What kind of terminology are they familiar with? As data scientists, we often get the urge to explain the nuts and bolts of the algorithm powering our analyses. Transparency is critical, and methodologies should be well documented and available. But unless it’s a technical audience who really cares about the how, keep your focus on the message. Nancy Duarte, in her excellent book Resonate, calls this “The Big Idea.” You must also provide the supporting evidence — effective and well-annotated visualizations, whose role is to convey meaning and establish trust. To get to this point, be ready to relentlessly iterate, gather feedback, and discard a lot of your work. Keep only what’s necessary and remove the rest.

Diagram of transition from “what you do” to “what you show”

This is where it comes full circle, where you provide one or more solutions to the problems described in the first step. Remember, your goal is not merely to have the best set of insights. Your goal is to have the best set of insights that someone understands and is willing to use. If this is not the case, the result runs the risk of not making enough of an impact on the business, like having a Lamborghini parked in your garage that you use only occasionally for picking up groceries from a local store.

Conclusion

In this article I’ve described a set of four steps for completing your data science project:

Problem describes the goals you are trying to achieve, who will benefit from your analysis, and the business impact you seek to influence. Make sure you have a good understanding of the problem before you invest time in solving it.
Data encompasses the main ingredients, including sources, pipelines, and warehouses, and with all the transformations and logic that make the data ready to use. Having full confidence in data is critical because all of your work depends on it.
Analysis represents the main analytics process and data science methodology you need to conduct, so you can transform data to insights. Sometimes, exploratory analysis is sufficient to address the problem, while in other cases you must employ more sophisticated methods and predictive modelling.
Storytelling is the final step, where you package your output into curated, compelling, and consumable evidence that is useful for communicating knowledge to your audience. Be selective in what you show and deliver to make an impact.

You can apply this framework directly or modify it to fit your own situation and style. But in any event, simplifying the end-to-end process, generalizing it, and describing it in simple terms is highly beneficial. It gives you a sense of control and sets you up for success.

Gordan Kuvac is on LinkedIn.