A taste of data science: how a vegetable gratin can turn your next data project into a success

Julia Suter
Data science at Nesta
12 min readApr 13, 2023

With this blend of cooking analogies and practical tips, we provide a step-by-step guide for planning and executing a successful data science project. Whether you’re new to the field of data science or a seasoned data scientist looking for a fun way to explain the bread and butter of your work to others, this recipe for success serves food for thought in a refreshing and easily digestible way. So grab your apron and learn how to turn half-baked ideas into cream-of-the-crop data science projects.

Vegetable gratin

Data scientists often work in interdisciplinary teams and need to explain their projects and ideas to non-technical colleagues and clients. But without a shared vocabulary to help explain and understand the processes of a data project, collaboration is not a piece of cake. Unrealistic expectations (“Can’t you just do a quick analysis?”), technical gibberish (“I’m refactoring the imputation of NaN values”) and endless spirals of “What do you want me to do with this data?” and “Well, what can you do with this data?” quickly lead to misunderstandings and frustration. So how can we break down the steps of a data science project into tasty little bites?

It all boils down to working with the right analogy: cooking and data science use different vocabulary to describe very similar processes. Both start with raw ingredients or raw data, which are transformed into something that is more than just the sum of its parts. We’re using these parallels to introduce six steps to a successful data science project (see graph below). In this article, we will introduce the first three steps, spiced up with relatable cooking examples and multi-flavoured activities, to help you plan your next data science project and to gather and prepare the necessary data. In a follow-on article, we will sink our teeth into the last three stages, discussing challenges of data science, presentation options for your findings and best work practices. The tastiest bits are summarised in a doggy bag for you at the end of each section.

Six cooking/data science steps: 1) Planning your meal/Planning your data project, 2) Grocery shopping/Data gathering, 3) Mis en place/Data wrangling, 4) The hot part/The “actual” data analytics, 5) On a silver platter/Presenting outputs, 6) Clean workspace/Best coding practices.

Step 1: Planning your meal and data project

Cutting board with vegetables. Step 1) Planning your meal/Planning your data project.

Planning a data science project is like planning a dinner party. You don’t start by switching on your stove — or opening a terminal. You first have to figure out the basics: What’s the occasion? Who are you inviting? Can you serve your famous honey-crust ham or does anyone have special dietary requirements? How much money can you spend? And how can you make sure everyone will have a great time and fondly remember the party?

The same applies to data science. You need to start asking the right questions in order to shape your project. Asking for ‘some data analysis’ is like asking the waiter in a restaurant to just bring ‘some food’ — it can cause confusion and the client may end up not liking it. That is why data scientists should always sit down with their colleagues and clients to discuss what questions the data science project is aiming to answer. This challenging task is interlinked with many other questions:

  • What data do we have, and what or whom is it about?
  • Why is this research relevant and who is the target audience?
  • How much money and manpower can we spend and when can we realistically expect first results?
  • Are there ethical considerations regarding the data or possible findings?

Activity:
Think back to the latest data science project you were part of, no matter what your role was. What questions do you wish you had asked right at the beginning?

It’s true, many of these questions are difficult to answer right off the bat, but asking them in the first place may already trigger some ideas or actions. Exploratory analysis of the domain and available data helps gain a better understanding of the task and possible resources, identify knowledge and data gaps and design more realistic and impactful projects. Even with a clear project idea, the recipe for the project — the structure, the timelines and the methodologies — may have to be adapted based on the data and resources available and the needs of the target audience.

In the case of our dinner party, we learn that two of our guests are vegetarians, which rules out the honey-crust ham. So how about a lovely asparagus risotto instead? Unfortunately, it’s winter right now so it’s tough to get the main ingredient, and buying out-of-season vegetables is not acceptable… or is it? Analogously, in a data science project, you wouldn’t want to build an elaborate data dashboard if we already know that it wouldn’t appeal to the client’s taste. Furthermore, we can’t analyse data that is not available to us and we have to consider ethical questions regarding our choice of data and methodologies from the very start.

After additional considerations and conversations, we finally decide to serve up a vegetable gratin, a relatively simple yet satisfying dish, which leaves some flexibility in terms of its ingredients. Now that we’ve meticulously selected a recipe, we can start cooking … or can we?

Doggy bag

  • Bring everyone to the table. Project planning shouldn’t be one-sided. Bring data scientists, domain experts and clients together from the very beginning and define a clear problem statement.
  • Sample the ingredients. Exploratory work, including checking the data, helps formulate meaningful research questions and define realistic plans.
  • Get ready for lemonade. The bitter-sweet truth is that your project might not go according to plan. Be ready to adjust your recipe, that’s just how the cookie crumbles.

Step 2: Getting the ingredients

Young woman at a supermarket. Step 2) Grocery shopping/Data gathering.

No, it’s still not time to switch on that stove yet! A quick glance into our fridge tells us that we don’t have the right ingredients at home so the next task is to go grocery shopping. For a vegetable gratin we need potatoes, creme fraiche, cheese and a bunch of vegetables.

Data is naturally the key ingredient to any data project. Sometimes we have to work with the data we have at hand (cook with what’s in the fridge) — which limits our recipe choices. For other projects, we have to search, select and obtain new datasets that match the requirements and resources available for our project.

The choices we make when selecting data are strikingly similar to the choices we make at the supermarket. Trade-offs in regard to price/quality and effort/convenience are part of our everyday grocery shopping experience, same as ethical decisions regarding the background of the products we buy.

  • Shall I get the fancy expensive Japanese mushrooms or the cheaper local ones?
  • Shall I buy fresh carrots that require peeling and chopping or shall I go with the canned sliced ones?
  • Is it okay to buy products from a company that refuses to use recyclable packaging?
  • And why go through all the trouble in the first place if I could just buy ready meals?

Activity: Below you find shelves filled with ingredients needed to cook a vegetable gratin. We have rated each ingredient based on its value 💰, the effort it takes to prepare 💪🏽 and its quality ⭐. For some ingredients, we give additional context.
Given a budget of 11
💰 and 8 💪🏽, which ingredients would you select? What is the rationale behind your decision? Remember, there is no right or wrong, there are just trade-offs!

Shelves with ingredients for vegetable gratin (1/2).
Shelves with ingredients for vegetable gratin (2/2).

Effort vs quality

Selecting datasets based on their quality and the effort required to process them comes with similar trade-offs to grocery shopping.

  • Raw data As with the fresh carrots, working with raw data requires more preparation (see next section), but on the other hand you can shape it into any form you prefer and you know exactly how it was processed.
  • Processed datasets Datasets processed by others (mirrored by the canned carrots) are convenient as they are ready to use. But if parts of the data were deleted or altered, it can be impossible to reconstruct the original data. Furthermore, the processing may not live up to your quality standards or might be poorly documented. Are you sure you want to add those carrots to your gratin…?

Cost factor

Sometimes you get lucky and the perfect data for your project is free, like when your generous neighbour shares mushrooms and vegetables from their own garden. But in other cases data can also be very expensive or difficult to get a licence for. Getting the newest and richest datasets comes at a price, just like the imported mushrooms from Japan. If the budget is tight, you will have to make do with free sources, such as the Census data, even if it is a bit outdated or requires a lot of processing, like our soon-to-expire mushrooms from the supermarket.

Grocery shelves with data equivalents.

What if you can’t get the data that you need?

Unfortunately, there is not always the perfect data for your data project, but there are always options to explore. The ultimate DIY option is growing your own data garden — but keep in mind that collecting and maintaining data is time-intensive and challenging, as you will want to guarantee high data quality and representativeness and document and update your data regularly. Keeping a garden is hard work and not every plant will bear fruits.

By keeping an open and creative mind, you may uncover exciting data in places you didn’t expect: remember that any form of text, image, audio and video counts as data, even if it isn’t stored in a neat spreadsheet. Maybe you can draw some informative insights by analysing competitor’s annual reports, or train a model for identifying weak points of a product based on customer review pictures?

Finally, combining new data with existing data sources can lead to unexpected and interesting results, just like combining two types of food. Prosciutto and melon, anyone?

Once retrieved, make sure to label your data and keep it in a backed-up place where it can only be accessed by authorised people. Keep your data fridge clean, organised and up-to-date. A messy fridge does not inspire new recipes…

Doggy bag

  • You can’t have your cake and eat it, too. When selecting datasets there are often trade-offs in terms of quality, price and effort but we can try to maximise our resources.
  • Something smells fishy. Always check the source of the data and how it has been collected and processed, to avoid data quality and ethical issues down the road.
  • Be a smart and innovative cookie. Look for novel and creative data sources to explore. Take advantage of low-hanging fruit.

Step 3: Preparation

Cutting boards and chopped up ingredients. Step 3) Mis en place/Data wrangling.

Okay, now that we have all our food gathered at home, it surely is time to get that stove running, right? No, still too soon! We have bought a lot of fresh ingredients for our gratin, so we first have to wash the mushrooms, peel the carrots, check the food quality (is this crème fraîche really fresh?) and chop everything into bite-sized pieces. Sounds like a lot of work? That’s because it is.

The process of cleaning and preparing the data is called data wrangling, and it often eats up the biggest chunk of a data project’s time. It’s also the least visible part, although it can have apparent and severe consequences if you skip this step. Bad food, such as mouldy cheese or rotten vegetables, can harm our bodies. Similarly, feeding bad data into our analysis, model or tool can lead to inaccurate and misleading results. Remember: garbage in, garbage out. Or in a more culinary phrasing: crêpe in, crêpe out!

But how does data get so dirty in the first place? And how exactly do we clean it? Let’s look at a toy dataset containing 6 fictional Energy Performance Certificate (EPC) records. An EPC record captures information about a property’s energy efficiency and other characteristics such as property type, floor area and heating system.

Activity: Look at the table and try to identify issues with this dataset. If you don’t know where to start, try to compare the properties by their features. What makes the comparison harder? Which values don’t make sense?

Table demonstrating data cleaning issues.

You probably found a wide variety of values that need cleaning, ranging from inconsistent formats for the floor area, different spellings and languages of the term ‘double glazed’ (yes, the last one is Welsh), duplicated entries (building #1) and potentially irrelevant data (blue windows). We will now look at three data cleaning challenges in more detail.

  • Our dataset contains several missing values for various features. Missing values occur when the property assessor couldn’t determine the correct value or if it got lost along the way (corrupted data). Building #7 has so many missing values one could consider dropping it altogether, like throwing out a carrot that looks like a mouse has already taken out a few bites. Urgh! In other cases, we may be able to infer the missing values, which is called data imputing. For instance, Building #2 has very similar values as Building #1, so we could estimate the missing EPC rating. When cooking, we also have to fill the gaps for out-of-stock ingredients.
  • Next, we’ll look at unexpected values. Have you found any? Exactly, it is hard to believe that the floor area of Building #3 is really 140,000 square metres. But unreasonable values are not always that obvious and spotting them often requires domain knowledge and a good understanding of what the data represents and how it was collected. For instance, you have to know that EPC ratings only range from A to G to identify that K is not a valid rating. The value 78 also doesn’t seem meaningful, but an expert in home decarbonisation could tell you that this number potentially refers to the EPC score, from which the rating can be derived. This shows that ‘knowing how the sausage is made’ is very important for understanding and cleaning datasets.
  • The last issue, biased data, is more abstract. If we mistakenly assume that this is a representative reflection of the UK housing stock we could reach the misleading conclusion that Wales has twice as many properties as England, and that all of them are gas-heated flats. Furthermore, we might conclude that only properties with blue windows are suitable for heat pumps, which would indeed be a puzzling finding for the low carbon heating sector. The best way to obtain a more representative view of properties in the UK is extending the dataset with more EPC records. If that data is not available, we could downsample the Wales properties (remove some) or upsample the England properties (duplicate or add new ones) to imitate the right distribution. This process mirrors adding more liquid to your sauce when you accidentally put too much salt in — you try to restore the right balance.

Inspecting and cleaning your data is often a lengthy and tedious process. The good news is that you only have to do it once. If you do it right now, you can lean back and (mostly) skip this step next time you work with the same dataset.

Doggy bag

  • Crêpe in, crêpe out. Don’t expect reliable results from bad data.
  • Good data, like good wine, takes time. Plan in sufficient time for cleaning the data, and involve domain experts for identifying and fixing errors.
  • Know how the sausage was made. Be aware of how the data was collected and how potential biases might influence the results.

We have now covered half of our data science and cooking steps — and we’re still not cooking with fire. Did you expect that? If so, you’re already becoming an expert in culinary data science. Stay tuned for the next part of the blog, where we will finally expose our ingredients to heat and discuss how to deal with hair in your data science soup. We will also learn how to present your findings in an appetising way and how to keep a clean data kitchen for cooking up reproducible and high-standard data projects.

It all boils down to this

If data science you shall need,
No matter tool or map or chart
Think about who it will feed,
Then start designing à la carte.

When picking out a dataset,
Trade-offs are the key
Weigh budget/effort in your head
And max out quality.

Data cleaning is a pain
And as with making wine
You need to know ‘bout your domain
And give it plenty time.

When in doubt, just think of what
It takes to cook a meal
Try out fun, new recipes — but
Don’t reinvent the wheel.

Join the kitchen, have a byte,
There’s so much to explore.
We hope we’ve wet your appetite
Please come back for more.

--

--