Becoming a Data Cook: What Data Preparation means for Data Scientists

Part of a Datalogue series on data preparation

Sonia Sen
Datalogue

--

Data science is a lot like making dinner. While raw ingredients may be interesting at first, the fun doesn’t start until you’re actually able to start slicing, dicing, and eventually serving up something delicious to devour. Most of the time, dinner, but in the data science world, data insights.

Data preparation to serve up your data insights

But imagine before cooking your dinner you didn’t know if your grocery store was open, whether the aisle names accurately represented what was in them, or if the food was even edible? Fortunately for our stomachs, getting groceries is usually a simple process that means that we can get to cooking, and more importantly, eating faster.

Unfortunately, in data science, this isn’t the case.

The metaphorical grocery shopping for and cooking of your data assets is often a long, intensive process that prevents you from actually experimenting and, ultimately, acting on what your data is telling you.

Data preparation is the accessing, organizing, and structuring of unprocessed data assets to be used for data analysis.

Every data-driven businessperson knows that where there’s data, there are potential answers to their questions. But these questions can’t be answered or even experimented on until the data is ready for analysis—which infamously account for about 80% of a data scientist’s time.

While data science as a profession is one of the most intellectually rewarding and challenging careers, not enough attention is given to the “janitorial work” that is required before delivering the beautiful and insightful visualizations and analytics that drive data-informed business decisions.

This conversion, massaging, and wrangling of data assets into usable and accessible formats doesn’t provide value to business until it actually starts answering business questions. Moreover, data, just like food, is a consumable that is always being replenished and augmented for further richness. And while grocery shopping, for the most part, is a straightforward process, data preparation can be a painful process in just being able to access, understand, and ensure consistency in your data assets for actual analysis work.

Getting Your Data

The data grocery store

The current technology landscape of data formats and stores is vast, to say the least. From good ol’ comma-separated values (CSVs) stored on Amazon S3 to more complex data types like MongoDB’s BSON, your data can—and does—live everywhere.

While dealing with your own localized data store may be familiar enough, incorporating new data from external sources (or even the department down the hall) means ensuring that the new data is ready for your processing tools. The act of simply seeing your data means overcoming two major potential barriers:

  • how the data is stored (the format)
  • where the data is stored (the storage)

Large scale enterprises and small individuals alike have to be able to jump through the hoops of the appropriate authentication and compatibility requirements presented by the ways that their data is stored. Oftentimes, this accessing means gaining the correct permissions to the organization’s data lake or being able to convert different file formats into each other via external packages. This is just to start looking at the data!

There are a multitude of ways to centralize where and how your data is stored and located, which is why understanding where to even start your data “shopping” is an essential and the obvious first step of starting any data science process.

Knowing Your Data

The data aisles

Once you’re able to actually get your hands on the data, you need to be able to make sense of it. Column headers are for humans right? I can just read them, right? Hopefully, yes. However, any person who has seen enough column headers is sure to have had an experience where these data titles can be less than helpful.

no empanadas?

Who’s to say whether or not a column entitled “emp_no” contains employee_numbers or instead signifies empanadas_none?

The only way to really understand what’s going on in the data is to look beyond the schema and quickly analyze the types of values and types that make up your data points. Joining the column Name in a table full of Businesses will not be fruitful if joined on a column also called Name that is contained in the Vegetable Inventory table of a grocery store. Since data scientists are intelligent, it doesn’t take too long to establish what the data actually contains with a quick glance-over.

At scale, however, with databases growing in their schema and contents, it quickly becomes unwieldy to understand all the fine details of every database. The reason why you need to know what’s in your data is because it informs how you can actually start performing data analysis.

Transforming Your Data

Is your data edible?

Once you access and understand your data, you can finally start brainstorming the types of questions and insights that could potentially be baked. Yet, often you’re still not ready for the data analysis process.

Depending on how in-depth your data analysis goals are, from simple visualizations with out-of-the-box tooling to customized machine learning models, your data may need to be transformed in ways that actually make it ingestible for the data analysis tools. Transformations include quality checks like cleaning out invalid values, to filtering down to only the columns that are interesting to you, or even limiting the amount of data you work with because of how computationally (not to mention financially) expensive it may be. Making sure the data isn’t rotten by filling in empty slots and ensuring you only get just the right amount of data by looking at only the interesting columns is crucial to getting your data ready for the data cooking.

What Are You Prepping For?

Finally, food puns aside, this three-step process sets you up for the fun stuff:

Data science!

Everything that gets you from raw data to being able to start your data analytics process falls under these steps, all while wearing your data preparation hat:

  • Accessing the data… getting to the store
  • Understanding what’s in it… going to the correct aisles
  • Making it usable for your data analytics tools… ensuring edibility

All of the above are all key processes to setting your data table. The ingredients that set you up for your tooling and actionable insights are crucial, though seemingly tedious. But understanding the importance of why and how data preparation works is key to being able to make the best use of your data scientists, understand what your data is telling you, and most importantly, get to data dining.

Make data scientists chefs, not grocery shoppers.

It’s time to eat better, faster, smarter. 👩‍🍳 🍳

All prepped up and ready to go

Interested in getting to cooking faster? Message me or check out our website to see how we can work together on your data science recipes.

Resources

Data Preparation

Data Preprocessing vs. Data Wrangling in Machine Learning Projects

--

--