Five “Secret” Ingredients to Become a Data Scientist

Tons of resources, and you don’t know where to start? My goal is to give you a personal Data Science roadmap using a Pizza recipe as an analogy.

Omar Valdez
Geek Culture
Published in
7 min readJun 15, 2021

--

If you’re asking yourself, what does this have to do with Data Science…Keep on reading, and you’ll see how the magic begins.

While I was walking through my sidewalk to pick up some tortillas (Yes, I’m Mexican, haha!) I stumbled upon a new Pizza Restaurant near my home. To be honest, I was surprised because the logo and the place were really fantastic; I was glad for them!

So when I arrived at my house with my Tortillas, I asked my wife if she could make a Pizza for us.

It was surprising for me to have an answer as a “yes” for two reasons:

  1. She said I don’t have any idea of how to cook
  2. My wife told me that it was easy

It seems like a coincidence, but the two points mentioned by my wife are something that nowadays I can confidently say to anyone in the Data Science area.

Anyway, just because I made a homemade pizza recently, I will explain a possible roadmap to becoming a data scientist using a pizza recipe analogy.

NOTE: This is a homemade pizza/personal roadmap, so maybe you have a better way to create one, or perhaps I’m missing some ingredients; please let me know!

Ingredient #1: Water, Sugar, and Yeast

As far as I know, these are somehow the base, the pillar, the core; translated into the roadmap, this must be Programming.

You can do Data Science with many tools: Python, Java, VBA, Scala, C, Julia, R, SQL, etc.

Sounds daunting at first glance, but this will mark the base of our Data Scientist road. We really need to dive deep into these topics, and please don’t overwhelm yourself; take it easy and at your own pace.

Programming

Whatever you would like to choose as your programming language, you must learn it remarkably well because that will be your bread every day. Personally, I’m not married with a programming language because you could use whatever you want if you get the desired result.

When you finish your first ingredient, you will understand the common data structures, manage libraries and packages, and understand object-oriented programming.

Again, this is totally personal, so probably you would know some better resources for your roadmap. Ultimately, you can use your own configuration!

Resources

Ingredient #2: Salt and Flour

These ingredients make the pizza more shaped, so Mathematics and Statistics give you that in the Data Science World.

Most of the roadmaps that I’ve seen on the internet do not give the proper importance to Math & Stats. Basically, I know that some people start coding some fancy machine learning algorithms. They don’t have any idea of what is going on behind the scenes!

Mathematics and Statistics

Possibly you have learned math and stats during your school days, but if not, don’t worry! It is not a big deal because the following site makes it really simple and interactive.

Resources

  • Statistics and Probability —Khan Academy Course Track will introduce you to both the descriptive and inferential statistics world.
  • Math — Linear Algebra and Calculus will give you the core of some machine learning algorithms that you will look at in the future. Don’t worry if you don’t know it from head to toe. Still, in the future, when you handle ML algorithms, you will review these topics again. You will already have the notion of how to attack future problems.

BONUS: If you are coming from R background, you should take a look at this site

Ingredient #3: Tomato Sauce

Who knows a Pizza which doesn’t have tomato sauce? As far as I know, this ingredient is not so visual for the person who’s going to eat it, but it is essential, so it is Data Cleaning.

Data Cleaning

According to many forums and Twitter posts I’ve seen during these years, Data Wrangling is approximately 80% of Data Science work. Think about it; you have a lot of data from different sources. You must collect it and clean it because you have to manipulate the data to obtain the final data.

That is, you could not expect clean data formatted as you want when you obtain it from a source, so you should take this “ingredient” seriously.

Resources

Ingredient #4: Mozarella Cheese

Who doesn’t love cheese? Most people I know want a little bit of cheese in their lives — cheddar, Feta, Parmesan, no matter what you like.

Okay, let me tell you something: Exploratory Data Analysis is like the cheese in the pizza!

Exploratory Data Analysis

As I mentioned above, you can choose whatever cheese you like, right? Well, when you are doing your Exploratory Data Analysis (EDA), you can select histograms, box plots, scatter plots, heat maps, etc.

This is the section where you taste your data. This is the part where you “play” with analysis, and you try to get some insights.

Resources

Ingredient #5: Pepperoni (Or anything you want as a topping)

I love Pepperoni, but you could choose another kind of cheese (If you are a cheese lover), or meatballs, or mushrooms, or anything you want.

I have some bad news…This is the end of my personal roadmap and so of my personal homemade pizza :( but turns out you can finally eat it! Of course…first, you have to bake it.

This is the section where you stumble upon Machine Learning. The famous, trendy, and hot topic you hear every day if you look at Data Science news.

Machine Learning

Take Machine Learning as a toolbox, not as an end. What I mean is, sometimes people confuse Data Science with Machine Learning because sometimes even you can communicate and solve a problem without it.

Occasionally, you wouldn’t need to use the fanciest ML algorithm. In fact, it’s better to give simple results because it’s more interpretable for anyone.

Resources

Conclusion

This is a personal roadmap; if you like to, you can find some other resources that are better than mine.

As I said before, you can use your own configuration to achieve whatever goal you have in mind.

By the way, I have to tell you that there are some topics which I didn’t cover because those are tougher and you need to have some experience before jumping into them:

  • Feature Engineering
  • Data Engineering
  • Deep Learning

I hope this personal roadmap can help you get some of your goals regarding the Data Science world, and also I hope you can do your homemade pizza after reviewing this tutorial :)

P.S. I guess you noticed that Python has better resources than R, and my opinion regarding whether you choose R or Python is the following:

  • If you come from a statistical background, go with R
  • If you come from a computer science background, got with Python

Those are the most used languages, and keep in mind that they are not mutually exclusive.

My recommendation is to learn both because you will use them at some point in your work.

Simple as that!

--

--

Omar Valdez
Geek Culture

Data Enthusiast | YouTube @valdezdata | #mex #programming #analytics #engineering