Five “Secret” Ingredients to Become a Data Scientist
Tons of resources, and you don’t know where to start? My goal is to give you a personal Data Science roadmap using a Pizza recipe as an analogy.
--
If you’re asking yourself, what does this have to do with Data Science…Keep on reading, and you’ll see how the magic begins.
While I was walking through my sidewalk to pick up some tortillas (Yes, I’m Mexican, haha!) I stumbled upon a new Pizza Restaurant near my home. To be honest, I was surprised because the logo and the place were really fantastic; I was glad for them!
So when I arrived at my house with my Tortillas, I asked my wife if she could make a Pizza for us.
It was surprising for me to have an answer as a “yes” for two reasons:
- She said I don’t have any idea of how to cook
- My wife told me that it was easy
It seems like a coincidence, but the two points mentioned by my wife are something that nowadays I can confidently say to anyone in the Data Science area.
Anyway, just because I made a homemade pizza recently, I will explain a possible roadmap to becoming a data scientist using a pizza recipe analogy.
NOTE: This is a homemade pizza/personal roadmap, so maybe you have a better way to create one, or perhaps I’m missing some ingredients; please let me know!
Ingredient #1: Water, Sugar, and Yeast
As far as I know, these are somehow the base, the pillar, the core; translated into the roadmap, this must be Programming.
You can do Data Science with many tools: Python, Java, VBA, Scala, C, Julia, R, SQL, etc.
Sounds daunting at first glance, but this will mark the base of our Data Scientist road. We really need to dive deep into these topics, and please don’t overwhelm yourself; take it easy and at your own pace.
Programming
Whatever you would like to choose as your programming language, you must learn it remarkably well because that will be your bread every day. Personally, I’m not married with a programming language because you could use whatever you want if you get the desired result.
When you finish your first ingredient, you will understand the common data structures, manage libraries and packages, and understand object-oriented programming.
Again, this is totally personal, so probably you would know some better resources for your roadmap. Ultimately, you can use your own configuration!
Resources
- Python — Interactive Python Tutorial
- R — Interactive R Tutorial
- SQL — Beginner and Advanced
- Git — Crash Course Git and GitHub by FreeCodeCamp.org
Ingredient #2: Salt and Flour
These ingredients make the pizza more shaped, so Mathematics and Statistics give you that in the Data Science World.
Most of the roadmaps that I’ve seen on the internet do not give the proper importance to Math & Stats. Basically, I know that some people start coding some fancy machine learning algorithms. They don’t have any idea of what is going on behind the scenes!
Mathematics and Statistics
Possibly you have learned math and stats during your school days, but if not, don’t worry! It is not a big deal because the following site makes it really simple and interactive.
Resources
- Statistics and Probability —Khan Academy Course Track will introduce you to both the descriptive and inferential statistics world.
- Math — Linear Algebra and Calculus will give you the core of some machine learning algorithms that you will look at in the future. Don’t worry if you don’t know it from head to toe. Still, in the future, when you handle ML algorithms, you will review these topics again. You will already have the notion of how to attack future problems.
BONUS: If you are coming from R background, you should take a look at this site
Ingredient #3: Tomato Sauce
Who knows a Pizza which doesn’t have tomato sauce? As far as I know, this ingredient is not so visual for the person who’s going to eat it, but it is essential, so it is Data Cleaning.
Data Cleaning
According to many forums and Twitter posts I’ve seen during these years, Data Wrangling is approximately 80% of Data Science work. Think about it; you have a lot of data from different sources. You must collect it and clean it because you have to manipulate the data to obtain the final data.
That is, you could not expect clean data formatted as you want when you obtain it from a source, so you should take this “ingredient” seriously.
Resources
- Python — Data Cleaning at Kaggle
- R — Data Cleaning Tutorial in R. Not interactive, but it shows you the tidyverse world!
Ingredient #4: Mozarella Cheese
Who doesn’t love cheese? Most people I know want a little bit of cheese in their lives — cheddar, Feta, Parmesan, no matter what you like.
Okay, let me tell you something: Exploratory Data Analysis is like the cheese in the pizza!
Exploratory Data Analysis
As I mentioned above, you can choose whatever cheese you like, right? Well, when you are doing your Exploratory Data Analysis (EDA), you can select histograms, box plots, scatter plots, heat maps, etc.
This is the section where you taste your data. This is the part where you “play” with analysis, and you try to get some insights.
Resources
- Python — Data Analysis with Python by FreeCodeCamp is a really well-detailed tutorial where you will see a lot of libraries like pandas and seaborn.
- R — Here is a link where you could take a look if you prefer R.
Ingredient #5: Pepperoni (Or anything you want as a topping)
I love Pepperoni, but you could choose another kind of cheese (If you are a cheese lover), or meatballs, or mushrooms, or anything you want.
I have some bad news…This is the end of my personal roadmap and so of my personal homemade pizza :( but turns out you can finally eat it! Of course…first, you have to bake it.
This is the section where you stumble upon Machine Learning. The famous, trendy, and hot topic you hear every day if you look at Data Science news.
Machine Learning
Take Machine Learning as a toolbox, not as an end. What I mean is, sometimes people confuse Data Science with Machine Learning because sometimes even you can communicate and solve a problem without it.
Occasionally, you wouldn’t need to use the fanciest ML algorithm. In fact, it’s better to give simple results because it’s more interpretable for anyone.
Resources
- Python — Machine Learning from Scratch with Python
- R — Tidy Modeling with R truly is a fantastic source; you should check it out even if you come from a different programming language.
Conclusion
This is a personal roadmap; if you like to, you can find some other resources that are better than mine.
As I said before, you can use your own configuration to achieve whatever goal you have in mind.
By the way, I have to tell you that there are some topics which I didn’t cover because those are tougher and you need to have some experience before jumping into them:
- Feature Engineering
- Data Engineering
- Deep Learning
I hope this personal roadmap can help you get some of your goals regarding the Data Science world, and also I hope you can do your homemade pizza after reviewing this tutorial :)
P.S. I guess you noticed that Python has better resources than R, and my opinion regarding whether you choose R or Python is the following:
- If you come from a statistical background, go with R
- If you come from a computer science background, got with Python
Those are the most used languages, and keep in mind that they are not mutually exclusive.
My recommendation is to learn both because you will use them at some point in your work.
Simple as that!