“How to Learn” & Master in “Data Science”
To rapidly master data science, you need to …
- Break it down
- Figure out what to do, and what not to do
- Design a plan
Let’s dive into each of these.
Break it down
To learn anything very quickly, you need to break it down into small components.
To rapidly master data science, you need to break data science down into smaller sub-disciplines. Furthermore, those sub-disciplines can be broken down into a set of skills and techniques. Going a step further, all of those techniques can be broken down into small learnable units that you can practice (we’ll talk about practice later).
At a very high level, you can break down basic data science into the following sub-disciplines:
- Data visualization
- Data manipulation
- Data analysis
These categories account for ‘the basics’ that you need to know.
Find out what to do, and what not to do
After you break things down, you need to figure out what to do, but also what not to do. To rapidly learn data science, it’s critical to select the right material and distinguish between what to do and what not to do; you need to distinguish between what’s really important and what is unimportant.
Figuring out what not to do is perhaps the more important of the two. When most people begin learning, they try to learn way too much. More often than not, this leaves people feeling overwhelmed, and it often causes them to spend time on topics that are not necessary.
Let me give you an example. As of 2017, there are over 10,000 R packages. You read that right. 10,000. Realistically, you will never learn all 10,000 packages, let alone master all 10,000 packages.
I should also point out that there is a lot of redundancy among these packages. In R, there’s often more than one way to do things. For example, to perform data visualization you can use the
function from base R, but there’s also several other packages and tools for visualizing and plotting data. Do you know which tools are the ‘best’? Do you know which packages you need to learn, and which ones you should skip?
To master R quickly and efficiently, you need to be able to select a very small number of packages among these 10,000. You need to choose what to learn and what to ignore.
Additionally, once you select the best packages to learn, you need to choose what to learn within those packages. Even if you select the best R packages, within those packages there are tools that you absolutely need to know, but other things that you probably don’t need to learn right now. Some tools and techniques are things that you don’t really need, and it would be better to wait a few months or years to learn them. Again, you need to know what to learn and what not to learn.
Focus on foundations
In the context of ‘selecting’ the right topics to learn, it is very useful to focus on foundational skills.
To master data science, you need to do this. You always need to master foundational techniques before you move on to advanced topics.
So as you’re learning data science, this means that you need to master 3 critical areas:
- Data visualization
- Data manipulation
- Data analysis (AKA, exploratory data analysis)
A big mistake that beginners make is that they jump into advanced topics too soon, before they’ve mastered these foundations. For example, new data science students get excited about machine learning and want to start by learning machine learning. These students would be much better served by mastering the critical, foundational tools first like data visualization and data manipulation.
To put this another way, by focusing on the foundations, you put yourself in a position to rapidly master more advanced topics later. If you master the critical foundations first, you will be better equipped to learn more advanced topics later, and you will do so at a much faster rate.
Do you want to be a top-performing data scientist? Master the foundations first.
That invites the question: What are the foundational data science skills in R?
The following is a quick list:
- Basic visualizations: bar charts, line charts, histograms
- Manipulating colors in charts
- Visualization formatting (I.e., how to format your charts to make them look good)
- String manipulation
- Date manipulation
- Data reshaping (I.e., transforming from ‘wide’ format to ‘long’ format and visa-versa)
- Adding variables
- Removing variables
- Aggregating data
- Reading in data (from external sources)
- Working with factor variables (e.g., ordering factors, re-naming factor levels, re-categorizing factor variables, etc)
If you can’t do these, do not move on to more advanced topics. Don’t get shiny object syndrome. This is a fairly high-level list but a good list of the things that you absolutely need to know. To be a top performer, you should be able to do these things without even thinking about them. Top-level data scientists can do these things ‘in their sleep.’
Design a learning plan
Once you select the right topics, you need a learning plan. Specifically, you need to sequence the topics in the best order.
There are some topics that are dependent on other topics. For example, I have routinely said that the prerequisites for machine learning are data visualization, data manipulation, and data analysis. To effectively learn ML, you need to be able to wrangle a data set, clean it, and visualize it. So, if you want to eventually learn ML, you need to start with visualization and manipulation first.
It’s best not to start with data manipulation, because by definition, data manipulation is required for more complex and messy data. There are much better data science topics to start with. In order to rapidly master data science, you need to be able to sequence the material in the optimal order and learn the right things at the right time.
Once you have a plan, you can start learning. Learning data science topics can be challenging. Many topics can be very confusing. How quickly you learn can depend critically on the quality of your learning materials.
Having said that, learning is not the final step.
Once you learn the basic concepts and techniques, you need to practice. You need to practice techniques and review concepts until they are ‘second nature.’
This is an extremely important point. There is absolutely a difference between learning something once, and remembering it in the long run.
Let me give you an example. If I show you a video right now that explains the ggplot() function, you’ll probably understand how it works. The syntax is fairly easy to understand once someone breaks it down and explains each piece of the syntax.
Next, if I ask you to write some simple ggplot() code, you’ll probably be able to do that too. For example, let’s say I ask you to create a simple scatter plot:
ggplot(data = diamonds, aes(x = carat, y = price) +
If I ask you to do something simple, like typing the code into R studio, you’ll probably be able to do it.
But what if I ask you to do it again 3 hours later? If I ask you 3 hours later to write that code from memory, there’s a good chance that you won’t be able to do it.
Because we forget. The human brain naturally forgets.
However, there’s a way to fight this. You can halt this forgetting process by practicing. Specifically, you need to repeat and review what you’ve learned.
Practicing techniques and repeating what you learn will enable you to remember those things in the long run. Moreover, as you practice, you will become more ‘fluent’ in those techniques. You will excuse those techniques more quickly and with less hesitation the more you practice.
An added benefit of practice, is that effective practice methods help you become a “top performer.” In fact, research has shown that elite levels of performance are strongly tied to deliberate practice. If you want to be a top performer, practice is critical.
I’ve said several times that to be a top-performing data scientist, you need to be able to execute the basic techniques ‘in your sleep.’ You should be able to do essential data visualization and data manipulation ‘with your eyes closed.’
You can achieve this level of mastery by practicing the right way, using good practice systems.
Effective learning has large benefits
Learning data science quickly can be a massive benefit.
Let’s put some numbers around it.
Let’s say that two people are learning data science: you and someone else. The other person learns extremely inefficiently, and takes 1000 hours to master the basics. But you learn the basics much faster, in about 200 hours.
The difference, 800 hours, is a really big difference. Again, we can put some numbers around this.
If your free time is worth only $20 an hour, that time savings of 800 hours translates to $16,000.
But let’s say that you really value your time. (You should value your time. Time is the only resource you can’t get back.) If you value your time at $50 an hour, the time you save by learning more efficiently amounts to a staggering $40,000.
Now these are just example numbers for illustration, but you get the idea.
Being highly effective and efficient in learning data science has massive benefits.
There’s actually another benefit of being an effective learner. If you really know how to learn, you’ll not only learn more quickly, but you’ll attain higher levels of proficiency and mastery.
If you are highly effective at learning data science, it becomes much easier to become a ‘top performer.’
It really pays to be a top performer. The reason for this is that the best people in tech often receive outsized gains. The best people disproportionately get the best jobs, highest salaries, and best perks. You’ve probably heard about the mythical ’10X developer’ … people who are 10 times more productive. These top performers often get the lion’s share of rewards in the tech industry.
The tech world is similar. Top performers get the lion’s share of rewards, while less skilled performers make dramatically less.
Get expert help
It definitely pays to learn data science as quickly as possible and to master the techniques.
Let’s review how you can do that:
- Break it down
- Sequence the material
At a high level, that’s really it (although, the key to getting it right is in the details).
If you can apply this learning process to data science, you’ll accelerate your learning and increase your chances of success.
But if you really want to accelerate your progress and learn as quickly as possible, there’s one more thing you can do.
You can get guidance from an expert.
Top performers understand that they can save massive amounts of time by getting advice from people who have already mastered the topic.
Learning a new subject is time consuming, because you need to figure out what you need to learn, design a learning plan, sequence the material, and all of the things I’ve already talked about. But you need to do these things without a clear understanding of the subject. It’s like trying to find your way through a jungle, alone, without knowledge of the terrain. You’d be well served by getting a guide … someone to safely and quickly get you to your destination.
A data science mentor can tell you exactly what to do: “learn this first, learn this second, focus on x-y-z, don’t bother learning that topic, etc.” A good teacher can dramatically accelerate your learning, because they remove the burden of having to find the path on your own.
If you want to rapidly master data science, you need to do the same. While it is possible to learn data science on your own, you can learn much, much faster with expert guidance. That might include finding a data science mentor, but it could also mean a good data science course.