I plan to keep this journal as a weekly thing and in each journal, I want to talk about:
- What progress did I make in the past week
- What was one discovery that I think is worth sharing
- What’s next
For the second part, I want it to be about how what I learned is being applied to solve real-world problems, or it would be about a concept that is important but easily overlooked by some newbies like me.
This week, I want to talk about: How Random Forest allows us to understand customers better other than it can classify new customers for us. I don’t intend to use technical terms and formulas to explain literally how the method work, but rather, I want a person from a non-technical background, a salesperson, a marketing person, a manager, or a business student to understand what is this tool and how is it helpful.
Progress update: In my previous journal, I updated that I plan to start learning non-linear supervised models this week. So that’s during the first 3–4 days of this week, I learned: Decision Tree, including Random Forest and Naive Bayesian Classification.
The resources I used were:
- Codecademy Data Science Path
- StatQuest’s Machine Learning Series to understand the concepts better.
- Introduction to Machine Learning with Python by O’Reilly
- An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani.
The way I did it was watching StatQuest’s videos to understand the concepts first. Then I use Codecademy to code and see exactly how it is done in Python. After these two steps, I would move on to the two books because they give me more technical insights and they do a very good job by comparing different models.
I definitely also found my previous study on the recursive algorithm and many stats concepts like bootstrapping helped a lot when it comes to understanding Random Forest.
Other than learning the two algorithms, I also did MIT 60002 from lesson 1–10, which helped me to review basic but important stats concepts in machine learning again, like confidence interval. Also, I found that listening to different interpretations often gives me some new angle to look at the old piece of knowledge every time.
For the rest of the week, I was really focusing on summarizing all the models I have learned in supervised learning. I tried to make sure I really understand how each of them differs, why each of them exists, what makes them good or bad and when they should be used.
On top of these, I was keeping my 2-hour Leetcode SQL practice every day and I found myself now able to help others to understand why their solutions didn’t work!! I answered a couple of them this week on Leetcode. Super happy about it.
Now let’s talk about Random Forest. Let’s first briefly talk about what Random Forest is first at a high level. Random Forest is an upgrade of the Decision Tree algorithm. See Image 1 as an example. It’s simple right? and we use it daily.
So why do we need a computer for this and how does this relate to machine learning?
There’s one big puzzle when using the tree method. Imagine we don’t know John, and after we closely observed John, we figured that “Will Miley join?” and “Any girls?” are relevant the questions to ask for predicting whether he will go to a party. However, how do we decide if we should use “Will Miley join?” as the first layer of filter or use “Any girls?” as the first filter? It might be easy if we have only 2 questions. What if we have 5 relevant questions, including “Does John have time?”, “Does the party has free alcohol?” “Is John in a good mood?” How do we decide now?
And here’s where an algorithm can help us to decide. The way the algorithm decides this is by asking two questions in sequence:
- Which one gives me the most confident result? For example, if the question “Does the party has free alcohol?” tells us that 85% of the time John will go vs. 80% chance he will go if “Miley will be there”, then we will pick the alcohol one as the first question to ask.
2. After we selected the question, we then ask how many times do we get the right prediction by using this question? When we are getting wrong predictions some times, that’s when we want to add a follow-up question to increase our prediction accuracy.
The idea is that, if there is one question that can give us a 100% certain answer, and that’s the right answer, then there is no need to ask further. In reality, that rarely happens, we then want to get to the most certain level by asking the fewest questions!
So, a tree sounds logical enough. Why do we need a forest and a random one?
The problem with a tree is that it really focusing on producing the result as fast as possible, sometimes it neglects possible variations. For example, now we want to expand this algorithm to 500 boys who are also Miley’s friends, and predict if they will go to the party or not, and John only happened to be the sample that we picked. Though we really believe John is a typical guy, still, the priorities of John might not be the priorities of others. You might think we can collect more samples, but the same uncertainty will remain unless we ask all the boys.
And this is why we need a forest.
The idea of having a forest is that we have thousands and thousands of possible trees. For example, we might only know John, but we can imagine many many John in our mind and just change their priorities to all the possible ways. Then, we see what’s the result of having 1000 John with different priorities. If more than 500 John says they will go to the party, we predict, he will go to the party. Now we have more confidence that our algorithm is useful even if when we are not predicting that one single John.
So great, a forest is great. What so random about this?
The random part happens when we were creating those imagery John and imagining them having different priorities. Look at the bag. In there, we have 5 balls with 5 questions. We close our eyes, pick out 1 ball, put it back, pick out another ball, put it back … We do this again and again blindly, creating many imaginary John, some using only 2 questions, some using only 3 questions, some use them all.
That’s why it is a Random. Forest. And as you can tell, the more trees we have, the more possibilities we cover and the more applicable our thinking process will be.
Using this algorithm, we can predict things like: Will our customers react positively to this campaign? Will our targeted customers like this new car model? and etc.
Ok, great! But we already know algorithms can always give us magic predictions, that’s what they do.
Where does Random Forest give us more information compare to other methods?
Personally, I think this is the most amazing part of the Decision Tree method compare to other classification methods in supervised learning. And that is, it can tell you, which feature (or question if you want to stick with the previous example), contributes the most to the final decision.
In other words, for example, if you are studying the churn rate of your customers using product x. Not only that random forest will tell you if this customer will leave/stay with this product, but it also tells you which reason was causing most customers to leave.
How? Let’s look at a simple example. Let’s say you want to develop an app. And you have four features which relate to if users will leave or use the app, and they are: “pay or free”, “customizable or not”, “if it has a share function” which people can share on their social media platforms and “if the app has great customer service”.
You used the decision tree algorithm, you have the best possible tree shown in Image 4. By looking at the tree, You now have 3 pieces of information.
- You can predict what kinds of app will keep your targeted customers. An app that’s either a free app with a share function or a paid app but with customizable function. You might think, your best salesperson who knows customers really well might be able to tell you this too. True, but a tree can also tell you which one matters more and by how much! Let’s look at the other two pieces of information.
- You know that the feature of “customer service” does not play a role in keeping your targeted customers at all. And you learned that the “pay” feature contributes the most to decide if the customer will stay or not.
- You can also quantify the feature’s usefulness by looking at their prediction accuracy. After you divide the customers by the “pay” feature, you get the final prediction right 60% of the time. After asking the two follow-up questions, we get the right prediction 90% of the time. However, when we tried using “customer service” to divide our customers further, our accuracy didn’t increase at all.
Now, the product team knows which feature to improve first and how it is being justified. That’s something very useful especially when you have 5 or 10 or even more features and that other algorithms do not reveal.
Now, you might notice that instead of talking about a whole forest, I used a decision tree to illustrate how a tree method can give us more information about our customers. The reason behind it is that a Random Forest has a more complicated process due to the fact that it has so many different trees but it shares the same underlying principle.
Of course, this introduction is only scratching the surface if we want to give a full picture of the algorithm. However, as I said at the beginning, my purpose is to demonstrate how algorithms are helpful in a real-world setting using non-technical language. Hope this helps :)
What’s next for me?
For the next week, my focus will continue to review and summarize all the linear models in supervised learning, as well as putting other components together, like feature engineering, subset selection, and model selection and validation. Though I cannot wait to learn clustering in unsupervised learning, for the past two days, I found my “mid-term study” session to be very useful to consolidate what I have learned so far.
The other two goals are:
- Do a project on customer segmentation using random forest.
- Study MIT 6.041 Probabilistic Systems Analysis and Applied Probability ideally until lesson 15, if not lesson 10.
Have a great week everyone!