A Week of Machine Learning: Day 4

Published in

Analytics Vidhya

6 min readSep 18, 2019

Decision trees

I will try to keep it simple. Imagine a tree. As the tree has different branches and each branch get further divided into new branches. The same concept is used here. We call it a Decision tree because at each node we take a decision to move towards which branch. Moving like this we reach the end and we have our answer there.

I am going to explain every part of the code. Trees have proved to be very effective when working on tabular data or structured data. Above is an implementation of the Decision Tree. We are solving a classification problem. You should write the whole code yourself to understand it better and also it will make you comfortable in coding.

Here we are importing the required libraries. Pandas to load and work on the data. And another we have our classifier.

We are loading the data here. In pandas, we can provide the path of the data which is locally present means which is already downloaded or we can pass the URL of the data. Pandas will download the data and load it in a dataframe. Link to the dataset is here.

This is to print the first 5 rows. We need to look into our data.

This gives basics information about our data.

From info() we found that ‘3P%’ has null values. So we need to fill these null values. We used fillna() function to fill them with 0.

Here we are plot our target value to check how many true values and how many false values we have.

Getting target and feature values in different variables.

Splitting our data in the training and testing set.

Creating the classifier.

Training the classifier and testing it on the testing data. score() is used to print accuracy.

Ensemble Learning

Single models can not cover the diversity of the data. Therefore we need more than one model working together. This technique is called Ensemble Learning. In this, we combine the result from different models and produce our final result. All models contribute to the final result. It is like combining the opinion of different people having a different view towards a problem.

Ensemble Learning is used to increase the power of Decision Trees. We have two methods:-

Bagging
Boosting

Bagging

In this technique, we create different trees which are independent of each other. Then at the end, we combine their result to get our final result. Random Forest is a great example of bagging. It performs very well. I am not going deep into the maths behind it because for now it is not required. As a beginner, we should first learn how to use things and their basic concepts. Then later on one by one, we should go deep into them. This keeps the learning interesting.

Random Forest

It is a type of bagging. In this algorithm, the samples and features are selected randomly. Features are selected to split the branch of a tree. In Decision Tree, it is done with an impurity score. The feature which has the least impurity score is chosen to split the tree. But in the case of Random Forest, we do it randomly. By doing so we have many weaker trees. We combine the result of all the weaker trees and that’s how we cover a lot of diversity.

Rest of the code is the same. Only two sections have some addition. We imported the RandomForestClassifier to create our model. Using sklearn is pretty simple. We can easily create and train a model.

Here we created an instance of the RandomForestClassifier. Then using the fit() method we trained it. score() method is used to check the accuracy.

Boosting

In this, we create many different trees but they are dependent on the previous trees. A tree tries to minimize the error of the previous trees. Maths behind all this is very simple but for now, I am not going deep into it. Just keep in mind that in boosting a tree takes input from the previous tree and try to minimize the error and like this, we have a lot of different trees. But in bagging, we create different trees which work independently and at the end we average(in case of regression) or highest vote(in case of classification).

LightGBM

It is very famous and widely used algorithm. It uses the boosting concept. It is very simple and easy to use. Lightgbm works great on tabular data most of the time. Below is a code example.

We need to install LightGBM library. If you are using Google Colab then you don’t have to worry about it. But if you are working on your local machine then do install LightGBM. It is a very simple and easy process. We assigned lgb name to LightGBM so that we don’t need to type the full name every time. as is used to assign any name you want to give.

Simple steps here. Created an instance of the LightGBM and then used fit() method to train the model and score() method to test or check the accuracy.

There are other boosting algorithms also. Some of the famous are XGBoost, CatBoost, etc. But LightGBM uses less memory and more efficient than XGBoost.

You will learn more if you write code yourself. As you are beginner don’t just copy and paste the code. If you want to copy a piece of code then look at it and type it yourself. It doesn’t matter how many times you look at it. But you should try to type it yourself and after your work is done then play with that code.

If you have some tabular data and you don’t know which algorithm to use for it. I would say go for trees. Start with Random Forest and then try LightGBM.