Decision Trees using Sklearn

Anjali Pal
Analytics Vidhya
Published in
3 min readJan 24, 2021

“Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.” — Andrew Ng

With decision trees, AI can make well-thought decisions without any human interference. Decision trees is a supervised learning algorithm which is used to solve both regression and classification problems. It is a predictive modelling approach that gives a graphical representation of decisions and every possible outcome because of that decision.

The below diagram explains a typical structure of decisions trees:

The major challenge in decision trees lies in attribute selection for root node at each level. This is done by the following two metrics:

  1. Entropy (Information Gain): Entropy is a measure of randomness in data points. Information Gain is change in Entropy. So, lower entropy means greater change and hence more information Gain. It is best suited for partitions with small counts but many distinct values.
  2. Gini Index: Measure of how often a randomly chosen element would be incorrectly identified. Attribute with a lower Gini index is best. Also, it is best suited for larger partitions.

Now, I’ll explain the decision tree classifier with the help of IRIS Dataset

Step One

Always get to know about dataset before starting (I’ve used iris.DESCR for this). It helps in understanding what all the data includes and its data types.

Step two and three

Next step is always to figure out your independent and target variables. Once this is done, you can use sklearn library for making a decision tree classifier.

Decision Tree

To understand how the above tree works to give predictions let’s use some examples.

Case 1:

Take sepal_length = 2.5 ,sepal_width = 1,petal_length = 1.5 ,petal_width =2 . Root node question is petal length<=2.45 which is True and hence, class is setosa.

Case 2:

Take sepal_length = 2.5 ,sepal_width = 1, petal_length = 2.46,petal_width = 2 . Root node question is petal length<=2.45 which is False and hence, we move to next question petal width <= 1.75 , which is also false. So next question is petal length<=4.85 which is True. Now the question come, sepal length<=5.95 which is also True and hence the class is versicolor.

To test the predictions, I’m using predict formula below.

Results

Hope this helps you. If you have any questions, feel free to add comments or ping me at LinkedIn.

To check my projects and other articles, visit my website at https://anjali001.github.io/ , my bot “Grey” is there to welcome you.

--

--

Anjali Pal
Analytics Vidhya

A data science enthusiast who believes that “It is a capital mistake to theorize before one has data”- Sherlock Holmes. Visit me at https://anjali001.github.io