MLearning.ai
Published in

MLearning.ai

Breast Cancer Detection with Decision Trees

A guide on how to find out the best parameter of the decision tree algorithm using scikit-learn.

Photo by Johannes Plenio on Unsplash

Decision tree is one of the most used machine learning algorithms. In this post, I’ll cover the following topics,

  • What are decision trees?
  • Some advantages and disadvantages of decision trees
  • Data preprocessing
  • Building the model
  • Model evaluation
  • Hyperparameter tuning with grid search

For more content about machine learning, you can follow us on Tirendaz Academy YouTube channel.

Let’s dive in!

What are decision trees?

Decision trees are a non-parametric supervised learning. This technique is widely used for classification and regression tasks. The goal of this method is to create a model that predicts the value of a target variable. In the other words, decision trees encode a series of if-then-else rules. Each node in a tree contains a condition.

Decision tree for iris dataset

Some advantages and disadvantages of decision trees

Decision trees have some advantages and disadvantages like other machine learning estimators. You can build a good model of decision trees by taking account into the following issues.

First, let’s take a look at some advantages of decision trees.

  • Decision trees are simple to understand and interpret.
  • You can easily visualize trees.
  • Decision trees require little data preprocessing.
  • You can deal with both numerical and categorical data using this technique.

Of course, decision trees have some disadvantages. Let’s take a look at these disadvantages.

  • Decision tree learners can create over-complex trees that do not generalize the data well. To overcome this problem, you can use some methods such as setting the maximum depth of the tree, setting the minimum number of samples required at a leaf node, and pruning.
  • Decision trees can be unstable. To avoid this problem, you can use decision trees within an ensemble.

Decision trees with scikit learn

To show how to implement the decision trees algorithm, I’m going to use the breast cancer wisconsin dataset. Before loading the dataset, let me import pandas.

import pandas as pd

Let’s load the dataset.

df = pd.read_csv( “Breast Cancer Wisconsin.csv”)

You can find the dataset here. Let’s take a look at the first five rows of the dataset.

df.head()
The first rows of the breast cancer dataset

This dataset consists of examples of malignant and benign tumor cells. The first column in the dataset shows the unique ID numbers and the second column shows diagnoses, let’s say M indicates malignant and B indicates benign. The rest of the columns are our features. Let’s take a look at the shape of the dataset.

df.shape# output:
(569, 33)

Data Preprocessing

Now, let’s create the input and output variables. To do this, I’m going to use the loc and drop methods. First, let me create our target variable.

y = df.loc[:,"diagnosis"].values

Beautiful. We created the target variable. Let’s create our feature variable. To do this, I’m going to use the drop method. Let me remove the target variable and unnecessary columns.

X = df.drop(["diagnosis","id","Unnamed: 32"],axis=1).values

Pay attention that our target variable has two categories, M and B. Scikit-learn likes to work with numpy arrays. Let’s encode the target variable with label encoder. First, I’m going to import this class.

from sklearn.preprocessing import LabelEncoder

Now, I’m going to create an object from this class.

le = LabelEncoder()

Let’s fit and transform our target variable.

y = le.fit_transform(y)

Before building the model, let’s split the dataset into training and test set. To do this, I’m going to use the train_test_split function. First, let me import this function.

from sklearn.model_selection import train_test_split

Let’s split our dataset using this function.

X_train, X_test, y_train, y_test = train_test_split(X, y,   
stratify=y,
random_state=0)

Building the decision tree model

Let’s go ahead and take a look at how to build the decision tree model. First of all, I’m going to import the decision tree classifier class.

from sklearn.tree import DecisionTreeClassifier

Let’s create an object from this class. First, I want to use the default values. So, I’m going to only use the random_state parameter.

dt = DecisionTreeClassifier(random_state = 42)

Let’s build the model with training sets.

dt.fit(X_train, y_train)

Awesome. We built our model. Now, let’s predict the training and the test values with this model.

y_train_pred=dt.predict(X_train)
y_test_pred=dt.predict(X_test)

Now, let’s take a look at the performance of the model on the training and test set. To do this, I’m going to use the accuracy_score function. First, let me import this function.

from sklearn.metrics import accuracy_score

Now let’s take a look at accuracy scores for training and test sets.

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)

Now, let’s print these scores.

print(f’Decision tree train/test accuracies: 
{tree_train:.3f}/{tree_test:.3f}’)
#Output:
Decision tree train/test accuracies:1.000/0.951

As you can see, the score on the training set is 100%, but the score on the test set is 95%. This means that our model has an overfitting problem. Note that the decision tree model learned the training set so well. So, it simply memorized the outcome. But, the model cannot generalize. Notice that overfitting happens when we have a complex model.

To overcome the overfitting problem, we control the complexity of a tree. To do this, we have multiple ways. First, let’s specify the max_depth parameter which controls the maximum number of levels. The default value for the max_depth parameter is None, which means that the tree can grow as large as possible. We can try a smaller value and compare the results. Let me specify the max_depth parameter.

dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)

Now, let’s take a look at the performance of this model on the training and the test set again.

y_train_pred=dt.predict(X_train)
y_test_pred=dt.predict(X_test)
tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f’Decision tree train/test accuracies:
{tree_train:.3f}/{tree_test:.3f}’)
#Output:
Decision tree train/test accuracies:0.951/0.923

As you can see, the performance on the training set was 100%, but now it’s only 95%. It means that the model can no longer memorize all the outcomes from the training set. By making it less complex, we improved the ability of our model to generalize.

But, another problem has occurred that the model is too simple. To make it better, we need to tune the model using different parameters. To do this, I’m going to use the grid search technique.

Hyperparameter tuning with grid search

You can find out the best parameters for your model with the grid search technique. Let’s import GridSearchCV class.

from sklearn.model_selection import GridSearchCV

First, I’m going to create an object from DecisionTreeClassifier.

dt = DecisionTreeClassifier(random_state = 42)

Now, let me create a parameters variable that includes the values of the max_depth and min_leaf_size which is another important parameter.

parameters = {"max_depth":[1, 2, 3, 4, 5, 7, 10],  
"min_samples_leaf": [1, 3, 6, 10, 20]}

Awesome. We specified the values of the parameters. To find the best parameters, I’m going to create an object from GridSearchCV.

clf = GridSearchCV(dt, parameters, n_jobs= 1)

Our model is ready to train. Next, I’m going to fit our model with training sets.

clf.fit(X_train, y_train)

Finally, to see the best parameters, I’m going to use the best_params_ attribute.

print(clf.best_params_)#Output:
{'max_depth': 3, 'min_samples_leaf': 1}

When we execute this cell, you can see the best parameters. These are 3 for max_depth and 1 for min_samples_leaf.

Evaluating the model

Now, I’m going to predict this model trained with these parameters. Note that we don’t need to train our model again. Because after the best parameters are found, the model is trained. Let’s predict the values of the training and the test values.

y_train_pred=clf.predict(X_train)
y_test_pred=clf.predict(X_test)
tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f’Decision tree train/test accuracies:
{tree_train:.3f}/{tree_test:.3f}’)
#Output:
Decision tree train/test accuracies:0.974/0.958

Here you go. The accuracy scores were found according to the best parameters. Note that the score of our model on the training set is close to the score on the test set. In addition, both accuracy scores are close to 1. So, we have obtained the best parameters and predicted the values in the training and the test set using these parameters.

Conclusion

Decision trees are a non-parametric supervised learning method. You can perform both classification and regression tasks with the decision tree algorithm. In this post, I talked about decision trees and how to implement this technique with scikit learn. Finally, I showed how to find out the best parameters with the grid search technique. You can find this notebook here.

That’s it. Thanks for reading. I hope you enjoy it. Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn.

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store