Analytics Vidhya
Published in

Analytics Vidhya

LOGISTIC REGRESSION FOR CLASSIFYING CANCER BEHAVIORAL RISK

Introduction

Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features.

Linear models are also extensively used for classification. One form of linear model for classification that is most often used is logistic regression.

Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).

In this paper, we will try to use logistic regression to classify cervical cancer behavioral risk which is a binary classification problem (problems with two class values)

Disclaimer

The author is not an expert in the health sector, so this paper should not be the main reference material.

Import the Required Libraries

First of all we first import the libraries that we will need

import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

About The Dataset

We download the datasets and we get information about the data

url = ‘https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv'

data = pd.read_csv(url)

data.shape

data.info()

data.isnull().values.any()

sns.countplot(x=data.ca_cervix)
plt.title(‘Count Plot for Label’)
plt.show()

label_0 = len(data[data.ca_cervix == 0])
label_1 = len(data[data.ca_cervix == 1])

total = label_1 + label_0

pc_of_0 = label_0*100/total
pc_of_1 = label_1*100/total

print(‘Percentage without cancer: {:.0f}’.format(pc_of_0))
print(‘Percentage without cancer: {:.0f}’.format(pc_of_1))

From the information above, we know that the dataset has 72 samples with 20 features.

One feature (ca_cervix) is a label with a value of 1 which means samples have cervical cancer and a value of 0 means samples do not have cervical cancer. From the data, about 71 percent of the sample did not suffer from cervical cancer while about 29 percent have cervical cancer

Based on the above information as well, the dataset does not have empty values, so we can move on to the next step

Split the Dataset

Separating the data into training and testing sets is intended so that the model obtained later has good generalizability in classifying data. It is not uncommon for a classification model to perform data classification very well in a training set, but very poorly at classifying new and non-existing data.

datasets = data.values

X = datasets[:, :-1]
y = datasets[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

print(‘Shape of Training Set is’, (X_train.shape, y_train.shape))
print(‘Shape of Test Set is’, (X_test.shape, y_test.shape))

Classification Using Logistic Regression

Logistic regression is a data analysis technique in statistics that aims to determine the relationship between several variables where the response variable is categorical, both nominal and ordinal, with the explanatory variables being categorical or continuous. Binary logistic regression is a mathematical model approach that is used to analyze the relationship between several factors and a binary variable. In logistic regression, if the response variable consists of two categories, for example Y = 1 states the results obtained are “successful” and Y = 0 states the results obtained “failed”, then the logistic regression uses binary logistic regression.

In this paper, we will use logistic regression with the value of the default parameters and the parameter values that we specify using the grid search. However, specifically for solver we will use liblinear. This is because the classification we will be working with is a binary classification and the dataset belongs to the small dataset category

The main parameter of logistic regression is regularization which is called C. Small values for C mean simple models. So In particular, tuning these parameters is quite important.

The other decision you have to make is whether you want to use L1 regularization or L2 regularization. If you assume that only a few of your features are actually important, you should use L1. Otherwise, you should default to L2. L1 can also be useful if interpretability of the model is important. As L1 will use only a few features, it is easier to explain which features are important to the model, and what the effects of these features are.

So, we will only set 2 parameters in the second way, namely C and penalty

Kita gunakan regresi logistik dengan solver liblinear dan parameter-parameter yang lain sesuai nilai defaultnya

logreg = LogisticRegression(solver=’liblinear’)

logreg.fit(X_train, y_train)

print(‘Score Training Set: {:.3f}’.format(logreg.score(X_train, y_train)))
print(‘Score Test Set: {:.3f}’.format(logreg.score(X_test, y_test)))

At default values, logistic regression provide 100% accuracy on training set and 83.3% accuracy on test set. This shows that we are overfitting.

Next, we set the C and penalty parameters. We determine the C parameter and the best penalty using a grid search with a cross validation value of 5

param_grid = {‘C’:[0.001, 0.01, 0.1, 1, 10, 100], ‘penalty’:[‘l1’, ‘l2’]}

grid_search = GridSearchCV(logreg, param_grid, cv=5)

grid_search.fit(X_train, y_train)

print(‘Best parameters: {}’.format(grid_search.best_params_))
print(‘Best cross-validation score: {:.3f}’.format(grid_search.best_score_))
print(‘Test set score: {:.3f}’.format(grid_search.score(X_test, y_test)))

The best score on the validation set is 91%, lower than before, probably because we used less data to train the model (X_train is smaller now because we split our dataset twice). However, the score on the test set — the score that actually tells us how well we generalize — become 89%, better than before.

Conclusion

Logistic regression can be used to estimate cancer risk in the dataset above.

Using the default value (with the liblinear solver) we get a score accuracy of 83.3%. Meanwhile, if we use the parameters that we find using grid search with a cross validation value of 5 (with the liblinear solver), we get an accuracy score of 88.9%.

References

  1. https://archive.ics.uci.edu/ml/datasets/Cervical+Cancer+Behavior+Risk
  2. Andreas C Muller and Sarah Guido. Introduction to Machine Learning With Python
  3. https://machinelearningmastery.com/logistic-regression-for-machine-learning/

For complete code you can visit here

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store