Logistic Regression: Predict The Malignancy of Breast Cancer

  1. Pravangasta Suihangya Balqis W.
photo credit by istockphoto


Supervised learning is also known as supervised machine learning, it is defined by the use of labeled data sets to train algorithms that classify data or predict outcomes accurately. Guided learning can be separated into two types of problems when mining data, namely classification and regression. Classification method is usually used in business problems such as churn analysis and risk management, classification will go through the data process in order to get more accurate data. In Classification there is a target category variable in the classification. The regression is actually almost the same as the classification but the regression cannot find a structure that is classified into classes. The Regression method looks for a pattern and assigns a numeric value to it. Regression techniques can be used to predict the future. The relationship between one or more independent variables and the dependent variable can be modeled using regression analysis.


Import library:

import the library
import data frame and change the target variable

Exploratory Data Analysis

Exploratory Data Analysis covers the critical process of initial investigative testing on data to identify patterns, find anomalies, test hypotheses, and check assumptions through summary statistics and graphical (visual) representations. EDA can help detect errors, identify outliers in data sets, understand relationships between data, explore key factors, find patterns in data, and provide new insights. EDA is very useful for statistical analysis. In this EDA we divide into two parts, namely Univariate and Multivariate

using df.info() to know about the data frame
checking the null value in each column
replace and fill the null value into the mean of column
visualisasi dari variable independent
heatmap using spearman
find the columns with the higher correlation with dependent variable
df.describe() the top 10 highest correlation
pairplot for 10 highest correlation with dependent variable

Machine Learning Theory

In the machine learning phase, we must first separate the dependent variable from the independent variable. For independent variables, because there are quite a lot of columns in this dataframe, we will take it according to the EDA phase, which is 10 columns with the highest correlation to the independent variable. we will divide into x and y then we will do train test split data.

train test split data
Making KNN method for comparison
making y_pred from KNN classifier
showing confusion matrix and accuracy score from KNN method
visualization of the confusion matrix on the KNN method
making a logistic regression
showing confusion matrix and accuracy score from Logistic Regression
visualization of the confusion matrix on the Logistic Regression method


  1. Logistic regression has a higher accuracy score than KNN. Maximum accuracy score is one. If it is closer to one, the method has a more accurate predictive value. In our experiment, it can be seen that the accuracy score of KNN is 0.8951048951048951 while the logistic regression is 0.9230769230769231. So it can be concluded that Logistic regression has a higher accuracy score, which means the prediction is more precise and better than using the KNN method.
  2. The 10 columns with the highest influence on the prediction are [‘perimeter_worst’, ’radius_worst’, ’area_worst’, ’concave points_worst’, ’concave points_mean’, ’perimeter_mean’, ’area_mean’, ’radius_mean’, ’concavity_mean’, ’area_se’ ] so that these columns become independent variables in the train test and split data.


IBM Cloud Education. (n.d.). What is Supervised Learning? IBM. https://www.ibm.com/cloud/learn/supervised-learning#:~:text=Supervised%20learning%2C%20also%20known%20as,data%20or%20predict%20outcomes%20accurately.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store