Logistic Regression: Predict The Malignancy of Breast Cancer
- A. A. N. Ananda Surya Wedhana
- Pravangasta Suihangya Balqis W.
Breast cancer is a disease in which cells in the breast grow out of control. There are different types of breast cancer. The type of breast cancer depends on which cells in the breast turn into cancer. Breast cancer can start from any part of the breast. The breast is an organ that sits above the upper ribs and chest muscles. There are left and right breasts and each has mostly glands, ducts, and fatty tissue. In women, the breasts make and provide milk to feed newborns and infants. The amount of fatty tissue in the breasts determines the size of each breast.
Breast Cancer occurs as a result of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor, which is a symptom of DNA mutation instructing the extra cell-growth. Most breast lumps are benign and not malignant(cancerous). Non-cancer breast tumors are abnormal growths, but they do not spread outside of the breast. Males and Females both can be affected with this type of cancer.
In some cases, many women want to find out about their breast cancer. There are many factors that can cause a doctor or scientist to judge how malignant breast cancer a sufferer has. This project aims to help people who want to try machine learning methods or people who are interested in knowing more about breast cancer or rather people who want to predict the malignancy of someone’s breast cancer.
Supervised learning is also known as supervised machine learning, it is defined by the use of labeled data sets to train algorithms that classify data or predict outcomes accurately. Guided learning can be separated into two types of problems when mining data, namely classification and regression. Classification method is usually used in business problems such as churn analysis and risk management, classification will go through the data process in order to get more accurate data. In Classification there is a target category variable in the classification. The regression is actually almost the same as the classification but the regression cannot find a structure that is classified into classes. The Regression method looks for a pattern and assigns a numeric value to it. Regression techniques can be used to predict the future. The relationship between one or more independent variables and the dependent variable can be modeled using regression analysis.
One of the classifications is logistic regression. logistic regression is a regression technique whose function is to separate the dataset into two parts (groups). According to Field Hair (2009:265) “Logistic regression is multiple regression but with an outcome variable that is a categorical variable and predictors variables that are continuous or categorical.”. Logistic regression is used to predict whether something is true or false, rather than predicting a continuous value like linear regression.
KNN algorithm is a classification algorithm that works by taking a number of K data closest (neighbors) as a reference to determine the class of new data. This algorithm classifies data based on similarity or similarity or proximity to other data.
Import the dataset and also change the m(malignancy) and b(benignancy) variables to 1 and 0.
why do we change to 1 and 0 in target variable? In this dataset there is a dependent variable/target variable. So changing the diagnostic encoder labels from M and B to 1 and 0 will make it easier for us to analyze them. The df variable contains the dataframe according to the dataframe that we imported.
Exploratory Data Analysis
Exploratory Data Analysis covers the critical process of initial investigative testing on data to identify patterns, find anomalies, test hypotheses, and check assumptions through summary statistics and graphical (visual) representations. EDA can help detect errors, identify outliers in data sets, understand relationships between data, explore key factors, find patterns in data, and provide new insights. EDA is very useful for statistical analysis. In this EDA we divide into two parts, namely Univariate and Multivariate
Start with checking all the null values that exist in the data
df.info() to help inform data that contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).
after that we look for data and its columns that are null
The eq() method compares each value in the DataFrame to check whether it is equal to the specified value, or the value of the specified DataFrame object, and returns a DataFrame with booleans True/False for each comparison.
In this data we can know that there is 6 column contain 13 null value and 1 column with 357 null value which is in [‘diagnosis’,‘concavity_mean’, ’concave points_mean’ , ‘concavity_se’, ‘concave points_se’, ‘concavity_worst’, ‘concave points_worst’]. the diagnosis column have 357 null its because we already change the benignancy to 0. So the systems will read it as a null.
After we know which column contains null then we will change the null to be the mean of that column. we do this for all the columns that still have null values as we found out in the previous step.
now we will check again whether there is still a null value column.
after recheck, we can see that there isn’t no more null values in 6 columns that we replace before. And the diagnosis column is stay with 357 null values because they are a target variable so it cannot be changed.
Now we will show the comparison on target/dependent column using countplot visualization
before making a visualization, we first calculate the number of each category on the independent variable / target. After that we will create a visualization using a countplot as shown above.
Now we will create a multivariate EDA. That is by using a pearplot using Spearman Correlation. The Spearman rank correlation is closely related to the Pearson correlation, and both are finite values, from -1 to 1 indicating a correlation between the two variables.
After that we will try to find 10 columns with the largest correlation value to the dependent variable
EDA Univariate using df.describe(). After finding the 10 columns with the highest correlation, among others are the columns [‘diagnosis’, ’perimeter_worst’, ’radius_worst’, ’area_worst’, ’concave points_worst’, ’concave points_mean’, ’perimeter_mean’, ’area_mean’, ’radius_mean’, ‘concavity_mean’, ’area_se’], we will find out related to these columns as well as create a visualization using pairplot.
Next we will visualize the data that we got from df.describe() using a pairplot. In this diagram all the yellow ones have a diagnosis of 1 or malignant. and everything in blue is a diagnosis of 0 or benign.
Machine Learning Theory
In the machine learning phase, we must first separate the dependent variable from the independent variable. For independent variables, because there are quite a lot of columns in this dataframe, we will take it according to the EDA phase, which is 10 columns with the highest correlation to the independent variable. we will divide into x and y then we will do train test split data.
After that we will use the KNN method as a comparison for the confusion matrix and score accuracy. This KNN method uses Euclidean and the 5 closest variables for comparison.
The last step of the KNN method is to determine the confusion matrix and accuracy score from the split data train test that we have done.
To make it easier, we will create a visualization using a heatmap in the confusion matrix for the KNN method that we have done.
Next we will try to use logistic regression to determine the confusion matrix and score accuracy as well.
To make it easier, we will create a visualization using a heatmap in the confusion matrix for the Logistic Regression that we have done.
- Logistic regression has a higher accuracy score than KNN. Maximum accuracy score is one. If it is closer to one, the method has a more accurate predictive value. In our experiment, it can be seen that the accuracy score of KNN is 0.8951048951048951 while the logistic regression is 0.9230769230769231. So it can be concluded that Logistic regression has a higher accuracy score, which means the prediction is more precise and better than using the KNN method.
- The 10 columns with the highest influence on the prediction are [‘perimeter_worst’, ’radius_worst’, ’area_worst’, ’concave points_worst’, ’concave points_mean’, ’perimeter_mean’, ’area_mean’, ’radius_mean’, ’concavity_mean’, ’area_se’ ] so that these columns become independent variables in the train test and split data.
Hopefully this learning machine can be useful for anyone in need. It is also hoped that our prediction results can help medical personnel in diagnosing the degree of malignancy of breast cancer as well as getting an accurate and fast diagnosis through supervised machine learning classification. Besides that, for those who are learning data science, hopefully this learning machine can provide info or tutorials that can be used as references/references in practice later.
Thank you and hope it is useful
IBM Cloud Education. (n.d.). What is Supervised Learning? IBM. https://www.ibm.com/cloud/learn/supervised-learning#:~:text=Supervised%20learning%2C%20also%20known%20as,data%20or%20predict%20outcomes%20accurately.
Mehreen Saeed(2021, August). Calculating Spearman’s Rank Correlation Coefficient in Python with Pandas:
Lutfia Afifah. Algoritma K-Nearest Neighbor (KNN) untuk Klasifikasi:
Datasans(2018, september 16). Logistic Regression Concept:
Doshisujay(2022, April 20). Breast-Cancer Prediction:
Sudarno. 2017. Data Analysis. Semarang: Departemen Statistika Fakultas Sains dan Matematika UNDIP