# Logistic Regression: Predict The Malignancy of Breast Cancer

1. Pravangasta Suihangya Balqis W.

# INTRODUCTION

Supervised learning is also known as supervised machine learning, it is defined by the use of labeled data sets to train algorithms that classify data or predict outcomes accurately. Guided learning can be separated into two types of problems when mining data, namely classification and regression. Classification method is usually used in business problems such as churn analysis and risk management, classification will go through the data process in order to get more accurate data. In Classification there is a target category variable in the classification. The regression is actually almost the same as the classification but the regression cannot find a structure that is classified into classes. The Regression method looks for a pattern and assigns a numeric value to it. Regression techniques can be used to predict the future. The relationship between one or more independent variables and the dependent variable can be modeled using regression analysis.

Import library:

# Exploratory Data Analysis

Exploratory Data Analysis covers the critical process of initial investigative testing on data to identify patterns, find anomalies, test hypotheses, and check assumptions through summary statistics and graphical (visual) representations. EDA can help detect errors, identify outliers in data sets, understand relationships between data, explore key factors, find patterns in data, and provide new insights. EDA is very useful for statistical analysis. In this EDA we divide into two parts, namely Univariate and Multivariate

# Machine Learning Theory

In the machine learning phase, we must first separate the dependent variable from the independent variable. For independent variables, because there are quite a lot of columns in this dataframe, we will take it according to the EDA phase, which is 10 columns with the highest correlation to the independent variable. we will divide into x and y then we will do train test split data.

# Conclusion

1. Logistic regression has a higher accuracy score than KNN. Maximum accuracy score is one. If it is closer to one, the method has a more accurate predictive value. In our experiment, it can be seen that the accuracy score of KNN is 0.8951048951048951 while the logistic regression is 0.9230769230769231. So it can be concluded that Logistic regression has a higher accuracy score, which means the prediction is more precise and better than using the KNN method.
2. The 10 columns with the highest influence on the prediction are [‘perimeter_worst’, ’radius_worst’, ’area_worst’, ’concave points_worst’, ’concave points_mean’, ’perimeter_mean’, ’area_mean’, ’radius_mean’, ’concavity_mean’, ’area_se’ ] so that these columns become independent variables in the train test and split data.

# Reference

IBM Cloud Education. (n.d.). What is Supervised Learning? IBM. https://www.ibm.com/cloud/learn/supervised-learning#:~:text=Supervised%20learning%2C%20also%20known%20as,data%20or%20predict%20outcomes%20accurately.

--

--

## More from Pravangasta suihangya balqis wahyudi

Love podcasts or audiobooks? Learn on the go with our new app.