End to End Deployment of Breast Cancer Prediction Through Machine Learning using Flask

VAIBHAV HARIRAMANI
GEEKY BAWA
Published in
11 min readFeb 22, 2021

This post aims to build a prediction model on Breast Cancer Dataset using different machine learning algorithms and classifiers, and then make you get started with putting your trained machine learning models into production using Flask API.

When a data scientist/machine learning engineer develops a machine learning model using Scikit-Learn, TensorFlow, Keras, PyTorch etc, the ultimate goal is to make it available in production. Often times when working on a machine learning project, we focus a lot on Exploratory Data Analysis(EDA), Feature Engineering, tweaking with hyper-parameters etc. But we tend to forget our main goal, which is to extract real value from the model predictions.

Building/Training a model using various algorithms on a large dataset is one part of the Machine Learning. But using these model with a different application is second part of deploying machine learning in the real world.To put it to use in order to predict the new data, we have to deploy it over the internet so that the outside world can use it. Deployment of machine learning models or putting models into production means making your models available to the end users or systems.

However, there is complexity in the deployment of machine learning models. This post aims to make you get started with putting your trained machine learning models into production using Flask API. In this article, we will talk about how we have trained a machine learning model, create a web application on it using Flask.

Artificial intelligence in healthcare is the use of complex algorithms and software in another words artificial intelligence (AI) to emulate human cognition in the analysis, interpretation, and comprehension of complicated medical and healthcare data. Specifically, AI is the ability of computer algorithms to approximate conclusions without direct human input.

The aim of health-related AI applications is to analyze relationships between prevention or treatment techniques and patient outcomes.

Before starting of the code and theory part of machine learning first we learn about flask and deployment part.

Flask is a web application framework written in Python. It has multiple modules that make it easier for a web developer to write applications without having to worry about the details like protocol management, thread management, etc.

Flask gives is a variety of choices for developing web applications and it gives us the necessary tools and libraries that allow us to build a web application.

Installing Flask on your Machine

Installing Flask is simple and straightforward. Here, I am assuming you already have Python 3 and pip installed. To install Flask, you need to run the following command:

sudo apt-get install python3-flask
pip install flask

That’s it! You’re all set to dive into the problem statement take one step closer to deploying your machine learning model through flask.

Using Breast Cancer Dataset

Data Collection

I have taken the Dataset from Kaggle. This Dataset consist several features such as mean radius , mean texture , mean perimeter , mean area , mean smoothness , mean compactness, mean concavity, mean concave points ,mean symmetry , mean fractal dimension, radius error, texture error, perimeter error, area error and so on. Let’s know about how to read the dataset into the Jupyter Notebook. You can download the dataset from Kaggle in csv file format.

As well we can also able to get the dataset from the sklearn datasets. Yup! It’s available into the sklearn Dataset.

Let’s we see how can we retrieve the dataset from the sklearn dataset.

#Load breast cancer dataset
from sklearn.datasets import load_breast_cancer

Starting of implementation

Here we have folder structure of the machine learning deployment model through flask.

We will be implementing these code in jupyter and sublime text editor.Implementing the machine learning models lets go for importing library.

# import libraries
import pandas as pd # for data manupulation or analysis
import numpy as np # for numeric calculation
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization
import pickle #for dumping the model or we can use joblib library

Now next step is to load the data through pandas.

you can download dataset from breast-cancer.csv

cancer_df = pd.read_csv('breast-cancer.csv')

Now next step is to see the data frame of the data.

# Head of cancer DataFrame
cancer_df.head(6)

Info about the model(gives null value and count the non float values)

# Information of cancer Dataframe
cancer_df.info()

Numerical description about the data (mean,median,25%,interquantile range and many other value of each feature.

# Numerical distribution of data
cancer_df.describe()

Data Visualization

# Count the target class
sns.countplot(cancer_df[‘target’])

Heatmap

# heatmap of DataFrame
plt.figure(figsize=(16,9))
# Let’s make sure all our values are numbers, and remove the empty rows as well.
sns.heatmap(cancer_df.drop(['diagnosis'],axis=1))

Heatmap of a correlation matrix

cancer_df.corr()#gives the correlation between them
# Heatmap of Correlation matrix of breast cancer DataFrame
plt.figure(figsize=(20,20))
sns.heatmap(cancer_df.corr(), annot = True, cmap ='coolwarm', linewidths=2)

Correlation Barplot

# create second DataFrame by droping target
cancer_df2 = cancer_df.drop(['target'], axis = 1)
print("The shape of 'cancer_df2' is : ", cancer_df2.shape)

The shape of ‘cancer_df2’ is : (569, 30)

visualize correlation barplot

# visualize correlation barplot
plt.figure(figsize = (16,5))
ax = sns.barplot(cancer_df2.corrwith(cancer_df.target).index, cancer_df2.corrwith(cancer_df.target))
ax.tick_params(labelrotation = 90)

Split DataFrame in Train and Test

Input variable

# input variable
X = cancer_df.drop([‘target’], axis = 1)
X.head(6)

Output variable

# output variable
y = cancer_df['target']
y.head(6)

Split dataset for training and testing

# split dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 5)

Feature scaling of data

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

Machine Learning Model Building

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

1. Support vector Classifier

# Train with Standard scaled Data
from sklearn.svm import SVC
svc_classifier2 = SVC()
svc_classifier2.fit(X_train_sc, y_train)
y_pred_svc_sc = svc_classifier2.predict(X_test_sc)
accuracy_score(y_test, y_pred_svc_sc)

Output 0.9649122807017544

2. Logistic Regression

from sklearn.linear_model import LogisticRegression
# Train with Standard scaled Data
lr_classifier2 = LogisticRegression(random_state = 51, C=1, penalty='l1', solver='liblinear')
lr_classifier2.fit(X_train_sc, y_train)
y_pred_lr_sc = lr_classifier.predict(X_test_sc)
accuracy_score(y_test, y_pred_lr_sc)

Output 0.6052631578947368

3. Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB
# Train with Standard scaled Data
nb_classifier2 = GaussianNB()
nb_classifier2.fit(X_train_sc, y_train)
y_pred_nb_sc = nb_classifier2.predict(X_test_sc)
accuracy_score(y_test, y_pred_nb_sc)

Output 0.9385964912280702

4. K — Nearest Neighbor Classifier

# K – Nearest Neighbor Classifier
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_classifier.fit(X_train, y_train)
y_pred_knn = knn_classifier.predict(X_test)
accuracy_score(y_test, y_pred_knn)

Output 0.9385964912280702

5. Decision Tree Classifier

# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 51)
dt_classifier.fit(X_train, y_train)
y_pred_dt = dt_classifier.predict(X_test)
accuracy_score(y_test, y_pred_dt)

Output 0.9473684210526315

6. Random Forest Classifier

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy', random_state = 51)
rf_classifier.fit(X_train, y_train)
y_pred_rf = rf_classifier.predict(X_test)
accuracy_score(y_test, y_pred_rf)

Output 0.9736842105263158

7. AdaBoost Classifier

# Adaboost Classifier
from sklearn.ensemble import AdaBoostClassifier
adb_classifier = AdaBoostClassifier(DecisionTreeClassifier(criterion = 'entropy', random_state = 200),n_estimators=2000, learning_rate=0.1, algorithm='SAMME.R', random_state=1,)
adb_classifier.fit(X_train, y_train)
y_pred_adb = adb_classifier.predict(X_test)
accuracy_score(y_test, y_pred_adb)

Output 0.9473684210526315

8. XGboost classsifier

# XGBoost Classifier
from xgboost import XGBClassifier
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train, y_train)
y_pred_xgb = xgb_classifier.predict(X_test)
y_pred_xgb=accuracy_score(y_test, y_pred_xgb)
accuarcy_xgb=accuracy_score(y_test, y_pred_xgb)
print(accuarcy_xgb)

Output 0.9824561403508771

As we can see i have applied different machine learning classifier such as

  • Support vector Classifier
  • Logistic Regression
  • Naive Bayes Classifier
  • K — Nearest Neighbor Classifier
  • Decision Tree Classifier
  • Random Forest Classifier
  • AdaBoost Classifier
  • XGboost classsifier

Similarly, we have done for test data and implement them on we can see that their will be no overfitting and underfitting the test data. their should low bias and low variance.

Accuracy on test data and the similar code for test data .

#accuracy of all the classifier test dataAccuracy of Support vector Classifier - 0.9649122807017544
Accuracy of Logistic regression- 0.9649122807017544
Accuracy of Naive Bayes Classifier- 0.9473684210526315
Accuracy of K – Nearest Neighbor Classifier- 0.9385964912280702
Accuracy of Decision Tree Classifier- 0.9473684210526315
Accuracy of Random Forest Classifier- 0.9736842105263158
Accuracy of Adaboost Classifier- 0.9473684210526315
Accuracy of XGBoost Classifier- 0.9824561403508771

As,we can conclude the test data is performing nearly good result in Xgboost classifier with low bias and variance.

For further improving we should go for the tuning method such as randomised search and grid search on Xgboost because we want our accuracy to be more optimal and fix all contraints like precision ,recall ,beta value and support which are import to satisfy to overcome the Type I and Type II Error.

XGBoost Parameter Tuning

Randomized search

Applying randomized search on the model which works on sample of data and it works more faster than any search tuning method

params={
“learning_rate” : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
“max_depth” : [ 3, 4, 5, 6, 8, 10, 12, 15],
“min_child_weight” : [ 1, 3, 5, 7 ],
“gamma” : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
“colsample_bytree” : [ 0.3, 0.4, 0.5 , 0.7 ]
}

# Randomized Search

from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(xgb_classifier, param_distributions=params, scoring= ‘roc_auc’, n_jobs= -1, verbose= 3)
random_search.fit(X_train, y_train)

Finding the best and optimize parameter.

random_search.best_params_#output
{'min_child_weight': 1,
'max_depth': 12,
'learning_rate': 0.3,
'gamma': 0.3,
'colsample_bytree': 0.7}

random_search.best_estimator_

#output
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.7, gamma=0.3,
learning_rate=0.3, max_delta_step=0, max_depth=12,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)

# training XGBoost classifier with best parameters

xgb_classifier_pt = XGBClassifier(base_score=0.5, booster=’gbtree’, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.4, gamma=0.2,
learning_rate=0.1, max_delta_step=0, max_depth=15,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective=’binary:logistic’, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)xgb_classifier_pt.fit(X_train, y_train)
y_pred_xgb_pt = xgb_classifier_pt.predict(X_test)

Accuracy after model

accuracy_score(y_test, y_pred_xgb_pt)#output  - 0.9824561403508771

Grid search

Applying grid search on the model which works on whole data.

Training the model

from sklearn.model_selection import GridSearchCV 
grid_search = GridSearchCV(xgb_classifier, param_grid=params, scoring= ‘roc_auc’, n_jobs= -1, verbose= 3)
grid_search.fit(X_train, y_train)

Now comes the implementing it

xgb_classifier_pt_gs = XGBClassifier(base_score=0.5, booster=’gbtree’, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.3, gamma=0.0,
learning_rate=0.3, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective=’binary:logistic’, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)xgb_classifier_pt_gs.fit(X_train, y_train)
y_pred_xgb_pt_gs = xgb_classifier_pt_gs.predict(X_test)
accuracy_score(y_test, y_pred_xgb_pt_gs)#output 0.9824561403508771

As,we are getting the nearly same accuracy after applying these tuning method so we will use grid search in this know comes the part of classification report and types of error.

Confusion matrix

It gives the value of true positive and false negative which will help to predict how much our model is optimized to predict it.

from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test, y_pred_xgb_pt)
plt.title(‘Heatmap of Confusion Matrix’, fontsize = 15)
sns.heatmap(cm, annot = True)
plt.show()

The model is giving 0 type II error and it is best and for model is gving 2/112 near 0.017 error. it means we have very less chance for the wrong prediction around zero.

Classification report of the model

print(classification_report(y_test, y_pred_xgb_pt))

Output

precision    recall  f1-score   support         
0.0 1.00 0.96 0.98 48
1.0 0.97 1.00 0.99 66
micro avg 0.98 0.98 0.98 114
macro avg 0.99 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114

Cross-validation of the ML model

# Cross validation
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator = xgb_classifier_pt, X = X_train_sc,y = y_train, cv = 10)
print("Cross validation accuracy of XGBoost model = ", cross_validation)
print("\nCross validation mean accuracy of XGBoost model = ", cross_validation.mean())

Output

Cross validation accuracy of XGBoost model =  [0.9787234  0.97826087 0.97826087 0.97826087 0.93333333 0.91111111
1. 1. 0.97777778 0.88888889]
Cross validation mean accuracy of XGBoost model = 0.9624617124062083

Saving model for deployment

pickle.dump(xgb_classifier_pt, open('breast_cancer_detector.pickle', 'wb'))
# load model
breast_cancer_detector_model = pickle.load(open('breast_cancer_detector.pickle', 'rb'))
# predict the output
y_pred = breast_cancer_detector_model.predict(X_test)
# confusion matrix
print('Confusion matrix of XGBoost model:,confusion_matrix(y_test, y_pred),'\n')
# show the accuracy
print('Accuracy of XGBoost model = ',accuracy_score(y_test, y_pred))
Output####################################################################
#Output
Confusion matrix of XGBoost model:
[[46 2]
[ 0 66]]
Accuracy of XGBoost model = 0.9824561403508771

Now our model is dumped into pickle file. now its time for creating the flask application for the model to deployment.

This is the first part of a two-part series. you can read the second part after this from here-> End to End Deployment of Breast Cancer Prediction Through Machine Learning using Flask Part-2

You can clone the Github repository via the link below:

Do read Second Part-2

Wrapping Up

There are lots of ways to deploy ML-models into production, numerous ways to store them and different ways to manage predictive models after deployment. choosing the best approach to a use case can be challenging but with the team’s technical and analytics maturity, the overall organization structure and its interactions can help in selecting the right approach for deploying predictive models to production.

Connect with me on Twitter and LinkedIn.

Do find time check out my other articles and further readings in the reference section. Kindly remember to follow me so as to get notified of my publications.

Thank You for reading

Please give 👏🏻 Claps if you like the blog.

GEEKY BAWA

just a silly Geek who love to seek out new technologies and experience cool stuff.

Do Checkout My other Blogs

Do Checkout My Youtube channel

Do Checkout My M.L. Model

If you want to get in touch and by the way, you know a good joke you can connect with me on Twitter or LinkedIn.

Thanks for reading!😄 🙌

Made with ❤️by Vaibhav Hariramani

Don’t forget to tag us

if you find this blog beneficial for you don’t forget to share it with your friends and mention us as well. And Don’t forget to share us on Linkedin, instagram, facebook , twitter, Github

More Resources

To learn more about these Resources you can Refer to some of these articles written by Me:-

Download THE VAIBHAV HARIRAMANI APP

The Vaibhav Hariramani App (Latest Version)

Download THE VAIBHAV HARIRAMANI APP consist of Tutorials,Projects,Blogs and Vlogs of our Site developed Using Android Studio with Web View try installing it in your android device.

Follow me

on Linkedin, instagram, facebook , twitter, Github

Happy coding ❤️ .

--

--

VAIBHAV HARIRAMANI
GEEKY BAWA

Hi there! I am Vaibhav Hariramani a Travel & Tech blogger who love to seek out new technologies and experience cool stuff.