Diabetes Prediction
Since I did a little side project last time (if you haven’t checked it out yet, here is the link), I have realized that working on a project on my own helps a ton. So “Oops, I did it again.” I once more downloaded a dataset about diabetes from Kaggle (thank you, Kaggle, for your open source datasets), and ran the entire process from data cleaning to prediction. To note, this is not for finding the best prediction model, but for showing the steps of how it can be done. Especially we have learned about Principal Component Analysis just this week, I wanted to incorporate it.
Data Cleaning
This dataset at first glance did not have any missing values; however, there were tons of 0 values that make the dataset incomplete. This is such a small dataset, I decided dropping them would not be a good idea. Instead, I created a little box plot for each column to check whether they have many outliers before imputing median or mean to replace 0 values with.
I have concluded that median would be the better imputation method since each column showed irregularities like shown in above plot. I calculated median for each group (diabetic, and non-diabetic group) and substituted 0s to the medians.
Exploratory Data Analysis
As mentioned earlier, this is a very small dataset ( 9 columns and 768 rows) — there are not much to show. But we cannot skip it; I only created a few graphs to take a quick look.
# pairgridg = sns.PairGrid(df[columns])
g = g.map_lower(sns.regplot, scatter_kws = {'color': 'midnightblue'}, line_kws = {'color': 'mediumvioletred'})
g = g.map_upper(sns.kdeplot, cmap = 'cividis', shade = True, shade_lowest = False)
g = g.map_diag(plt.hist)plt.show();
# distribution of each columnscolumns = [col for col in df.columns if col != 'Outcome']for col in columns:
fig = px.histogram(df, x = col, color = 'Outcome', marginal = 'box', hover_data = columns)
fig.show()
Prediction
Now, let’s dive into the part that I have not shared in blog posts before. Since week 3 in my program at General Assembly, what I have been taught is predicting over and over again whether it is price, duration, or classification. This dataset’s column “Outcome” is whether somebody has diabetes or not, which is a classification problem. Like I said at the beginning, I planned to use PCA, so that’s what I did here.
# train test splitX = df.drop(columns = ['Outcome'], axis = 1)
y = df['Outcome']X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2020, test_size = 0.66, stratify = y)# establish baseliney.value_counts(normalize = True) # --> our model should do better than this (0.65 (non-diabetic) to 0.35 (diabetic))# instantiate standardscalersc = StandardScaler()X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)# instantiate PCApca = PCA(random_state = 2020) # n_components = NoneX_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
(Maybe I will use pipeline next time for StandardScaler and PCA.)
After this, I used 2 classification prediction model (there are so many more), which are Logistic Regression and Support Vector Machine.
# accuracy score with logistic regressionaccuracy_score(y_test, y_preds) # --- 0.7869822485207101# accuracy score with SVCaccuracy_score(y_test, y_preds) # --- 0.8086785009861933
Logistic Regression scored 78.7% accuracy and Support Vector Machine did 80.9%. Neither of them is a very good score, but both did better than the baseline.
Here we are, I did it. Might sound super nerdy, I like working on little projects like this. It is not very challenging yet, but it is a confirmation of me actually learning cool things. And bonus, I was able to do the whole thing with not much time.
Next time, I will do something with natural language processing. So, if you are interested to see what I am able to do, please stay tuned!