A Guide to Building Your First Data Science Project

End-to-end, from the beginning

Published in

Analytics Vidhya

15 min readMay 4, 2021

“Stepping” into the realm of data science

We’ve heard the buzzwords: data science, machine learning, predictive modeling but what do they mean? How can we use this technology in the real-world to make impactful decisions?

This end-to-end project was created to showcase just that.

Data science can be defined as the combination of scientific methods, mathematics, specialized programming, advanced analytics, AI and storytelling to uncover business insights buried in data. Let’s simplify that definition: I’d describe data science as the process of grabbing useful pieces of information from a larger data source.

Well, how do we know what information is useful? This is where machine learning comes in. Machine learning provides systems and users (like us) with the ability to catch hidden insights based on our data using algorithms. Algorithms use statistical modeling to predict an output value by taking in input data.

There are two kinds of machine learning tasks and they’re separated based on the type of input data. Supervised learning uses labeled outputs when developing a model to show the relationship between the input and output data. Unsupervised learning does not have clearly labeled outputs and so a model is developed with only the given set of data points.

We can further split supervised learning based off of outcome type: classification, where our predicted outcome is binary (0/1) or regression, where our predicted outcome is continuous. When working on a machine learning problem, it’s important to understand the type of outcome — thought process and methodologies vary between both problem types.

This article will provide a detailed walk through of the steps for an end-to-end project using the Framingham Heart Study dataset. The data is from an ongoing cardiovascular study on individuals from Framingham, Massachusetts where study participants are monitored for the risk of Coronary Heart Disease (CHD) based off 15 different variables. With this dataset we’ll determine the most relevant variables to the outcome and predict the overall risk of being diagnosed with CHD.

Let’s get started.

Note: This article assumes working knowledge with Python IDE’s. All code and visualizations from this article were created in Jupyter Notebook and can be found on my GitHub

Step 1: Defining the problem

Since the output label is provided to us (TenYearCHD), we know that this is a supervised problem. We’re looking to classify individuals into two (binary) categories: those who develop CHD (1) and those who do not develop CHD (0), therefore this is a classification problem.

Step 2: Data Loading

We will be using Python libraries NumPy, Pandas and Seaborn for data loading, exploration and visualization. Seaborn library builds upon Matplotlib and produces clean and easy to read visualizations.

#Data loading and visualization 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

After reading in the data (the dataset can be downloaded from Kaggle as a CSV file), the next step is to inspect the data before moving on to data cleaning — we want to understand the shape of the dataset, the different data types, the variables included and the target variable.

#Reading in data
fhs = pd.read_csv('/Users/harikapanuganty/Desktop/framingham.csv')
fhs.head()#Shape of data
fhs.shape#Information about a DataFrame
fhs.info()

Description of the 15 variables plus the target, in the dataset

Step 3: Data Cleaning

In real-world datasets and projects, we aren’t given a neat and clean CSV file — there will be inconsistent data types, missing/null values and duplicates. As the saying goes, “garbage in, garbage out”: a machine learning model can only be as good as the input data it’s given.

This Kaggle dataset is relatively clean but we will be checking for and handling null values.

#Looking for null values
fhs.isnull().sum()

We notice that several columns have one or more missing values. The two most popular ways to deal with missing data is either removing the affected rows (or column altogether) or imputing the missing values. Based on the column and the number of missing values, I chose to alternate between dropping rows, filling in null values with the mean and interpolating values.

fhs = fhs.dropna(subset = ['heartRate'])fhs['cigsPerDay'].fillna(int(fhs['cigsPerDay'].mean()), inplace=True)fhs['totChol'].fillna(int(fhs['totChol'].mean()), inplace=True)fhs['BPMeds'].fillna(int(fhs['BPMeds'].mean()), inplace=True)fhs['BMI'].interpolate(method='pad', direction = 'forward', inplace=True)fhs['glucose'].interpolate(method='pad', direction = 'forward', inplace=True)fhs.dropna(subset = ['education'], inplace=True)

Step 4: Exploratory Data Analysis

Previously, we did some brief exploring to better understand our dataset. In this step, using 6 different graphs and plots, we’ll dive deep into each variable and the relationship between the variable and our outcome.

Boxplots

We can identify outliers by plotting a boxplot. Any data points that are outside the upper and lower lines of the box are clear outliers (like the extreme data points in the totChol and sysBP columns) and need to be removed.

plt.figure(figsize=(20,35))
sns.boxplot(data=fhs)
plt.show()

Correlation Heatmaps

Heatmaps show the correlation between all variables with the target variable. Reading a heatmap is simple, all we need to do is compare the color of a square in the grid to the value on the side bar.

fhs_corr = fhs.corr()
plt.figure(figsize=(20,10))
sns.heatmap(fhs_corr)
plt.title("Correlation between features", size=20)
fhs_corr

When value on the side bar is:
closer to 0, there is no linear correlation between the two variables
closer to +1, there is a positive correlation between the two variables
closer to -1, there is a negative correlation between the two variables

For example the color of the sysBP square on the y-axis compared with the TenYearCHD square on the x-axis, is a light pinky purple and corresponds to ~0.3 on the side bar. This indicates that sysBP positively correlates with the TenYearCHD variable.

Distplots

Displots show the frequency distribution and potential skew of each of the variables, and this information will come in handy in future steps. Variable distribution plays a role in the type of method we choose to select for final features and in feature scaling.

numeric_vars_fhs = ['cigsPerDay', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']
for var in numeric_vars_fhs:
    plt.figure(figsize=(8,8))
    sns.distplot(fhs[var])
    plt.title('{} Distribution'.format(var))
    plt.show()

An example of a normally distributed variable would be sysBP whereas cigsPerDay is highly skewed and leaning right.

Barplots

Barplots are generally used to plot the relationship between two categorical variables. Take for example variables gender and TenYearCHD , we can clearly see that males have a slightly higher risk at developing CHD compared to females.

plt.figure(figsize=(8,6))
sns.barplot(x=gender_graph["male"], y=gender_graph["TenYearCHD"])
plt.title("Graph showing which gender has more risk of coronary heart disease CHD", size=15)
plt.xlabel("Gender\n0 is female and 1 is male",size=15)
plt.ylabel("TenYearCHD cases", size=15)

Countplots

Countplots are effective at showing the count of numerical observations in a categorical ‘bin’ using bars. These plots can be used to show the relationship between a numerical variable and a categorical variable.

plt.figure(figsize=(30,12), facecolor='w')
sns.countplot(x='TenYearCHD', data=fhs, hue='cigsPerDay')
plt.legend(title = 'cigsPerDay', fontsize='large')
plt.title("Graph showing the relationship between cigsPerDay and risk of developing CHD", size=28)
plt.xlabel("Risk of developing, TenYearCHD", size=25)
plt.ylabel("Count, TenYearCHD", size=25)
plt.show()

From this plot we observe that cigsPerDay (numerical) is positively correlated with TenYearCHD (categorical) i.e., the more cigarettes a person smokes in a day, the more likely they are to develop CHD.

Regplots

Regplots are used to plot data and a linear regression model fit. These kinds of plots take in one numerical variable and one categorical variable and output a trend line showcasing the relationship between both variables.

plt.figure(figsize=(8,8))
sns.regplot(x=sysbp_graph["TenYearCHD"], y=sysbp_graph["sysBP"])
plt.title("Distribution of sysBP in relation to developing CHD", size=15)

Looking at sysBP (numerical) and TenYearCHD (categorical), we can see a linearly increasing line that indicates a positive relationship between both variables. The risk of developing TenYearCHD increases as sysBP increases.

Step 5: Feature Selection

Now that we’ve explored our variables and the relationship between those variables and the outcome, we’re ready to choose features for our machine learning model. As displayed by the above graphs and plots, not every variable directly influences the output and we want to be certain that the variables we include in the model will contribute positively to model performance.

There are various feature selection techniques we can use but for this dataset, we’ll limit to two methods:

SelectKBest: Calculates the chi² statistic between each feature of X and y class labels and returns first k features with the highest scores.

from sklearn.feature_selection import SelectKBest, chi2bestfeatures_skb = SelectKBest(score_func=chi2, k=8)fit_skb = bestfeatures_skb.fit(X,y) dfscores = pd.DataFrame(fit_skb.scores_)
dfcolumns = pd.DataFrame(X.columns)featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  print(featureScores.nlargest(10,'Score'))

10 Features with the highest SelectKBest scores

Mutual Information Classification: Measures dependency of features with the target variable, higher score indicates more dependent the variable.

from sklearn.feature_selection import mutual_info_classifthreshold = 10  
high_score_features = []feature_scores = mutual_info_classif(X, y, random_state=1)for score, f_name in sorted(zip(feature_scores, X.columns), reverse=True)[:threshold]:
        print(f_name, score)
        high_score_features.append(f_name)feature_scores_mic = X[high_score_features]

10 Features with the highest Mutual Information Classification scores

The final features in our model will be a combination of the top features from the results of SelectKBest and Mutual Information Classification: sysBP, age, totChol, diaBP, prevalentHyp, diabetes, BPMeds and male.

fhs = fhs[['sysBP', 'age', 'totChol', 'diaBP', 'prevalentHyp', 'diabetes', 'BPMeds', 'male','TenYearCHD' ]]fhs.head()

Step 6: Data pre-processing

Known as the process of converting the data into a form that is readable by the machine learning model, this step includes splitting the dataset into train and test, scaling the features and balancing imbalanced variables.

Train-test split: We divide our one dataset into two subsets— the first subset (training dataset) fits the model and the second subset (test dataset) is used to evaluate the predictions from the training data onto a test data. If we don’t split our dataset, the model will “see” all of the data and can’t accurately predict the performance on new data.

from sklearn.model_selection import train_test_splitX = fhs.drop(['TenYearCHD'], axis=1) 
y = fhs['TenYearCHD'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=25)

Feature scaling: We want each data point in our features to have the same weight. The feature scaling method depends on the distribution of our data, in our case the distribution is normal so we will be using Min Max Scaler.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Resampling imbalanced variable: Taking a look at the shape of the dataset before resampling, we see that our target variable, TenYearCHD, is highly imbalanced. If we use this variable to predict, our models will favor the majority class and ignore the minority class resulting in models that have high accuracy but low recall. There’s a few ways to tackle this but we’ll use the SMOTE method which oversamples the minority class by generating new samples from existing ones.

from imblearn.over_sampling import SMOTEsm = SMOTE(random_state=1)
X_sm, y_sm = sm.fit_resample(X,y)print(f'''Shape of X before SMOTE: {X.shape}
Shape of X after SMOTE: {X_sm.shape}''')print('\nNumber of positive and negative instances in both classes (%):')
y_sm.value_counts(normalize=True) * 100

So far we’ve explored the dataset, identified and removed outliers, analyzed our categorical and numerical variables in depth, selected our features, appropriately divided our data into testing and training datasets, scaled our features, and balanced our target variable.

We are now ready for the machine learning algorithms.

Step 7: Predictive Modeling

There are a several algorithms that are well-suited for classification problems. This project will implement four of these algorithms (and their hypertuned counterparts): Logistic Regression, Random Forest, K-Nearest Neighbors and Support Vector Machines.

Logistic Regression: Widely used classification algorithm when the expected output is binary (yes/no or 0/1). This algorithm proves helpful when understanding the influence of one or more independent variables to a single outcome variable.

from sklearn.linear_model import Logistic Regression
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_reportlr = LogisticRegression(random_state=25, max_iter=100)
lr_model = fhs_lr.fit(X_train, y_train) 
lr_predict = fhs_lr.predict(X_test)lr_accuracy = accuracy_score(y_test, lr_predict) 
lr_cm = confusion_matrix(y_test, lr_predict)print("Accuracy of Logistic Regression:", lr_accuracy)
print("Logistic Regression Confusion Matrix:", lr_cm)
print(classification_report(y_test, lr_predict))

Random Forest: Effective for both classification and regression problems, a random forest is several decision trees put together. What’s a decision tree? Similar looking to a flowchart, a decision tree breaks down data into smaller subsets until the algorithm finds the smallest tree that fits the data. Although individual trees are easy to interpret and handle data well, they are prone to overfitting and produce results with low accuracy. Bringing together multiple trees into a model i.e., Random Forest enhances the performance of each individual tree model into one strong tree model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_reportrf = RandomForestClassifier(n_estimators=100, random_state=25, max_depth=12)
rf.fit(X_train, y_train)
rf_predict = rf.predict(X_test)rf_accuracy = accuracy_score(y_test, rf_predict)
rf_cm = confusion_matrix(y_test, rf_predict)print("Accuracy of Random Forest:", rf_accuracy)
print("Random Forest Confusion Matrix:", rf_cm)
print(classification_report(y_test, rf_predict))

K-Nearest Neighbors: This algorithm operates under the assumption that similar data points exist near each other. KNN combines this idea of ‘closeness’ to calculate the distance between data points. Taking a specific value for K, for example K = 5, we’d consider the 5 closest data points to the unknown data point and the majority label between these points would be assigned to the unknown data point.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_reportknn = KNeighborsClassifier(n_neighbors=5)
knn_model = knn.fit(X_train, y_train)
knn_predict = knn.predict(X_test)knn_cm = confusion_matrix(y_test, knn_predict)
knn_accuracy = accuracy_score(y_test, knn_predict)print("Accuracy of KNN Classification:", knn_accuracy)
print("KNN Classification Confusion Matrix:", knn_cm)
print(classification_report(y_test, knn_predict))

Support Vector Machine: This algorithm creates a linear separator that divides the group of data points into two classes (for classification), classifying each data point into one of the two classes.

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_reportsvm = SVC()
svm_model = svm.fit(X_train, y_train)
svm_predict = svm.predict(X_test)svm_cm = confusion_matrix(y_test, svm_predict)
svm_accuracy = accuracy_score(y_test, svm_predict)print("Accuracy of SVM Classification:", svm_accuracy)
print("SVM Classification Confusion Matrix:", svm_cm)
print(classification_report(y_test, svm_predict))

Step 8: Hyperparameter Tuning

With most models, we can also tune hyperparameters (think of it as settings for an algorithm) to optimize model performance. Scikit-learn includes a set of default hyperparameters for all models but these values are not guaranteed to produce the best results. GridSearch and RandomizedSearch are common tuning methods used to find optimal values; all the hypertuned models in this project used RandomizedSearch (with this method, we select hyperparameter combinations at random based on a range of values)

Hypertuned Random Forest:

Adjusted hyperparameters:

n_estimators: number of trees in the forest
max_features: maximum number of features needed before each split
max_depth: maximum number of levels in each tree
min_samples_split: minimum number of samples needed to split a node
min_samples_leaf: minimum number of samples needed at each node
bootstrap: method of choosing samples for training each tree

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_report# Number of trees in forest 
n_estimators = [int(x) for x in np.linspace(start= 100, stop = 1000, num = 10)]# Max number of features needed before each split
max_features = ["auto", "sqrt"]# Max no. of levels in each tree
max_depth = [int(x) for x in np.linspace(start=10, stop=100, num = 10)] 
max_depth.append(None)# Min no. of samples needed to split a node  
min_samples_split = [2, 5, 10]# Min no. of samples needed at each node 
min_samples_leaf = [1, 2, 4]# Method of choosing samples for training each tree
bootstrap = [True, False]# Create the random grid to sample from during fitting
rf_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}print(rf_grid)

Once we’ve created the grid, we can instantiate the object and fit similar to other scikit-learn models.

#instantiating object 
rf_hyp = RandomForestClassifier()#Random search of parameters through 3-Fold cross-validation 
rf_hyp_rs = RandomizedSearchCV(estimator=rf_hyp, param_distributions = rf_grid, n_iter=100, cv=3, verbose=2, random_state=25, n_jobs=-1)#Fit model
rf_hyp_rs.fit(X_train, y_train)rf_hyp_model = rf_hyp_rs.best_estimator_
rf_hyp_model.fit(X_train, y_train)
rf_hyp_predict = rf_hyp_model.predict(X_test)
rf_hyp_cm = confusion_matrix(y_test, rf_hyp_predict)
rf_hyp_accuracy = accuracy_score(y_test, rf_hyp_predict)print("Accuracy of Hypertuned Random Forest:", rf_hyp_accuracy)
print("Hypertuned Random Forest:", rf_hyp_cm)
print(classification_report(y_test, rf_hyp_predict))

Hypertuned K-Nearest Neighbors

Adjusted hyperparameters:

leaf_size: affects speed and memory of query, passed to algorithm (balltree in this case)
n_neighbors: number of neighbors
p: minkowski metric power parameter

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_report# affects speed and memory of query, passed on to algorithm (balltree or kdtree)
leaf_size= list(range(1,50))# Number of neighbors 
n_neighbors= list(range(1,30))# minkowski metric power parameter 
p= [1,2]#creating the dict 'grid'
knn_hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

Once we’ve created the grid, we can instantiate the object and fit similar to other scikit-learn models.

#Create base model to tune and use grid to find best hyperparameters 
knn_hyp_obj = KNeighborsClassifier()#Random search of parameters through 3-Fold cross-validation 
knn_hyp_grid = RandomizedSearchCV(knn_hyp_obj, knn_hyperparameters, random_state = 25, cv=3, n_jobs=-1)#Fit random search model
knn_hyp_model = knn_hyp_grid.fit(X_train, y_train)knn_hyp = KNeighborsClassifier(n_neighbors=28, leaf_size=20, p=1)
knn_model_hyp = knn_hyp.fit(X_train, y_train)
knn_predict_hyp = knn_hyp.predict(X_test)
knn_cm_hyp = confusion_matrix(y_test, knn_predict_hyp)
knn_hyp_accuracy = accuracy_score(y_test, knn_predict_hyp)print("Accuracy of Hypertuned KNN Classification:", knn_hyp_accuracy)
print("Hypertuned KNN Classification Confusion Matrix:", knn_cm_hyp)
print(classification_report(y_test, knn_predict_hyp))

Hypertuned Support Vector Machine

Adjusted hyperparameters:

C: regularization parameter
Kernel: kernel type (can be linear, poly, rbf, sigmoid, precomputed or callable)
Gamma: kernel coefficient (rbf, poly, sigmoid)

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, classification_reportsvm_param_grid = {'C': [0.1, 1, 10, 100], 
                 'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                 'kernel': ['rbf']}svm_hyp_rs = RandomizedSearchCV(SVC(), svm_param_grid, refit=True, verbose=1, n_jobs=-1)svm_hyp_rs.fit(X_train, y_train)svm_hyp_model = svm_hyp_rs.best_estimator_
svm_hyp_model.fit(X_train, y_train)
svm_hyp_predict = svm_hyp_model.predict(X_test)
svm_hyp_cm = confusion_matrix(y_test, svm_hyp_predict)
svm_hyp_accuracy = accuracy_score(y_test, svm_hyp_predict)print("Accuracy of Hypertuned Support Vector Machine:", svm_hyp_accuracy)
print("Hypertuned Support Vector Machine:", svm_hyp_cm)
print(classification_report(y_test, svm_hyp_predict))

Step 9: Model Evaluation

To evaluate our models, we’ll use accuracy and the confusion matrix. Classification accuracy, is the ratio of the number of correct predictions to the total number of input samples and works best when each class has an equal number of samples (both of our classes do since we resampled to balance in an earlier step). The ultimate goal is to have the highest accuracy possible, a rare 100% — the accuracy for all of our models are between 83 and 84%. At first glance this looks pretty good.

One problem with accuracy is that it doesn’t clearly indicate misclassification of samples and depending on the type and goal of your project, this can become a problem. Let’s take our dataset for example, we’re trying to predict the possibility of an individual developing CHD based off several variables. There are two misclassification errors that could potentially occur:

Type 1/False Positive: model indicates individual will develop CHD when in actuality, they will not. We reject the null hypothesis when it is actually true.
Type 2/False Negative: model indicates individual will not develop CHD when in actuality, they will. We fail to reject the null hypothesis when it is actually false.

Going back to our case, which error is worse? They’re both bad but probably Type 2. We don’t want to tell study participants that they’re clear of CHD only for them to come back a couple years later with advanced stage CHD.

The confusion matrix is a performance measurement for machine learning classification problems that (unlike accuracy) takes into account true positive (TP), false positive (FP/Type 1 error), false negative (FN/Type 2 error) and true negative (TN) values.

Source: Understanding Confusion Matrix by Sarang Narkhede in Towards Data Science

Taking a look at the confusion matrix above through the lens of our dataset, we want our model to output a high numbers of true positives (TP) and true negatives (TN) and low numbers of false negatives (FN) and false positives (FP).

Here are our results with each model’s accuracy and confusion matrix [[TP, FP] [FN, TN]]. Although all of the models generated similar results in terms of overall accuracy and number of false negatives, I would consider the Hypertuned Random Forest to be the model that represented our dataset and outcome the best.

Final Thoughts

Applying our hypertuned random forest model to this dataset, we can be 84% sure that the model is predicting the outcome correctly. The comparatively low number of false negatives adds to our confidence. Models like this one can certainly be put into production and used in the real world to help cardiac specialists make health decisions based on the output.

There are ways of improving the model accuracy. Since the Framingham Heart Study is still ongoing, we can feed machine learning models more data as we receive it — models are known to make better predictions with increased data. From a technical standpoint, we could try to use more advanced machine learning techniques and algorithms like ensembling and deep learning. This is what I love most about machine learning, the possibilities are endless :)

Thank you for reading!

A Guide to Building Your First Data Science Project

End-to-end, from the beginning

Step 1: Defining the problem

Step 2: Data Loading

Step 3: Data Cleaning

Step 4: Exploratory Data Analysis

Step 5: Feature Selection

Step 6: Data pre-processing

Step 7: Predictive Modeling

Step 8: Hyperparameter Tuning

Step 9: Model Evaluation

Final Thoughts

Written by Harika Panuganty