A Guide to Building Your First Data Science Project

End-to-end, from the beginning

Harika Panuganty
Analytics Vidhya
15 min readMay 4, 2021

--

“Stepping” into the realm of data science

We’ve heard the buzzwords: data science, machine learning, predictive modeling but what do they mean? How can we use this technology in the real-world to make impactful decisions?

This end-to-end project was created to showcase just that.

Data science can be defined as the combination of scientific methods, mathematics, specialized programming, advanced analytics, AI and storytelling to uncover business insights buried in data. Let’s simplify that definition: I’d describe data science as the process of grabbing useful pieces of information from a larger data source.

Well, how do we know what information is useful? This is where machine learning comes in. Machine learning provides systems and users (like us) with the ability to catch hidden insights based on our data using algorithms. Algorithms use statistical modeling to predict an output value by taking in input data.

There are two kinds of machine learning tasks and they’re separated based on the type of input data. Supervised learning uses labeled outputs when developing a model to show the relationship between the input and output data. Unsupervised learning does not have clearly labeled outputs and so a model is developed with only the given set of data points.

We can further split supervised learning based off of outcome type: classification, where our predicted outcome is binary (0/1) or regression, where our predicted outcome is continuous. When working on a machine learning problem, it’s important to understand the type of outcome — thought process and methodologies vary between both problem types.

This article will provide a detailed walk through of the steps for an end-to-end project using the Framingham Heart Study dataset. The data is from an ongoing cardiovascular study on individuals from Framingham, Massachusetts where study participants are monitored for the risk of Coronary Heart Disease (CHD) based off 15 different variables. With this dataset we’ll determine the most relevant variables to the outcome and predict the overall risk of being diagnosed with CHD.

Let’s get started.

Note: This article assumes working knowledge with Python IDE’s. All code and visualizations from this article were created in Jupyter Notebook and can be found on my GitHub

Step 1: Defining the problem

Since the output label is provided to us (TenYearCHD), we know that this is a supervised problem. We’re looking to classify individuals into two (binary) categories: those who develop CHD (1) and those who do not develop CHD (0), therefore this is a classification problem.

Step 2: Data Loading

We will be using Python libraries NumPy, Pandas and Seaborn for data loading, exploration and visualization. Seaborn library builds upon Matplotlib and produces clean and easy to read visualizations.

After reading in the data (the dataset can be downloaded from Kaggle as a CSV file), the next step is to inspect the data before moving on to data cleaning — we want to understand the shape of the dataset, the different data types, the variables included and the target variable.

Description of the 15 variables plus the target, in the dataset

Step 3: Data Cleaning

In real-world datasets and projects, we aren’t given a neat and clean CSV file — there will be inconsistent data types, missing/null values and duplicates. As the saying goes, “garbage in, garbage out”: a machine learning model can only be as good as the input data it’s given.

This Kaggle dataset is relatively clean but we will be checking for and handling null values.

We notice that several columns have one or more missing values. The two most popular ways to deal with missing data is either removing the affected rows (or column altogether) or imputing the missing values. Based on the column and the number of missing values, I chose to alternate between dropping rows, filling in null values with the mean and interpolating values.

Step 4: Exploratory Data Analysis

Previously, we did some brief exploring to better understand our dataset. In this step, using 6 different graphs and plots, we’ll dive deep into each variable and the relationship between the variable and our outcome.

  • Boxplots

We can identify outliers by plotting a boxplot. Any data points that are outside the upper and lower lines of the box are clear outliers (like the extreme data points in the totChol and sysBP columns) and need to be removed.

  • Correlation Heatmaps

Heatmaps show the correlation between all variables with the target variable. Reading a heatmap is simple, all we need to do is compare the color of a square in the grid to the value on the side bar.

When value on the side bar is:
closer to 0, there is no linear correlation between the two variables
closer to +1, there is a positive correlation between the two variables
closer to -1, there is a negative correlation between the two variables

For example the color of the sysBP square on the y-axis compared with the TenYearCHD square on the x-axis, is a light pinky purple and corresponds to ~0.3 on the side bar. This indicates that sysBP positively correlates with the TenYearCHD variable.

  • Distplots

Displots show the frequency distribution and potential skew of each of the variables, and this information will come in handy in future steps. Variable distribution plays a role in the type of method we choose to select for final features and in feature scaling.

An example of a normally distributed variable would be sysBP whereas cigsPerDay is highly skewed and leaning right.

  • Barplots

Barplots are generally used to plot the relationship between two categorical variables. Take for example variables gender and TenYearCHD , we can clearly see that males have a slightly higher risk at developing CHD compared to females.

  • Countplots

Countplots are effective at showing the count of numerical observations in a categorical ‘bin’ using bars. These plots can be used to show the relationship between a numerical variable and a categorical variable.

From this plot we observe that cigsPerDay (numerical) is positively correlated with TenYearCHD (categorical) i.e., the more cigarettes a person smokes in a day, the more likely they are to develop CHD.

  • Regplots

Regplots are used to plot data and a linear regression model fit. These kinds of plots take in one numerical variable and one categorical variable and output a trend line showcasing the relationship between both variables.

Looking at sysBP (numerical) and TenYearCHD (categorical), we can see a linearly increasing line that indicates a positive relationship between both variables. The risk of developing TenYearCHD increases as sysBP increases.

Step 5: Feature Selection

Now that we’ve explored our variables and the relationship between those variables and the outcome, we’re ready to choose features for our machine learning model. As displayed by the above graphs and plots, not every variable directly influences the output and we want to be certain that the variables we include in the model will contribute positively to model performance.

There are various feature selection techniques we can use but for this dataset, we’ll limit to two methods:

  • SelectKBest: Calculates the chi² statistic between each feature of X and y class labels and returns first k features with the highest scores.
10 Features with the highest SelectKBest scores
  • Mutual Information Classification: Measures dependency of features with the target variable, higher score indicates more dependent the variable.
10 Features with the highest Mutual Information Classification scores

The final features in our model will be a combination of the top features from the results of SelectKBest and Mutual Information Classification: sysBP, age, totChol, diaBP, prevalentHyp, diabetes, BPMeds and male.

Columns included in the final dataset

Step 6: Data pre-processing

Known as the process of converting the data into a form that is readable by the machine learning model, this step includes splitting the dataset into train and test, scaling the features and balancing imbalanced variables.

  • Train-test split: We divide our one dataset into two subsets— the first subset (training dataset) fits the model and the second subset (test dataset) is used to evaluate the predictions from the training data onto a test data. If we don’t split our dataset, the model will “see” all of the data and can’t accurately predict the performance on new data.
  • Feature scaling: We want each data point in our features to have the same weight. The feature scaling method depends on the distribution of our data, in our case the distribution is normal so we will be using Min Max Scaler.
  • Resampling imbalanced variable: Taking a look at the shape of the dataset before resampling, we see that our target variable, TenYearCHD, is highly imbalanced. If we use this variable to predict, our models will favor the majority class and ignore the minority class resulting in models that have high accuracy but low recall. There’s a few ways to tackle this but we’ll use the SMOTE method which oversamples the minority class by generating new samples from existing ones.

So far we’ve explored the dataset, identified and removed outliers, analyzed our categorical and numerical variables in depth, selected our features, appropriately divided our data into testing and training datasets, scaled our features, and balanced our target variable.

We are now ready for the machine learning algorithms.

Step 7: Predictive Modeling

There are a several algorithms that are well-suited for classification problems. This project will implement four of these algorithms (and their hypertuned counterparts): Logistic Regression, Random Forest, K-Nearest Neighbors and Support Vector Machines.

  • Logistic Regression: Widely used classification algorithm when the expected output is binary (yes/no or 0/1). This algorithm proves helpful when understanding the influence of one or more independent variables to a single outcome variable.
  • Random Forest: Effective for both classification and regression problems, a random forest is several decision trees put together. What’s a decision tree? Similar looking to a flowchart, a decision tree breaks down data into smaller subsets until the algorithm finds the smallest tree that fits the data. Although individual trees are easy to interpret and handle data well, they are prone to overfitting and produce results with low accuracy. Bringing together multiple trees into a model i.e., Random Forest enhances the performance of each individual tree model into one strong tree model.
  • K-Nearest Neighbors: This algorithm operates under the assumption that similar data points exist near each other. KNN combines this idea of ‘closeness’ to calculate the distance between data points. Taking a specific value for K, for example K = 5, we’d consider the 5 closest data points to the unknown data point and the majority label between these points would be assigned to the unknown data point.
  • Support Vector Machine: This algorithm creates a linear separator that divides the group of data points into two classes (for classification), classifying each data point into one of the two classes.

Step 8: Hyperparameter Tuning

With most models, we can also tune hyperparameters (think of it as settings for an algorithm) to optimize model performance. Scikit-learn includes a set of default hyperparameters for all models but these values are not guaranteed to produce the best results. GridSearch and RandomizedSearch are common tuning methods used to find optimal values; all the hypertuned models in this project used RandomizedSearch (with this method, we select hyperparameter combinations at random based on a range of values)

Hypertuned Random Forest:

Adjusted hyperparameters:

  • n_estimators: number of trees in the forest
  • max_features: maximum number of features needed before each split
  • max_depth: maximum number of levels in each tree
  • min_samples_split: minimum number of samples needed to split a node
  • min_samples_leaf: minimum number of samples needed at each node
  • bootstrap: method of choosing samples for training each tree

Once we’ve created the grid, we can instantiate the object and fit similar to other scikit-learn models.

Hypertuned K-Nearest Neighbors

Adjusted hyperparameters:

  • leaf_size: affects speed and memory of query, passed to algorithm (balltree in this case)
  • n_neighbors: number of neighbors
  • p: minkowski metric power parameter

Once we’ve created the grid, we can instantiate the object and fit similar to other scikit-learn models.

Hypertuned Support Vector Machine

Adjusted hyperparameters:

  • C: regularization parameter
  • Kernel: kernel type (can be linear, poly, rbf, sigmoid, precomputed or callable)
  • Gamma: kernel coefficient (rbf, poly, sigmoid)

Step 9: Model Evaluation

To evaluate our models, we’ll use accuracy and the confusion matrix. Classification accuracy, is the ratio of the number of correct predictions to the total number of input samples and works best when each class has an equal number of samples (both of our classes do since we resampled to balance in an earlier step). The ultimate goal is to have the highest accuracy possible, a rare 100% — the accuracy for all of our models are between 83 and 84%. At first glance this looks pretty good.

One problem with accuracy is that it doesn’t clearly indicate misclassification of samples and depending on the type and goal of your project, this can become a problem. Let’s take our dataset for example, we’re trying to predict the possibility of an individual developing CHD based off several variables. There are two misclassification errors that could potentially occur:

  • Type 1/False Positive: model indicates individual will develop CHD when in actuality, they will not. We reject the null hypothesis when it is actually true.
  • Type 2/False Negative: model indicates individual will not develop CHD when in actuality, they will. We fail to reject the null hypothesis when it is actually false.

Going back to our case, which error is worse? They’re both bad but probably Type 2. We don’t want to tell study participants that they’re clear of CHD only for them to come back a couple years later with advanced stage CHD.

The confusion matrix is a performance measurement for machine learning classification problems that (unlike accuracy) takes into account true positive (TP), false positive (FP/Type 1 error), false negative (FN/Type 2 error) and true negative (TN) values.

Source: Understanding Confusion Matrix by Sarang Narkhede in Towards Data Science

Taking a look at the confusion matrix above through the lens of our dataset, we want our model to output a high numbers of true positives (TP) and true negatives (TN) and low numbers of false negatives (FN) and false positives (FP).

Here are our results with each model’s accuracy and confusion matrix [[TP, FP] [FN, TN]]. Although all of the models generated similar results in terms of overall accuracy and number of false negatives, I would consider the Hypertuned Random Forest to be the model that represented our dataset and outcome the best.

Final Thoughts

Applying our hypertuned random forest model to this dataset, we can be 84% sure that the model is predicting the outcome correctly. The comparatively low number of false negatives adds to our confidence. Models like this one can certainly be put into production and used in the real world to help cardiac specialists make health decisions based on the output.

There are ways of improving the model accuracy. Since the Framingham Heart Study is still ongoing, we can feed machine learning models more data as we receive it — models are known to make better predictions with increased data. From a technical standpoint, we could try to use more advanced machine learning techniques and algorithms like ensembling and deep learning. This is what I love most about machine learning, the possibilities are endless :)

Thank you for reading!

--

--