My data science template for Python

Published in

Saturdays.AI

6 min readApr 14, 2019

I’ve been learning data science and AI for the past year, during this time my way of working was to search for the code I needed at every step of my data science projects, copy-paste it and adapt it to my project. I thought it would be really useful for me to have some kind of template containing all the code I could need for a data science project.

In this post I will show my data science template. It is a Python file with most of the code needed for a data science project, structured in a way that makes it super easy to follow through.

Let’s begin by the ending part. You can find this template in my Github:

albertsl/toolkit

My Toolkit for Machine Learning and Data Science. Contribute to albertsl/toolkit development by creating an account on…

github.com

Now that you have easy access to the code, I’ll explain you how it is structured. Keep in mind that I’ll keep updating this template on Github but I won’t update this medium article, some parts of what I write here might become outdated.

First of all, I followed the structure for a data science project that you can find on the Appendix B of the book Hands-on Machine Learning with Scikit-Learn and TensorFlow by Aurelien Geron (https://amzn.to/2WIfsmk)

After creating an empty file, following the structure outlined on the book and adding most of the text on Appendix B as comments to structure the code, I started filling every part of the document with relevant code snippets (still working on it). The snippets come from many different sources, from code I wrote for competitions I participated, from friends, from examples on the internet, from books, etc.

While making it I was participating on the CareerCon 2019 — Help Navigate Robots Kaggle competition. While doing my first tests, i decided to go for the fast.ai strategy of launching a model as quick as possible and getting a baseline metric. While doing that I tested a random forest model and got a 39% accuracy. Then I started following this template and achieved a 65%!!!

Right now I’m looking to add even more snippets to the template and make it useful for different kinds of data (right now it only has code for tabulated data).

Let’s dive deeper into the code.

This is intended to be a summary of the structure, commenting the most important parts:

As always, we’ll begin with the necessary imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
sns.set()

Most of them are the typical data science imports. The ones it’s worth talking about are seaborn which is a data visualization library which works on top of matplotlib, adding extra functionality to it, different kinds of plots and overall prettier visuals. Also it’s worth noting tqdm, a library which gives you progression bars so you can see how much your functions are taking to run.

Then we load our data, there can be many variations of it depending on how your data is structured, we won’t cover those here.

df = pd.read_csv(‘file.csv’)

Now we visualize our data in order to get a quick glimpse of what we have in our hands:

#Visualize data
df.head()
df.describe()
df.info()
df.columns
#For a categorical dataset we want to see how many instances of each category there are
df['categorical_var'].value_counts()                                                #Exploratory Data Analysis (EDA)
sns.pairplot(df)
sns.distplot(df['column'])
sns.countplot(df['column'])

Data pre-processing

The first step after loading and visualizing the data is to pre-process it and give it an appropriate format for passing it to the machine learning models.

First let’s check for errors in our dataset and fix them, let’s check for NaN’s, infinite numbers, duplicated values, etc.

#Fix or remove outliers
plt.boxplot(df['feature1'])
plt.boxplot(df['feature2'])#Check for missing data
total_null = df.isna().sum().sort_values(ascending=False)
percent = (df.isna().sum()/df.isna().count()).sort_values(ascending=False)
missing_data = pd.concat([total_null, percent], axis=1, keys=['Total', 'Percent'])#Generate new features with missing data
df['feature1_nan'] = df['feature1'].isna()
df['feature2_nan'] = df['feature2'].isna()
#Also look for infinite data, recommended to check it also after feature engineering
df.replace(np.inf,0,inplace=True)
df.replace(-np.inf,0,inplace=True)#Check for duplicated data
df.duplicated().value_counts()
df['duplicated'] = df.duplicated() #Create a new feature#Fill missing data or drop columns/rows
df.fillna()
df.drop('column_full_of_nans')
df.dropna(how='any')

Then we pass to a feature engineering phase. I’m not going to copy any code here because this section will be totally different for every project you work on, in the template there are all the feature engineering stuff I’ve used in previous projects, of course there are only a few examples of feature engineering since the amount of different feature engineering that can be done is almost infinite and will vary completely depending on your project and kind of data.

Model selection and evaluation

After data pre-processing is done and we have the data in the required format, we can start working with models.

We must define a validation strategy like K-Fold Cross Validation or dividing the dataset in train/validation sets. Depending on your dataset and your objectives you might opt for one option or other. Here’s the code for some of them:

#Define Validation method
#Train and validation set split
from sklearn.model_selection import train_test_split
X = df.drop('target_var', inplace=True, axis=1)
y = df['column to predict']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.4, stratify = y.values, random_state = 101)#Cross validation
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)#StratifiedKFold
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=101)
for train_index, val_index in skf.split(X, y):
  X_train, X_val = X[train_index], X[val_index]
  y_train, y_val = y[train_index], y[val_index]

Finally we jump to the model fitting section, we can try many different models and evaluate their performance comparing them to one another and so we can choose the most promising ones. On the template there are shown implementations of many different algorithms. I’m not gonna show them all here since that would be 100+ lines of code. However I will show as an example the implementation of Random Forest which is one of the most versatile algorithms used in Machine Learning.

#########
# Random Forest
#########
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=200, random_state=101, n_jobs=-1, verbose=3)
rfr.fit(X_train, y_train)#Use model to predict
y_pred = rfr.predict(X_val)#Evaluate accuracy of the model
acc_rf = round(rfr.score(X_val, y_val) * 100, 2)#Evaluate feature importance
importances = rfr.feature_importances_
std = np.std([importances for tree in rfr.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
feature_importances = pd.DataFrame(rfr.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False)
feature_importances.sort_values('importance', ascending=False)plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices], yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()

We should decide what performance metrics we will use to evaluate the model. There are many different metrics and as always, depending on your problem, you might choose one or another or maybe many of them. I won’t post any code here since there is a huge amount of them.

To end up with a great algorithm, we can add hyper-parameter tuning on top of the chosen algorithms. Here’s an example of doing so by using a Grid Search algorithm.

from sklearn.model_selection import GridSearchCV
param_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]}
grid = GridSearchCV(model, param_grid, verbose = 3)
grid.fit(X_train, y_train)
grid.best_params_
grid.best_estimator_

Conclusion

With this I’ve covered all the steps you’ll need for most of your data science projects. Every section should be expanded with code to treat your specific dataset and you should use your expertise to decide which steps you should follow and which steps you shouldn't.