Principal steps of a Machine Learning project

Khuyen Le
Khuyen Le
Feb 23 · 14 min read
[Image source: https://farm2.staticflickr.com/1816/30212411048_2a1d7200e2_b.jpg]

Machine learning engineers do not only need to have good programming skill, but they also need to have some skills of a data scientist as collecting and managing data, skills of statistician for analyzing data, and also skills of a mathematician …. It is because a machine learning project requires a lot of steps, from managing data, building and evaluating machine learning models, to applying this model for predicting the new data in the testing set. In this article, we are going to discover all these steps, including:

  1. Gathering data
  2. Exploratory Data Analysis (EDA)
  3. Preprocessing data
  4. Modelization and evaluation
  5. Hyperparameter tuning
  6. Prediction.

All these tasks will be illustrated detailly through a project of predicting the house price in the King Country. This data is collected from Kaggle.

I. Gathering data

As I presented in the previous article, data is very important for building a Machine learning model. Because the model will learn from the dataset that you provide to it. Hence, collecting and managing data play a key role in a Machine Learning project.

There are many sources for collecting data, here are some open sources that you can download a variety of datasets for feeding to your model:

  • Kaggle is an online community of data scientists where you can find many Machine learning, Deep learning projects as well as open data sources. It is a useful page to learn Machine learning through doing real-life projects or participating in competitions.
  • UCI Machine Learning Repository provides a collection of databases for the Machine learning community. It actually maintains 560 datasets.
  • Open Data on AWS is a place that users can share datasets that are available via AWS resources.
  • OpenDataSoft is a source that contains more than 2600 open data portals around the world.

Our dataset is saved in a table (.csv) format. It can be read by the read_csv function from the pandas package:

import pandas as pddf = pd.read_csv('kc_house_data.csv')

Checking the data:

df.head()

II. Exploratory Data Analysis (EDA)

The objective of this step is to understand the dataset as much as possible so that we can figure out a quick strategy for the modelization step.

As Machine Learning works a lot on structured data, which is saved in .csv or .xlsx formats, so in this post we focus on analyzing this type of data.

Firstly, we can start with some basic analysis such as discovering target variables, the number of rows, columns, the data type of each column, check if there exist any NaN values in the dataset:

For our dataset kc_house_data.csv:

  • The target feature in our case is the price column.
  • This dataset includes 21613 rows and 21 columns.
print(df.shape)>>> (21613, 21)
  • Discover the type of each column:
df.dtypes

This result shows that almost all our features have a numerical type (integer or float). There is only the column “date” which has an object type.

  • Visualizing the ratio of types by pie plot:
df.dtypes.value_counts().plot.pie()
  • Verifying if the dataset includes any NaN value:
df.isna().sum()

This result shows that all columns do not include any missing value.

  • Describing statistical values of the dataset by using function df.describe(). This function allows us to compute some basic statistical values for each numerical column such as the number of data points, min, max, mean, standard deviation values (std), and quantiles.
df.describe().transpose()
  • To better understanding the dataset, we can also use histograms or bar plots to discover the distribution of each feature.

The following figure visualizes the distribution of the price column:

plt.figure(figsize = (10,8))sns.distplot(df['price'],hist = True, label = 'Price')plt.show()

Based on the figure above, we see that

  • Almost house prices are distributed from 0 to 1 million dollars.
  • Prices around 0.5 million dollars appear most frequently.
  • There exist some outliers values that we can skip to present their influence on our ML model.

This plot does not only help us to figure out the most concentrated values of the data, but it is also useful for determining the outliers

The price house obviously depends on the number of floors, bedrooms, bathrooms. So, it is also interesting to visualize these features:

import matplotlib.pyplot as plt
import seaborn as sns

Floors:

plt.figure()sns.countplot(df['floors'])plt.show()

Bathrooms:

plt.figure(figsize = (12,5))sns.countplot(df['bathrooms'])plt.show()

Bedrooms:

plt.figure(figsize = (7,5))sns.countplot(df['bedrooms'])plt.show()
  • Furthermore, discovering the correlation between variables is also very important. Thanks to this analysis, we can select the variables which are most correlated to our target feature and ignore the weak correlated ones.
plt.figure(figsize = (14,8))sns.heatmap(df.corr(), linewidths = 0.5, annot = True)

Find the correlations of the price with other features:

df.corr()["price"].sort_values(ascending = False)

The above result shows that the price is highly correlated with some variables as sqft_living, grade, sqft_above, sqft_living15, and bathrooms. The correlations of Id and Zipcode with the price are very weak.

III. Preprocessing data

As you know that the machine learning algorithms learn from the data that you provide to it. If the dataset is not good (there exist missing values, outliers, or the features are not presented in a corrected format), then the machine learning model built from this data will be very bad. Therefore, preparing the data before feeding it to the model is very important, it takes around 80% working time of data scientists.

Among preprocessing techniques, some methods are usually used as encoding, normalization, imputation, outliers and NaN rejections, variable selection, variable extraction. Let’s discover how these techniques work before applying them for preprocessing our dataset.

1. Encoding

Encoding is a method to encode the categorical data into numbers before using it to fit the model. The two most popular techniques are Ordinal Encoding and One-Hot Encoding.

In Ordinal Encoding, each categorical value is encoded by an integer value. For example, we have two categorical values [“dog”, “cat”, “bird”], Ordinal encoding will convert these two values into three integers 0, 1, and 2, according to the order in which they appear in the dictionary. Hence, “bird” is assigned to 0, “cat” to 1, and “dog” to 2. In Python, it is done by OrdinalEncoder from the module sklearn.preprocessing:

from sklearn.preprocessing import OrdinalEncoderfrom numpy import asarrayexample = asarray([['dog'], ['cat'],['bird']])encoder = OrdinalEncoder()encode_example = encoder.fit_transform(example)print(encode_example)>>> [[2.]  
[1.]
[0.]]

We can also reverse the transformation to find the original categorical value of an encoded value:

import numpy as npencoder.inverse_transform(np.array([[0],[2],[2]]))>>> array([['bird'], ['dog'], ['dog']], dtype='<U4')

In the case when there does not exist an ordinal relationship between variables, the integer encoding may not be suitable. It can be replaced by the One-hot encoding technique. This technique aims to convert each target into a vector with length equal to the number of categories. If a data point belongs to the iᵗʰ-category, then iᵗʰ -component in this vector is assigned to 1, and others are assigned to 0.

from sklearn.preprocessing import OneHotEncoderfrom numpy import asarrayexample = asarray([['dog'], ['cat'],['bird']])encoder = OneHotEncoder(sparse = False)encode_example = encoder.fit_transform(example)print(encode_example)>>> [[0. 0. 1.]  
[0. 1. 0.]
[1. 0. 0.]]

In this example, “dog” is assigned to vector [0, 0, 1], “cat” to [0, 1, 0] and “bird” to [1, 0, 0].

2. Normalization

Sometimes, the range of numerical features in the raw data may vary widely. So, it is necessary to normalize them before fitting the data to our machine learning model. The goal of normalization is to change the values of numerical columns to a common scale, without losing differences in the range of values. Now, let’s discover the most popular techniques in sklearn for normalizing features:

  • Min-max normalization:

This method rescales the range of features to the new range in [0,1]. The general formula is given by:

where X is the original value and X’ is the normalized value.

In Python, this formula is computed by MinMaxScaler from the module sklearn.preprocessing:

Example: The vector (1, 2, 3) is rescaled to (0, 0.5, 1)

from sklearn.preprocessing import MinMaxScalerimport numpy as npX = np.array([[1],[2],[3]])scaler = MinMaxScaler()scaler.fit_transform(X)
>>> array([[0. ], [0.5], [1. ]])
  • Standardization (Z-score normalization):

This method aims to rescale the feature in a new range with a zero mean and a standard deviation is 1. The formula of this technique is given by:

where mean(X) and σ denote the average value and the standard deviation of X, respectively.

The function StandardScaler belongs to the sklearn.preprocessing module.

Example: The new scale of the vector (1,2,3) by this method is (-1.22474487, 0, 1.22474487)

from sklearn.preprocessing import StandardScalerimport numpy as npX = np.array([[1],[2],[3]])scaler = StandardScaler()scaler.fit_transform(X)>>> array([[-1.22474487],
[ 0. ],
[ 1.22474487]])
  • Robust scaler:

The two methods above are not really suitable when the dataset includes outliers. This drawback can be overcome by the Robust Scaler method, where the median and interquartile range are taken into account. The normalization formula of this technique is given by:

where median(X) and IQR denotes the median and the interquartile range of data X.

3. NaN rejections

When the dataset includes missing values, they need to be rejected or replaced before fitting this data into the model. Pandas provides some useful functions to deal with this problem.

  • pandas.DataFrame.isna() determines if there exist any missing value in the DataFrame
  • pandas.DataFrame.fillna(α) replaces missing values in DataFrame by a given value α
  • pandas.DataFrame.dropna() drops all missing values in DataFrame

You can see this reference for more detail.

4. Imputation

Sometimes, dropping missing values may lose valuable data. A better strategy is to impute these missing values by some statistical values as the mean, the median, the most frequent value, or some constants, … This can be helped by the SimpleImputer function from sklearn.impute module.

5. Variable selection:

Selecting the most relevant variables is very important for constructing a machine model. This is used for some reasons:

  • to simplify the model, make it easier to interpret
  • to reduce the training time
  • to reduce overfitting

There are some popular techniques for selecting variables as the chi-squared test, person correlation selection, Lasso, Recursive feature elimination, …

6. Variable extraction

Sometimes, our dataset consists of unstructured data as text and image. It is necessary to extract features from this data to a format supported by machine learning algorithms. sklearn.feature_extraction is a useful module to deal with this problem.

7. Split data into training and testing sets

Separating the dataset into training and testing sets is an important part of preprocessing data. The training set is used for processing the model, and the test set is used for testing the accuracy of the model. Therefore, the training set should be large enough so that the model can “learn” correctly. In fact, most of the data is used for training and a small portion of data is used for testing.

This task can be helped by the function train_test_split from the module sklearn.model_selection.

Now, it’s time to return to our project! :-)

8. Application to our house price prediction project

Since our dataset includes neither categorical data nor NaN values, then we only need to do some task as rejecting outliers, selecting the most relevant variables and splitting data into training and testing sets.

a. Outlier rejection

Based on the distribution of price, there are only some values that are larger than 2.5 million. Therefore we can consider 𝜏= 2.5 million as a threshold for filtering the outliers, all the houses whose prices larger than 𝜏 will be dropped from the dataset.

t = 2.5*10**6df_new = df[df['price']<= t]

The distribution of the price column in the new dataset:

plt.figure(figsize=(10,7))sns.distplot(df_new['price'])plt.xlabel('price', fontsize = 16)plt.ylabel('Density', fontsize = 16)plt.show()

b. Variable selection

Some features as “date”, “id” and “zipcode” are not really corrected to our target (price), hence they can be rejected to simplify the model.

df_new = df_new.drop(['id','date', 'zipcode'], axis = 1)df_new.head()

Visualizing the correlation of remaining variables:

plt.figure(figsize = (8,8))sns.clustermap(df_new.corr())

c. Splitting data into training/testing sets

from sklearn.model_selection import train_test_splittrain_set, test_set = train_test_split(df_new, test_size = 0.2, random_state = 0)print('Train size: ', train_set.shape[0], 'Test size: ', test_set.shape[0])>>> Train size:  17212 Test size:  4304

Visualizing the price distribution over the country:

plt.figure()df_new.plot(kind = 'scatter', x = 'long', y = 'lat', alpha = 0.8, c = 'price',cmap=plt.get_cmap('jet'), figsize = (12,8))plt.legend()plt.show()

d. Normalization

X_train = train_set.drop('price', axis = 1)y_train = train_set['price']X_test = test_set.drop('price', axis = 1)y_test = test_set['price']from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)

Checking sizes of training and testing sets:

print(X_train.shape, X_test.shape)>>> (17212, 17) (4304, 17)

Here we have 17 212 observations in the training set and 4304 observations in the test set.

IV. Modelization and evaluation

There are various machine learning models that you can choose according to the objective, such as Linear Regression, Support Vector Machine, Decision Tree, Random Forest, K-Nearest Neighbors, Neural Network, K-means, …

Especially, the implementation of these algorithms in sklearn are similar, it contains 3 main steps:

  • Step 1: Initializing the model
  • Step 2: Fitting the model on the training set
  • Step 3: Evaluating the model in the test set

Example:

  • Linear regression model:
from sklearn.linear_model import LinearRegression# initialize the modelmodel = LinearRegression()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)
  • Logistic regression model:
from sklearn.linear_model import LogisticRegression# initialize the modelmodel = LogisticRegression()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)
  • Support Vector Machine model:
from sklearn.svm import SVC# initialize the modelmodel = SVC()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)
  • Random forest
from sklearn.ensemble import RandomForestClassifier# initialize the modelmodel = RandomForestClassifier()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)

Choosing a metric for evaluating your machine learning algorithms is also very important. On supervised learning, depending on your objective is classification or regression, you can choose different metrics:

  • Classification metrics: accuracy, loss, ROC curve, confusion matrix, classification report.
  • Regression metrics: Mean Absolute Error, Mean Squared Error (MSE), Root Mean Square Error (RMSE), R² metric.

Application to our project:

Come back to our project of house price prediction. We are going to try different models as Linear regression, Decision tree regressor, Random forest regressor. Since our objective is a regression problem, then we select the Root Mean Squared Error (RMSE) to evaluate our model:

where n is the observation number of the test set, y^{(i)} and f(x^{(i)}) are corresponding the true and the estimated targets of x^{(i)}. This metric can be imported from sklearn.metrics module.

from sklearn.metrics import mean_squared_error
  1. Linear regression
from sklearn.linear_model import LinearRegression# Initialize the model
lin_reg = LinearRegression()
# Fit the model on the training set
lin_reg.fit(X_train,y_train)
# Evaluate the model on the test set:
y_pred_lin = lin_reg.predict(X_test)
mse_lin = mean_squared_error(y_test,y_pred_lin)
rmse_lin = np.sqrt(mse_lin)
print('RMSE of Linear Regression is: ', round(rmse_lin,1))
RMSE of Linear Regression is: 167347.9

2. Decision tree

from sklearn.tree import DecisionTreeRegressor# Initialize the model
tree_reg = DecisionTreeRegressor()
# Fit the model on the training set
tree_reg.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred_tree = tree_reg.predict(X_test)
mse_tree = mean_squared_error(y_test, y_pred_tree)
rmse_tree = np.sqrt(mse_tree)
print('RMSE of Decision Tree is: ', rmse_tree)
RMSE of Decision Tree is: 156869.0

We can see that the result given by the Decision tree algorithm is better than the one of Linear regression. Let’s try the Random forest algorithm to see if it is better than these two algorithms.

3. Random forest

from sklearn.ensemble import RandomForestRegressor# Initialize the model 
forest_reg = RandomForestRegressor()
# Fit the model on the training set
forest_reg.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred_forest = forest_reg.predict(X_test)
mse_forest = mean_squared_error(y_test1, y_pred_forest)rmse_forest = np.sqrt(mse_forest)print('RMSE of Random Forest method is: ', round(rmse_forest,1))
RMSE of Random Forest method is: 114766.6

Conclusion: Among the three methods above, the Random method gives the best result. In the next section, we are going to improve this algorithm by finding better parameters so that model gets higher performance.

V. Hyperparameter tuning

The function GridSearchCV from the Scikit-learn.model_selection module package allows us to train the model with different hyperparameter combinations, and it determines automatically parameters that give us the best performance.

Import the GridSearchCV function:

from sklearn.model_selection import GridSearchCV

Firstly, let’s discover the parameters of the actual random forest model:

forest_reg  RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,criterion='mse',                       
max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

There are many parameters that we can change. Now, let’s try with some parameters as bootstrap, max_features, min_samples_split, n_estimors. The testing values of these parameters are save in the dictionary params_grid:

params_grid = [{'bootstrap': [False, True],
'min_samples_split': [2,4,5],
'n_estimators': [100,150,200],
'max_features': [8,10,12]}]

We have 2 × 3 × 3 × 3 = 54 combinations of hyperparameters bootstrap, max_features, min_samples_split, n_estimators.

# Initialize the model
forest_reg = RandomForestRegressor()
# Apply GridSearchCV on our model with all parameters in params_grid
grid_search = GridSearchCV(forest_reg, params_grid, cv = 5,
scoring = 'neg_mean_squared_error',
return_train_score = True)
# Fit all models in the training set
grid_search.fit(X_train, y_train)

Each model is trained 5 times, which is corresponding to the cross-validation value. Hence, we have in total 54 × 5 = 270 turns of training.

Once the training is finished, we can determine the parameters which give us the best estimated model.

The best parameters:

grid_search.best_params_
{'bootstrap': True,
'max_features': 10,
'min_samples_split': 2,
'n_estimators': 150}

So, for the given parameters, the model works best when bootstrap = True, max_features = 10, min_samples_split = 2 and n_estimators = 150. The model corresponding to these parameters given by:

final_model = grid_search.best_estimator_
final_model
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,criterion='mse',
max_depth=None, max_features=10, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

VI. Prediction.

When the best model is determined, it can be used to predict the new samples on the testing set:

y_pred_final = final_model.predict(X_test)mse_final = mean_squared_error(y_test1, y_pred_final)rmse_final = np.sqrt(mse_final)print('RMSE of final model is: ', round(rmse_final,1))
RMSE of final model is: 110775.0

The following figure visualizes the true and the predicted values of 100 first data points in the testing set.

That sounds our predictions are closed to the true values. But the model can still be improved by trying more hyperparameters. (This task is spent for you as a practice work. :-))

Conclusion:

In this article, we have discovered some necessary steps for a Machine Learning project. They include gathering, analyzing, preprocessing data, modelization, evaluating the model, hyperparameter tuning and finally using the model for predicting new data points. Sometimes, we have to try different models with different parameters to choose the best one whose accuracy is the highest. All these models are supported by sklearn package, which is a powerful tool for Machine learning. You can see this reference for more detail of this package.

I hope that this article helps plan your project. If you have any question, please let me know in the comment. All the contributions are welcome. ^.^

Thank you for your reading! :-)

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Khuyen Le

Written by

Khuyen Le

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store