Machine learning engineers do not only need to have good programming skill, but they also need to have some skills of a data scientist as collecting and managing data, skills of statistician for analyzing data, and also skills of a mathematician …. It is because a machine learning project requires a lot of steps, from managing data, building and evaluating machine learning models, to applying this model for predicting the new data in the testing set. In this article, we are going to discover all these steps, including:

- Gathering data
- Exploratory Data Analysis (EDA)
- Preprocessing data
- Modelization and evaluation
- Hyperparameter tuning
- Prediction.

All these tasks will be illustrated detailly through a project of predicting the house price in the King Country. This data is collected from Kaggle.

# I. Gathering data

As I presented in the previous article, data is very important for building a Machine learning model. Because the model will learn from the dataset that you provide to it. Hence, collecting and managing data play a key role in a Machine Learning project.

There are many sources for collecting data, here are some open sources that you can download a variety of datasets for feeding to your model:

- Kaggle is an online community of data scientists where you can find many Machine learning, Deep learning projects as well as open data sources. It is a useful page to learn Machine learning through doing real-life projects or participating in competitions.
- UCI Machine Learning Repository provides a collection of databases for the Machine learning community. It actually maintains 560 datasets.
- Open Data on AWS is a place that users can share datasets that are available via AWS resources.
- OpenDataSoft is a source that contains more than 2600 open data portals around the world.
- …

Our dataset is saved in a table (.csv) format. It can be read by the **read_csv** function from the **pandas** package:

import pandas as pddf = pd.read_csv('kc_house_data.csv')

Checking the data:

`df.head()`

# II. Exploratory Data Analysis (EDA)

The objective of this step is to understand the dataset as much as possible so that we can figure out a quick strategy for the modelization step.

As Machine Learning works a lot on structured data, which is saved in .csv or .xlsx formats, so in this post we focus on analyzing this type of data.

Firstly, we can start with some basic analysis such as discovering target variables, the number of rows, columns, the data type of each column, check if there exist any NaN values in the dataset:

For our dataset **kc_house_data.csv:**

- The target feature in our case is the price column.
- This dataset includes 21613 rows and 21 columns.

print(df.shape)>>> (21613, 21)

- Discover the type of each column:

`df.dtypes`

This result shows that almost all our features have a numerical type (integer or float). There is only the column “date” which has an object type.

- Visualizing the ratio of types by pie plot:

`df.dtypes.value_counts().plot.pie()`

- Verifying if the dataset includes any NaN value:

`df.isna().sum()`

This result shows that all columns do not include any missing value.

- Describing statistical values of the dataset by using function
**df.describe()**. This function allows us to compute some basic statistical values for each numerical column such as the number of data points, min, max, mean, standard deviation values (std), and quantiles.

`df.describe().transpose()`

- To better understanding the dataset, we can also use histograms or bar plots to discover the distribution of each feature.

The following figure visualizes the distribution of the price column:

plt.figure(figsize = (10,8))sns.distplot(df['price'],hist = True, label = 'Price')plt.show()

Based on the figure above, we see that

- Almost house prices are distributed from 0 to 1 million dollars.
- Prices around 0.5 million dollars appear most frequently.
- There exist some outliers values that we can skip to present their influence on our ML model.

This plot does not only help us to figure out the most concentrated values of the data, but it is also useful for determining the outliers

The price house obviously depends on the number of floors, bedrooms, bathrooms. So, it is also interesting to visualize these features:

`import matplotlib.pyplot as plt`

import seaborn as sns

**Floors:**

plt.figure()sns.countplot(df['floors'])plt.show()

**Bathrooms:**

plt.figure(figsize = (12,5))sns.countplot(df['bathrooms'])plt.show()

**Bedrooms:**

plt.figure(figsize = (7,5))sns.countplot(df['bedrooms'])plt.show()

- Furthermore, discovering the correlation between variables is also very important. Thanks to this analysis, we can select the variables which are most correlated to our target feature and ignore the weak correlated ones.

plt.figure(figsize = (14,8))sns.heatmap(df.corr(), linewidths = 0.5, annot = True)

Find the correlations of the price with other features:

`df.corr()["price"].sort_values(ascending = False)`

The above result shows that the price is highly correlated with some variables as **sqft_living**, **grade**, **sqft_above**, **sqft_living15**, and **bathrooms**. The correlations of **Id** and **Zipcode** with the price are very weak.

# III. Preprocessing data

As you know that the machine learning algorithms learn from the data that you provide to it. If the dataset is not good (there exist missing values, outliers, or the features are not presented in a corrected format), then the machine learning model built from this data will be very bad. Therefore, preparing the data before feeding it to the model is very important, it takes around 80% working time of data scientists.

Among preprocessing techniques, some methods are usually used as encoding, normalization, imputation, outliers and NaN rejections, variable selection, variable extraction. Let’s discover how these techniques work before applying them for preprocessing our dataset.

## 1. Encoding

**Encoding **is a method to encode the categorical data into numbers before using it to fit the model. The two most popular techniques are ** Ordinal Encoding **and

**.**

*One-Hot Encoding*In **Ordinal Encoding**, each categorical value is encoded by an integer value. For example, we have two categorical values [“dog”, “cat”, “bird”], Ordinal encoding will convert these two values into three integers 0, 1, and 2, according to the order in which they appear in the dictionary. Hence, “bird” is assigned to 0, “cat” to 1, and “dog” to 2. In Python, it is done by **OrdinalEncoder** from the module **sklearn.preprocessing:**

from sklearn.preprocessing import OrdinalEncoderfrom numpy import asarrayexample = asarray([['dog'], ['cat'],['bird']])encoder = OrdinalEncoder()encode_example = encoder.fit_transform(example)print(encode_example)>>> [[2.]

[1.]

[0.]]

We can also reverse the transformation to find the original categorical value of an encoded value:

import numpy as npencoder.inverse_transform(np.array([[0],[2],[2]]))>>> array([['bird'], ['dog'], ['dog']], dtype='<U4')

In the case when there does not exist an ordinal relationship between variables, the integer encoding may not be suitable. It can be replaced by the One-hot encoding technique. This technique aims to convert each target into a vector with length equal to the number of categories. If a data point belongs to the iᵗʰ-category, then iᵗʰ -component in this vector is assigned to 1, and others are assigned to 0.

from sklearn.preprocessing import OneHotEncoderfrom numpy import asarrayexample = asarray([['dog'], ['cat'],['bird']])encoder = OneHotEncoder(sparse = False)encode_example = encoder.fit_transform(example)print(encode_example)>>> [[0. 0. 1.]

[0. 1. 0.]

[1. 0. 0.]]

In this example, “dog” is assigned to vector [0, 0, 1], “cat” to [0, 1, 0] and “bird” to [1, 0, 0].

**2. Normalization**

Sometimes, the range of numerical features in the raw data may vary widely. So, it is necessary to normalize them before fitting the data to our machine learning model. The goal of normalization is to change the values of numerical columns to a common scale, without losing differences in the range of values. Now, let’s discover the most popular techniques in **sklearn** for normalizing features:

**Min-max normalization:**

This method rescales the range of features to the new range in [0,1]. The general formula is given by:

where X is the original value and X’ is the normalized value.

In Python, this formula is computed by **MinMaxScaler** from the module **sklearn.preprocessing:**

**Example**: The vector (1, 2, 3) is rescaled to (0, 0.5, 1)

from sklearn.preprocessing import MinMaxScalerimport numpy as npX = np.array([[1],[2],[3]])scaler = MinMaxScaler()scaler.fit_transform(X)

>>> array([[0. ], [0.5], [1. ]])

**Standardization (Z-score normalization):**

This method aims to rescale the feature in a new range with a zero mean and a standard deviation is 1. The formula of this technique is given by:

where mean(X) and σ denote the average value and the standard deviation of X, respectively.

The function **StandardScaler** belongs to the **sklearn.preprocessing **module**.**

**Example: **The new scale of the vector (1,2,3) by this method is (-1.22474487, 0, 1.22474487)

from sklearn.preprocessing import StandardScalerimport numpy as npX = np.array([[1],[2],[3]])scaler = StandardScaler()scaler.fit_transform(X)>>> array([[-1.22474487],

[ 0. ],

[ 1.22474487]])

**Robust scaler:**

The two methods above are not really suitable when the dataset includes outliers. This drawback can be overcome by the Robust Scaler method, where the median and interquartile range are taken into account. The normalization formula of this technique is given by:

where *median*(X) and *IQR* denotes the median and the interquartile range of data X.

**3. NaN rejections**

When the dataset includes missing values, they need to be rejected or replaced before fitting this data into the model. Pandas provides some useful functions to deal with this problem.

**pandas.DataFrame.isna()**determines if there exist any missing value in the DataFrame**pandas.DataFrame.fillna(α)**replaces missing values in DataFrame by a given value α**pandas.DataFrame.dropna()**drops all missing values in DataFrame

You can see this reference for more detail.

**4. Imputation**

Sometimes, dropping missing values may lose valuable data. A better strategy is to impute these missing values by some statistical values as the mean, the median, the most frequent value, or some constants, … This can be helped by the **SimpleImputer** function from **sklearn.impute** module.

**5. Variable selection:**

Selecting the most relevant variables is very important for constructing a machine model. This is used for some reasons:

- to simplify the model, make it easier to interpret
- to reduce the training time
- to reduce overfitting

There are some popular techniques for selecting variables as the chi-squared test, person correlation selection, Lasso, Recursive feature elimination, …

**6. Variable extraction**

Sometimes, our dataset consists of unstructured data as text and image. It is necessary to extract features from this data to a format supported by machine learning algorithms. sklearn.feature_extraction is a useful module to deal with this problem.

**7. Split data into training and testing sets**

Separating the dataset into training and testing sets is an important part of preprocessing data. The training set is used for processing the model, and the test set is used for testing the accuracy of the model. Therefore, the training set should be large enough so that the model can “learn” correctly. In fact, most of the data is used for training and a small portion of data is used for testing.

This task can be helped by the function **train_test_split** from the module **sklearn.model_selection**.

*Now, it’s time to return to our project! :-)*

**8. Application to our house price prediction project**

Since our dataset includes neither categorical data nor NaN values, then we only need to do some task as rejecting outliers, selecting the most relevant variables and splitting data into training and testing sets.

**a. Outlier rejection**

Based on the distribution of price, there are only some values that are larger than 2.5 million. Therefore we can consider 𝜏= 2.5 million as a threshold for filtering the outliers, all the houses whose prices larger than 𝜏 will be dropped from the dataset.

t = 2.5*10**6df_new = df[df['price']<= t]

The distribution of the price column in the new dataset:

plt.figure(figsize=(10,7))sns.distplot(df_new['price'])plt.xlabel('price', fontsize = 16)plt.ylabel('Density', fontsize = 16)plt.show()

**b. Variable selection**

Some features as “date”, “id” and “zipcode” are not really corrected to our target (price), hence they can be rejected to simplify the model.

df_new = df_new.drop(['id','date', 'zipcode'], axis = 1)df_new.head()

Visualizing the correlation of remaining variables:

plt.figure(figsize = (8,8))sns.clustermap(df_new.corr())

**c. Splitting data into training/testing sets**

from sklearn.model_selection import train_test_splittrain_set, test_set = train_test_split(df_new, test_size = 0.2, random_state = 0)print('Train size: ', train_set.shape[0], 'Test size: ', test_set.shape[0])>>> Train size: 17212 Test size: 4304

Visualizing the price distribution over the country:

plt.figure()df_new.plot(kind = 'scatter', x = 'long', y = 'lat', alpha = 0.8, c = 'price',cmap=plt.get_cmap('jet'), figsize = (12,8))plt.legend()plt.show()

**d. Normalization**

X_train = train_set.drop('price', axis = 1)y_train = train_set['price']X_test = test_set.drop('price', axis = 1)y_test = test_set['price']from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)

Checking sizes of training and testing sets:

print(X_train.shape, X_test.shape)>>> (17212, 17) (4304, 17)

Here we have 17 212 observations in the training set and 4304 observations in the test set.

# IV. Modelization and evaluation

There are various machine learning models that you can choose according to the objective, such as Linear Regression, Support Vector Machine, Decision Tree, Random Forest, K-Nearest Neighbors, Neural Network, K-means, …

Especially, the implementation of these algorithms in **sklearn **are similar, it contains 3 main steps:

- Step 1: Initializing the model
- Step 2: Fitting the model on the training set
- Step 3: Evaluating the model in the test set

**Example:**

- Linear regression model:

from sklearn.linear_model import LinearRegression# initialize the modelmodel = LinearRegression()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)

- Logistic regression model:

from sklearn.linear_model import LogisticRegression# initialize the modelmodel = LogisticRegression()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)

- Support Vector Machine model:

from sklearn.svm import SVC# initialize the modelmodel = SVC()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)

- Random forest

from sklearn.ensemble import RandomForestClassifier# initialize the modelmodel = RandomForestClassifier()# fit the model on the training setmodel.fit(X_train, y_train)# evaluate the model on the test set:y_pred = model.predict(X_test)

- …

Choosing a metric for evaluating your machine learning algorithms is also very important. On supervised learning, depending on your objective is classification or regression, you can choose different metrics:

**Classification metrics**: accuracy, loss, ROC curve, confusion matrix, classification report.**Regression metrics**: Mean Absolute Error, Mean Squared Error (MSE), Root Mean Square Error (RMSE), R² metric.

**Application to our project:**

Come back to our project of house price prediction. We are going to try different models as Linear regression, Decision tree regressor, Random forest regressor. Since our objective is a regression problem, then we select the Root Mean Squared Error (RMSE) to evaluate our model:

where n is the observation number of the test set, y^{(i)} and f(x^{(i)}) are corresponding the true and the estimated targets of x^{(i)}. This metric can be imported from** sklearn.metrics **module.

`from sklearn.metrics import mean_squared_error`

**Linear regression**

from sklearn.linear_model import LinearRegression# Initialize the model

lin_reg = LinearRegression()# Fit the model on the training set

lin_reg.fit(X_train,y_train)# Evaluate the model on the test set:

y_pred_lin = lin_reg.predict(X_test)

mse_lin = mean_squared_error(y_test,y_pred_lin)

rmse_lin = np.sqrt(mse_lin)print('RMSE of Linear Regression is: ', round(rmse_lin,1))

RMSE of Linear Regression is: 167347.9

**2. Decision tree**

from sklearn.tree import DecisionTreeRegressor# Initialize the model

tree_reg = DecisionTreeRegressor()# Fit the model on the training set

tree_reg.fit(X_train, y_train)# Evaluate the model on the test set

y_pred_tree = tree_reg.predict(X_test)

mse_tree = mean_squared_error(y_test, y_pred_tree)

rmse_tree = np.sqrt(mse_tree)print('RMSE of Decision Tree is: ', rmse_tree)

RMSE of Decision Tree is: 156869.0

We can see that the result given by the Decision tree algorithm is better than the one of Linear regression. Let’s try the Random forest algorithm to see if it is better than these two algorithms.

**3. Random forest**

from sklearn.ensemble import RandomForestRegressor# Initialize the model

forest_reg = RandomForestRegressor()# Fit the model on the training set

forest_reg.fit(X_train, y_train)# Evaluate the model on the test set

y_pred_forest = forest_reg.predict(X_test)mse_forest = mean_squared_error(y_test1, y_pred_forest)rmse_forest = np.sqrt(mse_forest)print('RMSE of Random Forest method is: ', round(rmse_forest,1))

RMSE of Random Forest method is: 114766.6

**Conclusion**: Among the three methods above, the Random method gives the best result. In the next section, we are going to improve this algorithm by finding better parameters so that model gets higher performance.

# V. Hyperparameter tuning

The function GridSearchCV from the **Scikit-learn.model_selection** module package allows us to train the model with different hyperparameter combinations, and it determines automatically parameters that give us the best performance.

Import the **GridSearchCV** function:

`from sklearn.model_selection import GridSearchCV`

Firstly, let’s discover the parameters of the actual random forest model:

forest_regRandomForestRegressor(bootstrap=True, ccp_alpha=0.0,criterion='mse',

max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

There are many parameters that we can change. Now, let’s try with some parameters as bootstrap, max_features, min_samples_split, n_estimors. The testing values of these parameters are save in the dictionary **params_grid**:

`params_grid = [{'bootstrap': [False, True],`

'min_samples_split': [2,4,5],

'n_estimators': [100,150,200],

'max_features': [8,10,12]}]

We have 2 × 3 × 3 × 3 = 54 combinations of hyperparameters bootstrap, max_features, min_samples_split, n_estimators.

# Initialize the model

forest_reg = RandomForestRegressor()# ApplyGridSearchCVon our model with all parameters inparams_grid

grid_search = GridSearchCV(forest_reg, params_grid, cv = 5,

scoring = 'neg_mean_squared_error',

return_train_score = True)# Fit all models in the training set

grid_search.fit(X_train, y_train)

Each model is trained 5 times, which is corresponding to the cross-validation value. Hence, we have in total 54 × 5 = 270 turns of training.

Once the training is finished, we can determine the parameters which give us the best estimated model.

The best parameters:

`grid_search.best_params_`

{'bootstrap': True,

'max_features': 10,

'min_samples_split': 2,

'n_estimators': 150}

So, for the given parameters, the model works best when bootstrap = True, max_features = 10, min_samples_split = 2 and n_estimators = 150. The model corresponding to these parameters given by:

`final_model = grid_search.best_estimator_`

final_model

**RandomForestRegressor**(bootstrap=True, ccp_alpha=0.0,criterion='mse',

max_depth=None, max_features=10, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

# VI. Prediction.

When the best model is determined, it can be used to predict the new samples on the testing set:

y_pred_final = final_model.predict(X_test)mse_final = mean_squared_error(y_test1, y_pred_final)rmse_final = np.sqrt(mse_final)print('RMSE of final model is: ', round(rmse_final,1))

RMSE of final model is: 110775.0

The following figure visualizes the true and the predicted values of 100 first data points in the testing set.

That sounds our predictions are closed to the true values. But the model can still be improved by trying more hyperparameters. (*This task is spent for you as a practice work. :-)*)

**Conclusion:**

In this article, we have discovered some necessary steps for a Machine Learning project. They include gathering, analyzing, preprocessing data, modelization, evaluating the model, hyperparameter tuning and finally using the model for predicting new data points. Sometimes, we have to try different models with different parameters to choose the best one whose accuracy is the highest. All these models are supported by sklearn package, which is a powerful tool for Machine learning. You can see this reference for more detail of this package.

I hope that this article helps plan your project. If you have any question, please let me know in the comment. All the contributions are welcome. ^.^

Thank you for your reading! :-)