# Applied Machine Learning: Part 1

## Prediction Using Linear Regression, LassoCV, ElasticNet, RidgeCV, and xgboost

Want to build a resume worthy machine learning project with actual real-life significance in just a couple of hours? Are you a beginner in this field or have no clue where to start? You are at the right place. Read on!

There is a lot of buzz going on around Artificial Intelligence and Machine Learning in the last few years. It’s not a surprise considering that our existence is full of patterns. Machine learning for me is simply pattern recognition on a fundamental level.

If you want to learn the basics, you should first learn statistics, probability, basics of programming, and then the fundamental algorithms of prediction and pattern recognition. **However, this article is not about any of it**. There are several online courses and articles to learn it all.

What I am going to cover here is **applied machine learning**. This series is not concerned about the inner workings of various algorithms but focuses on where and how to apply these models with real-life significance.

I am assuming that my reader is an absolute beginner to this field (and is using a Windows PC). So, without further delay, let’s get started in building your first project.

#### Step 1: Install the Anaconda Framework

- Download the installation setup here. Select the setup based on your PC specification.
- Install this in your system. It will install
**Anaconda Prompt**and**Spyder IDE**as well. - These can be simply accessed from the search bar on Windows.
- We will be using
**Spyder IDE**to do all our programming.

#### Step 2: Choose Your Domain of Application

- The three major areas in applied machine learning would be
**Computer Vision**and**Image Analysis**,**Speech Recognition**and**Natural Language Processing**, and**Prediction Analysis**. - In this tutorial, I’ll cover a classic prediction problem.

#### Step 3: Choosing Your Dataset

- This is one of the
**most****important**steps of your project. - Every machine learning project will need a dataset for training and testing purposes.
**Kaggle**is an excellent place to pick your dataset. - Go to the above hyperlink and search for relevant datasets. You can also search for datasets elsewhere.
- You can select your dataset first and then decide what to do with it if you are clueless about what kind of problem you want to solve.
- One of the datasets I spotted is the
**‘****House Sales in King Country, USA Dataset****’**which I shall use for predicting the price of a house. (Will discuss in a moment). For prediction purposes, you can choose from a wide variety of sales datasets, product prices datasets, or sports datasets to predict who will win, etc. - The above choice is completely arbitrary and you can choose any dataset that you may find interesting.

**Step 4: Understanding What is Inside Your Dataset**

- Extract the downloaded zip files and examine the contents inside for the dataset you chose. In my case, it is as follows:
- The House Sales Dataset has various features (19 to be precise) of houses along with their price mentioned. Let’s take all these features as X and the Price as Y, which we want to predict.
- Now open Spyder IDE and create a new project.
**Copy the dataset file into the project folder**.

For any project, the very first step would be importing the required libraries. This is done as follows-

import numpy as np

import pandas as pd

import matplotlib as plt #and so on...

In this way, all the required libraries can be imported. If you don’t happen to have any library pre-installed, you can launch **‘Anaconda Prompt’** from the Start menu search toolbar. With its help, you can install the required library using the simple *pip* command.

pip install pandas #This command would automatically install pandas

In a similar manner, you can install any library given you know its name. The Pandas library is generally used for reading the datasets.

*A small tip before going further*- Whenever you use Spyder to code, use ‘**#%% **’ to divide your code into blocks. You can separately execute chunks or blocks of code using ‘**Ctrl + Enter**’. This approach will keep the code clean and it will be easy for debugging as well.

#### Step 5: Building Your Model- Predicting the Price of a House

- Importing the basic libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

- Reading our dataset-

data = pd.read_csv('kc_house_data.csv')

- Cleaning the data- Let us remove all the unwanted or irrelevant data from our dataset. This can be based on your intuition. Here I’m removing some columns as an example. The syntax would look something like this. The following line would remove the ‘
*date*’ and ‘*zipcode*’ column from the dataset.

data = data.drop(['date', 'zipcode'], axis = 1)

- Our problem has a lot of
**known features (X)***(Like number of bedrooms, living area space, etc.)***price (Y)**is dependent on. We now need to plot a correlation matrix to understand which of the features highly influence the price of a house. This is visualized using the seaborn library. This step may be carried out to understand our data better.

corr = data.corr()

mask = np.zeros_like(corr, dtype=np.bool)

mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(12, 9))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

- Choosing an algorithm- Whenever we have a situation where
**‘Y’**depends on several**‘X’**, we can approach it using**Linear Regression**where we try to establish a linear relation between Y and all X’s i.e**Y = a1X1 + a2X2 +**… and so on**.**So, given**X1, X2**, … , we can predict the value of Y provided we know the values of**a1, a2**, … and so on. That’s exactly what we are trying to do here, finding out the best approximates for**a1, a2**, … and so on. Generally, the dataset is split into 70% training data and 30% testing data on which the trained model is tested to measure performance.

features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'grade', 'sqft_above', 'sqft_basement', 'condition', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

X = data[features]

y = data.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

regressor = LinearRegression()

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

- Our model is now trained and tested. We can use some standard metrics to measure how well our model performs.

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

df.head()

#Results

Mean Absolute Error: 125002.07442244422

Mean Squared Error: 43690646677.93878

Root Mean Squared Error: 209023.07690285967

#First 5 Predictions of our modelActual Predicted

297000.0 3.872015e+05

1578000.0 1.502715e+06

562100.0 5.274534e+05

631500.0 5.779358e+05

780000.0 9.993390e+05

- Let us now visualize our predictions with actual prices for better understanding. A straight line with a slope of 45 degrees would indicate the perfect model.

plt.scatter(y_test, y_pred)

plt.xlabel("Prices")

plt.ylabel("Predicted prices")

- It may seem as if we have got a lot of error. Definitely, the performance is poor but to some extent, the model was able to give an estimate of the price. As it can be seen in the first five predictions, for example, our model predicted the price of a house to be
**1,502,715**but the actual price of the house was**1,578,000.**This is relatively fine to give a very rough estimate but there is a huge scope for improvement. - Improving our model: Keep a note of the
**Root Mean Squared Error: 209023.07690285967**for our model. Let us see if we can reduce this error. Here are some bonus techniques for you to try. - I’ll be implementing three advanced techniques namely,
**LassoCV**,**ElasticNet**, and**RidgeCV.**These are some good prediction models that take care of the shortcomings of linear regression. For now, you can just familiarize yourself with these techniques and see how they are practically implemented. Once you master linear regression, you should definitely check out the background working of these models. Here CV stands for**Cross-Validation**. Observe how we don’t use any X_test or y_test while training these models. That is because we use the entire dataset for training and testing. This is done through iteration of the training process using different data points within the dataset as test data and the rest of the data for training. Ultimately, the best model out of all iterations is chosen.

from sklearn.linear_model import LassoCV, RidgeCV, ElasticNet

from sklearn.model_selection import cross_val_score

#Implementation of LassoCV

lasso = LassoCV(alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100])

print("Root Mean Squared Error (Lasso): ", np.sqrt(-cross_val_score(lasso, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

#Implementation of ElasticNetelastic = ElasticNet(alpha=0.001)

print("Root Mean Squared Error (ElasticNet): ", np.sqrt(-cross_val_score(elastic, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

#Implementation of RidgeCVridge = RidgeCV(alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100])

print("Root Mean Squared Error (Ridge): ", np.sqrt(-cross_val_score(ridge, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

#ResultsRoot Mean Squared Error (Lasso):203421.22072610114Root Mean Squared Error (ElasticNet):203442.40673916895Root Mean Squared Error (Ridge):203411.816202574

- As you can see, the best we could reduce error to is
**203,411**using RidgeCV. Still not that much impressive. Now, let me try out another method known as**xgboost**and see how it performs.

#Implementation of xgboost

import xgboost as xgb

regr = xgb.XGBRegressor(colsample_bytree=0.2, gamma=0.0, learning_rate=0.01, max_depth=4, min_child_weight=1.5, n_estimators=7200, reg_alpha=0.9, reg_lambda=0.6, subsample=0.2, seed=42, silent=1)

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

print("Root Mean Squared Error (Xgboost): ", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(y_test, y_pred)

plt.xlabel("Prices")

plt.ylabel("Predicted prices")

plt.title("xgboost")

#ResultRoot Mean Squared Error (Xgboost):122692.65863401273

- And there you go! That’s a significant improvement from our previous models. So far, xgboost has turned out to be the best model to solve this problem. xgboost is a fast and more robust library that can give high performance in prediction problems. Also, please note that depending on the size and nature of the dataset you selected, the performance of your models could be much better than what I have got here.
- Let’s check in reality how well our model works on a case to case basis by checking the first five predictions.

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

df.head()

#Result with xgboost

Actual Predicted

297000.0 2.918043e+05

1578000.0 1.695336e+06

562100.0 5.116367e+05

631500.0 5.994548e+05

780000.0 6.963796e+05

This is much better than linear regression. For example, check the fourth prediction. **631,500** was the actual price and our model came as close as **599,454**.

With that, we can conclude our first project on prediction! You can now directly use the trained model to estimate the price of a real house, given you have all the **‘X’** features as taken from the dataset by just using this line of code.

Predicted_price = model_name.predict(X)

If you have taken your own dataset and problem statement and followed the same steps, you would have reached your project conclusion with similar real-life applicability.

*In case of any doubts or clarifications in applied machine learning or if you get stuck somewhere in implementing your model, feel free to ask down in the responses below.*

*Stay tuned for the next article where we will explore more diverse models and their application.*

*Clap and share if you found this useful and do follow ‘The Research Nest’ for more insightful content.*