Neural Network and Machine Learning. Now, these are pretty daunting concepts for any beginner in the field of data science. However, today I am going to attempt to allay such apprehensions by using a simple in-built neural network bequeathed to us by the amazing scikit-learn Python library.
People often dive into the deep end straightaway and try designing intricate networks and most of the times, get overwhelmed by the complexity of it all. The key to such nuanced algorithms, is to first understand their mechanism and I found that exploring predefined algorithms and trying to decipher their working was the best way to go about doing this.
This program is the perfect starter project for anyone who wishes to venture into neural networks mainly, because I am going to take you through some pre-built and predefined functions and datasets and I will try to explain the functioning of the code and the network used to the best of my ability. So, let’s get started!
First and foremost, we need to do the all-important step of importing every single essential library.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
For this venture, I have considered the Boston Housing Prices dataset. Now, you could go ahead and download the database from Kaggle and then import it but you don’t really need to. If you use Kaggle and import it using the pd.read_csv() funtion you will be good to go with the further steps. Albeit, if you do import the database using the sklearn library, you will have to code your way through like so.
boston = datasets.load_boston()
#loads boston dataset from sklearn.datasets
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
boston_df = pd.DataFrame(boston.data, columns = cols)
y = boston.target
boston_df['TARGET'] = y
The columns bit might have left you perplexed you, but all I did here was add the column titles of the database to reduce obscurity and enhance the understandability of the data frame. Here are the columns with a brief description for each of them.
CRIM : Per capita crime rate of town
ZN : Proportion of land zoned residential plots over 25,000 sq.ft.
INDUS : Proportion of non-business acres per town
CHAS : Charles River Variable [ 1 - tract bound, 0 - otherwise]
NOX : Nitrogen oxide concentration
RM : Average number of rooms per house
AGE : Proportion of owner-occupied houses built before 1940
DIS : Weighted distances to five Boston employment centers
RAD : Index of accessibility to radial highways
TAX : Property tax-rate per $10,000
PTRATIO : Pupil-teacher ratio of town
B : Used to represent proportion of people of African-American descent
LSTAT : Percentage of lower status of the population
After importing the database, the first thing I did was check for any missing values. Data sets have this really troublesome tendency to contain missing entries (which are represented as ‘NaN’ by Python). But, if the data set is considerably large, you can just drop the rows with missing values. This bit of preprocessing can be summed up in two lines.
print(boston_df.isnull().sum())
boston_df.dropna(inplace = True)
This gives the following output
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
TARGET 0
dtype: int64
Luckily, this database does not have any missing values, so the data is preserved in its entirety and we can just get down to the pivotal part of the entire project.
Databases often end up being really large, containing loads of columns and features, some of which may end up having negligible impact on the target variable. These features just clutter the processing unnecessarily and working with such data can be tedious. So, to eliminate this inconvenience, I decided to weed out all the redundant and nonessential features in order to focus on only those variables which actually matter. As complicated as this may sound, the coding part was relatively short and simple.
corr_matrix = boston_df.corr() #creates correlation matrixval_corr = corr_matrix['TARGET']
print(val_corr) #considers correlation of all features with 'TARGET'features = []
for col in cols :
if val_corr[col] >= 0.25 :
features.append(col)
if val_corr[col] <= -0.25 :
features.append(col)
print(features) #weeds out irrelevant featuresdata_df = pd.DataFrame(columns = features)
for col in features :
data_df[col] = boston_df[col]
print(data_df.head(10))
#creates a data frame of just the essential features for ease
Following this, you can even depict the correlation of these features with the target variable using multiple scatter plots. The code and the output looks something like this.
l = len(features) #calculates number of relevant featuresf = 1
plt.figure(figsize = (23,23))
for col in features :
plt.subplot(l, 1 , f)
plt.xlabel(col)
plt.scatter(data_df[col], y, marker = 'o')
plt.ylabel('TARGET')
f = f + 1
plt.tight_layout()
plt.show() #plots all features against 'TARGET'
Now, time for the main event. The network.
Scikit-learn, the generous benefactor of every data-scientist also provides us with a Multi-Layer Perceptron (MLP), which is essentially a neural network. MLP comes in two forms, namely: MLPClassifier and MLPRegressor and as the name suggests, they are used for classification and regression tasks respectively. Since prices are going to be predicted, I have made use of MLPRegressor.
Before I get to the code, I would like to elaborate on the working of MLPRegressor. It works like any basic neural network. It takes the inputs, that is the value of the features and multiplies it with some weight and then uses an activation function (which normalizes the input) in the hidden layers. The output layer in the case of the MLPRegressor has no activation function, that is to say, it uses the identity function. The error in the predictions is then calculated using mean square error function. Any predictive algorithm’s primary objective is to increase the accuracy of the predictions and this is done by minimizing the error. MLP does that by finding the gradient of the error function using partial derivatives and based on the slope, makes changes to the weights of the inputs to better fit the predictions to the data. This process of course-correction is called back propagation and this entire cycle is called an epoch. This goes on till either an appreciable level of accuracy is achieved or the maximum epoch limit is reached. Now, to the code.
X_train, X_test, y_train, y_test = train_test_split(data_df, y, test_size = 0.3, shuffle = True)
net = MLPRegressor(activation = 'relu', hidden_layer_sizes = (10,10,10), max_iter = 2000, solver='lbfgs', alpha = 0.07)
#splits data and creates a basic neural network
net.fit(X_train, y_train)
y_prediction = net.predict(X_test)
dict = {'Actual Values' : y_test, 'Predicted Values' : y_prediction}
final_df = pd.DataFrame(dict)
print(final_df.head(10))
print('R2 score is: {}'.format(r2_score(y_test, y_prediction)))
Here is the output after the network is done working its magic.
Actual Values Predicted Values
0 13.8 16.700065
1 13.6 15.916129
2 17.8 13.622637
3 18.2 19.700112
4 17.4 16.002389
5 21.4 21.429275
6 17.4 20.753055
7 31.6 32.218705
8 23.8 24.603075
9 19.5 17.852198
R2 score is: 0.8459992804561988
Initially, I was faced with a really low r2 score and I found that the predicted values were no where near the actual values so, with the purpose of increasing accuracy, I chose a smooth activation function (‘relu’) and a higher regularization term (‘alpha’) and et voila.
This should suffice for now.
In the upcoming blogs, I will be formulating some machine learning algorithms on Python from scratch.