Introduction to Neural Networks, from scratch for practical learning (Part 2)

19 min readNov 9, 2021

Welcome to Part 2 of the previous article ‘Introduction to Neural Networks, from scratch for practical learning’. Considering the fact that 10 minutes of readable content takes 10 hours to curate, I am sorry that I made you guys wait.

Fine then, as promised this part is going to be on ANNs which will be implemented from scratch. If you have not covered Part 1 of this article, I would urge you to kindly do so first. The article can be found here: Introduction to Neural Networks, from scratch for practical learning (Part 1). It can be a long read but it covers some crucial fundamentals required to understand ANNs properly by first understanding a single neuron (perceptron).

Kindly note that this example is just for a practical understanding of what is actually happening when we train the neural network. Ideally, if you are building a neural network to solve a problem, I will highly recommend you guys to use the ‘Keras’ library. It has lots of optimizations and facilities to help you train a reliable neural network quickly even on large datasets.

All that out of the way, we can now start our coding to build an ANN from scratch. First, as usual, I will be assuming that you are using Google Colab. That too preferably in dark mode. But it can always be done offline on a Jupyter Notebook if you wish, using plotly offline.

Guiding you through the mundane steps, first, mount your Drive onto Google Colab using this simple code and grant the necessary permissions.

#code to mount drive
from google.colab import drive
drive.mount('/content/drive')

I will use a new library called ‘category_encoders’ for handling categorical data present in the dataset. Unfortunately, Google Colab doesn’t have it pre-installed so you will have to do it manually using this simple code.

#installing category encoders
!pip install category_encoders

Wait for it to get installed, which shouldn’t take much time then proceed to import the required libraries. Following is the code to import necessary libraries and set the default renderer to Google Colab. Except for the ‘plotly.figure_factory’, everything else is needed.

#importing libraries
import time
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objs as go
import category_encoders as ce
import plotly.figure_factory as fffrom sys import stdoutfrom plotly import subplots
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split#setting the renderer as colab
pio.renderers.default = "colab"

In this part, I will introduce you to how neural networks perform multi-class classification using supervised learning. For this, I found a very good dataset. The only drawback is that it is very small in size. But I went ahead with it anyhow. Here is the Dataset which is available on Kaggle and can be found here: Dataset.

If you don’t want to see these trivial steps of data visualization and preprocessing, I am providing you the preprocessed dataset as well, so that you can directly jump to Implementing ANN section: Preprocessed dataset.

As usual, Kindly note that I have uploaded the dataset in a folder called ANN in MyDrive and is used from there. The code to load the dataset is as follows.

#loading data
dataset = pd.read_csv("/content/drive/MyDrive/ANN/drug200.csv")
dataset.head()

You should see something like this. It only has 6 columns namely, ‘Age’, ‘Sex’, ‘BP’, ‘Cholesterol’, ‘Na_to_K’, and ‘Drug’. Most of them must be self-explanatory.

‘Na_to_K’ is the sodium to potassium ratio in the body. ‘Drug’ is the target variable. We will try to predict what drug should be given to the patient based on the given knowledge of ‘Age’, ‘Sex’, ‘BP’, ‘Cholesterol’, and ‘Na_to_K’ as features.

Understanding Data

First, let's write some rudimentary visualization codes to understand the dataset. Here I am looking for null values, if any.

#checking for null values
dataset.info()

Output:

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 200 entries, 0 to 199 
Data columns (total 6 columns):  
#   Column       Non-Null Count  Dtype   
---  ------       --------------  -----    
0   Age          200 non-null    int64    
1   Sex          200 non-null    object   
2   BP           200 non-null    object   
3   Cholesterol  200 non-null    object   
4   Na_to_K      200 non-null    float64  
5   Drug         200 non-null    object  
dtypes: float64(1), int64(1), object(4) 
memory usage: 9.5+ KB

No null values were observed. But I don’t like the fact that there are just 200 rows in the dataset. It creates an issue (more on that later).

Now, We should look into the fact that how many classes are present in our target variable and in what proportions. Here is a simple plotly code to do so.

#visualizing classes
total = len(dataset)
drugs = dataset["Drug"].value_counts()
drugs = drugs.sort_index(ascending = False)
percentage = [
              str(round(value/total*100, 2)) + "%" \
              for value in drugs.values
]
trace0 = go.Bar(
    x = drugs.values,
    y = drugs.keys(),
    orientation = 'h',
    texttemplate = percentage,
    textposition = "outside",
    textfont_color = "white",
    marker = dict(
        color = drugs.values,
        colorscale = "OrRd"
    )
)
data = [trace0]
fig = go.Figure(data)
fig.update_layout(
    title = "Class Information",
    xaxis_title = "Count",
    yaxis_title = "Drugs",
    template = "plotly_dark",
)
fig.show()

Class Information, Image by autor

You can see there is some imbalance in the classes. Almost 50% of it is ‘DrugY’ while ‘DrugB’ and ‘DrugC’ constitute only 16% together. But the dataset is so small that I didn’t wish to disturb it and make classes of equal proportions. So I just went ahead with what I had.

Let us also look at some categorical data that we have in the dataset. Here is a visualization I came up with.

#visualizing data
trace0 = go.Pie(
    labels = dataset["Sex"].unique(),
    values = dataset["Sex"].value_counts(),
    textinfo = "label+percent",
    name = "Sex",
    hole = 0.6,
    marker = dict(
        colors = [
                  "floralwhite",
                  "darkred"
        ]
    )
)
trace1 = go.Pie(
    labels = dataset["BP"].unique(),
    values = dataset["BP"].value_counts(),
    textinfo = "label+percent",
    name = "BP",
    hole = 0.6,
    marker = dict(
        colors = [
                  "darkred",
                  "floralwhite",
                  "coral"
        ]
    )
)
trace2 = go.Pie(
    labels = dataset["Cholesterol"].unique(),
    values = dataset["Cholesterol"].value_counts(),
    textinfo = "label+percent",
    name = "Cholesterol",
    hole = 0.6,
    marker = dict(
        colors = [
                  "darkred",
                  "coral"
        ]
    )
)
sex = dataset["Sex"].value_counts()
trace3 = go.Bar(
    x = sex.values,
    y = sex.keys(),
    orientation = 'h',
    texttemplate = [
            str(value) \
            for value in sex.values        
    ],
    textposition = "outside",
    textfont_color = "white",
    name = "Sex",
    marker = dict(
        color = sex.values,
        colorscale = "OrRd"
    )
)
bp = dataset["BP"].value_counts()
trace4 = go.Bar(
    x = bp.values,
    y = bp.keys(),
    orientation = 'h',
    texttemplate = [
            str(value) \
            for value in bp.values        
    ],
    textposition = "outside",
    textfont_color = "white",
    name = "BP",
    marker = dict(
        color = drugs.values,
        colorscale = "OrRd"
    )
)
chol = dataset["Cholesterol"].value_counts()
trace5 = go.Bar(
    x = chol.values,
    y = chol.keys(),
    orientation = 'h',
    texttemplate = [
            str(value) \
            for value in chol.values        
    ],
    textposition = "outside",
    textfont_color = "white",
    name = "Cholesterol",
    marker = dict(
        color = drugs.values,
        colorscale = "OrRd"
    )
)
specs = [
         [{"type": "domain"}, {"type": "domain"}, {"type": "domain"}],
         [{"type": "bar"}, {"type": "bar"}, {"type": "bar"}]
]
fig = subplots.make_subplots(
    rows = 2, cols = 3,
    row_heights = [0.8, 0.2],
    horizontal_spacing = 0.15, 
    vertical_spacing = 0.05, specs = specs,
    subplot_titles = ["Sex", "BP", "Cholesterol"]
)
fig.add_trace(trace0, row = 1, col = 1)
fig.add_trace(trace1, row = 1, col = 2)
fig.add_trace(trace2, row = 1, col = 3)
fig.add_trace(trace3, row = 2, col = 1)
fig.update_xaxes(title_text = "Count", range = (0, 125), row = 2, col = 1)
fig.update_yaxes(title_text = "Sex", row = 2, col = 1)
fig.add_trace(trace4, row = 2, col = 2)
fig.update_xaxes(title_text = "Count", range = (0, 85), row = 2, col = 2)
fig.update_yaxes(title_text = "BP", row = 2, col = 2)
fig.add_trace(trace5, row = 2, col = 3)
fig.update_xaxes(title_text = "Count", range = (0, 125), row = 2, col = 3)
fig.update_yaxes(title_text = "Cholesterol", row = 2, col = 3)
fig["layout"].update(
    title = "Categorigal Data", 
    template = "plotly_dark",
    showlegend = False,
    height = 700
)
fig.show()

Catagorical Data, Image by author.

You can clearly see, most of this data is well balanced. Let me also show you how the numerical data looks like.

#visualizing data
trace0 = go.Pie(
    labels = dataset["Sex"].unique(),
    values = dataset["Sex"].value_counts(),
    textinfo = "label+percent",
    name = "Sex",
    hole = 0.6,
    marker = dict(
        colors = [
                  "floralwhite",
                  "darkred"
        ]
    )
)
trace1 = go.Pie(
    labels = dataset["BP"].unique(),
    values = dataset["BP"].value_counts(),
    textinfo = "label+percent",
    name = "BP",
    hole = 0.6,
    marker = dict(
        colors = [
                  "darkred",
                  "floralwhite",
                  "coral"
        ]
    )
)
trace2 = go.Pie(
    labels = dataset["Cholesterol"].unique(),
    values = dataset["Cholesterol"].value_counts(),
    textinfo = "label+percent",
    name = "Cholesterol",
    hole = 0.6,
    marker = dict(
        colors = [
                  "darkred",
                  "coral"
        ]
    )
)
sex = dataset["Sex"].value_counts()
trace3 = go.Bar(
    x = sex.values,
    y = sex.keys(),
    orientation = 'h',
    texttemplate = [
            str(value) \
            for value in sex.values        
    ],
    textposition = "outside",
    textfont_color = "white",
    name = "Sex",
    marker = dict(
        color = sex.values,
        colorscale = "OrRd"
    )
)
bp = dataset["BP"].value_counts()
trace4 = go.Bar(
    x = bp.values,
    y = bp.keys(),
    orientation = 'h',
    texttemplate = [
            str(value) \
            for value in bp.values        
    ],
    textposition = "outside",
    textfont_color = "white",
    name = "BP",
    marker = dict(
        color = drugs.values,
        colorscale = "OrRd"
    )
)
chol = dataset["Cholesterol"].value_counts()
trace5 = go.Bar(
    x = chol.values,
    y = chol.keys(),
    orientation = 'h',
    texttemplate = [
            str(value) \
            for value in chol.values        
    ],
    textposition = "outside",
    textfont_color = "white",
    name = "Cholesterol",
    marker = dict(
        color = drugs.values,
        colorscale = "OrRd"
    )
)
specs = [
         [{"type": "domain"}, {"type": "domain"}, {"type": "domain"}],
         [{"type": "bar"}, {"type": "bar"}, {"type": "bar"}]
]
fig = subplots.make_subplots(
    rows = 2, cols = 3,
    row_heights = [0.8, 0.2],
    horizontal_spacing = 0.15, 
    vertical_spacing = 0.05, specs = specs,
    subplot_titles = ["Sex", "BP", "Cholesterol"]
)
fig.add_trace(trace0, row = 1, col = 1)
fig.add_trace(trace1, row = 1, col = 2)
fig.add_trace(trace2, row = 1, col = 3)
fig.add_trace(trace3, row = 2, col = 1)
fig.update_xaxes(title_text = "Count", range = (0, 125), row = 2, col = 1)
fig.update_yaxes(title_text = "Sex", row = 2, col = 1)
fig.add_trace(trace4, row = 2, col = 2)
fig.update_xaxes(title_text = "Count", range = (0, 85), row = 2, col = 2)
fig.update_yaxes(title_text = "BP", row = 2, col = 2)
fig.add_trace(trace5, row = 2, col = 3)
fig.update_xaxes(title_text = "Count", range = (0, 125), row = 2, col = 3)
fig.update_yaxes(title_text = "Cholesterol", row = 2, col = 3)
fig["layout"].update(
    title = "Categorigal Data", 
    template = "plotly_dark",
    showlegend = False,
    height = 700
)
fig.show()

Numerical Data, Image by author.

All values seem to be well in range and no predominant outliers exist either.

Finally, let's also make a scatter plot to see how the data is spread around and to find out whether is there some correlation between a drug given and age or sodium to potassium ratio.

#visualizing data
traces = []
for drug in dataset["Drug"].unique():
  data_subset = dataset.where(
          dataset["Drug"] == drug
  )
  traces.append(
      go.Scatter(
          x = data_subset["Na_to_K"],
          y = data_subset["Age"],
          mode = "markers",
          marker = dict(
              size = 10
          ),
          legendgroup = drug,
          name = drug
      )
  )
data = traces
fig = go.Figure(data)
fig.update_layout(
    colorway = [
                "floralwhite", 
                "plum", 
                "coral", 
                "orange", 
                "darkred"
    ],
    title = "Data Distribution",
    yaxis_title = "Age", 
    xaxis_title = "Sodium to Potassium Ratio",
    template = "plotly_dark",
    height = 700
)
fig.show()

Data Distribution, Image by author.

You can see the ‘DrugY’ is predominantly used in cases with a high sodium to potassium ratio across all ages. Similarly, ‘DrugB’ is mainly used with older people having normal sodium to potassium ratio. Like this, you can make as many visualizations as you want seeking more information and patterns. But in the interest of keeping the article short, I am moving ahead with the next steps.

Preprocessing Data

We preprocess the data so that our neural network can understand it that too without losing out on the statistical properties of data. I will convert the nominal (values that don’t have any form of precedence) feature like ‘Sex’ using Binary Encoding. I could have also used One Hot Encoding to give the same result because there are only 2 genders here. But I wanted to introduce you guys to Binary Encoding as it can generate the same information using fewer extra features. If Binary Encoding is still generating way too many features for your liking, you can also use BaseN Encoding where the base is upgraded from 2 to any value you like.

Binary Encoding simply converts unique data into decimal numbers then further converts it into a binary system starting from 1. Thus ‘F’ and ‘M’ will be converted to ‘01’ (1 in decimal) and ‘10’ (2 in decimal) respectively.

Ordinal (values that have a hierarchy) features like ‘BP’ and ‘Cholesterol’ are encoded using Ordinal Encoding. Lastly, I have encoded different classes using One Hot Encoding. The code is as follows. Notice how I have given the mapping for ordinal data.

#handling categorical data#performing binary encoding
nominal = ["Sex"]
encoder_n = ce.BinaryEncoder(cols = nominal)
dataset = encoder_n.fit_transform(dataset)#performing label encoding
ordinal = ["BP", "Cholesterol"]
encoder_o = ce.OrdinalEncoder(cols = ordinal)
dataset = encoder_o.ordinal_encoding(
    dataset, 
    mapping = [
               {
                   "col": "BP", 
                    "mapping": {
                        "LOW": 0, 
                        "NORMAL": 5, 
                        "HIGH": 10
                    }
               },
               {
                   "col": "Cholesterol",
                   "mapping": {
                       "NORMAL": 0,
                       "HIGH": 10
                   }
               }
    ] 
)#converting dataset back to dataframe
#ordinal encoder returns a tuple for some reason
dataset = pd.DataFrame(dataset[0])#performing one hot encoding
target = ["Drug"]
classes = dataset[target]
encoder_t = ce.OneHotEncoder(cols = target)
dataset = encoder_t.fit_transform(dataset)#looking at data
dataset.head()

Now we have the data ready! Just one thing left to do so that there is less hassle at the end to gauge performance metrics of the neural network. We need to prepare a mapping of the new class labels and the actual class labels so that we can revert back to the original labels at the end for better understanding. (I couldn’t figure out how the .inverse_tranform() function of category encoders works, so had to do all this ugly workaround 😔)

#separating features from dataset
X = dataset.iloc[:, :-5]
Y = dataset.iloc[:, -5:]

This will separate the target columns and feature columns from the dataset into ‘Y’ and ‘X’ respectively. Then we do the mapping.

#learning mapping for encoding
class_mappings = dict()
new_classes = np.array(Y)
values, indices = np.unique(new_classes, axis = 0, return_index = True)
indices, values = zip(*sorted(zip(indices, values)))
for n, value in enumerate(values):
  class_mappings[classes["Drug"].unique()[n]] = value
print(class_mappings)

You will get an output like this:

{‘drugY’: array([1, 0, 0, 0, 0]), ‘drugC’: array([0, 1, 0, 0, 0]), ‘drugX’: array([0, 0, 1, 0, 0]), ‘drugA’: array([0, 0, 0, 1, 0]), ‘drugB’: array([0, 0, 0, 0, 1])}

Now we need to scale the data for reasons explained in Part 1, I am again using standardization for scaling.

#standardizing data
scaler = StandardScaler()
X = scaler.fit_transform(X)

PCA

I am using PCA for reducing dimensionality. However, this step is totally optional. I don't know how much it will impact our neural network training since there are only 6 features actually. But still, I am putting it here for your reference. If you want to understand PCA in detail, kindly refer to this article on PCA implemented from scratch with a very cool application: PCA for image reconstruction, from scratch.

#dimensionality rediction using PCA
features = X.shape[1]
pca = PCA(n_components = features)
X_pca = pca.fit_transform(X)
trace0 = go.Bar(
    x = np.arange(
        start = 1, 
        stop = features + 1, 
        step = 1
    ),
    y = pca.explained_variance_ratio_*100,
    text = [
            str(round(ratio*100, 2)) + "%" \
            for ratio in pca.explained_variance_ratio_
    ],
    textposition = "outside",
    textfont_color = "white",
    marker = dict(
        color = pca.explained_variance_ratio_,
        colorscale = "OrRd"
    )
)
cumulative = np.cumsum(
    [ratio*100 for ratio in pca.explained_variance_ratio_]
)
trace1 = go.Bar(
    x = np.arange(
        start = 1, 
        stop = features + 1, 
        step = 1
    ),
    y = cumulative,
    text = [
            str(round(value, 2)) + "%" \
            for value in cumulative
    ],
    textposition = "outside",
    textfont_color = "white",
    marker = dict(
        color = cumulative,
        colorscale = "OrRd"
    )
)
fig = subplots.make_subplots(
    rows = 1, cols = 2,
    horizontal_spacing = 0.1,
    subplot_titles = [
                      "Explained varience by principal components",
                      "Cumulative explained varience"
    ]
)
fig.add_trace(trace0, row = 1, col = 1)
fig.add_trace(trace1, row = 1, col = 2)
fig.update_yaxes(title_text = "Percentage", row = 1, col = 1)
fig.update_yaxes(title_text = "Percentage", row = 1, col = 2)
fig.update_xaxes(title_text = "Principal Components", row = 1, col = 1)
fig.update_xaxes(title_text = "Principal Components", row = 1, col = 2)
fig.update_layout(
    title = "Explained Variance",
    xaxis_title = "Principal Components",
    yaxis_title = "Percentage",
    template = "plotly_dark",
    showlegend = False,
    height = 700
)
fig.show()

Explained Variance, Image by author.

Keeping only the relevant information to reduce the number of features.

#keeping 80% information
X_pca = X_pca[:, :4]
X_pca.shape

Output:

(200, 4)

Now, just so you want to skip all the above steps, I am creating the final dataset for you to take reference and directly jump to this dataset. (you must not be reading this part anyway then! 😆)

#making the final dataset
columns = [
           "Principal Component " + str(col + 1) \
           for col in range(X_pca.shape[1])
]
dataset = pd.DataFrame(X_pca, columns = columns)
dataset = pd.concat([dataset, Y], axis = 1)
dataset.head()

This dataset this available here for you to skip all the previous hassle if you want: Dataset. Here is the code to export data in CSV format.

#exporting data
dataset.to_csv("Prepared_Data.csv", index = False)

Implementing ANN

Finally, we can start with the task we came here for. Building an ANN from scratch! Splitting the features and target into ‘X’ and ‘Y’ respectively.

#separating features and target
X = dataset.iloc[:, :-5]
Y = dataset.iloc[:, -5:]

Now converting the datatype from DataFrame to NumPy array so that our ANN can understand it and implement the python concept of vectorization.

#converting to numpy arrays.
X = np.array(X)
Y = np.array(Y)

Now performing the test-train split for testing and training respectively.

#tesst-train split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

Now we will come to the code related to an ANN implementation. Here, I am defining some functions for implementing activation functions that are going to be used in the neural network.

#sigmoid
def sigmoid(x):
    return 1/(1 + np.exp(-x))#sigmoid dervative
def sigmoid_derivative(P):
    return P * (1 - P)#ReLU
def ReLU(x):
    return np.maximum(x, 0, x)#ReLU derivative
def ReLU_derivative(P):
    P[P <= 0] = 0
    P[P > 0] = 1
    return P

Why do we need activation functions to change the output from every neuron as shown in Part 1? Well, the answer is to have non-linearity. It can be shown that no matter how many layers you use the final output will simply be a linear combination only. Let’s say we don’t use an activation function or even worse, use some linear activation function. Then let me tell you that you are simply fitting a linear separating hyperplane to the data! You won’t get a separating plane that twists and turns itself according to the data. Like I showed in Part 1, I just fit a linear separating plane to the given data and you saw how many data points were misclassified only because the separating plane could not bend something like a parabola. That would have given a better separating plane given better accuracy in turn. So, here I am using two activation functions called Rectified Linear activation Unit (ReLU) for internal neurons i.e ones in the input layer and hidden layer, and sigmoid for classifying data into classes in the output layer. The respective derivatives are used during the backpropagation.

Now comes the main code! The actual neural network itself.

#neural network
class NeuralNetwork:
    
    def __init__(self, x, y, r):
        self.input = x
        self.y = y
        self.learning_rate = r
        self.nodes_in_first_layer = 25
        self.nodes_in_second_layer = 15
        self.nodes_in_output_layer = 5
        self.output = np.zeros(y.shape)
        self.error_values = []
        
        upper_limit = 0.5
        lower_limit = -0.5
        
        self.weights1 = np.random.uniform(upper_limit, lower_limit, (self.input.shape[1], self.nodes_in_first_layer))
        self.weights2 = np.random.uniform(upper_limit, lower_limit, (self.nodes_in_first_layer, self.nodes_in_second_layer))
        self.weights3 = np.random.uniform(upper_limit, lower_limit, (self.nodes_in_second_layer, self.nodes_in_output_layer))
        
        self.bias1 = np.random.uniform(upper_limit, lower_limit, (1, 1))
        self.bias2 = np.random.uniform(upper_limit, lower_limit, (1, 1))
        self.bias3 = np.random.uniform(upper_limit, lower_limit, (1, 1))
        
    def forwardprop(self):
        self.layer1 = ReLU(np.dot(self.input, self.weights1) + self.bias1)
        self.layer2 = ReLU(np.dot(self.layer1, self.weights2) + self.bias2)
        self.layer3 = sigmoid(np.dot(self.layer2, self.weights3) + self.bias3)
        return self.layer3
    
    def error(self):
        return -(1 / self.output.shape[0]) * np.sum((self.y * np.log(self.output)) + ((1 - self.y) * np.log(1 - self.output)))
     
    def backprop(self):
        d_Propogation3 = self.output - self.y
        d_weights3 = (1 / self.output.shape[0]) * (np.dot(d_Propogation3.T, self.layer2))
        d_bias3 = (1 / self.output.shape[0]) * (np.sum(d_Propogation3))
        
        d_Propogation2 = np.dot(d_Propogation3, self.weights3.T) * ReLU_derivative(self.layer2)
        d_weights2 = (1 / self.output.shape[0]) * (np.dot(d_Propogation2.T, self.layer1))
        d_bias2 = (1 / self.output.shape[0]) * (np.sum(d_Propogation2))
        
        d_Propogation1 = np.dot(d_Propogation2, self.weights2.T) * ReLU_derivative(self.layer1)
        d_weights1 = (1 / self.output.shape[0]) * (np.dot(d_Propogation1.T, self.input))
        d_bias1 = (1 / self.output.shape[0]) * (np.sum(d_Propogation1))
        
        self.weights3 = self.weights3 - self.learning_rate * d_weights3.T
        self.bias3 = self.bias3 - self.learning_rate * d_bias3
        
        self.weights2 = self.weights2 - self.learning_rate * d_weights2.T
        self.bias2 = self.bias2 - learning_rate * d_bias2
        
        self.weights1 = self.weights1 - self.learning_rate * d_weights1.T
        self.bias1 = self.bias1 - self.learning_rate * d_bias1
        
    def train(self):
        self.output = self.forwardprop()
        self.error_values.append(self.error())
        self.backprop()
        return self.error_values
    
    def predict(self, test):
        self.layer1 = ReLU(np.dot(test, self.weights1) + self.bias1)
        self.layer2 = ReLU(np.dot(self.layer1, self.weights2) + self.bias2)
        self.layer3 = sigmoid(np.dot(self.layer2, self.weights3) + self.bias3)
        return self.layer3

Let me take you through the code.

A class called ‘NeuralNetwork’ is defined which is initialized by the constructor (__init__() function) with the input data ‘x’, target values ‘y’, and the learning rate ‘r’.
The first layer is set to have 25 nodes (neurons), followed by a second layer with 15 nodes. Lastly, the output layer has 5 nodes which have to be equal to the number of classes.
A list is created to keep track of intermediate error values as the network is trained. An output NumPy array is also initialized to store the results after every iteration to help calculate the error.
Weights and biases are initialized to random values which are close to zero (defined by ‘upperlimit’ and ‘lowerlimit’).
A function to implement forward propagation is defined. It is similar to the propagation function in Part 1. The only difference is that it is done for every layer separately so that you can customize the network according to your needs. It returns the output from the last layer (output layer).
A function to calculate error is defined. Again, similar to Part 1.
Then you will see a function is implemented to do back propagation. This function adjusts the weights based on error. Similar to the step tagged as calculating gradient in Part 1. Notice that there is a general pattern to updating weights across all layers.
Then comes the function to train the network. It does the forward propagation, saves the error value then does back propagation one time. Finally returning all the saved error values (The one returned after the final iteration contains all the error values).
Finally comes the function to predict classes. You can see it is just a forward propagation done once but with the final learned weights. I wrote it separately just to make things clear to understand. I could have easily cubbed the two with test = None as a parameter to forwardprop().

All the concepts remain the same as those in Part 1. I could have made the code much shorter but I chose to keep the processing of every layer separate for better understanding. You can easily note the pattern and see some redundant code lines. I would encourage you to shorten the code yourself. There is a lot of complicated multivariate calculus used to derive the backpropagation. I would request you to look into it from this awesome book I used to get started with neural networks: A Brief Introduction to Neural Networks by David Kriesel. It is available for free from his official website!

Let’s now train our network and see what happens.

#simulating the Neural Network.
iterations = 5000
learning_rate = 0.005
NN = NeuralNetwork(X_train, Y_train, learning_rate)
print("Training...")
start_time = time.time()
for iteration in range(0, iterations):
  stdout.write("\r Iteration: " + str(iteration + 1) + " / " + str(iterations))
  stdout.flush()
  error_values = NN.train()
execution_time = time.time() - start_time 
print()
print("Done! Execution time: " + str(round(execution_time, 2)) + " seconds")

Run this cell and you should see an output as follows:

Training...  
 Iteration: 5000 / 5000 
Done! Execution time: 12.66 seconds

The values for iterations and learning rate are chosen via trial and error, based on the information from the error graph that we will plot soon.

I just like to keep the code related to neural networks together. Hence I put the code for prediction here. Ideally, you should first plot the error function and see how the training process went.

#predicting classes
output = NN.predict(X_test)
predictions = np.zeros_like(output)
predictions[np.arange(len(output)), output.argmax(1)] = 1
predictions = predictions.astype(int)

Now I am plotting the error function and a very fancy confusion matrix.

#get actual class label
def get_key(data):
  for key, value in class_mappings.items():
    if np.array_equal(data, value):
      return key

But first! Just a function to get the reverse mappings of the class labels. Again, I apologize for the ugly code but I couldn’t figure out .inverse_transform() of category encoders 😜.

Now, let's plot the results! so excited!

#visualising error and confusion matrix
trace0 = go.Scatter(
    x = np.arange(
        start = 1,
        stop = iterations+1,
        step = 1
    ),
    name = "error",
    y = error_values,
    marker = dict(
        color = "darkred"
    )
)
Y_true = [get_key(data) for data in Y_test]
Y_pred = [get_key(data) for data in predictions]
labels = sorted(list(class_mappings.keys()))
cm = confusion_matrix(Y_true, Y_pred, labels = labels)
annotations = []
for n, _ in enumerate(cm):
  for m, _ in enumerate(cm):
    if cm[n][m] != 0:
      annotations.append(
        go.layout.Annotation(
            text = str(cm[n][m]),
            bgcolor = "floralwhite",
            borderpad = 5,
            bordercolor = "black",
            borderwidth = 2,
            font = dict(
                color = "black",
                size = 30
            ),
            x = m,
            y = n,
            xref = "x2",
            yref = "y2",
            width = 40,
            showarrow = False
        )
    )
trace1 = go.Heatmap(
    z = cm,
    x = labels,
    y = labels,
    name = "Confusion Matrix",
    colorscale = "OrRd",
    ygap = 2, 
    xgap = 2,
    xaxis = "x2",
    yaxis = "y2"
)
fig = subplots.make_subplots(
    rows = 1, cols = 2,
    horizontal_spacing = 0.1,
    subplot_titles = [
                      "Convergence of Error",
                      "Confusion Matix"
    ]
)
fig.add_trace(trace0, row = 1, col = 1)
fig.add_trace(trace1, row = 1, col = 2)
fig.update_xaxes(title_text = "Iterations", range = (-10, iterations + 10), row = 1, col = 1)
fig.update_yaxes(title_text = "Error", row = 1, col = 1)
fig.update_layout(
    title = "Training Results",
    template = "plotly_dark",
    #this is a bug, always have to add annotations[0] and [1] explicitly.
    annotations =  [annotations[0]] + \
                   [annotations[1]] + \
                   annotations,
    height = 700
)
fig.show()

Traning Results, Image by Author.

You should see these beautiful plots. On the left side, we have our error values. And on the right side, we get to see a heat map representing the confusion matrix. Only non-zero values are shown. Our neural network has done a decent job. Let’s see the accuracy also.

#calculating accuracy
accuracy = np.sum(np.diagonal(cm))/np.sum(cm)*100
print("Final Accuracy: " + str(round(accuracy, 2)) + "%")

Output:

Final Accuracy: 92.5%

Pretty impressive right! so at least 92 people out of 100 will be prescribed the right drug by this model. This is still very low for actual usage though. As an afterthought, I realized, I should have returned the trained weights as well! To save them and share with you guys. but I forgot! 😵

Your results might vary because of the random initialization of weights but the main reason is the size of the dataset. The dataset is too small to get reliable results. It depends a lot on how the dataset got split. This is the result I got with the best accuracy, the worst I got was 70% but on average, it must be about 85%. One way to check that is to run the code again and again in a loop including the test-train split part and taking the average of all accuracies obtained. But still, more data would have been better.

So this is how you write a neural network in python from scratch. Do remember that it is an exercise for self-improvement and better understanding. It is always better to use standard libraries like Keras. I urge you to play around with the code and change the values of tunable parameters like the number of iterations, learning rate, number of neurons, etc to see how things behave for getting a deeper understanding.

Alright then! this is it from me for this time. Hope to see you soon with an even more engaging new post. I am humbled that you reached all the way down here. It must have been a lengthy read. I will request an excuse now.

I wish you good luck.

Bye!

References:

[1] A Brief Introduction to Neural Networks — David Kriesel.

Introduction to Neural Networks, from scratch for practical learning (Part 2)

Understanding Data

Preprocessing Data

PCA

Implementing ANN

References:

Written by Pranjall Kumar