Simple Neural Network in Python from Scratch

15 min readFeb 22, 2024

Implementing a Neural Network from Scratch without using TF or Pytorch: A Step-by-Step Guide

Introduction

Welcome to my tutorial on building a simple basic neural network from scratch in Python! In this guide, I will break down the process of creating a neural network step by step, making it easy for beginners to understand.

Understanding Neural Networks

A Physical Neuron in our brain consists of a Nucleus, Dendrites, and Axons, it receives inputs through the Dendrites, processes the information, and transmits the output through the Axons.

A Neuron receives multiple inputs and each input has a certain weightage associated with it telling how important that input is. which the network learns. next the inputs get multiplied by their weight, and the inputs are summed up

In the above image, the neuron receives 3 inputs x1,x2, and x3 with weights w1,w2, and w3 respectively, so the total input to the neuron would be the weighted sum of the inputs

Total inputs Z= weighted sum + b

This term “b” is called bias term which is added to allow the model to better fit complex data and learn patterns effectively. Bias terms provide neural networks with flexibility by allowing them to represent functions that do not necessarily pass through the origin (0,0) in the input space.

Finally, the Total inputs is passed to a function called “Activation Function” which returns the final output of the neuron. The Activation functions introduce non-linearity into the network, enabling neural networks to learn and approximate complex, non-linear relationships within data. Without non-linear activation functions, neural networks would only be able to model linear relationships. There are a couple of activation functions such as

Sigmoid
Tanh
ReLu
Softmax

How Neural Networks Learns

Neural networks learn through a process of iteratively adjusting their parameters to minimize a defined loss function. This process involves the following key steps:

1. Forward Pass

During the forward pass, input data is propagated through the network’s layers, from the input layer to the output layer

2. Calculating the Loss:

Once the forward pass is completed and the network makes predictions, the next step involves calculating the loss or error. The loss function quantifies how well the network’s predictions match the true labels or targets:

3. Backpropagation:

Backpropagation is the process by which the network adjusts its parameters to minimize the loss function. It involves computing gradients of the loss function with respect to each parameter in the network and using those gradients to update the parameters:

Implementing the Neural Network Class

We’ll start by defining a Python class called NN to represent our neural network. The neural network that we are going to implement has a single hidden layer with “h” no of neurons (we can set the value for h) and an output layer with “n” no of neurons

# creating a class for Neural Network
class NN:
  pass

Now we will define the constructor that initializes the object of our NN class. we should initialize our object of this class with 3 parameters

n_features: no of features in our data/ dimensionality of each datapoint
n_classes: no of classes in our classification setup
n_hidden: no of neurons in the hidden layers

     def __init__(self,n_features,n_classes,n_hidden):
        
        # n_features: no of features
        self.d=n_features
        # n_classes: no of classes (no of neurons in the output layer)
        self.n=n_classes
        # n_hidden: no of neurons in the hidden layers
        self.h=n_hidden

Initializing Weights and biases from Input to Hidden layer

Suppose we have 3 features and 4 neurons in the hidden layer. now all three 3 features have connections to all 4 neurons

Therefore we now have 3*4=12 Connections in total and each connection has weight associated with it, so we have 12 weights in total. Now we are going to store these weights in a 2D Matrix called weight matrix. we also have a separate bias term for each neuron in the hidden layer, therefore we have 4 bias terms. we will store the bias terms in a vector

In the previous code block, we had “d” inputs and “h” neurons in the hidden layer. Therefore the total no of weights between the input and the hidden layer will be d*h . We will store the weights in a dxh matrix called W1. Initially we will assign random values for the weights.

#creating the weight matrices W1 with rancom values (collection of weight values from input layer to hidden layer) of dimension (dxh) each column is a weight vectors for each neuron
self.W1=0.01*np.random.randn(self.d,self.h)

Note that in addition to the weights we have a separate bias term for each neuron in the hidden layer, in our case, we have h bias terms, so we will store the bias terms in a 1xh vector called b1, here we will initialize bias terms with zeros

#creating a bias vectorb1 (collection of bias values from input layer to hidden layer) of dimension (1xh)
self.b1 = np.zeros((1,self.h))

we have defined all of the weights and biases from the input to the hidden layer.

Initializing Weights and biases from Hidden to Output layer

Now we need to define all of the weights and biases from the hidden layer to the output layer

At the hidden layer, we have h no of neurons (in the above diagram h=4) and at the output layer, we have n no of neurons ( in the above diagram n=2). Therefore in total, we have h*n weights (in the above diagram 2*4= 8 weights). we will store these weights in a hxn Matrix called W2. we will randomly initialize values for these weights

Similarly we will have n no of bias terms from hidden to output layer. we will store these bias terms in a 1xn vector b2, we initialize this with zeros

#creating the weight matrices W2 (collection of weight values from hidden layer to output layer) of dimension (hxn) each column is a weight vectors for each neuron
self.W2=0.01*np.random.randn(self.h,self.n)

#creating a bias matrix b2 (collection of bias values from hidden layer to output layer) of dimension (1xn)
self.b2 = np.zeros((1,self.n))

Lets see the whole __init__ method

 def __init__(self,n_features,n_classes,n_hidden):

      self.d=n_features
      self.n=n_classes
      self.h=n_hidden

      #creating the weight matrices W1 (collection of weight values from input layer to hidden layer) of dimension (dxh) each column is a weight vectors for each neuron
      self.W1=0.01*np.random.randn(self.d,self.h)

      #creating a bias matrix b1 (collection of bias values from input layer to hidden layer) of dimension (1xh)
      self.b1 = np.zeros((1,self.h))

      #creating the weight matrices W2 (collection of weight values from hidden layer to output layer) of dimension (hxn) each column is a weight vectors for each neuron

      self.W2=0.01*np.random.randn(self.h,self.n)

      #creating a bias matrix b2 (collection of bias values from hidden layer to output layer) of dimension (1xn)
      self.b2 = np.zeros((1,self.n))

Forward Propagation

In Forward propagation we pass the input data through the network’s layers, applying weights and biases, and activating neurons to produce an output. So let us implement how our Neural network pass the data to produce the output.

We define a function called forward_prop inside the class NN. this function takes X as the argument where X is 2D matrix of Datapoints. The dimension of X will be bxf where b is the no of datapoints we are going to pass into the network and f is the dimensionality of each datapoint (no of features in a datapoint )

def frwd_prop(self,x):

Now we will multiply the input matrix X with the weight matrix W1 and add the resulting matrix with the bias vector b1 (the bias vector will get broadcasted to a matrix and then get added with the resulting matrix). we will call the final matrix after adding bias as z1. The elements inside z1 are called “Logits” (Logits are the outputs of a neural network before the activation function is applied)

def frwd_prop(self,x):

  # multiplying the weight with the values(datapoint) and adding the bias term b1
  z1=np.dot(x,self.W1)+self.b1

now we need to apply the activation function for the z1. After applying the activation function on z1 we will get the resulting matrix A1. we will use ReLu as the activation function. it returns the input as the output if the input is positive and returns 0 if the input is negative

 # applying the relu function to z1
 A1=np.maximum(0,z1)

Now we passed input matrix X successfully to the hidden layer and we got A1 as the resulting matrix, now this A1 Matrix will the input matrix for the output layer

Similarly, we will multiply the Matrix A1 with the weight matrix W2 and add the resulting matrix with the bias b2 to get z2

# multiplying the weight with the values (r1) and adding the bias term b2
z2=np.dot(A1,self.W2)+self.b2

The elements inside z2 are called “Logits” (Logits are the outputs of a neural network before the activation function is applied ). Since this is the output from the output layer the values inside the z2 need to be the probabilities but as of now the values (logits) in z2 cannot be interpreted as probabilities so in order to convert those values into probabilities we will apply a different activation function called “Softmax Activation”.

# applying the softmax to the z2
A2=np.exp(z2)
A2=A2/np.sum(A2,axis=1,keepdims=True)

now the frwd_prop function return the matrices A1 and A2 , here is the whole forward_prop function

def frwd_prop(self,x):

      # multiplying the weight with the values(datapoint) and adding the bias term b1
      z1=np.dot(x,self.W1)+ self.b1

      # applying the relu function to z1
      A1=np.maximum(0,z1)

      # multiplying the weight with the values (A1) and adding the bias term b2
      z2=np.dot(A1,self.W2)+ self.b2


      # applying the softmax to the z2
      A2=np.exp(z2)
      A2=A2/np.sum(A2,axis=1,keepdims=True)

      return A1,A2

Calculating the Loss

In our Neural Network implementation, we are doing a Multiclass classification so we are going to use “Cross Entropy Loss”. The cross-entropy loss function measures the dissimilarity between the predicted probability distributions and the true probability distributions (or one-hot encoded labels) of the classes

Let's implement a function called ce_loss which computes cross-entropy loss given the true labels (ti) and predicted probabilities (pi). the ce_loss function takes two arguments and returns loss (Lce)

i) y_true : True labels of the data points

ii) y_pred_proba : Predicted Probabilities

def ce_loss(self,y_true,y_pred_proba):

          # computing the cross entropy loss
          num_examples=y.shape[0]
          yij_pij=-np.log(y_pred_proba[range(num_examples),y])
          loss=np.sum(yij_pij)/num_examples
          return loss

Backward Propagation

After we compute the loss through the forward prop, we will travel back through the network to compute gradients of weight values and biases. This process, known as Backpropagation

Backpropagation enables neural networks to learn from data by iteratively adjusting their weights and biases to minimize the difference between predicted and actual outputs

Let us create the function backward_prop which takes the values x,y,A1,A2 returned by the forward_prop function where x is the data, y is the labels A1,A2 are the intermediate values generated in the forward prop function

 def backward_prop(self,x,y,A1,A2):
    pass

Let us understand how backprop works

First let's see how our data flows through the network

First, let’s compute dZ2 =∂L/∂z2 the partial derivative of the cross-entropy loss with respect to logits Z2. The derivative of the cross entropy loss function is pij-yij, where pij is the predicted probability of jth class for ith data point. so we need to subtract labels from the logits

# copying the logits into dZ2 variable
dZ2 =A2

# subtracting the labels from the logits where the lables are equal to 1
dZ2[range(num_examples),y] -= 1

# normalizing the gradients
dZ2 /= num_examples

2. Now we need to compute dW2=∂L/∂W2 the partial derivative of the cross-entropy loss with respect to weight matrix W2.

# computing the derivative of loss with respect to W2
dW2=np.dot(A1.T,dZ2)

3. Now we need to compute db2=∂L/∂b2 the partial derivative of the cross-entropy loss with respect to bias vector b2

# computing the derivative of loss wrt b2
   """
    here the db2 is equal to dZ2 which is a matrix that contains gradients across all of our examples
    so we nee to sum the gradients across the training examples
   """
   db2=np.sum(dZ2,axis=0,keepdims=True)

4. Now we need to compute dA1=∂L/∂A1 the partial derivative of the cross-entropy loss with respect to A1

  # computing the derivative of loss wrt A1
  dA1=np.dot(dZ2,self.W2.T)

5. Now we need to compute dZ1=∂L/∂Z1 the partial derivative of the cross-entropy loss with respect to Z1. Now A1= ReLu(Z1)

# computing the gradient for ReLu (gradient is 0 for the negative points)
""" The gradients of relu will be zero for the negative points """
dA1[dA1<0]==0

# computing the gradient for z1
dZ1=dA1

6. Now we need to compute dW1=∂L/∂W1 the partial derivative of the cross-entropy loss with respect to weight matrix W1 .

# computing the gradient for W1
dW1=np.dot(x.T,dZ1)

7. Now we need to compute db1=∂L/∂b1 the partial derivative of the cross-entropy loss with respect to bias vector b1

# computing the gradient for b2
"""
 here the db2=dZ1 which is a matrix that contains gradients across all of our examples
 so we nee to sum the gradients across the training examples
"""
db1=np.sum(dZ1,axis=0,keepdims=True)

Finally, we will return the gradients dW1, db1, dW2, db2, These gradients will be used by the fit function to update the parameters of our neural network. Here is the full code for the backpropagation

 def backward_prop(self,x,y,A1,A2):

      # capturing the no of datapoints
      num_examples=y.shape[0]

      # computing the derivatives of CE loss wrt to z(inputs to sfmx layer)
      """ derivative of CE loss wrt to zj  dL/dzj= Pij-Yij """
      dZ2 =A2
      dZ2[range(num_examples),y] -= 1

      # normalizing the gradients
      dZ2 /= num_examples
      # computing the derivative of loss wrt W2)
      dW2=np.dot(A1.T,dZ2)
      # computing the derivative of loss wrt b2
      """
       here the db2 is a matrix which contains gradients across all of our examples
       so we nee to sum the gradients across the training examples
      """
      db2=np.sum(dZ2,axis=0,keepdims=True)

      # computing the derivative of loss wrt A1
      dA1=np.dot(dZ2,self.W2.T)

      # computing the gradient for ReLu (gradient is 0 for the negative points)
      """ The gradients of relu will be zero for the negative points """
      dA1[dA1<0]==0

      # computing the gradient for z1
      dZ1=dA1

      # computing the gradient for W1
      dW1=np.dot(x.T,dZ1)

      # computing the gradient for b2
      db1=np.sum(dZ1,axis=0,keepdims=True)

      return dW1, db1, dW2, db2

Training the Neural Network

In Training the neural network iteratively adjusts its parameters (weights and biases) to minimize a defined loss function (in our case its cross-entropy loss). In training we pass our whole data multiple times this refers to epochs. In each epoch, we pass our whole data into the neural network

Let's define a function that trains the neural network, this function takes the following parameters

i ) x= datapoints

ii) y= labels

iii) reg= coefficient of L2 regularization

iv) epochs= no of times we pass our whole data into the network

v) eta= learning rate

def fit(self,x,y,reg,epochs,eta):
       
      # getting the total no of datapoints passed into the network
      num_examples=x.shape[0]

      # doing forward and backward prop for each epoch 
      for i in range(epcohs):

          #forward prop
          A1,A2=self.frwd_prop(x)

          #calculating the loss
          loss=self.ce_loss(y,A2)
          # calculating the regularization loss
          reg_loss = 0.5*reg*np.sum(self.W1*self.W1) + 0.5*reg*np.sum(self.W2*self.W2)
          # computing the total loss
          total_loss=loss+reg_loss

          if i % 1000 == 0:
                print("iteration %d: loss %f" % (i, total_loss))

          # backprop
          dW1, db1, dW2, db2  = self.backward_prop(x,y,A1,A2)

          # during the backprop we have computed the gradients only with respect to loss, not regularization.
          # add regularization gradient contribution
          dW2 += reg * self.W2
          dW1 += reg * self.W1

          # updating the parameters
          self.W1+= -eta*dW1
          self.W2+= -eta*dW2
          self.b1+= -eta*db1
          self.b2+= -eta*db2

Prediction

After training we would have found the optimal values of weights and biases that minimize our loss, now when given a query data point at runtime we must predict the label, so in this case, we just do a forward prop and return the output given by the neural network

Let's create a function called predict() which takes a query point x as input and predicts the output by doing forward prop and returns the output

def predict(self,x):

   # doing foward prop
   _,y_pred=self.frwd_prop(x)

   # converting the  class probabilities into class labels
   y_pred=np.argmax(y_pred,axis=1)
    
   return y_pred

I have shown the code in bits and pieces so let's see the whole class implementation here. Our class NN has 5 methods namely

frwd_prop ( does forward prop )
ce_loss ( computes cross-entropy loss )
backward_prop ( does backpropagation )
fit ( trains the neural network)
predict (predicts the label of the query point at runtime)

# creating a class for Neural Network
class NN:

   """
   n_features: no of features
   n_classes: no of classes (no of output neurons)
   n_hidden: no of neurons in the hidden layers

   """

   def __init__(self,n_features,n_classes,n_hidden):

      self.d=n_features
      self.n=n_classes
      self.h=n_hidden

      #creating the weight matrices W1 (collection of weight values from input layer to hidden layer) of dimension (dxh) each column is a weight vectors for each neuron
      self.W1=0.01*np.random.randn(self.d,self.h)

      #creating a bias matrix b1 (collection of bias values from input layer to hidden layer) of dimension (1xh)
      self.b1 = np.zeros((1,self.h))

      #creating the weight matrices W2 (collection of weight values from hidden layer to output layer) of dimension (hxn) each column is a weight vectors for each neuron

      self.W2=0.01*np.random.randn(self.h,self.n)

      #creating a bias matrix b2 (collection of bias values from hidden layer to output layer) of dimension (1xn)
      self.b2 = np.zeros((1,self.n))

   def frwd_prop(self,x):

      # multiplying the weight with the values(datapoint) and adding the bias term b1
      z1=np.dot(x,self.W1)+self.b1

      # applying the relu function to z1
      A1=np.maximum(0,z1)

      # multiplying the weight with the values (r1) and adding the bias term b2
      z2=np.dot(A1,self.W2)+self.b2


      # applying the softmax to the z2
      A2=np.exp(z2)
      A2=A2/np.sum(A2,axis=1,keepdims=True)

      return A1,A2

   def ce_loss(self,y_true,y_pred_proba):

          # computing the cross entropy loss
          num_examples=y.shape[0]
          yij_pij=-np.log(y_pred_proba[range(num_examples),y])
          loss=np.sum(yij_pij)/num_examples
          return loss

   def backward_prop(self,x,y,A1,A2):

      # capturing the no of datapoints
      num_examples=y.shape[0]

      # computing the derivatives of CE loss wrt to z(inputs to sfmx layer)
      """ derivative of CE loss wrt to zj  dL/dzj= Pij-Yij """
      dZ2 =A2
      dZ2[range(num_examples),y] -= 1

      # normalizing the gradients
      dZ2 /= num_examples
      # computing the derivative of loss wrt W2)
      dW2=np.dot(A1.T,dZ2)
      # computing the derivative of loss wrt b2
      db2=np.sum(dZ2,axis=0,keepdims=True)

      # computing the derivative of loss wrt A1
      dA1=np.dot(dZ2,self.W2.T)

      # computing the gradient for ReLu (gradient is 0 for the negative points)
      dA1[dA1<0]==0

      # computing the gradient for z1
      dZ1=dA1

      # computing the gradient for W1
      dW1=np.dot(x.T,dZ1)

      # computing the gradient for b2
      db1=np.sum(dZ1,axis=0,keepdims=True)

      return dW1, db1, dW2, db2



   def fit(self,x,y,reg,max_iters,eta):

      num_examples=x.shape[0]

      # doing forward and backward prop max_iter times
      for i in range(max_iters):

          #forward prop
          A1,A2=self.frwd_prop(x)

          #calculating the loss
          loss=self.ce_loss(y,A2)
          # calculating the regularization loss
          reg_loss = 0.5*reg*np.sum(self.W1*self.W1) + 0.5*reg*np.sum(self.W2*self.W2)
          # computing the total loss
          total_loss=loss+reg_loss

          if i % 1000 == 0:
                print("iteration %d: loss %f" % (i, total_loss))

          # backprop
          dW1, db1, dW2, db2  = self.backward_prop(x,y,A1,A2)

          # during the backprop we have computed the gradients only with respect to loss, not regularization.
          # add regularization gradient contribution
          dW2 += reg * self.W2
          dW1 += reg * self.W1

          # updating the parameters
          self.W1+= -eta*dW1
          self.W2+= -eta*dW2
          self.b1+= -eta*db1
          self.b2+= -eta*db2


   def predict(self,x):

      # doing foward prop
      _,y_pred=self.frwd_prop(x)

      # converting the  class probabilities into class labels
      y_pred=np.argmax(y_pred,axis=1)

      return y_pred

Evaluation

We have successfully implemented a Neural Network from scratch without the help of libraries like TensorFlow or PyTorch, so let us evaluate our Neural network on synthesized data and see how it performs

I have manually synthesized the data in which each data point is two-dimensional and the dataset has a total 3 classes, so this is a multiclass classification problem, I have created the data in such a way that these data points are distributed to form a spiral shape in the 2D space, and cannot be classified by a linear hyperplane, let's see if our neural network is able to classify these points

# importing the data
df=pd.read_csv('drive/MyDrive/spiral.csv')

# plotting the data to see the distribution
plt.scatter(df["x1"], df["x2"], c=df["y"], s=40, cmap=plt.cm.Spectral)
plt.show()

Training our network

# splitting the X and Y
X = df.iloc[:, :-1].to_numpy()
y = df.iloc[:, -1].to_numpy()

# defining the model
nn=NN(n_features=2,n_classes=3,n_hidden=100)

# fitting the model
nn.fit(X,y,reg=1e-3,max_iters=10000,eta=1)

print('training accuracy: %.2f' % (np.mean(np.array(nn.predict(X)) == y)))

Let’s visualize the class decision boundary created by our neural network

# create a 2D grid
step = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))

# predict for all the points in the grid
y_hat = nn_model.predict(np.c_[xx.ravel(), yy.ravel()])
y_hat = y_hat.reshape(xx.shape)

# plot
fig = plt.figure()
plt.contourf(xx, yy, y_hat, cmap=plt.cm.Spectral, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()

We can see that the neural network that we implemented is able to learn complex shape easily and able to separate the data points accurately

Thank you all for patiently reading and reaching this point. I hope you’ve gained new insights and learnings today. your feedback and support are invaluable — please feel free to share your thoughts…………….