Building a Neural Network from Scratch in Python: A Step-by-Step Guide

A Hands-On Guide to Building a Neural Network from Scratch with Python

14 min readMay 16, 2023

This blog post will guide you through the process of coding a neural network from scratch in Python. Not only will we provide step-by-step instructions, but we will also delve into the underlying theory behind neural networks.

Linear and Non-Linear

Linear regression serves as a fundamental starting point in the realm of machine learning. However, it is limited in its ability to effectively capture and explain nonlinearity in data. This limitation arises from the underlying assumptions and structure of linear regression models.

Linear regression assumes a linear relationship between the independent variables and the target variable. It seeks to fit a straight line or hyperplane that best represents the relationship between the variables. However, many real-world phenomena exhibit complex nonlinear patterns and interactions that cannot be accurately modeled by a simple linear relationship.

Linear regression falls short in effectively capturing complex nonlinear relationships, but neural networks excel in this aspect. Neural networks enhance linear regression in three significant ways:

Nonlinear Transformation: Unlike linear regression, neural networks apply nonlinear transformations on top of the linear transformation. This enables them to model and capture intricate nonlinear patterns in the data.
Multiple Layers: Neural networks consist of multiple layers, allowing them to capture interactions and dependencies between features. Each layer contributes to extracting higher-level representations of the input data, enabling more sophisticated modeling.
Multiple Hidden Units: Within each layer of a neural network, there are multiple hidden units. Each hidden unit performs its unique combination of linear and nonlinear transformations, providing flexibility in capturing complex relationships and enhancing the model’s ability to learn intricate patterns.

Now, let’s explore some essential concepts underlying neural networks.

Activation Functions

We utilize activation functions to perform these nonlinear transformations, introducing nonlinearity in place of linearity.

We modify the linear function y = wx + b by applying an activation function, resulting in the transformed equation y = activation(wx + b).

import numpy as np
import matplotlib.pyplot as plt

def plot_func(x,y, title):
    # helper function to plot activation functions
    plt.plot(x, y)
    plt.title(title)
    plt.xlabel('x')
    plt.ylabel('activation(x)')
    plt.grid(True)
    plt.show()

x = np.linspace(-10, 10, 100)

Sigmoid

The sigmoid activation function is a mathematical function that maps the input to a value between 0 and 1, providing a smooth S-shaped curve.

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Rectified Linear Unit (ReLU)

The ReLU (Rectified Linear Unit) activation function is a mathematical function that returns the input if it is positive, and zero otherwise, providing a piecewise linear output.

def relu(x):
    return np.maximum(0, x)

Leaky ReLU

The Leaky ReLU activation function is a variation of the ReLU function that allows a small, non-zero output for negative input values, preventing complete suppression of information.

def leaky_relu(x, alpha=0.1):
    return np.maximum(alpha*x, x)

Leaky relu function. Image by the author.

Tanh

The tanh (hyperbolic tangent) activation function is a mathematical function that maps the input to a value between -1 and 1, providing a smooth S-shaped curve centered around zero.

def tanh(x):
    return np.tanh(x)

Softmax

The softmax activation function is a mathematical function commonly used in multiclass classification tasks to convert a vector of real values into a probability distribution over multiple classes.

def softmax(x):
    exp_scores = np.exp(x)
    return exp_scores / np.sum(exp_scores)

We will go on with the ReLU activation function in our implementation.

Multiple Layers

Certainly, solely incorporating the ReLU activation function would significantly amplify the error. This is primarily because it would produce identical predictions for any input below zero. To address this issue, we introduce multiple layers in our neural network. For instance, a neural network with two layers can be represented by the equation y^ = w2 * relu(w1 * x + b1) + b2.

Multiple Hidden Units

A lambda function, prediction, calculates a linear prediction based on the input x, using predefined values for the weights (w1) and bias (b).

prediction = lambda x, w1=.2, b=1.99: x * w1 + b

Then, we apply the ReLU activation function to the linear predictions.

layer1_1 = np.maximum(0, prediction(x))
plt.plot(x, layer1_1)

What happens if we add another layer?

layer1_2 = np.maximum(0, prediction(x, .3, -2))
plt.plot(x, layer1_1+layer1_2)

We introduced a nonlinearity in the output. Let’s add one more.

layer1_3 = np.maximum(0, prediction(x, .6, -2))
plt.plot(x, layer1_1+layer1_2+layer1_3)

By increasing the number of units, we observe the emergence of a more pronounced non-linear relationship. As we adjust the weights, we can observe corresponding changes in the relationship.

Let’s draw a diagram for a two-layer structure.

Calculating Outputs

As evident from the above, the output of one component serves as the input for another. Matrix multiplication is employed to compute the outputs in this process.

Suppose we have an input represented by a matrix of shape 2x1.

Forward Pass

The tsensor package can be employed to visualize tensor variables effectively. By enhancing error messages and displaying Python code, TensorSensor provides insights into the shape of tensor variables. It is compatible with popular libraries such as TensorFlow, PyTorch, JAX, Numpy, Keras, and fastai.

from tsensor import explain as exp

x_input = np.array([[10], [20], [-20], [-40], [-3]])

# 1x2 weight matrix
l1_weights = np.array([[.73, .2]])

# 1x2 bias matrix
l1_bias = np.array([[4, 2]])

# output
with exp() as c:
    l1_output = x_input @ l1_weights + l1_bias

We apply an activation function to the output of the aforementioned process.

l1_activated = relu(l1_output)

A useful guideline is that the number of rows in the weight matrix should match the number of columns in the input matrix, while the number of columns in the weight matrix should match the number of columns in the output matrix. Consequently, in layer one, our weight matrix is 1x2, signifying a transition from 1 input feature to 2 output features.

Essentially, this is the fundamental process of making predictions in neural networks. It involves utilizing the weight matrix and bias matrix for each layer, performing repeated multiplications with the weight matrix, incorporating the bias, applying a non-linearity, and repeating this process for each layer. To accommodate additional units within a layer, you can simply add more columns to the weight matrix.

Loss

Now, we can use mean squared error (or any other metric) to calculate the error between our output and the actual values.

def calculate_mse(actual, predicted):
    return (actual - predicted) ** 2

actual = np.array([[9], [13], [5], [-2], [-1]])

print(calculate_mse(actual,output))

"""
[[6.4000000e-03]
 [9.2160000e-01]
 [1.0000000e+00]
 [3.6000000e+01]
 [3.4386496e+01]]
"""

During gradient descent, it is essential to determine the gradient of the loss function, which represents the rate of change. This gradient indicates how the loss function alters as we modify the input values.

def gradient_mse(actual, predicted):
    return predicted - actual

print(gradient_mse(actual,output))

"""
[[-0.08 ]
 [-0.96 ]
 [-1.   ]
 [ 6.   ]
 [ 5.864]]
"""

This information guides the necessary adjustment to our prediction in order to minimize the error effectively.

Backward Pass

Backpropagation, in essence, reverses the forward pass to distribute the gradient to the different parameters of the network, such as weights and biases. This gradient plays a crucial role in enabling the network to learn through the utilization of gradient descent.

Forward pass:

We basically reverse this pass to get a backward pass.

We compute the partial derivative of the loss function for each parameter. The gradient of the output, referred to as the L2 output gradient, is used to update the weights in layer 2. We perform this update by multiplying the input to layer 2 and the output from layer 1 in the forward pass by the L2 output gradient. The bias is updated by taking the average of the output gradient.

Next, we propagate the gradient to layer one by multiplying the output gradient by the layer 2 weights. We then pass this gradient through the ReLU function and utilize it to update the weights and bias in layer 1.

output_gradient = gradient_mse(actual, output)

with exp():
    l2_w_gradient =  l1_activated.T @ output_gradient
l2_w_gradient

"""
array([[-8.14616],
       [ 2.1296 ]])
"""

We took the transpose of the 5x2 l1_activated matrix (now it is 2x5), which was the output of layer 1 in the forward pass, and multiplied it by our 5x1 output_gradient to obtain a 2x1 matrix. This matrix gives us the gradient of each value in our w2 matrix.

The diagram provided illustrates how the inputs are multiplied by the weights during the forward pass to generate the output. It is evident that each weight is associated with multiple inputs, and each weight is connected to both multiple inputs and multiple outputs.

We transpose the output of layer 1 and multiply it by the output gradient to obtain the weight gradient.

l2_w_gradient =  l1_activated.T @ output_gradient

To ensure that each input connected to an output through a weight is multiplied by the corresponding output gradient, we transpose the layer one output matrix (which serves as the input to layer 2).

By multiplying the output gradients with the inputs to the layer, we can determine the necessary adjustments to our weights. This relationship arises from the application of the chain rule of partial derivatives.

By applying the chain rule of partial derivatives, we compute the partial derivative of the loss for the second weight matrix. This involves multiplying the partial derivative of the loss by the partial derivative of the product of x and w2 concerning w2. Solving this equation allows us to determine the appropriate values for the weight matrix.

Next, we calculate the derivative for the bias.

with exp():
    l2_b_gradient =  np.mean(output_gradient, axis=0)

l2_b_gradient

"""
array([1.9648])
"""

To update the weights and biases in layer 2, we subtract the gradient from the current values of w and b, scaled by the learning rate. The learning rate helps to prevent updates that are too large, which could cause us to move away from the optimal solution with the lowest error.

# Set a learning rate
lr = 1e-4

with exp():
    # Update the bias values
    l2_bias = l2_bias - l2_b_gradient * lr
    # Update the weight values
    l2_weights = l2_weights - l2_w_gradient * lr

l2_weights

"""
array([[0.40171069],
       [0.09955278]])
"""

We update the l2 bias and l2 weights.

Layer 1 Gradients

Next, we proceed to compute the gradients for layer 1. The layer 1 outputs are obtained by scaling the inputs with the corresponding weights, resulting in the layer 2 output. To determine the gradient of the loss with respect to the layer 1 output, we reverse the forward pass by scaling the output gradient with the weights of layer 2.

with exp():
    # Calculate the gradient on the output of layer 1
    l1_activated_gradient = output_gradient @ l2_weights.T

l1_activated_gradient

"""
array([[-0.03213686, -0.00796422],
       [-0.38564227, -0.09557067],
       [-0.40171069, -0.09955278],
       [ 2.41026416,  0.5973167 ],
       [ 2.35563151,  0.58377753]])
"""

Moving on, we proceed to compute the gradients for the layer 1 weight and biases. Initially, we need to propagate the gradient through the ReLU non-linearity. To achieve this, we consider the derivative of the ReLU function.

In our case, the slope is 0 or 1. So, the derivative of the relu function is either 0 or 1.

with exp():
    l1_output_gradient = l1_activated_gradient * np.heaviside(l1_output, 0)

l1_output_gradient

"""
array([[-0.03213686, -0.00796422],
       [-0.38564227, -0.09557067],
       [-0.        , -0.        ],
       [ 0.        ,  0.        ],
       [ 2.35563151,  0.58377753]])
"""

Now, we calculate the layer 1 gradient.

# back propagation
l1_w_gradient =  input.T @ l1_output_gradient
l1_b_gradient = np.mean(l1_output_gradient, axis=0)

# gradient descent
l1_weights -= l1_w_gradient * lr
l1_bias -= l1_b_gradient * lr

In the previous step, we computed the gradients for layer 1 and used them to update the weight and bias values. Essentially, we performed backpropagation through two layers and applied gradient descent in each of them.

Training

Below is the algorithm we employed:

Perform the forward pass through the network and obtain the output.
Calculate the gradient for the network outputs using the mse_grad function.
For each layer in the network:

Determine the gradient for the pre-nonlinearity output (if the layer includes a nonlinearity).
Calculate the gradient for the weights.
Compute the gradient for the biases.
Determine the gradient for the inputs to the layer.

4. Update the network parameters using gradient descent.

For simplicity, we consolidated step 4 into step 3. However, it is crucial to understand that backpropagation corresponds to step 3, while gradient descent corresponds to step 4. By separating these steps, it becomes more convenient to employ different variations of gradient descent, such as Adam or RMSProp, to update the weights. Stages 3 and 4 are commonly referred to as the backward pass of a neural network.

Backpropagation and gradient descent represent the more intricate aspects of training neural networks. Conceptually, backpropagation involves reversing the forward pass of the network to determine the most effective means of minimizing error. This entails propagating the gradient of the loss from one layer to another and applying the chain rule along the way.

Code

In the following section, we have consolidated all the aforementioned topics into a class called Neural.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from statistics import mean
from typing import Dict, List, Tuple

np.random.seed(42)

class Neural:
    
    def __init__(self, layers: List[int], epochs: int, 
                 learning_rate: float = 0.001, batch_size: int=32,
                 validation_split: float = 0.2, verbose: int=1):
        self._layer_structure: List[int] = layers
        self._batch_size: int = batch_size
        self._epochs: int = epochs
        self._learning_rate: float = learning_rate
        self._validation_split: float = validation_split
        self._verbose: int = verbose
        self._losses: Dict[str, float] = {"train": [], "validation": []}
        self._is_fit: bool = False
        self.__layers = None
        
    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        # validation split
        X, X_val, y, y_val = train_test_split(X, y, test_size=self._validation_split, random_state=42)
        # initialization of layers
        self.__layers = self.__init_layers()
        for epoch in range(self._epochs):
            epoch_losses = []
            for i in range(1, len(self.__layers)):
                # forward pass
                x_batch = X[i:(i+self._batch_size)]
                y_batch = y[i:(i+self._batch_size)]
                pred, hidden = self.__forward(x_batch)
                # calculate loss
                loss = self.__calculate_loss(y_batch, pred)
                epoch_losses.append(np.mean(loss ** 2))
                #backward
                self.__backward(hidden, loss)
            valid_preds, _ = self.__forward(X_val)
            train_loss = mean(epoch_losses)
            valid_loss = np.mean(self.__calculate_mse(valid_preds,y_val))
            self._losses["train"].append(train_loss)
            self._losses["validation"].append(valid_loss)
            if self._verbose:
                print(f"Epoch: {epoch} Train MSE: {train_loss} Valid MSE: {valid_loss}")
        self._is_fit = True
        return
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        if self._is_fit == False:
            raise Exception("Model has not been trained yet.")
        pred, hidden = self.__forward(X)
        return pred
    
    def plot_learning(self) -> None:
        plt.plot(self._losses["train"],label="loss")
        plt.plot(self._losses["validation"],label="validation")
        plt.legend()
    
    def __init_layers(self) -> List[np.ndarray]:
        layers = []
        for i in range(1, len(self._layer_structure)):
            layers.append([
                np.random.rand(self._layer_structure[i-1], self._layer_structure[i]) / 5 - .1,
                np.ones((1,self._layer_structure[i]))
            ])
        return layers
    
    def __forward(self, batch: np.ndarray) -> Tuple[np.ndarray, List[np.ndarray]]:
        hidden = [batch.copy()]
        for i in range(len(self.__layers)):
            batch = np.matmul(batch, self.__layers[i][0]) + self.__layers[i][1]
            if i < len(self.__layers) - 1:
                batch = np.maximum(batch, 0)
            # Store the forward pass hidden values for use in backprop
            hidden.append(batch.copy())
        return batch, hidden
    
    def __calculate_loss(self,actual: np.ndarray, predicted: np.ndarray) -> np.ndarray:
        "mse"
        return predicted - actual
    
    
    def __calculate_mse(self, actual: np.ndarray, predicted: np.ndarray) -> np.ndarray:
        return (actual - predicted) ** 2
    
    def __backward(self, hidden: List[np.ndarray], grad: np.ndarray) -> None:
        for i in range(len(self.__layers)-1, -1, -1):
            if i != len(self.__layers) - 1:
                grad = np.multiply(grad, np.heaviside(hidden[i+1], 0))
    
            w_grad = hidden[i].T @ grad
            b_grad = np.mean(grad, axis=0)
    
            self.__layers[i][0] -= w_grad * self._learning_rate
            self.__layers[i][1] -= b_grad * self._learning_rate
            
            grad = grad @ self.__layers[i][0].T
        return

Let’s generate some dummy data to test the Neural.

def generate_data():
    # Define correlation values
    corr_a = 0.8
    corr_b = 0.4
    corr_c = -0.2
    
    # Generate independent features
    a = np.random.normal(0, 1, size=100000)
    b = np.random.normal(0, 1, size=100000)
    c = np.random.normal(0, 1, size=100000)
    d = np.random.randint(0, 4, size=100000)
    e = np.random.binomial(1, 0.5, size=100000)
    
    # Generate target feature based on independent features
    target = 50 + corr_a*a + corr_b*b + corr_c*c + d*10 + 20*e + np.random.normal(0, 10, size=100000)
    
    # Create DataFrame with all features
    df = pd.DataFrame({'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'target': target})
    return df

And the client code:

df = generate_data()

# Separate the features and target
X = df.drop('target', axis=1)
y = df['target']

scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
y_train = y_train.to_numpy().reshape(-1,1)
y_test = y_test.to_numpy().reshape(-1,1)

layer_structure = [X_train.shape[1],10,10,1]
nn = Neural(layer_structure, 20, 1e-5, 64, 0.2, 1)

nn.fit(X_train, y_train)

y_pred = nn.predict(X_test)
nn.plot_learning()

print("Test error: ",mean_squared_error(y_test, y_pred))

"""
Epoch: 0 Train MSE: 6066.584227303851 Valid MSE: 5630.972575612107
Epoch: 1 Train MSE: 5870.45827241517 Valid MSE: 5384.024826048717
Epoch: 2 Train MSE: 5584.489636840577 Valid MSE: 4993.952466830458
Epoch: 3 Train MSE: 5127.64238543267 Valid MSE: 4376.563641292963
Epoch: 4 Train MSE: 4408.550555767417 Valid MSE: 3470.967255214888
Epoch: 5 Train MSE: 3370.6165240733935 Valid MSE: 2333.4365011529103
Epoch: 6 Train MSE: 2112.702666917853 Valid MSE: 1245.1547720938968
Epoch: 7 Train MSE: 1001.3618108374816 Valid MSE: 565.5834115291266
Epoch: 8 Train MSE: 396.9514096548994 Valid MSE: 298.31216370120575
Epoch: 9 Train MSE: 198.29006090703072 Valid MSE: 204.83294115572235
Epoch: 10 Train MSE: 139.2931182121901 Valid MSE: 162.0341771457693
Epoch: 11 Train MSE: 113.971621253487 Valid MSE: 138.35491897074462
Epoch: 12 Train MSE: 100.19734344395454 Valid MSE: 124.60170156400542
Epoch: 13 Train MSE: 92.35069581444299 Valid MSE: 116.55999261926036
Epoch: 14 Train MSE: 87.88890529435344 Valid MSE: 111.85169154584908
Epoch: 15 Train MSE: 85.37162170152865 Valid MSE: 109.08001681897412
Epoch: 16 Train MSE: 83.96135084225956 Valid MSE: 107.42929147368837
Epoch: 17 Train MSE: 83.17564386183105 Valid MSE: 106.42800615549532
Epoch: 18 Train MSE: 82.73977092210092 Valid MSE: 105.80581167903857
Epoch: 19 Train MSE: 82.49876360284046 Valid MSE: 105.40815905002043
Test error: 105.40244085384184
"""

And the learning curve:

We have explored the step-by-step process of building a neural network from scratch using Python. This hands-on guide has provided a lean and simple implementation, allowing us to gain a fundamental understanding of neural network architectures. However, it’s important to note that this is just the beginning of our journey into the vast world of neural networks. There are numerous advanced concepts and techniques yet to be explored.

Introduction to Perceptrons: Building Your First Neural Network in Python

Perceptron Implementation in Python: Understanding the Basics of Artificial Neural Networks

medium.com

Boltzmann Machines

Introduction to Boltzmann Machines

awstip.com

Linear Regression from Scratch with Gradient Descent

Linear Regression Code in Python, plus Library Implementations

towardsdev.com

Scala #9: Spark: Linear Regression

Linear Regression Model in Spark Using Scala