Building a Neural Network from Scratch in Python: A Step-by-Step Guide

A Hands-On Guide to Building a Neural Network from Scratch with Python

Okan Yenigün
14 min readMay 16, 2023
Photo by Alina Grubnyak on Unsplash

This blog post will guide you through the process of coding a neural network from scratch in Python. Not only will we provide step-by-step instructions, but we will also delve into the underlying theory behind neural networks.

Linear and Non-Linear

Linear regression serves as a fundamental starting point in the realm of machine learning. However, it is limited in its ability to effectively capture and explain nonlinearity in data. This limitation arises from the underlying assumptions and structure of linear regression models.

Linear regression assumes a linear relationship between the independent variables and the target variable. It seeks to fit a straight line or hyperplane that best represents the relationship between the variables. However, many real-world phenomena exhibit complex nonlinear patterns and interactions that cannot be accurately modeled by a simple linear relationship.

Linear regression falls short in effectively capturing complex nonlinear relationships, but neural networks excel in this aspect. Neural networks enhance linear regression in three significant ways:

  1. Nonlinear Transformation: Unlike linear regression, neural networks apply nonlinear transformations on top of the linear transformation. This enables them to model and capture intricate nonlinear patterns in the data.
  2. Multiple Layers: Neural networks consist of multiple layers, allowing them to capture interactions and dependencies between features. Each layer contributes to extracting higher-level representations of the input data, enabling more sophisticated modeling.
  3. Multiple Hidden Units: Within each layer of a neural network, there are multiple hidden units. Each hidden unit performs its unique combination of linear and nonlinear transformations, providing flexibility in capturing complex relationships and enhancing the model’s ability to learn intricate patterns.

Now, let’s explore some essential concepts underlying neural networks.

Activation Functions

We utilize activation functions to perform these nonlinear transformations, introducing nonlinearity in place of linearity.

We modify the linear function y = wx + b by applying an activation function, resulting in the transformed equation y = activation(wx + b).

import numpy as np
import matplotlib.pyplot as plt

def plot_func(x,y, title):
# helper function to plot activation functions
plt.plot(x, y)
plt.title(title)
plt.xlabel('x')
plt.ylabel('activation(x)')
plt.grid(True)
plt.show()

x = np.linspace(-10, 10, 100)

Sigmoid

The sigmoid activation function is a mathematical function that maps the input to a value between 0 and 1, providing a smooth S-shaped curve.

def sigmoid(x):
return 1 / (1 + np.exp(-x))
Sigmoid function. Image by the author.

Rectified Linear Unit (ReLU)

The ReLU (Rectified Linear Unit) activation function is a mathematical function that returns the input if it is positive, and zero otherwise, providing a piecewise linear output.

def relu(x):
return np.maximum(0, x)
Relu function. Image by the author.

Leaky ReLU

The Leaky ReLU activation function is a variation of the ReLU function that allows a small, non-zero output for negative input values, preventing complete suppression of information.

def leaky_relu(x, alpha=0.1):
return np.maximum(alpha*x, x)
Leaky relu function. Image by the author.

Tanh

The tanh (hyperbolic tangent) activation function is a mathematical function that maps the input to a value between -1 and 1, providing a smooth S-shaped curve centered around zero.

def tanh(x):
return np.tanh(x)
Tanh function. Image by the author.

Softmax

The softmax activation function is a mathematical function commonly used in multiclass classification tasks to convert a vector of real values into a probability distribution over multiple classes.

def softmax(x):
exp_scores = np.exp(x)
return exp_scores / np.sum(exp_scores)
Softmax function. Image by the author.

We will go on with the ReLU activation function in our implementation.

Multiple Layers

Certainly, solely incorporating the ReLU activation function would significantly amplify the error. This is primarily because it would produce identical predictions for any input below zero. To address this issue, we introduce multiple layers in our neural network. For instance, a neural network with two layers can be represented by the equation y^ = w2 * relu(w1 * x + b1) + b2.

Multiple Hidden Units

A lambda function, prediction, calculates a linear prediction based on the input x, using predefined values for the weights (w1) and bias (b).

prediction = lambda x, w1=.2, b=1.99: x * w1 + b

Then, we apply the ReLU activation function to the linear predictions.

layer1_1 = np.maximum(0, prediction(x))
plt.plot(x, layer1_1)
Image by the author.

What happens if we add another layer?

layer1_2 = np.maximum(0, prediction(x, .3, -2))
plt.plot(x, layer1_1+layer1_2)
Image by the author.

We introduced a nonlinearity in the output. Let’s add one more.

layer1_3 = np.maximum(0, prediction(x, .6, -2))
plt.plot(x, layer1_1+layer1_2+layer1_3)
Image by the author.

By increasing the number of units, we observe the emergence of a more pronounced non-linear relationship. As we adjust the weights, we can observe corresponding changes in the relationship.

Let’s draw a diagram for a two-layer structure.

Image by the author.

Calculating Outputs

As evident from the above, the output of one component serves as the input for another. Matrix multiplication is employed to compute the outputs in this process.

Matrix multiplication. Source

Suppose we have an input represented by a matrix of shape 2x1.

Forward Pass

The tsensor package can be employed to visualize tensor variables effectively. By enhancing error messages and displaying Python code, TensorSensor provides insights into the shape of tensor variables. It is compatible with popular libraries such as TensorFlow, PyTorch, JAX, Numpy, Keras, and fastai.

from tsensor import explain as exp

x_input = np.array([[10], [20], [-20], [-40], [-3]])

# 1x2 weight matrix
l1_weights = np.array([[.73, .2]])

# 1x2 bias matrix
l1_bias = np.array([[4, 2]])

# output
with exp() as c:
l1_output = x_input @ l1_weights + l1_bias
tsensor output. Image by the author.
Outputs. Image by the author.

We apply an activation function to the output of the aforementioned process.

l1_activated = relu(l1_output)
l1 activated. Image by the author.

A useful guideline is that the number of rows in the weight matrix should match the number of columns in the input matrix, while the number of columns in the weight matrix should match the number of columns in the output matrix. Consequently, in layer one, our weight matrix is 1x2, signifying a transition from 1 input feature to 2 output features.

tsensor output. Image by the author.
Output. Image by the author.

Essentially, this is the fundamental process of making predictions in neural networks. It involves utilizing the weight matrix and bias matrix for each layer, performing repeated multiplications with the weight matrix, incorporating the bias, applying a non-linearity, and repeating this process for each layer. To accommodate additional units within a layer, you can simply add more columns to the weight matrix.

Loss

Now, we can use mean squared error (or any other metric) to calculate the error between our output and the actual values.

def calculate_mse(actual, predicted):
return (actual - predicted) ** 2

actual = np.array([[9], [13], [5], [-2], [-1]])

print(calculate_mse(actual,output))

"""
[[6.4000000e-03]
[9.2160000e-01]
[1.0000000e+00]
[3.6000000e+01]
[3.4386496e+01]]
"""

During gradient descent, it is essential to determine the gradient of the loss function, which represents the rate of change. This gradient indicates how the loss function alters as we modify the input values.

def gradient_mse(actual, predicted):
return predicted - actual

print(gradient_mse(actual,output))

"""
[[-0.08 ]
[-0.96 ]
[-1. ]
[ 6. ]
[ 5.864]]
"""

This information guides the necessary adjustment to our prediction in order to minimize the error effectively.

Backward Pass

Backpropagation, in essence, reverses the forward pass to distribute the gradient to the different parameters of the network, such as weights and biases. This gradient plays a crucial role in enabling the network to learn through the utilization of gradient descent.

Forward pass:

Forward pass. Image by the author.

We basically reverse this pass to get a backward pass.

Backward pass. Image by the author.

We compute the partial derivative of the loss function for each parameter. The gradient of the output, referred to as the L2 output gradient, is used to update the weights in layer 2. We perform this update by multiplying the input to layer 2 and the output from layer 1 in the forward pass by the L2 output gradient. The bias is updated by taking the average of the output gradient.

Next, we propagate the gradient to layer one by multiplying the output gradient by the layer 2 weights. We then pass this gradient through the ReLU function and utilize it to update the weights and bias in layer 1.

output_gradient = gradient_mse(actual, output)

with exp():
l2_w_gradient = l1_activated.T @ output_gradient
l2_w_gradient

"""
array([[-8.14616],
[ 2.1296 ]])
"""
tsensor output. Image by the author.

We took the transpose of the 5x2 l1_activated matrix (now it is 2x5), which was the output of layer 1 in the forward pass, and multiplied it by our 5x1 output_gradient to obtain a 2x1 matrix. This matrix gives us the gradient of each value in our w2 matrix.

Image by the author.

The diagram provided illustrates how the inputs are multiplied by the weights during the forward pass to generate the output. It is evident that each weight is associated with multiple inputs, and each weight is connected to both multiple inputs and multiple outputs.

Image by the author.

We transpose the output of layer 1 and multiply it by the output gradient to obtain the weight gradient.

l2_w_gradient =  l1_activated.T @ output_gradient

To ensure that each input connected to an output through a weight is multiplied by the corresponding output gradient, we transpose the layer one output matrix (which serves as the input to layer 2).

By multiplying the output gradients with the inputs to the layer, we can determine the necessary adjustments to our weights. This relationship arises from the application of the chain rule of partial derivatives.

By applying the chain rule of partial derivatives, we compute the partial derivative of the loss for the second weight matrix. This involves multiplying the partial derivative of the loss by the partial derivative of the product of x and w2 concerning w2. Solving this equation allows us to determine the appropriate values for the weight matrix.

Next, we calculate the derivative for the bias.

with exp():
l2_b_gradient = np.mean(output_gradient, axis=0)

l2_b_gradient

"""
array([1.9648])
"""
tsensor output. Image by the author.

To update the weights and biases in layer 2, we subtract the gradient from the current values of w and b, scaled by the learning rate. The learning rate helps to prevent updates that are too large, which could cause us to move away from the optimal solution with the lowest error.

# Set a learning rate
lr = 1e-4

with exp():
# Update the bias values
l2_bias = l2_bias - l2_b_gradient * lr
# Update the weight values
l2_weights = l2_weights - l2_w_gradient * lr

l2_weights

"""
array([[0.40171069],
[0.09955278]])
"""

We update the l2 bias and l2 weights.

tsensor output. Image by the author.
tsensor output. Image by the author.

Layer 1 Gradients

Next, we proceed to compute the gradients for layer 1. The layer 1 outputs are obtained by scaling the inputs with the corresponding weights, resulting in the layer 2 output. To determine the gradient of the loss with respect to the layer 1 output, we reverse the forward pass by scaling the output gradient with the weights of layer 2.

with exp():
# Calculate the gradient on the output of layer 1
l1_activated_gradient = output_gradient @ l2_weights.T

l1_activated_gradient

"""
array([[-0.03213686, -0.00796422],
[-0.38564227, -0.09557067],
[-0.40171069, -0.09955278],
[ 2.41026416, 0.5973167 ],
[ 2.35563151, 0.58377753]])
"""
tsensor output. Image by the author.

Moving on, we proceed to compute the gradients for the layer 1 weight and biases. Initially, we need to propagate the gradient through the ReLU non-linearity. To achieve this, we consider the derivative of the ReLU function.

Relu. Image by the author.

In our case, the slope is 0 or 1. So, the derivative of the relu function is either 0 or 1.

with exp():
l1_output_gradient = l1_activated_gradient * np.heaviside(l1_output, 0)

l1_output_gradient

"""
array([[-0.03213686, -0.00796422],
[-0.38564227, -0.09557067],
[-0. , -0. ],
[ 0. , 0. ],
[ 2.35563151, 0.58377753]])
"""
tsensor output. Image by the author.

Now, we calculate the layer 1 gradient.

# back propagation
l1_w_gradient = input.T @ l1_output_gradient
l1_b_gradient = np.mean(l1_output_gradient, axis=0)

# gradient descent
l1_weights -= l1_w_gradient * lr
l1_bias -= l1_b_gradient * lr

In the previous step, we computed the gradients for layer 1 and used them to update the weight and bias values. Essentially, we performed backpropagation through two layers and applied gradient descent in each of them.

Training

Below is the algorithm we employed:

  1. Perform the forward pass through the network and obtain the output.
  2. Calculate the gradient for the network outputs using the mse_grad function.
  3. For each layer in the network:
  • Determine the gradient for the pre-nonlinearity output (if the layer includes a nonlinearity).
  • Calculate the gradient for the weights.
  • Compute the gradient for the biases.
  • Determine the gradient for the inputs to the layer.

4. Update the network parameters using gradient descent.

For simplicity, we consolidated step 4 into step 3. However, it is crucial to understand that backpropagation corresponds to step 3, while gradient descent corresponds to step 4. By separating these steps, it becomes more convenient to employ different variations of gradient descent, such as Adam or RMSProp, to update the weights. Stages 3 and 4 are commonly referred to as the backward pass of a neural network.

Backpropagation and gradient descent represent the more intricate aspects of training neural networks. Conceptually, backpropagation involves reversing the forward pass of the network to determine the most effective means of minimizing error. This entails propagating the gradient of the loss from one layer to another and applying the chain rule along the way.

Code

In the following section, we have consolidated all the aforementioned topics into a class called Neural.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from statistics import mean
from typing import Dict, List, Tuple

np.random.seed(42)

class Neural:

def __init__(self, layers: List[int], epochs: int,
learning_rate: float = 0.001, batch_size: int=32,
validation_split: float = 0.2, verbose: int=1):
self._layer_structure: List[int] = layers
self._batch_size: int = batch_size
self._epochs: int = epochs
self._learning_rate: float = learning_rate
self._validation_split: float = validation_split
self._verbose: int = verbose
self._losses: Dict[str, float] = {"train": [], "validation": []}
self._is_fit: bool = False
self.__layers = None

def fit(self, X: np.ndarray, y: np.ndarray) -> None:
# validation split
X, X_val, y, y_val = train_test_split(X, y, test_size=self._validation_split, random_state=42)
# initialization of layers
self.__layers = self.__init_layers()
for epoch in range(self._epochs):
epoch_losses = []
for i in range(1, len(self.__layers)):
# forward pass
x_batch = X[i:(i+self._batch_size)]
y_batch = y[i:(i+self._batch_size)]
pred, hidden = self.__forward(x_batch)
# calculate loss
loss = self.__calculate_loss(y_batch, pred)
epoch_losses.append(np.mean(loss ** 2))
#backward
self.__backward(hidden, loss)
valid_preds, _ = self.__forward(X_val)
train_loss = mean(epoch_losses)
valid_loss = np.mean(self.__calculate_mse(valid_preds,y_val))
self._losses["train"].append(train_loss)
self._losses["validation"].append(valid_loss)
if self._verbose:
print(f"Epoch: {epoch} Train MSE: {train_loss} Valid MSE: {valid_loss}")
self._is_fit = True
return

def predict(self, X: np.ndarray) -> np.ndarray:
if self._is_fit == False:
raise Exception("Model has not been trained yet.")
pred, hidden = self.__forward(X)
return pred

def plot_learning(self) -> None:
plt.plot(self._losses["train"],label="loss")
plt.plot(self._losses["validation"],label="validation")
plt.legend()

def __init_layers(self) -> List[np.ndarray]:
layers = []
for i in range(1, len(self._layer_structure)):
layers.append([
np.random.rand(self._layer_structure[i-1], self._layer_structure[i]) / 5 - .1,
np.ones((1,self._layer_structure[i]))
])
return layers

def __forward(self, batch: np.ndarray) -> Tuple[np.ndarray, List[np.ndarray]]:
hidden = [batch.copy()]
for i in range(len(self.__layers)):
batch = np.matmul(batch, self.__layers[i][0]) + self.__layers[i][1]
if i < len(self.__layers) - 1:
batch = np.maximum(batch, 0)
# Store the forward pass hidden values for use in backprop
hidden.append(batch.copy())
return batch, hidden

def __calculate_loss(self,actual: np.ndarray, predicted: np.ndarray) -> np.ndarray:
"mse"
return predicted - actual


def __calculate_mse(self, actual: np.ndarray, predicted: np.ndarray) -> np.ndarray:
return (actual - predicted) ** 2

def __backward(self, hidden: List[np.ndarray], grad: np.ndarray) -> None:
for i in range(len(self.__layers)-1, -1, -1):
if i != len(self.__layers) - 1:
grad = np.multiply(grad, np.heaviside(hidden[i+1], 0))

w_grad = hidden[i].T @ grad
b_grad = np.mean(grad, axis=0)

self.__layers[i][0] -= w_grad * self._learning_rate
self.__layers[i][1] -= b_grad * self._learning_rate

grad = grad @ self.__layers[i][0].T
return

Let’s generate some dummy data to test the Neural.

def generate_data():
# Define correlation values
corr_a = 0.8
corr_b = 0.4
corr_c = -0.2

# Generate independent features
a = np.random.normal(0, 1, size=100000)
b = np.random.normal(0, 1, size=100000)
c = np.random.normal(0, 1, size=100000)
d = np.random.randint(0, 4, size=100000)
e = np.random.binomial(1, 0.5, size=100000)

# Generate target feature based on independent features
target = 50 + corr_a*a + corr_b*b + corr_c*c + d*10 + 20*e + np.random.normal(0, 10, size=100000)

# Create DataFrame with all features
df = pd.DataFrame({'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'target': target})
return df

And the client code:

df = generate_data()

# Separate the features and target
X = df.drop('target', axis=1)
y = df['target']

scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
y_train = y_train.to_numpy().reshape(-1,1)
y_test = y_test.to_numpy().reshape(-1,1)

layer_structure = [X_train.shape[1],10,10,1]
nn = Neural(layer_structure, 20, 1e-5, 64, 0.2, 1)

nn.fit(X_train, y_train)

y_pred = nn.predict(X_test)
nn.plot_learning()

print("Test error: ",mean_squared_error(y_test, y_pred))

"""
Epoch: 0 Train MSE: 6066.584227303851 Valid MSE: 5630.972575612107
Epoch: 1 Train MSE: 5870.45827241517 Valid MSE: 5384.024826048717
Epoch: 2 Train MSE: 5584.489636840577 Valid MSE: 4993.952466830458
Epoch: 3 Train MSE: 5127.64238543267 Valid MSE: 4376.563641292963
Epoch: 4 Train MSE: 4408.550555767417 Valid MSE: 3470.967255214888
Epoch: 5 Train MSE: 3370.6165240733935 Valid MSE: 2333.4365011529103
Epoch: 6 Train MSE: 2112.702666917853 Valid MSE: 1245.1547720938968
Epoch: 7 Train MSE: 1001.3618108374816 Valid MSE: 565.5834115291266
Epoch: 8 Train MSE: 396.9514096548994 Valid MSE: 298.31216370120575
Epoch: 9 Train MSE: 198.29006090703072 Valid MSE: 204.83294115572235
Epoch: 10 Train MSE: 139.2931182121901 Valid MSE: 162.0341771457693
Epoch: 11 Train MSE: 113.971621253487 Valid MSE: 138.35491897074462
Epoch: 12 Train MSE: 100.19734344395454 Valid MSE: 124.60170156400542
Epoch: 13 Train MSE: 92.35069581444299 Valid MSE: 116.55999261926036
Epoch: 14 Train MSE: 87.88890529435344 Valid MSE: 111.85169154584908
Epoch: 15 Train MSE: 85.37162170152865 Valid MSE: 109.08001681897412
Epoch: 16 Train MSE: 83.96135084225956 Valid MSE: 107.42929147368837
Epoch: 17 Train MSE: 83.17564386183105 Valid MSE: 106.42800615549532
Epoch: 18 Train MSE: 82.73977092210092 Valid MSE: 105.80581167903857
Epoch: 19 Train MSE: 82.49876360284046 Valid MSE: 105.40815905002043
Test error: 105.40244085384184
"""

And the learning curve:

Learning curves. Image by the author.

We have explored the step-by-step process of building a neural network from scratch using Python. This hands-on guide has provided a lean and simple implementation, allowing us to gain a fundamental understanding of neural network architectures. However, it’s important to note that this is just the beginning of our journey into the vast world of neural networks. There are numerous advanced concepts and techniques yet to be explored.

Read More

Sources

https://www.youtube.com/watch?v=MQzG1hfhow4

https://github.com/VikParuchuri/zero_to_gpt/blob/master/explanations/dense.ipynb

https://github.com/parrt/tensor-sensor

--

--