Multilayer Perceptron from Scratch Using Python

An Artificial Neural Network

11 min readJul 20, 2024

Here I create a neural network from scratch that uses age, fitness, and diet to predict a health score from 1–10. As it turns out the training data was made up — by me — so although the features are arbitrary and the output is meaningless, this network DOES kick butt at answering, ‘How does a multilayer perceptron work?’

This network has 3 input neurons, 1 hidden layer with 2 neurons, and 1 output neuron. There should be 1 input neuron for each independent feature and they all require scaling — a min-max scaler (linear) is used.

import numpy as np # for math
import pandas as pd # for data storage
import random # for weight and bias generation
import time # to record training time
import matplotlib.pyplot as plt # visuals

# independent variables (requires scaling)
age = np.array([60, 63, 17, 35, 90, 10])
exercise = np.array([8, 2, 7, 4, 1, 9])
diet = np.array([7, 2, 7, 5, 1, 7])

# min-max scaling for independent features
x1 = (age - 0) / (101 - 0)
x2 = (exercise - 0) / (11 - 0)
x3 = (diet - 0) / (11 - 0)

# dependent variable (does not need to be scaled)
y = np.array([7, 3, 9, 6, 1, 10])

Part I — Forward Propagation

The task is to find the values of the 11 network parameters (8 weights and 3 bias terms) that minimize the loss function. This is done using both forward and backward propagation: forward propagation creates predictions that are measured by a loss function, and backward propagation uses the loss function to adjust the weights. The process is repeated until the error reaches a local or global minimum.

Weights and Bias

The hidden neurons consist of (1) a linear function h that combines the inputs, weights, and biases and (2) an activation function that takes h as an input. The output neuron is similar but uses a linear function of h1 and h2 — Z. Initial weights are random numbers generated uniformly between .1 and .9 and the bias is set to 0.

# generate a random number uniformly between lower and upper bound
lower_bound = .1
upper_bound = .9

# initiate random weights
w1 = random.uniform(lower_bound, upper_bound)
w2 = random.uniform(lower_bound, upper_bound)
w3 = random.uniform(lower_bound, upper_bound)
w4 = random.uniform(lower_bound, upper_bound)
w5 = random.uniform(lower_bound, upper_bound)
w6 = random.uniform(lower_bound, upper_bound)
w7 = random.uniform(lower_bound, upper_bound)
w8 = random.uniform(lower_bound, upper_bound)

# initiate bias - can be set to zero (unlike weights)
b1 = 0
b2 = 0
b3 = 0

# initiate hidden neuron values to store for backward propagation
h1 = 0
h2 = 0

# learning rate
lr = 0.005

# epochs
iterations = 2000

Activation and Loss Functions

The network’s prediction is the output of the activation function and its performance is evaluated using Mean Squared Error. The derivatives will be used in the backpropagation step.

# Activation function : identity function
def activ_func(x):
    return x # identity

# Derivative of the identity activation function wrt sum of inputs and weights equation (L)
def d_activ_func(x):
    return 1

# n = number of samples (also equal to the length of the input vectors)
n = len(age)

# Mean Suared Error (MSE)
def MSE(predictions, targets):
    return ((1/n)*np.sum(((predictions-targets)**2)))
    # ALT METHOD: divide by 2 to make the derivitive simpler. return ((1/2n)*np.sum((predictions-targets)**2))
    # ALT METHOD: derivitive would be...       ((1/n)*(predictions - targets))

# Derivitve of Mean Suared Error wrt to the activation equation (Y_pred)
def d_MSE(predictions, targets):
    return ((1/n)*2*(predictions - targets))

Part II — Backward Propagation

The trick behind the success of artificial neural networks is the gradient vector. Whether the input space is two-dimensional, three-dimensional, or 1,000,000-dimensional: the gradient of a function points in the direction that makes the function increase the fastest. In other words, the gradient is a compass which nudges weights in the direction that will reduce error the most.

How are the exact adjustments derived? The gradient vector has both magnitude and direction and is derived from the error function. Therefore, the partial derivative of the error function with respect to a parameter is the parameter’s adjustment. Think of each partial derivative as the parameter’s influence on error and use the derived value as the adjustment to the parameter’s value. The last step is to multiply the derived value by a learning rate to ensure the network can converge on an optimal solution.

Below are the backpropagation equations for each layer. Layer 1 parameters are dependent on all layers while Layer 2 parameters are only dependent on the output neuron — which is why backpropagation for Layer 1 requires more calculations. Note: I do not use any code to calculate partial derivatives, I manually calculate them and then include them with if-else logic.

Layer 1 — Weights (1–6)

def back_propagation_layer_1(predictions, targets, Z, parameter):
  # calculate the gradient for each parameter by multipling each partial derivitve (chain rule)

  # (1)
  one = d_MSE(predictions, targets)

  # (2)
  two = d_activ_func(Z)

  # (3)
  if parameter == 'w1':
    three = w7
  elif parameter == 'w2':
    three = w7
  elif parameter == 'w3':
    three = w7
  elif parameter == 'w4':
    three = w8
  elif parameter == 'w5':
    three = w8
  elif parameter == 'w6':
    three = w8
  elif parameter == 'b1':
    three = 1
  elif parameter == 'b2':
    three = 1
  else:
    print('invalid parameter')

  # (4)
  four = 1

  # (5)
  if parameter == 'w1':
    five = x1
  elif parameter == 'w2':
    five = x2
  elif parameter == 'w3':
    five = x3
  elif parameter == 'w4':
    five = x1
  elif parameter == 'w5':
    five = x2
  elif parameter == 'w6':
    five = x3
  elif parameter == 'b1':
    five = 1
  elif parameter == 'b2':
    five = 1
  else:
    print('invalid parameter')

  # calculate gradient
  # returns a vector of gradients the size of input vector
  # gradient sizes may be able to distinguish feature importance (or at least find the feature that has the potential to improve the model the most)
  gradient = one * two * three * four * five

  # divide by n again to get the average loss per sample
  # returns the mean gradient for the vector of gradients
  avg_gradient = np.mean(gradient)

  return avg_gradient

Layer 2— Weights (7–8)

def back_propagation_layer_2(predictions, targets, Z, h1, h2, parameter):
  # calculate the gradient for each parameter by multipling each partial derivitve (chain rule)

  # (1)
  one = d_MSE(predictions, targets)

  # (2)
  two = d_activ_func(Z)

  # (3)
  if parameter == 'w7':
    three = h1
  elif parameter == 'w8':
    three = h2
  # bias is 1 (not = bias) because it is not multiplied by an input variable
  elif parameter == 'b3':
    three  = 1
  else:
    print('invalid parameter')

  # calculate gradient
  # returns a vector of gradients the size of input vector
  # gradient sizes may be able to distinguish feature importance (or at least find the feature that has the potential to improve the model the most)
  gradient =  one * two * three

  # divide by n again to get the average loss per sample
  # returns the mean gradient for the vector of gradients
  avg_gradient = np.mean(gradient)

  return avg_gradient

The model improves at each iteration by subtracting the gradient for each parameter from its current value. The partial derivative of the gradient wrt to a parameter measures how much each parameter contributed to the error — the adjustment is proportional to the contribution to the error.

Part III — Train and Evaluate the Network

def hidden_layer_model(x1, x2, x3, w1, w2, w3, w4, w5, w6, w7, w8, b1, b2, b3):

    h1 = w1*x1 + w2*x2 + w3*x3 + b1
    h2 = w4*x1 + w5*x2 + w6*x3 + b2
    Z = w7*h1 + w8*h2 + b3
    Y_pred = activ_func(Z)
    return Y_pred, Z, h1, h2

start_time = time.time()
data_storage = pd.DataFrame(columns=['Z', 'y_pred', 'RMSE', 'w1_gradient', 'w2_gradient', 'w3_gradient', 'w4_gradient', 'w5_gradient', 'w6_gradient','w7_gradient', 'w8_gradient', 'b1_gradient','b2_gradient','b3_gradient', 'w1', 'w2', 'w3','w4', 'w5', 'w6','w7', 'w8', 'b1', 'b2', 'b3'])
for i in range(iterations):

    # Forward propagation
    Y_pred, Z, h1, h2 = hidden_layer_model(x1, x2, x3, w1, w2, w3, w4, w5, w6, w7, w8, b1, b2, b3)

    # calculate prediction error
    loss = MSE(Y_pred, y)
    RMSE = np.sqrt(loss)

    # Backward propagation
    # calculate the gradient for each parameter
    w1_gradient = back_propagation_layer_1(Y_pred, y, Z, 'w1')
    w2_gradient = back_propagation_layer_1(Y_pred, y, Z, 'w2')
    w3_gradient = back_propagation_layer_1(Y_pred, y, Z, 'w3')
    w4_gradient = back_propagation_layer_1(Y_pred, y, Z, 'w4')
    w5_gradient = back_propagation_layer_1(Y_pred, y, Z, 'w5')
    w6_gradient = back_propagation_layer_1(Y_pred, y, Z, 'w6')
    w7_gradient = back_propagation_layer_2(Y_pred, y, Z, h1, h2, 'w7')
    w8_gradient = back_propagation_layer_2(Y_pred, y, Z, h1, h2, 'w8')

    b1_gradient = back_propagation_layer_1(Y_pred, y, Z, 'b1')
    b2_gradient = back_propagation_layer_1(Y_pred, y, Z, 'b2')
    b3_gradient = back_propagation_layer_2(Y_pred, y, Z, h1, h2, 'b3')

    # update weights and bias
    # the gradient vector points in the direction that will reduce error the most.
    # this value is negitive so you must subtract the gradient from the weight so the negitives cancel.
    w1 -= lr * w1_gradient
    w2 -= lr * w2_gradient
    w3 -= lr * w3_gradient
    w4 -= lr * w4_gradient
    w5 -= lr * w5_gradient
    w6 -= lr * w6_gradient
    w7 -= lr * w7_gradient
    w8 -= lr * w8_gradient

    b1 -= lr * b1_gradient
    b2 -= lr * b2_gradient
    b3 -= lr * b3_gradient

    # store values in df
    data_storage.loc[i] = [Z, Y_pred, RMSE, w1_gradient, w2_gradient, w3_gradient, w4_gradient, w5_gradient, w6_gradient, w7_gradient, w8_gradient, b1_gradient,b2_gradient, b3_gradient, w1, w2, w3,w4, w5, w6,w7, w8, b1, b2, b3]
end_time = time.time()
print('Training Time: ', round(end_time - start_time, 4), 'Seconds')

At each iteration, all relevant training variables are stored in a data frame with one row for each iteration. Notice (1) RMSE is slowly decreasing and (2) the predictions converge towards the correct predictions.

Similar to the coefficients in a regression equation, these parameters can be plugged into the network and used to make predictions on unseen data. The code below shows how to pull the weights and bias directly from the dataframe but you could manually enter them.

# pull the trained weights and bias values from the last row of the dataframe
fw1 = data_storage.iloc[-1]['w1']
fw2 = data_storage.iloc[-1]['w2']
fw3 = data_storage.iloc[-1]['w3']
fw4 = data_storage.iloc[-1]['w4']
fw5 = data_storage.iloc[-1]['w5']
fw6 = data_storage.iloc[-1]['w6']
fw7 = data_storage.iloc[-1]['w7']
fw8 = data_storage.iloc[-1]['w8']
fb1 = data_storage.iloc[-1]['b1']
fb2 = data_storage.iloc[-1]['b2']
fb3 = data_storage.iloc[-1]['b3']

def final_model_version_2(x1, x2, x3):

    x1 = (x1 - 0) / (101 - 0)
    x2 = (x2 - 0) / (11 - 0)
    x3 = (x3 - 0) / (11 - 0)

    h1 = (fw1*x1) + (fw2*x2) + (fw3*x3) + fb1
    h2 = (fw4*x1) + (fw5*x2) + (fw6*x3) + fb2
    Z = (fw7*h1) + (fw8*h2 + fb3)

    Y_pred = activ_func(Z)

    return Y_pred

# predict health values based on theee custom metrics
age = 30
health = 2
diet = 5
health = final_model_version_2(age, health, diet)
print(health)

The model predicts a health score of 5.59 for the person above.

Convergence Visuals

# plot template
fig, axs = plt.subplots(3, 4, figsize=(12,10))  # Create 5 subplots on one line

# plot RMSE
axs[0, 0].plot(range(iterations), data_storage['RMSE'])
axs[0, 0].set_title('RMSE over iterations')
axs[0, 0].set_xlabel('Iterations')
axs[0, 0].set_ylabel('RMSE')

# plot w1
axs[0, 1].plot(range(iterations), data_storage['w1'])
axs[0, 1].set_title('W1 over iterations')
axs[0, 1].set_xlabel('Iterations')
axs[0, 1].set_ylabel('W1')

# plot w2
axs[0, 2].plot(range(iterations), data_storage['w2'])
axs[0, 2].set_title('W2 over iterations')
axs[0, 2].set_xlabel('Iterations')
axs[0, 2].set_ylabel('W2')

# plot w2
axs[0, 3].plot(range(iterations), data_storage['w3'])
axs[0, 3].set_title('W3 over iterations')
axs[0, 3].set_xlabel('Iterations')
axs[0, 3].set_ylabel('W3')

# plot w4
axs[1, 0].plot(range(iterations), data_storage['w4'])
axs[1, 0].set_title('W4 over iterations')
axs[1, 0].set_xlabel('Iterations')
axs[1, 0].set_ylabel('W4')

# plot w5
axs[1, 1].plot(range(iterations), data_storage['w5'])
axs[1, 1].set_title('W5 over iterations')
axs[1, 1].set_xlabel('Iterations')
axs[1, 1].set_ylabel('W5')

# plot w6
axs[1, 2].plot(range(iterations), data_storage['w6'])
axs[1, 2].set_title('W6 over iterations')
axs[1, 2].set_xlabel('Iterations')
axs[1, 2].set_ylabel('W6')

# plot w7
axs[1, 3].plot(range(iterations), data_storage['w7'])
axs[1, 3].set_title('W7 over iterations')
axs[1, 3].set_xlabel('Iterations')
axs[1, 3].set_ylabel('W7')

# plot w8
axs[2, 0].plot(range(iterations), data_storage['w8'])
axs[2, 0].set_title('W8 over iterations')
axs[2, 0].set_xlabel('Iterations')
axs[2, 0].set_ylabel('W8')

# plot b1
axs[2, 1].plot(range(iterations), data_storage['b1'])
axs[2, 1].set_title('B1 over iterations')
axs[2, 1].set_xlabel('Iterations')
axs[2, 1].set_ylabel('B1')

# plot b2
axs[2, 2].plot(range(iterations), data_storage['b2'])
axs[2, 2].set_title('B2 over iterations')
axs[2, 2].set_xlabel('Iterations')
axs[2, 2].set_ylabel('B2')

# plot b3
axs[2, 3].plot(range(iterations), data_storage['b3'])
axs[2, 3].set_title('B3 over iterations')
axs[2, 3].set_xlabel('Iterations')
axs[2, 3].set_ylabel('B3')

# subplots title
fig.suptitle(f"Activation function: Identity function\nLearning rate: {lr}\nIterations: {iterations}")
plt.tight_layout()
plt.show()

Bonus — Alternative Functions

Sigmoid Activation Function

# sigmoid activation function (non-linear)
def sigmoid_activ_func(x):
    for i in range(len(x)):
      x[i] = 1 / (1 + np.exp(-x[i]))
    return x

# derivative of sigmoid activation function
def d_sigmoid_activ_func(x):
    x = sigmoid_activ_func(x)
    x = x * (1 - x)
    return x

ReLu Activation Function

# ReLu activation function (non-linear)
def relu_activ_func(x):
    for i in range(len(x)):
        x[i] = max(0, x[i]) # ReLu
    return x

# Derivative of ReLu activation function
def d_relu_activ_func(x):
    for i in range(len(x)):
        if x[i] > 0:
            x[i] = 1
        else:
            x[i] = 0
    return x

Standardization Scaling Equation

# Calculate mean and standard deviation
mean_x1 = np.mean(x1)
std_x1 = np.std(x1)

mean_x2 = np.mean(x2)
std_x2 = np.std(x2)

# Standardize the vectors
#x1 = (x1 - mean_x1) / std_x1
#x2 = (x2 - mean_x2) / std_x2

Conclusion

You can now train a simple artificial neural network with a hidden layer from scratch using Python!