Artificial Neural Networks (ANNs) In Depth

Fraidoon Omarzai
10 min readJul 22, 2024

--

Everything you need to know about ANNs, practical examples, forward propagation, backward propagation, perception, and maths behind ANNs.

Contents:

  • Introduction To ANNs
  • Perceptrons
  • Regression With Perceptrons
  • Classification With Perceptrons
  • Neural Network: Two Layers NN and Three Layers NN
  • Multi-layers NN
  • TensorFlow and PyTorch Implementation

Artificial Neural Networks (ANNs):

  • An Artificial Neural Network (ANN) is a computational model inspired by the way biological neural networks in the human brain process information

Deep Learning: is a sub-branch of AI and ML that follows the working of the human brain for processing datasets and making efficient decisions.

Basic Structure Of ANN:

1. Neurons: Basic units that receive input, process it, and pass it to the next layer are called neurons, or neurons are nodes through which data and computations flow

2. Layers: Consist of an input layer, hidden layers, and an output layer:

  • Input Layer: Receives initial data
  • Hidden Layers: Intermediate layers that perform complex computations
  • Output Layer: Produces the final output

3. Operation:

  • Forward Propagation: Input data passes through the network, generating an output
  • Activation Functions: Apply non-linear transformations to inputs at each node, allowing the network to learn complex patterns
  • Backpropagation: Used to adjust weights based on error, improving the network’s accuracy, or it applies the chain rules to compute the gradient of the loss function with respect to the input

Types of Neural Networks:

  • Feedforward Neural Networks (FNNs): Data flows in one direction, from input to output
  • Convolutional Neural Networks (CNNs): Specialized for processing grid-like data such as images
  • Recurrent Neural Networks (RNNs): Designed for sequential data, like time series or text
  • Generative Adversarial Networks (GANs): The main focus is to generate data from scratch
  • Autoencoders: Autoencoders are a type of artificial neural network used for unsupervised learning. They aim to learn a compressed representation of data by training the network to ignore signal noise and capture essential data features
  • Transformers: The new type of neural network architecture that has revolutionized natural language processing (NLP)

Perceptron

  • A perceptron is a fundamental building block of neural networks, it is the simplest type of artificial neural network, typically used for binary classification tasks

Key Components of a Perceptron:

  • Inputs {X1,X2,…,Xn}: Receives signals, which can be single or multiple values representing the data’s features
  • Weights {W1,W2,…,Wn}: Each input has an associated weight, signifying its importance in influencing the perceptron’s output. These weights are adjusted during training to optimize performance
  • Bias: An additional term that provides flexibility in modeling complex data patterns
  • Activation Function: A mathematical function that transforms the weighted sum of inputs and bias into a single output value. Link For More

How a Perceptron Works:

  1. Input & Weighting: Each input is multiplied by its corresponding weight.
  2. Summation: The weighted inputs and bias are summed together.
  3. Activation: The activation function is applied to the summed value, resulting in the final output.

Mathematical Representation:

Perceptron Learning Algorithm

1. Initialization:

  • Initialize weights and bias to small random numbers or zeros

2. Training:

  • For each training sample, calculate the output using the current weights and bias
  • Calculate the error
  • Update the weights and bias based on the error between the predicted output and the actual output

3. Convergence:

  • Repeat the training process until the algorithm converges, i.e., the weights stabilize and the error is minimized, or a maximum number of iterations is reached

Regression With Perceptron

  • Perceptron can be used for regression with a linear activation function, making them essentially linear regressors
  • This approach is helpful for understanding the fundamentals of neural networks in regression but might not be the most powerful technique for complex regression problems

1. Prediction (Forward Pass):

  • w: wights
  • x: inputs
  • b: bias
  • y_hat: final output

2. Loss function (Error Calculation):

  • Loss Function: The loss function computes the error for a single training example.
  • Cost Function: The cost function is the average of the loss function of the entire training set.
  • In the regression problem, we use the Square Error function
  • L: loss function

Note:The main goal is to find (w1,w2, and b) that give y_hat with least error. To find optimal values for (w1,w2, and b) we use gradient descent.

3. Calculate Gradient Descent (Backward Propagation):

  • Gradient descent: It is an optimization algorithm, used to find a set of model parameters(w, b) that minimize the cost function. Link For More About Optimization
  • Backpropagation (Backward propagation of errors): It is a specific algorithm used to calculate the gradient and update the model parameters in NN. Backward apply the chain rules to compute the gradient of the loss function with respect to the input
  • Below is the process for calculating gradient descent:
  • Alpha: it is the learning rate that determines the size of the steps taken during the optimization process when updating the model’s weights. The learning rate directly influences how quickly or slowly a model learns and converges to a minimum of the loss function.
  • We take the derivate of the loss function with respect to the w and b, to find the optimal value to minimize the loss function
  • In order to simplify the above equation and find out the derivatives:
  • After getting the derivative, our final equations are:

Classification With Perceptron

1. Prediction (Forward Pass):

  • z: the transformation
  • sigma: the activation function used in binary classification

2. Loss function (Calculate Error):

  • In the classification problem, we use different loss functions compared to the regression problem. Because we have many local minima due to having a non-convex function
  • For classification problems, we use cross-entropy, and for binary problems, we use binary cross-entropy

Note: To find optimal values for (w1,w2, and b) we use gradient descent

Calculate Gradient Descent (Backward Propagation):

  • Below is the process for calculating gradient descent:
  • Below is shown to compute the derivate of the loss function with respect to w and b:
  • Our final equations are:

Neural Network

Neural network with two layers:

1. Prediction (Forward Pass)

  • Remember the common activation function in the hidden layer is ReLU, we do not use sigmoid or Tanh, because if we do so, we will face a vanishing gradient problem
  • Note Vanishing Gradient Problem: This occurs when the gradients of the loss function with respect to the parameters (weights) become very small during backpropagation, effectively preventing the weights from updating properly. So the gradient often gets smaller and smaller and approaches zero which eventually leaves the weights of the lower layer nearly unchanged, as a result, the gradient descent never converges to the optimum.
  • The activation function in the output layer is sigmoid if the problem is binary. For multi-class classification we use Softmax
  • Link For More About Activation Function: Link
  • w_ij: weight for ith node and jth training example
  • g: activation function( Hidden Layer We Use ReLU)
  • A sigmoid activation function is used in the output layer

2. Loss function (Calculate Error):

  • We use binary cross entropy:

3. Compute Gradient Descent (Backward Propagation):

Neural network with three layers:

1. Prediction (Forward Pass):

  • In a multi-layer neural network, it’s common practice to organize all parameter values (weights and biases) in matrices and vectors, grouped by layers. This approach simplifies calculations and makes the operations more efficient
  • W^[1]: is the weight matrix of layer 1
  • 𝑏^[1]: is the bias vector of layer 1
  • 𝑍^[𝑙]: is the linear combination (pre-activation value) for layer 1
  • 𝐴^[1]: is the activation output of layer 1
  • Remember: we take the transpose of W

2. Computer Loss (Calculate Error):

  • We are using binary cross entropy:

3. Compute Gradient Descent (Backward Propagation):

Note:In the last step, we just showed the chain rule for W in layer 3, and we performed the chain rules for all of them.

Multi-Layer NN:

Step1: Forward Propagation For Layer l

  • We compute forward propagation in deep learning using the below equations:
  • W^[l]: is the weight matrix of layer l
  • 𝐴^[𝑙−1]: is the activation output from the previous layer (or input data if it’s the first layer)
  • 𝑏^[𝑙]: is the bias vector of layer l
  • Z^[l]: is the linear combination (pre-activation value) for layer l
  • g: is the activation function applied element-wise to 𝑍[𝑙]
  • 𝐴[𝑙]: is the activation output of layer l

Step2: Calculate Loss

  • Loss Function Formula for binary classification: calculate the error for a single training example
  • Cost Function Formula for binary classification: average of loss for entire training example

Step3: Backward Propagation For Layer l

1. Compute the gradient of the loss with respect to the output layer’s linear combination 𝑍^[𝐿]:

  • g′: typically refers to the derivative of the activation function with respect to its input

2. Gradient for weight and bias: For each layer l from L to 1:

3. Update Parameters:

Summary

1. Forward Pass:

  • Computes the output of the network given the input data
  • Involves computing the linear combination Z^[l] and applying the activation function A^[l] for each layer
  • Ends with the computation of the loss

2. Backward Pass:

  • Computes the gradients of the loss function with respect to each parameter in the network
  • Involves backpropagating the error through the network to compute dZ^[l], dW^[l], and db^[l] for each layer
  • Uses these gradients to update the weights and biases to minimize the loss

Tensorflow Implementation

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder

# Load dataset
data = load_iris()
X = data.data
y = data.target.reshape(-1, 1)

# One-hot encode target variable
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model
model = Sequential([
Dense(10, input_shape=(X.shape[1],), activation='relu'),
Dense(10, activation='relu'),
Dense(y.shape[1], activation='softmax')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=10, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

PyTorch Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Load dataset
data = load_iris()
X = data.data
y = data.target.reshape(-1, 1)

# One-hot encode target variable
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y)

# Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)

# Create dataset and dataloaders
dataset = TensorDataset(X, y)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=10, shuffle=False)

# Define the model
class ANN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(ANN, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, output_dim)
self.softmax = nn.Softmax(dim=1)

def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
x = self.relu(x)
x = self.fc3(x)
x = self.softmax(x)
return x

# Model instantiation
input_dim = X.shape[1]
hidden_dim = 10
output_dim = y.shape[1]
model = ANN(input_dim, hidden_dim, output_dim)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 100
for epoch in range(num_epochs):
for inputs, targets in train_loader:
outputs = model(inputs)
loss = criterion(outputs, torch.argmax(targets, dim=1))

optimizer.zero_grad()
loss.backward()
optimizer.step()

if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model
model.eval()
with torch.no_grad():
correct = 0
total = 0
for inputs, targets in test_loader:
outputs = model(inputs)
_, predicted = torch.max(outputs.data, 1)
total += targets.size(0)
correct += (predicted == torch.argmax(targets, dim=1)).sum().item()

print(f'Test Accuracy: {100 * correct / total:.2f}%')

--

--

Fraidoon Omarzai

AI Enthusiast | Pursuing MSc in AI at Aston University, Birmingham