Explaining Neural Network as Simple as Possible 1 —The Perceptron

Vectors, Matrices, the Dot Product and the earlier Neural Network — The Perceptron

Alex Punnen
Better ML
12 min readFeb 25, 2024

--

“Let’s start at the very beginning, a very good place to start …” from Do-Re-Mi -The Sound of Music

Deep Learning is about neural networks and their structures and designs; and training those networks. Even the most complex Neural network is based on vectors, and matrices and uses the concept of cost function and algorithms like gradient descent to find a reduced cost, and then propagate the cost back to all constituents of the network proportionally via a method called back-propagation.

Have you held an integrated circuit or chip in hand or seen it? It looks overwhelmingly complex. But its base is the humble transistor and Boolean logic. To understand something complex we need to understand the simpler constituents.

Note — Medium does not support MathJax; So the maths equations are shown as pictures here. If you prefer to read via a nicer LateX layout see here — https://alexcpn.github.io/html/NN/ml/

The earliest neural network — Rosenblatt’s Perceptron was the first to introduce the concept of using vectors and the property of dot product, to split hyperplanes of input feature vectors.

What are Vectors?

A vector is an object that has both a magnitude and a direction. Example Force and Velocity. Both have magnitude as well as direction.

However, we need to specify also a context where this vector lives -Vector Space. For example, when we are thinking about something like a Force vector, the context is usually a 2D or 3D Euclidean world.

(Source: 3Blue1Brown)
(Source: 3Blue1Brown)

The easiest way to understand the Vector is in such a geometric context, say 2D or 3D cartesian coordinates.

The properties can then be extrapolated for other Vector spaces which we encounter but cannot imagine.

What are Matrices, and how are they related to Vectors?

Matrices are many things to many aspects of mathematics. But in the case of Neural networks they are just a way to represent Vectors (and Tensors).

Vectors are represented as matrices. A Vector is essentially a one-dimensional matrix. A matrix is defined to be a rectangular array of numbers. An example here is a Euclidean Vector in three-dimensional Euclidean space (or R³ with some magnitude and direction (from (0,0,0) origin in this case).

A vector is represented either as a column matrix or as a row matrix.

Multi-dimensional matrices can be thought of as one-dimensional vectors stacked on top of each other.

The Neural Network Weights or weight vectors are stacked together as matrices. This intuition is especially helpful when we use dot products on neural network weight matrices.

What are Tensors?

Since we will be dealing soon with multidimensional matrices, it is as well to state here what Tensors are. Easier is to define what they represent.

We have seen that a Vector is a one-dimensional matrix.

Higher dimension matrices are called Tensors.

A Vector is a Tensor of Rank 1 and technically a Scalar is also a Tensor of Rank 0. Matrices are Vectors of Rank 2. Higher dimensional arrays are Tensors of Rank N.

Now we are coming to a very important part.

What is the Vector Dot Product and what is so special about the Dot Product?

We could use the dot product as a way to find out if two vectors are aligned or not.

This way we can use it to cluster a collection of vectors (in a vector space).

Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers.

if

and

then,

Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them

Note- These definitions are equivalent when using Cartesian coordinates. Here is a simple proof that follows from trigonometry 8 and 9

If two Vectors are in the same direction the dot product is positive and if they are in the opposite direction the dot product is negative. This can be visualized geometrically putting in the value of the Cosine angle.

So we could use the dot product as a way to find out if two vectors are aligned or not.

Dot Product and Splitting the hyper-plane -the crux of the Perceptron

Let’s take a simple example.

Let's assume that a vector like [x,y] is some feature of a leaf /a feature vector of a leaf. Healthy leaves will have some values of this feature vector. Unhealthy leaves will have some other value ranges of this feature vector.

Let's collect features for 10 leaves and assume that 2 are unhealthy ones.

Imagine we have a problem classifying if a leaf is healthy or not based on the features of the leaf. For each leaf, we have some feature vector set.

Now, if we have a weight vector, whose dot product with the feature vector of the set of input vectors of a certain class (say leaf is healthy) is positive, and with the other set is negative, then that weight vector splits the feature vector hyper-plane into two areas.

In the above diagram, the weight vector w (shown as a dotted line) splits the 2D feature space of positive and negative samples.

For any new leaf, if we only extract the same features into a feature vector; we can dot it with the trained weight vector and find out if it falls in the healthy or deceased class.

A short Colab for this, generated by ChatGPT.

import numpy as np
import matplotlib.pyplot as plt

# Generating synthetic data
# Features for healthy leaves
healthy_leaves = np.random.randn(8, 2) + np.array([2, 2])
# Features for unhealthy leaves
unhealthy_leaves = np.random.randn(2, 2) + np.array([-2, -2])

print(healthy_leaves[0])
print(unhealthy_leaves[0])
# Defining a weight vector by hand
weight_vector = np.array([1, 1])

print(weight_vector.dot(healthy_leaves.T))
print(weight_vector.dot(unhealthy_leaves.T))

# Plotting the features
plt.scatter(healthy_leaves[:, 0], healthy_leaves[:, 1], c='green', label='Healthy Leaves')
plt.scatter(unhealthy_leaves[:, 0], unhealthy_leaves[:, 1], c='red', label='Unhealthy Leaves')

# Plotting the decision boundary
# This decision boundary is determined by the weight vector
# We want to find a line such that weight_vector.dot(x) = 0
# Let's derive two points that lie on the line to plot it
x_values = np.array(plt.gca().get_xlim())
y_values = - (weight_vector[0] / weight_vector[1]) * x_values

plt.plot(x_values, y_values, label='Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Output

Note that here we have hand-coded the weight matrix. We will see how this code can be adapted to learn the weight matrix

The Perceptron

The initial neural network — Frank Rosenblatt’s perceptron was designed to split a linearly separable feature set into distinct sets. Here is how the Rosenblatt’s perceptron is modelled

Image source

Inputs are x₁ to xₙ. Weights are some values that are learned w₁ to wₙ. There is also a bias (b) which in the above is θ. The bias can be modelled as a weight w₀ connected to a dummy input x₀ set to 1.

If we ignore the bias term, the output y can be written as the sum of all inputs times the weights; thresholded by zero.

The Perceptron will fire if the sum of its inputs is greater than zero; otherwise, it will not. It will be activated if the sum of its inputs is greater than zero; otherwise, it will not.

The Activation Function of the Neural Network

The big blue circle is the primitive brain of the primitive neural network — the perceptron brain. This is what is called an Activation Function in Neural Networks. In Perceptron, the activation function is a simple step function, the output is non-continuous (and hence non-differentiable) and is either 1 or 0.

If the inputs are arranged as a column matrix and weights are also arranged likewise then both the input and weights can be treated as vectors and the activation function

is the same as the dot product!!

Hence the activation function can also be written as

Note that the dot product of two matrices (representing vectors), can be written as the transpose of one multiplied by another.

All three equations (Eq 1,2 &3) are the same, just that in different references it will be written in one of these forms.

The equation

defines all the points on one side of the hyperplane, and

all the points on the other side of the hyperplane and on the hyperplane itself.

This happens to be the very definition of “linear separability”

Thus, the Perceptron allows us to separate our feature space in two convex half-spaces (12)

If we can get the weight matrix that has this property, then this weight vector splits the input feature vectors into two regions by a hyperplane.

This is the essence of the Perceptron, the initial artificial neuron.

In simple terms, it means that an unknown feature vector of an input set belonging to say Dogs and Cats, when a dot product is applied with a trained weight vector, will fall into either the Dog space of the hyperplane or the Cat space of the hyperplane. This is how neural networks do classifications.

Concept of Hyperplane

Next, let’s see how Perceptron is trained.

How are the Perceptron weights learned?

You may have heard about Gradient descent. Perceptron learning, is much simpler.

What is done is to start with a randomly initialized weight vector, compute a resultant classification (0 or 1) by taking the dot product with the input feature vector, and then adjust the weight vector by a tiny bit to the right ’direction’ so that the output is closer to the expected value. Do this iteratively until the output is close enough.

The question is how to nudge to the correct “direction”?

We want to move the weight vector in the direction of the input vector so that the hyperplane is closer to the correct classification.

The error of a perceptron with weight vector w is the number of incorrectly classified points. The learning algorithm must minimize this error function

Perceptron Training

  1. Taking input from the training data, and doing a dot product with the initial weight vector; will give you either a value greater than 0 or less than 0.
  2. Note that this means which quadrant the feature vector lies; either in the positive quadrant (P) or on the negative side (N).
  3. If this is as expected, then do nothing.
  4. If the dot product comes wrong, that is if the input feature vector — say x was x ∈ P but dot product w.x < 0 we need to drag/rotate the weight vector towards x.

Which is vector addition, that is w is moved towards x

5. Alternately say that x N but dot product w.x > 0, then we need to do the reverse

This is the method of perceptron learning

This is also called the delta rule. Note that some articles refer to this as gradient descent simplified. But gradient descent depends on the activation function being differentiable. The step function which is the activation function of the perceptron is non-continuous and hence non-differentiable.

Here is a sample code the illustrates (ChatGPT generated) and modified

def perceptron_training(X, y, learning_rate=0.1, n_epochs=10):
"""
Train a perceptron model.

Parameters:
- X: Input features, a numpy array of shape (n_samples, n_features)
- y: Target values, a numpy array of shape (n_samples,)
- learning_rate: The learning rate for weight updates.
- n_epochs: Number of passes over the training dataset.

Returns:
- weights: The learned weight vector, including the bias term.
"""
n_samples, n_features = X.shape
# Initializing weights to zeros; +1 for the bias term
weights = np.zeros(n_features)

# Training process
for epoch in range(n_epochs):
for i in range(n_samples):
xi = X[i]
# Predicting using the step function
prediction = np.dot(xi, weights) >= 0
# Updating weights if the prediction is wrong
if prediction != y[i]: # if not matching prediction
update = learning_rate * (y[i] - prediction)
weights += update #update weights towards the prediction

return weights

# Generating synthetic data
# Features for healthy leaves
healthy_leaves = np.random.randn(88, 2) + np.array([2, 2])
# Features for unhealthy leaves
unhealthy_leaves = np.random.randn(22, 2) + np.array([-2, -2])

# Generating labels for the synthetic data
# Healthy leaves (1) and unhealthy leaves (0)
y_healthy = np.ones(88) # Labels for healthy leaves
y_unhealthy = np.zeros(22) # Labels for unhealthy leaves

# Combining the datasets
X = np.vstack((healthy_leaves, unhealthy_leaves))
y = np.concatenate((y_healthy, y_unhealthy))

# Training the perceptron
weights = perceptron_training(X, y, learning_rate=0.4, n_epochs=100)

# weights are learned now lets test with a feature set.
# True implies leaf is healthy, False not
input_vector = [1, 2]

# Predicting using the step function
prediction = np.dot(input_vector, weights) >= 0
print(prediction)

If we plot the above we get a visual representation of how the weight vector has split the training data set (88 healthy and 22 unhealthy)

# Plotting the data and the decision boundary learned by the perceptron
plt.scatter(healthy_leaves[:, 0], healthy_leaves[:, 1], c='green', label='Healthy Leaves')
plt.scatter(unhealthy_leaves[:, 0], unhealthy_leaves[:, 1], c='red', label='Unhealthy Leaves')

# Calculating decision boundary
x_values = np.array(plt.gca().get_xlim())
y_values = - (weights[1] ) * x_values - (weights[0] )

plt.plot(x_values, y_values, label='Perceptron Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Note that when we have more complex feature vectors say a 3D or a N dimensional vector, the same principles apply. Only visualising the N dimensional weight vector becomes very difficult. Below is a visualisation of a weight vector for a 3D feature vector space.

Note green points are on top of the weight vector plane and red points are below

Note: A more rigorous explanation of the proof is here from the book Neural Networks by R.Rojas and a more lucid explanation in perceptron-learning-algorithm

The Perceptron Network and the AI winter

The perceptron network, due to its reliance on a step function as the activation function, requires the input feature set to be linearly separable for successful classification. However, not all problems feature linearly separable datasets, which represents a significant constraint of the perceptron model.

Linearly and non-linearly separable datasets

The fact that Perceptron could not be trained for XOR or XNOR; which was demonstrated in 1969, by by Marvin Minsky and Seymour Papert led to a lot of disillusionment, as much of the hype generated initially by Frank Rosenblatt’s discovery became a disillusionment at that time.

The concepts of Vectors, Feature Space, Hyper-plane, Dot Product, Weights and even activation function are some of the key takeaways from this history. Most of these are very relevant in the modern neural network with few changes and additions. We will see this in the second part.

--

--

Alex Punnen
Better ML

SW Architect/programmer- in various languages and technologies from 2001 to now. https://www.linkedin.com/in/alexpunnen/