The Perceptron Algorithm: How it Works and Why it Works

Published in

Geek Culture

8 min readJan 17, 2022

The Perceptron Algorithm is the simplest machine learning algorithm, and it is the fundamental building block of more complex models like Neural Networks and Support Vector Machines. Understanding how it works and why it works forms a foundation for supervised learning problems.

Note: NumPy is the only library used to implement this algorithm, and all code in this post can be found and run in the following colab notebook. Additionally, all images, code, and figures were generated by the author.

https://colab.research.google.com/drive/1XpC16z4gE0Q9REGfs-ExwziY_cVlbDdi?usp=sharing

The Perceptron Algorithm: How it Works

The Perceptron Algorithm is a model used for supervised learning. All supervised learning models are trained on a set of feature vectors X and a corresponding set of labels Y, which can be ±1. The goal of the Perceptron Algorithm is to find a decision boundary in the feature space so that every feature vector belonging to a given class falls on the same side of the boundary and the boundary separated both classes.

Mathematically, the feature vectors X are vectors in the feature space, which we can think of as a vector space, so the decision boundary must separate the vector space. In a 3D space, a 2D plane separates the entire space, and in 2D, a 1D line separates the entire space. For a n-dimensional space, an (n-1)-dimensional space is called a hyperplane, and the goal of the perceptron algorithm is to find a hyperplane that divides the space. While a decision boundary isn’t required to be linear, the simplest version of the Perceptron Algorithm works with linear decision boundaries.

Restricting ourselves to the set of all hyperplanes that pass through the origin of the feature space, an assumption we will later relax, the hyperplane and decision boundary can be defined by the vector θ, which is normal or perpendicular to the hyperplane. Using the feature vectors X, the labels Y, and θ to represent the decision boundary, the Perceptron Algorithm works is described with the below pseudocode:

For all linearly separable data, this algorithm will always find a decision boundary. In python, this pseudo code is written:

def train_perceptron(X, y):
    theta = np.zeros(len(X[0]))
    misclassified = True
    while misclassified:
        misclassified = False
        for i in range(len(X)):
            if y[i] * np.dot(theta, X[i]) <= 0:
                theta += y[i] * X[i]
                misclassified = True
return theta

The first step of the algorithm is to initialize θ to the zero vector in the feature space. This is an important step because the Perceptron uses the dot product of θ and a feature vector to determine the appropriate label, so the focus of training the perceptron is learning the optimal value for θ.

There are 2 main loops, beginning on lines 2 and 3 of the pseudocode. On line 2, this loop is executed until all data points are correctly classified or another termination condition is inserted, as we will later see. On line 3, this loop iterates over each feature vector in the training data, updating θ as needed.

Why it Works: The Perceptron Update

The Perceptron update step is the critical part of the algorithm, occurring on lines 4 and 5 of the pseudocode. The if statement, on line 4, determines if θ correctly classifies the feature vector xᵢ. Since θ is the normal vector of the decision boundary, we can use the dot produce to determine which side of the hyperplane a feature vector is on. Recall that parallel unit vectors have a dot product of +1, and antiparallel (vectors in the exact opposite direction) unit vectors have a dot product of -1. Based on this, if θ and xᵢ point in the same direction, the dot product will be > 0, and if they point in opposite directions, the dot product will be < 0.

If yᵢ is +1 and the dot product is > 0, then yᵢ(θ ⋅ xᵢ) > 0. Similarly, if yᵢ is -1 and the dot product is < 0, then yᵢ(θ ⋅ xᵢ) > 0. These conditions show that θ correctly classifies xᵢ. However, if yᵢ and θ ⋅ xᵢ do not have the same sign, then yᵢ(θ ⋅ xᵢ) will be less than or equal to 0. This is the condition checked in the if statement on line 4.

When the decision boundary correctly classifies a feature vector, nothing updates. However, when a misclassification occurs, the decision boundary is updated in the “right direction” to prevent the misclassification from occurring again. This update takes the form of adding (yᵢ ⋅ xᵢ) to θ. The reason this addition moves the boundary in the right direction is because the new θ better classifies xᵢ. To see this, consider the following:

This shows the label times the dot product between a misclassified feature vector and θ is greater after incrementing θ. Since this quantity is positive for a correctly classified feature vector and the update makes this value larger, it follows that it moves the decision boundary in the right direction.

The Perceptron In Action

Once a Perceptron is trained and the decision boundary is learned by finding θ, it can be used to classify data. The colab notebook (linked above), contains code to generate data, train a Perceptron, and plot the decision boundary. This section will demonstrate some of the strengths and weaknesses of this algorithm.

Two clusters of 2D feature vectors were generated and shown in the figure below. The Perceptron algorithm, exactly as written above, was trained to find decision boundary, defined by θ, and seen as the green line in the plot below.

This data is linearly separable with a decision boundary through the origin.

The Perceptron Algorithm does a great job finding a decision boundary that works well for this data set. However, there are 2 important notes to make here:

Not all data can be separated with a straight line (or hyperplane) through the origin.
There is no noise in this dataset. The purple and yellow clusters have no overlap to them, which make them very easy to separate.

While the Perceptron Algorithm works while these conditions are true, it fails miserably if these conditions are not met. To fix this, a few modifications can be made to the algorithm.

Improved Perceptron

Perceptron with Offset

First, relaxing the assumption that the decision boundary must pass through the origin of the feature space, the algorithm can be modified by including an offset term b. Previously, with a decision boundary through the origin a correctly classified feature vector satisfied the condition yᵢ(θ ⋅ xᵢ) > 0. With the offset term b, a correctly classified feature vector satisfies the condition yᵢ(θ ⋅ xᵢ + b) > 0. This is referred to as the Perceptron with Offset.

Similar to θ in the original Perceptron Algorithm above, b is only updated when a feature vector is misclassified. When a point is misclassified, θ is updated as θ = θ + yᵢ ⋅ xᵢ, and b is updated as b = b + yᵢ.

The offset term can be incorporated into the original Perceptron Algorithm without adding an offset term b but rather extending θ and xᵢ as follows:

In this case, the original Perceptron Algorithm is equivalent to the Perceptron with offset of b. In the colab notebook, the algorithm explicitly used an offset b. An example of a decision boundary learned by the Perceptron with an offset is seen below.

A decision boundary that does NOT pass through the origin

Inseparable Data

The loop beginning on line 2 of pseudo code executes until the Perceptron Algorithm finds a decision boundary that separates the 2 classes of data. However, this isn’t always possible! This is problematic because when the Perceptron Algorithm is applied to data that can’t be perfectly separated, it will execute an infinite loop.

Generally, there are 2 strategies to solve this problem. The simplest strategy is to set a limit on the number of times this outer loop executes. Each execution of the outer loop is an epoch, so we could add a condition setting a maximum number of epochs to execute.

The second strategy is to exit the second loop when the updates to θ become sufficiently small between each epoch. This method is slightly more complicated to implement, and it is further complicated by the fact that θ grows unbounded since there is no regularization term in the Perceptron Algorithm.

Below is an example of the Perceptron Algorithm with Offset trained on data that is not separable for 10 epochs.

This example applies the Perceptron Algorithm with Offset to inseparable data. The code for this plot is available towards the bottom of the colab notebook.

While the decision boundary does sort of separate the2 clusters, it doesn’t do that good of a job. This highlights that the Perceptron Algorithm is useful when working with separable data but not very helpful otherwise.

Extensions of the Perceptron

We have seen how and why the Perceptron algorithm works and also its limitations. While the algorithm itself isn’t a popular choice these days as it only really works well on linearly separable data, the ideas behind it form the backbone of many popular machine learning concepts today.

Concepts like the decision boundary, training for multiple epochs, and how feature vectors are treated in the Perceptron are all ideas critical to understanding more complex models like neural networks. Applying

kernels
activation functions
regularization
multiple Perceptrons

and other techniques are the key to extending the Perceptron Algorithm to Support Vector Machines and Neural Networks. These are concepts I will cover in future posts.

Thanks for reading my first story on Medium!

— Joshua Pickard