Machine Learning From Scratch: Classification, Regression, Clustering and Gradient Descent

Published in

The Startup

7 min readJun 27, 2020

A quick start “from scratch” on 3 basic machine learning models — Linear regression, Logistic regression, K-means clustering, and Gradient Descent, the optimisation algorithm acting as a driving force behind them.

The purpose of this article is for coders to understand the inner workings of basic machine learning algorithms. To make the best use of the article, it is recommended to follow the code on your own development environment to understand the process.

Linear Regression

Linear regression models a linear relationship between an independent variable X and a continuous dependent variable y. Linear regression is an example of supervised learning, the task of learning a function mapping given X and y. A continuous variable can take any value within an interval (e.g. 0.01, 1.23), while a discrete variable can only take distinct and separate values (e.g. 0, 1). Discrete variables will be addressed in the later section on Logistic Regression.

A linear model can be simply defined by the following equation, where m and c represents the gradient of the line and the y-intercept respectively.:

y = m*X + c

The gradient and the y-intercept are commonly referred to as the weight and the bias respectively. When X is multi-dimensional, the equation becomes that of a multiple linear regression model:

y = m1*X1 + m2*X2 + m3*X3 + C

The weights and dependent variables can be represented as vectors m = [m1,m2,m3] and x = [x1,x2,x3]. To consolidate in Python 3.8, the linear regression function for 1-dimensional X can be defined as using list comprehension:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import randomdef f(m, X, c):
    """Linear regression"""
    return [m * x + c for x in X]X = [i for i in range(10)]
y = [x + random.random() for x in X]m, c = 1, 0
y_hat = f(m, X, c)plt.plot(X, y, '.', c='r')
plt.plot(X, y_hat)
plt.show()

The linear regression model y = 1 * X + 0 where m = 1 and c = 0 seems to fit the data reasonably well but a quantifiable measure is required in order to compare between models with different values of m and c.

The error of the model can be quantified in terms of the difference between y, the actual output, and y_hat, the prediction of the model. One error measure is Mean Squared Error, defined as:

def MSE(y, y_hat):
    """Mean squared error"""
    c = 0
    n = len(y)
    for i in range(n):
        c += (y_hat[i] - y[i]) ** 2
    return 1/2 * c/nprint(cost(y, y_hat))  # 0.17272072055396498

A lower mean squared error indicates parameters that result in a better-fitting linear regression model. The goal of optimising a model is to minimise the error. Therefore, an optimisation algorithm, such as gradient descent, is required.

Gradient Descent

Gradient descent is an optimisation algorithm for finding the local minimum of a differentiable function. The loss function mse = (y_hat - y)**2 of a linear regression model may be minimised to improve model fitness. In the following diagram, the vertical axis represents the MSE while the horizontal-axis represents a weight, e.g. m. By changing the value of m, the MSE may increase or decrease. To minimise MSE, m must move towards a direction where MSE decreases.

In the case of the above loss function, a positive gradient signals a need to decrease theta and a negative one, increase. The gradient, or partial derivative, of the loss function with respect to m is computed as:

Partial derivative of MSE with respect to `m`

The key step in gradient descent is therefore the weight update, where a represents the learning rate, a hyperparameter that controls the size of the change in weight. The bias, c, also changes in the same way. The following is the gradient descent algorithm.

# No. of iterations, weight, bias, learning rate
e, m, c, a = 10, 1, 0, 0.01for i in range(e):
    y_hat = f(m, X, c)
    n = len(y_hat)
    
    # Compute partial derivatives of MSE w.r.t. m and c
    dm = (-2/n) * sum([X[i] * (y[i] - y_hat[i]) for i in range(n)])
    dc = (-2/n) * sum([y[i] - y_hat[i] for i in range(n)])
    
    # Update weight and bias
    m = m - a * dm
    c = c - a * dc
    
    print(f"Epoch {i}: m={m:.2f}, c={c:.2f}")y_hat = f(m, X, c)
plt.plot(X, y, '.', c='r')
plt.plot(X, y_hat)
plt.show()

Logistic Regression

Logistic regression uses a logistic function (or sigmoid function) to model a binary (discrete) dependent variable, usually either 0 and 1. Logistic regression is commonly used for classification tasks, such as predicting if a passenger survives on board the sinking Titanic ship.

The sigmoid function is defined as:

Sigmoid function

import mathdef sigmoid(z):
    return 1 / (1 + math.exp(-z))X = [i for i in range(-10,10)]
y = [sigmoid(x) for x in X]
plt.plot(X, y)
plt.show()

The loss function for logistic regression (“log-loss”) is defined as:

When the actual value of y is 0, the first term -y*log(y_hat) equals 0, and log-loss equals -log(1-y_hat). If y_hat is close to 0, e.g. y_hat=0.1, -log(1–0.1) equals ~0.105, while if y_hat is close to 1, e.g. y_hat=0.9, -log(1-0.9) equals ~2.303, indicating a higher loss when y_hat differs from y greatly.

The loss function can be adjusted to become:

import mathdef loss(y, y_hat):
    n = len(y)
    loss = 0    for i in range(n):
        loss += -y[i] * math.log(y_hat[i]) - (1-y[i]) * math.log(1-y_hat[i])
    return loss / ny_hat = [sigmoid(x + random.uniform(-1,1)) for x in X]
plt.plot(X, y_hat, label='noisy')
plt.plot(X, y, label='sigmoid')
plt.legend(loc='upper left')
plt.show()
print("Loss:", loss(y, y_hat))  # 0.17083318165128286

K-Means Clustering

K-Means Clustering is a method of grouping data points together based on similarity between one another, forming clusters. K-means clustering is an unsupervised learning algorithm, requiring only X (data points) without the need for y (cluster index).

The K-means clustering algorithm is well-explained in words:

Randomly initialise K centroids (points).
Assign each data point to the closest centroid.
Compute the mean of all points assigned for each unique centroid, and re-assign the K centroids with the newly computed mean.
Repeat steps 2–3.

Step 1: Randomly initialise K centroids.

from sklearn.datasets import make_blobsX, y = make_blobs(n_samples=20, random_state=0)
random.seed(1)# Randomly initialise k=3 centroids
k = 3
centroids = np.array([
    [random.uniform(-5, 5), random.uniform(-5, 5)],
    [random.uniform(-5, 5), random.uniform(-5, 5)],
    [random.uniform(-5, 5), random.uniform(-5, 5)]])plt.scatter(X[:,0], X[:,1])
plt.scatter(centroids[:,0], centroids[:,1], c='r')
plt.show()

Initialisation of 3 centroids (red) on the 2D dataset (blue)

Step 2: Assign each data point to the closest centroid.

cluster_labels = []
for x in X:
    dists = []
    for c in centroids:
        # Euclidean distance
        d = math.sqrt((x[0] - c[0])**2 + (x[1] - c[1])**2)
        dists.append(d)
    cluster_labels.append(dists.index(min(dists)))print(cluster_labels)
# [0, 0, 2, 2, 1, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 0]

Step 3: Compute the mean of all points assigned for each unique centroid, and re-assign the K centroids with the newly computed mean.

# Compute the mean of all points assigned for each unique centroid
new_centroids = []
for i in range(k):
    data_points = []
        for j in range(len(cluster_labels)):
            if cluster_labels[j] == i:
                data_points.append(X[j])
        if len(data_points) == 0:
            new_centroids.append(centroids[i])
        else:
            new_c = sum(data_points)/len(data_points)
            new_centroids.append(new_c)new_centroids = np.array(new_centroids)
print("Centroids:", new_centroids)# Centroids:
[[-0.94637217  3.75181235]
 [ 3.62236293 -0.20060169]
 [ 1.77693338  2.32752122]]cls0_idx = np.where(np.array(cluster_labels)==0)[0]
cls1_idx = np.where(np.array(cluster_labels)==1)[0]
cls2_idx = np.where(np.array(cluster_labels)==2)[0]plt.scatter(X[cls0_idx,0], X[cls0_idx,1], c='yellow')
plt.scatter(X[cls1_idx,0], X[cls1_idx,1], c='blue')
plt.scatter(X[cls2_idx,0], X[cls2_idx,1], c='green')
plt.scatter(new_centroids[:,0], new_centroids[:,1], c='r')
plt.show()

Assigning of each data point to the closest centroid by Euclidean distance

Step 4: Repeat steps 2–3 (in a function!).

def kmeans_clustering(centroids, X):
    """K-means clustering"""
    # Assign each data point to closest centroid
    cluster_labels = []
    
    for x in X:
        dists = []
        for c in centroids:
            # Euclidean distance
            d = math.sqrt((x[0] - c[0])**2 + (x[1] - c[1])**2)
            dists.append(d)
        cluster_labels.append(dists.index(min(dists)))    # Compute the mean of all points assigned as each unique centroid
    new_centroids = []
    for i in range(k):
        data_points = []
        for j in range(len(cluster_labels)):
            if cluster_labels[j] == i:
                data_points.append(X[j])
        if len(data_points) == 0:
            new_centroids.append(centroids[i])
        else:
            new_c = sum(data_points)/len(data_points)
            new_centroids.append(new_c)    new_centroids = np.array(new_centroids)
    return new_centroids, cluster_labels
# Perform 10 iterations of K-means clustering
for i in range(10):
    centroids, cluster_labels = kmeans_clustering(centroids, X)

Results after 10 iterations of K-means clustering

K-means clustering has a hyperparameter k that needs to be specified. If the data has 4 clusters by visual inspection, k=3 will not provide a good clustering of the data.

Conclusion

This article should provide for you a basic idea and intuition behind machine learning. Each algorithm has its benefits and limitations, so an in-depth inspection of algorithms, beyond a surface-level usage of libraries, provides a deeper understanding of machine learning.