Implementing Convolution2D, Linear regressions and K-means clustering from scratch

Published in

Machine Learning World

3 min readApr 24, 2021

2D Convolution

For running convolution we need 2 things: input matrix with shape (Batch, Channels, Height, Width) and kernels of our convolution with shape (Out Channels, Input Channels, Height, Width), for each input channel we use different kernel.

Iteratively we get part of the input matrix with the size of kernel and run by element multiplication with each kernel. Sum of element multiplication is our output value. Formula for this A*B = a1*b1 + a2*b2 + …

As we also using strides and input matrix padding we need formula for calculating output matrix shape:
output_w_h = int(((w_h — kernel_w_h + 2 * padding_w_h) / stride_w_h) + 1)

For simplicity we will use square kernels in our example

Of course running such many iteration in real life calculations is not a good idea, so this is just an example to give understanding of logic behind convolution.

Linear Regression, Ridge Regression and Lasso Regression

Difference between this is only in regularization. Linear regression does not use regularization, Ridge using L1 regularization (lambda*Sum(|Weights|)), Lasso using L2 regularization (lambda*Sum(Weights²))

Optimization process mostly done or by Least Squares method or by Gradient Descent method. I will show the second one as it can be used in many other cases.

If you don’t know about Gradient Descent optimization, then i recommend to read my previous paper:

Making Backpropagation, Autograd, MNIST Classifier from scratch in Python

Backpropagation (backward propagation of errors) — is a widely used algorithm in training feedforward networks. It…

medium.com

So idea of this methods is simple — find the parameters of the line that will pass through the set of points in such way that will minimize out loss function. (For N-dimension points our line will be N-dimension plane)

Formula for the line F(x)= A*X + b
And MSE loss function loss(y_true,y_pred) = Sum((y_true-y_pred)²)/n

So our optimization function for linear regression:
E(x) = loss(y_true,y_pred) = loss(y_true,A*X+b) =
Sum((y_true-A*X-b)²)/n

Now we need to calculate derivative of E(x) by A and b:

dE/dA = (Sum((y_true-A*X-b)²)/n) /dA
By using chain rule we set j = y_true-A*X-b and dE/dA = (dE/dj)*(dj/dA)

dE/dj = (Sum((j)²)/n)dj = 2/n * sum(y_true-A*X-b)
dj/dA = (y_true-A*X-b)/dA = -x

dE/dA = -2/n * x * sum(y_true-A*X-b)

In the same way we getting dE/db = -2/n * sum(y_true-A*X-b)

And now we can use them to calculate update of out weights A and b

For Ridge and Lasso regression we doing the same, except adding L1 and L2 regularization to our optimization forumla

https://gist.github.com/creotiv/9d9911d4707c74d338aa4a4786a48017

K-means

This algorithm is also pretty simple.
We set number of clusters to N, now set N random centroids(centers of the clusters) and assign nearest points to them forming initial clusters. After this in iteration:
1) for each cluster finding new centroid
2) for each centroids assign points and form new cluster
Iteration stopped when SSE error are not decreasing.

SSE = Sum(euclid(centroid,cluster_points)²), so for each cluster we get sum from squared distances between centroid and cluster points.
For Dimension≤3 we use Euclidian distance

You can run this code in Google Colab