Machine Learning

Building a Neuron

Building a simple logistic regression model from first principles

Juan Vera

Published in

Intuition

9 min readApr 16, 2024

Sometimes, I think neurons, artificial or biological, will serve as the foundation for building the future of technology.

From artificial intelligence, brain-computer interfaces, and neuromorphic chips, their concepts seem to be foundational.

I’m currently grasping the foundations of deep learning to set myself up to build technical skillsets at the bleeding edge of AI. I recently visited the mathematics behind logistic regression, or a single neuron, of a fully-fledged neural network.

Once you’ve understood the foundational mathematics, if you know how to code in Python, building a similar model yourself should be fairly straightforward.

I’ll be going over the base mathematics, and then the code for building this logistic regression model with pure Python.

I’ll be referring to the logistic regression model as a neuron.

The Building Blocks

We’ll be using a simple example to build an intuition for how this neuron works.
Our input data will not be of scale to real world problems as they can tend to get complex in size and dimensionality.
Our dataset will rather be scaled down to a size that allows us to see how the linear algebra really works on a foundational level.
Side note, we’ll be expressing the mathematics in a vectorized, linear algebra notation.

The input matrix for our neuron, X, will have dimensions of (3, 2), with the number of rows, 3, being the total number features per sample in our (very small) dataset and the number of columns, 2, being the total number of samples in our dataset.

It’s a really small input dataset, but it’s perfect for understanding the foundational mathematics.

Don’t get confused by the subscripts and superscripts (these are not exponents!)
The meaning of each will be written as the caption of a given equation.

Just like this:

Subscripts are index of a sample | Superscripts are index of a feature

This input matrix X can be defined in code as:

X = np.array([[1, 2], [3, 4], [5, 6]])

This will be the matrix which will be fed forward into our model.

Once done that, we need to define our parameters, the weighted matrix, W, and the bias scalar, B.

Given the size (3, 2) of our input X, and our model having a size of only 1 neuron, our weighted matrix can be defined with the dimensions of (1, 3), with the number of rows, 1, representing the total number of neurons, and the number of columns, 3, representing the number of total weights that will be used in our model.

There are 3 weight parameters as our network must have a weight per connection to a neuron from a previous layer.

Given that we have 3 input features per sample, there will be 3 connections to our neuron, therefore we have 3 weight parameters in our model.

It’ll look just like this:

Superscripts are index of a given weight parameter, totaling to 3 for 3 input features.

Our bias B, will be of dimensions (1, 1) or just (1, ), it’s size equivalent to the number of neurons in our model.

We only need 1 bias term as our linear combination, defined by y = wx + b is only computed once per neuron (we only have 1 neuron) and only needs 1 bias value.

It’ll look just like this:

In code, this can be defined as a function:

def init_params(dims):
    W = np.random.rand(1, dims)
    B = np.random.rand(1, 1)
    return W, B

# We use "dims" in our function to generalize it for other use cases.

# In our case, we input 3 as dims when calling init_params, given the number 
# of features per sample.

Now, we can define our activation function which in this case will be the sigmoid function (σ).

Mathematically, this function looks like this:

I had to use “sigmoid” as LaTeX converters don’t accept “σ”, lol. Someone build a better one? Maybe I will.

and in code:

def sigmoid(z):
  a = 1 / (1 + np.exp(-z))
  return a

Now, we define the forward pass / forward propagation step.

Here, we apply a linear combination and the addition of the bias parameter, B, parameter per sample to get our weighted sum.

The equation for the linear combination looks as:

This is a simplified version of the linear combination. It’s an equation defined for every ith sample.

Given that we’re vectorizing our data into matrix X and our parameters into matrices W and B, we can express the linear combination as a matrix multiplication!

Here’s what it looks like:

If you’re having trouble with the matrix multiplication, check out this git repo.
If you need more foundational knowledge, check out this resource!

Now, we can add our bias scalar, B, element-wise:

and we ultimately get:

Yet, this isn’t the final output. We still need to apply our predefined activation function, sigmoid.

We’ll be defining this output as the activation matrix, A.

All of this can be defined in code as:

def forward(X, W, B):
    z = np.dot(W, X) + B
    A = sigmoid(z)
    return A

The matrix A has a dimensionality of (2, 1), the number of rows being the number of outputs equivalent to the amount of samples and the number of columns being equivalent to the total neurons in the output layer (1).

A, is matrix that holds the final predictions for our neural network per sample.

Training our Neuron.

Now that our neuron has the foundation for making predictions based on a set of samples and corresponding features, it’s to train our neuron to optimize for accuracy.

We can decipher the level of inaccuracy of our neuron through the log loss function.

Here, Y is the matrix that holds the true values of a sample, of dimensions (3, 1). We’re comparing it against A derive the value of the loss.

But ultimately, we want to average the loss value across all samples in our dataset. To do so, we can take the sum of L and divide over the total number of samples n.

In python, we can write this as

def log_loss(Y, A)
    loss = np.mean(-Y * np.log(A) - (Y - A) * np.log(1 - A)
    return loss

Next, we need to calculate the gradient of the loss function L with respect to both parameters wᶦ and bᶦ.

Taking the derivative of the loss with respect to parameters wᶦ is done so as:

i is the ith sample | j is the jth feature / neuron at a given layer.

Through a derivation, this can be simplified as:

Though, implementing this in code and expressing the entire equation mathematically can get complex if we do it for every ith sample and jth feature.

Typically, we’d express the equation in linear algebra notation and compute it with matrix operations as:

Something very similar can be done when computing the gradient of the loss J with respect to B.

But the equation is expressed a tad bit differently.

Using linear algebra notation, we’d express it as:

Notice how we’ve instantly skipped from (∂J/∂A)(∂A/∂B) to (∂J/∂Z)!

This is as:

So we can easily insert (∂J/∂Z) in place of (∂J/∂A)(∂A/∂B).

This derivative can then be simplified to:

Now, before we move on, we need to do one more thing!

Given that we need to calculate the mean gradient of the loss with respect to all parameters, we can take the summation and then divide by total parameters W or B.

Here, let N equal the total number of params in W or B.

We can these averages as:

Now that we have the gradients of our loss, J, with respect to parameters W and B, we can update W and B through what’s called the update rule,

where θ is a specific parameter and ⍺ is the learning rate.

We’ll set our learning rate, ⍺, to the value of .0001 to begin with.

To update W:

Then to update B:

We can express all of this, called backpropagation, in Python as:

def back_prop(W, B, Y, X, A, alpha):
    dW = np.mean(X * (A - Y))
    dB = np.mean(A - Y)
    W = W - alpha * dW
    B = B - alpha * dB
    return W, B

Finally, we can take all of the above computations and put it into one function which will be called gradient descent.

Gradient descent is the optimization algorithm that allows for a model to learn and be trained over various iterations or epochs.

We’ll be running our gradient descent over 20,000 epochs, meaning our dataset will be passed 20,000 times through our model and it’s parameters will be updated 20,000 times.

This should be more than enough to get good results given the very small dataset.

Of course, you don’t have to stick to a small dataset, you can most definitely implement this model on larger datasets for binary classification.
Here’s my example on Github for predicting likelihood of heart disease

We can define the function as:

def gradient_descent(X, Y, epochs, alpha):
    W, B = init_params(3)
    for epoch in range(epochs):
        A = forward(X, W, B)
        loss = log_loss(Y, A)
        W, B = back_prop(W, B, Y, X, A, alpha)

        print(f"Epoch: {epoch}")
        print(f"Loss: {loss}")

     return W, B

Putting everything all together, the code for the entire neuron looks like:

import numpy as np

def init_params(dims):
  W = np.random.rand(1, dims)
  B = np.random.rand(1, 1)
  return W, B

def sigmoid(Z):
  A = 1 / (1 + np.exp(-Z))
  return A

def forward(X, W, B):
  Z = np.dot(W, X) + B
  A = sigmoid(Z)
  return A

def log_loss(Y, A):
    loss = np.mean( -Y * np.log(A) - (1 - Y) * np.log (1 - A))
    return loss

def back_prop(X, Y, W, B, A, alpha):
  dW = np.mean(X * (A - Y))
  dB = np.mean(A - Y)
  W = W - alpha * dW
  B = B - alpha * dB
  return W, B

def gradient_descent(X, Y, epochs, alpha, dims):
  W, B = init_params(dims)
  for epoch in range(epochs):
    A = forward(X, W, B)
    loss = log_loss(Y, A)
    W, B = back_prop(X, Y, W, B, A, alpha)

    print(f"Epoch: {epoch}")
    print(f"Loss: {loss}")

  return W, B

if __name__ == "__main__":
     
    X = np.array([[1, 2], [3, 4], [5, 6]])
    Y = np.array([[0],[1],[1]])
    W, B = gradient_descent(X, Y, 20000, .0001, 3)

    # You can add code below here to save the params W, B as a pkl file 
    # for future use or testing purposes
    # If you want to see a sample how it might be done, visit the github below!
    # https://github.com/vxnuaj/LABS/blob/main/Deep-Learning/Regression/Logistic-Regression/src/heartdisease.py

Try it for yourself!

If you’re curious, check out the actual implementation of this, here, it’s built on a larger dataset.

Feel free to reach out on twitter or email.

PS — Check out my substack, if you’re curious to get updates and get an insight into what I’m up to!

🏳️

Machine Learning

Building a Neuron

Building a simple logistic regression model from first principles

The Building Blocks

Training our Neuron.

Written by Juan Vera