Logistic Regression using Single Layer Perceptron Neural Network (SLPNN)

Angelo (Evangelos Tzimopoulos)

Published in

Analytics Vidhya

15 min readOct 4, 2019

https://blogs.nvidia.com/wp-content/uploads/2018/12/xx-ai-networks-1280x680.jpg

10 Weeks Of Machine Learning Fun –Weeks 4–10 Retrospective

Introduction

Welcome to the final Retrospective of the ML challenge which is going to cover Weeks 4 to 10. As a quick introduction, for those who’d like to follow the full 10-week journey, here’re the links to all previous posts:

· Original post about the challenge: #10WeeksOfMachineLearningFun

· Links to previous retrospectives: #Week1 #Week2 #Week3

Weeks 4–10 has now been completed and so has the challenge!

I’m very pleased for coming that far and so excited to tell you about all the things I’ve learned, but first things first: as a quick explanation as to why I’ve ending up summarising the remaining weeks altogether and so late after completing this:

The approach I selected for Logistic regression in #Week3 (Approximate Logistic regression function using a Single Layer Perceptron Neural Network — SLPNN) took longer to unravel, both from maths as well as from coding perspective that it was practically impossible to provide updates on a weekly basis
Also, I probably digressed a bit during that period to understand some of the maths, which was good learning overall e.g. Cost functions and their derivatives, and most importantly when to use one over another and why :) (more on that below)
Derivative of Cost function: given my approach in #Week3, I had to make sure that the Back-propagation chain rule maths for working out the partial derivative of the Cost function with respect to the weights, tied perfectly with the maths of the analytical calculation for the same partial derivative. And in order to do so, I had to put pen on paper multiple times over and over again until it finally made sense. You can definitely trust the maths once you’ve verified them in two different ways!
In fact, I have created a handwritten single page cheat-sheet that shows all these, which I’m planning to publish separately so stay tuned.
Finally, a fair amount of the time, planned initially to spend on the Challenge during weeks 4–10, went to real life priorities in professional and personal life. Which is exactly what happens at work, projects, life, etc… You just have to deal with the priorities and get back to what you’re doing and finish the job! So here I am!

Datasets

Before we go back to the Logistic Regression algorithm and where I left it in #Week3 I would like to talk about the datasets selected:

Glass Data Set

There are three main reasons for using this data set:

This dataset has been used for classifying glass samples being a “Window” type glass or not, which was perfect as my intention was to work on a binary classification problem
As stated in the dataset itself, although being a curated one, it does come from real life use case: “the study of classification of types of glass was motivated by criminology investigation. At the scene of the crime, the glass left can be used as evidence…if it is correctly identified!
Finally, being part of a technical skills workshop presented by Randy Lao, Harpreet Sahota and the amazing DataScienceDreamJob team, it meant it was complemented by reference material I could use to verify my results and not get lost along the way (though I did anyway… lol!).

Data summary and Features

The glass dataset consists of 10 columns and 214 rows, 9 input features and 1 output feature being the glass type:

More detailed information about the dataset can be found here in the complementary Notepad file. As a quick summary, the glass dataset is capturing the Refractive Index (Column 2), the composition of each glass sample (each row) with regards to its metallic elements (Columns 3–10) and the glass type (Column 11).

Based on the latter, glass type attribute 11, there’s 2 classification predictions one can try with this data set:

Window (Types 1–4) vs non-Window (Types 5–7) or
Float (Types 1 & 3) vs non-Float (Types 2 & 4) vs Not Applicable (Types 5–7)

The first one is a classic binary classification problem. The second one can either be treated as a multi-class classification problem with three classes or if one wants to predict the “Float vs Rest” type glasses, can merge the remaining types (non-Float, Not Applicable) into a single feature.

e.g the code snippet for the first approach by masking the original output feature:

# glass_type 1, 2, 3 are window glass captured as "0"
# glass_type 5, 6, 7 are non-window glass captured as "1"df['Window'] = df.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1})

The new engineered “Window” output:

The dataframe with all the inputs and the new outputs now looks like the following (including the Float feature):

Going forward and for the purposes of this article the focus is going to focus be on predicting the “Window” output. i.e which input variables can be used to predict the glass type being Window or Not.

Iris Data Set

Initially, I wasn’t planning to use another dataset, but eventually I turned to home-sweet-home Iris to unravel some of the implementation challenges and test my assumptions by coding with a simpler dataset. This, along with some feature selection I did with the glass data set, proved really useful in getting to the bottom of all the issues I was facing, finally being able to tune my model correctly.

Data summary and Features

For the Iris Data set, I’ve borrowed a very handy approach proposed by Martín Pellarolo here to transform the 3 original iris types into 2, thus turning this into a binary classification problem:

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

Which gives the following scatter plot of the input and output variables:

Single Layer Perceptron Neural Network

A single layer perceptron is the simplest Neural Network with only one neuron, also called the McCullock-Pitts (MP) neuron, which transforms the weighted sum of its inputs that trigger the activation function to generate a single output. Below is a sample diagram of such a neural network with X the inputs, Θi the weights, z the weighted input and g the output

Logistic Regression Hypothesis

For the purposes of our experiment, we will use this single neuron NN to predict the Window type feature we’ve created, based on the inputs being the metallic elements it consists of, using Logistic Regression. So, we’re using a classification algorithm to predict a binary output with values being 0 or 1, and the function to represent our hypothesis is the Sigmoid function, which is also called the logistic function.

Neural Network Input

The input to the Neural network is the weighted sum of the inputs Xi:

Activation function

The input is transformed using the activation function which generates values as probabilities from 0 to 1:

The mathematical equation that describes it:

The code snippet that implements it:

def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

Hypothesis

If we combine all above, we can formulate the hypothesis function for our classification problem:

Feed-forward Loop

As a result, we can calculate the output h by running the forward loop for the neural network with the following function:

def feedforward(self,X):
        #X = self.add_bias(X)
        z = np.dot(X, self.w)
        h = self.sigmoid(z)
        return(h)

Cost functions

Selecting the correct Cost function is paramount and a deeper understanding of the optimisation problem being solved is required.

Initially I assumed that one of the most common optimisation functions, Least Squares, would be sufficient for my problem as I had used it before with more complex Neural Network structures and to be honest made most sense taking the squared difference of the predicted vs the real output:

Unfortunately, this led me to being stuck and confused as I could not minimise the error to acceptable levels and looking at the maths and the coding, they did not seem to match to similar approaches I was researching at the time to get some help.

To my rescue came the lecture notes (Chapter 6) by Andrew Ng’s online course about the cost function for logistic regression. The bottom line was that for the specific classification problem, I used a non-linear function for the hypothesis, the sigmoid function. For optimisation purposes, the sigmoid is considered a non-convex function having multiple of local minima which would mean that it would not always converge.

The answer to this is using a convex logistic regression cost function, the Cross-Entropy Loss, which might look long and scary but gives a very neat formula for the Gradient as we’ll see below :

# Defining the Cost function J(θ) (or else the Error)
# using the Cross Entropy function
    def error(self,h, y):
        error = (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
        self.E = np.append(self.E, error)

Gradient Descent — Analytical calculation

Using analytical methods, the next step here would be to calculate the Gradient, which is the step at each iteration, by which the algorithm converges towards the global minimum and, hence the name Gradient Descent.

In mathematical terms this is just the partial derivative of the cost function with respect to the weights. i.e. in every iteration you calculate the adjustment (or delta) for the weights:

and repeat for each iteration

Gradient Descent using Backpropagation Chain Rule

Here I will use the backpropagation chain rule to arrive at the same formula for the gradient descent.

As per diagram above, in order to calculate the partial derivative of the Cost function with respect to the weights, using the chain rule this can be broken down to 3 partial derivative terms as per equation:

Term (1)

If we differentiate J(θ) with respect to h, we practically take the derivatives of log(h) and log(1-h) as the two main parts of J(Θ). With a little tidying up in the maths we end up with the following term:

Term (2)

The 2nd term is the derivative of the sigmoid function:

Term (3)

The 3rd term is just the input vector X:

If we substitute the 3 terms in the calculation for J’, we end up with the swift equation we saw above for the gradient using analytical methods:

The implementation of this as a function within the Neural Network class is as below:

def backprop(self,X,y,h):
        self.delta_E_w = np.dot(X.T,h-y) / self.outputLayer
        
        # Store All weights throughout learning
        self.w_list.append(self.w)
        
        # Adjust weights
        self.w = self.w - eta * self.delta_E_w

Summary — Equations

As a summary, the full set of mathematics involved in the calculation of the gradient descent in our example is below:

Prediction & Classification

In order to predict the output based on any new input, the following function has been implemented that utilises the feedforward loop:

def predict(self, X):
        # Forward pass
        pred = self.feedforward(X)
        return pred

As mentioned above, the result is the predicted probability that the output is either of the Window types. To turn this into a classification we only need to set a threshold (here 0.5) and round the results up or down, whichever is the closest.

def classify(self, y):
        return self.predict(y).round()

Neural Network training and parameters

To train the Neural Network, for each iteration we need to:

Pass the input X via the forward loop to calculate output
Run the backpropagation to calculate the weights adjustment
Apply weights adjustment and continue in the next iteration

The function to implement shows below:

def train(self,X,y):
    for epoch in range(epochs):
      # Forward pass
      h = self.feedforward(X)
                        
      # Backpropagation - Calculate Weight adjustments and update weights
      self.backprop(X,y,h)
            
      # Calculate error based on the Cross Entropy Loss function
      self.error(h, y)

Also, below are the parameters used for the NN, where eta is the learning rate and epochs the iterations.

# Learning Rate
eta = 0.001# Number of epochs for learning
epochs = 10000# Input layer
inputLayer = X.shape[1]# Output Layer
outputLayer = 1

Training and Selecting features

As discussed at the Dataset section, the raw data have 9 raw features and in selecting the correct one for the training, the right approach would be to use scatter plots between the variables and the output and in general visualise the data to get a deeper understanding and intuition as to which the starting point can be.

Input vector with “Al” only

As this was a guided implementation based on Randy Lao’s introduction to Logistic regression using this glass dataset, I initially used the following input vector:

# Selecting Independent Variables
iv = ['al']
# Define Input vector for training
X = df[iv].values

This gives the following scatter plot between the input and output which suggests that there can be an estimated sigmoid function which can be used to classify accordingly:

During testing though it proved difficult to reduce the error to significantly small values using just one feature as per run below:

Input with 5 features

In order to reduce the error, further experimentation led to the selection of 5 features configuration of the input vector:

# Selecting Independent Variables
iv = ['ri','mg','al','k','ca']
# Define Input vector for training
X = df[iv].values

Finally, the main part of the code that run the training for the NN is below:

# Initiate Single Perceptron NN
SPNN = LogisticRegressionSinglePerceptronModel(eta, inputLayer, outputLayer)# Train SPNN for Linear Regression Model
%time SPNN.train(X, y)# Plot Error
SPNN.plot()# Predict output based on test set
pred = SPNN.predict(X)# Generate classified output
pred2 = SPNN.classify(X)# Assess Model accuracy
print("Minimum Error achieved:", min(SPNN.E))# SPNN weights
SPNN.w

Results

Results and Valuation

The code run in ~313ms and resulted in a rapidly converging error curve with a final value of 0.15:

The array at the end are the final weights that can be used for prediction of new inputs.

Predicted output

The real vs the predicted output vectors after the training shows the prediction has been (mostly) successful:

Validation using Iris Data Set Classification

Given the generalised implementation of the Neural Network class, I was able to re-deploy the code for a second data set, the well known Iris dataset. As mentioned earlier this was done both for validation purposes, but it was also useful working with a known and simpler dataset in order to unravel some of the maths and coding issues I was facing at the time.

As described under Iris Data set section of this post, with a small manipulation, we’ve turned the Iris classification to a binary one.

SLPNN configuration

For the new configuration of the Iris dataset, I have lowered the learning rate and the epochs significantly:

# Learning Rate
eta = 0.0005# Number of epochs for learning
epochs = 3000# Input layer
inputLayer = X.shape[1]# Output Layer
outputLayer = 1

Error Curve

As expected the training time is much smaller than the Glass Dataset and the algorithm achieves much smaller error very quickly.

Predicted Output

As per dataset example, we can also inspect the generated output vs the expected one to verify the results:

Regression line

Based on the predicted values, the plotted regression line looks like below:

Conclusion and Summary

As a summary, during this experiment I have covered the following:

Discussed the implementation of a Single Layer Perceptron Neural Network for Binary classification using Logistic Regression
Detailed the maths behind the Neural Network inputs and activation functions
Analysed the hypothesis and cost function for the logistic regression algorithm
Calculated the Gradient using 2 approaches: the backpropagation chain rule and the analytical approach
Used 2 datasets to test the algorithm, the main one being the Glass Dataset, and the Iris Dataset which was used for validation
Presented results including error graphs, plots and compared outputs to validate the findings

Backlog

As per previous posts, I have been maintaining and curating a backlog of activities that fall off the weeks, so I can go back to them following the completion of the Challenge. This is the full list:

1. #week1 — Implement other types of encoding and at least on type manually, not using libraries

2. #week1 — Refactor Neural Network Class so that Output Layer size to be configurable

3. #week2 — Solve Linear Regression example with Gradient Descent

4. #week2 — Apply the Linear Regression model prediction and calculations to real data sets (“Advertising” data set or this one from Kaggle)

5. #week3 — Read on Analytical calculation of Maximum Likelihood Estimation (MLE) and re-implement Logistic Regression example using that (no libraries)

6. #week4_10 — Add more validation measures on the logistic algorithm implementation

7. #week4_10 — Implement Glass Set classification with sklearn library to compare performance and accuracy

#PRODUCTIVITY #EFFECTIVENESS #RESILIENCE

Having completed this 10-week challenge, I feel a lot more confident about my approach in solving Data Science problems, my maths & statistics knowledge and my coding standards.

Having said that, the 3 things I still need to improve are:

a) my approach in solving Data Science problems

b) my maths and statistics knowledge and

c) my coding standards

Lol… it never ends, enjoy the journey and learn, learn, learn!

About my journey

As noted in the introduction, I started the 10-week challenge a while back but was only able to publish on a weekly basis for the first 3 weeks. And that was a lot to take in every week: crack the maths (my approach was to implement without using libraries where possible for the main ML algorithms), implement and test, and write it up every Sunday
And that was after all family and professional duties during a period with crazy projects in both camps 😊
Nevertheless, I took a step back to focus on understanding the concepts and the maths and make real progress even if that meant it was slower and was already breaking my rules. So, I stopped publishing and kept working.
Then I had a planned family holiday that I was also looking forward to so took another long break before diving back in.
This is the critical point where you might never come back!
But I did and got stuck in the same problems and continued as I really wanted to get this over the line.
Waking up 4’30 am 4 or 5 days a week was critical in turning around 6–8 hours per week
And being that early in the morning meant that concentration was 100%. i.e. 6–8 net hours working means practically 1–2 working days extra per week just of me!

Finally, one last comment to say that, I would recommend this to anyone who’s starting in Data Science to try something similar and see how far they can go and push themselves.

I’d love to hear from people who have done something similar or are planning to.

Also, any geeks out there who would like to try my code, give me a shout and happy to share this, I’m still tidying up my GitHub account.

Drop me your comments & feedback and thanks for reading that far.

Angelo

#machinelearning #datascience #python #LogisticRegression