Logistic Regression With PyTorch — A Beginner Guide

Build On Dataset — Wheat Seed Species Prediction

Published in

Analytics Vidhya

8 min readMay 14, 2021

If you are someone who wanted to get started with PyTorch but not quite sure which dataset to pick to begin with, then you are at the right place. We see PyTorch implementations in classical machine learning to deep neural networks. I can’t wait to get started but before we get started, let us start by answering a couple of fundamental questions—

What is PyTorch ?
PyTorch is an open-source, community-driven deep learning framework developed by Facebook’s artificial intelligence research group. PyTorch is widely used for several deep learning applications such as natural language processing, computer vision applications, image classification, transfer learning, and so on.
What are tensors in PyTorch?
Tensors are n-dimensional matrices. Tensors are core to the PyTorch library and are used for efficient computation in deep learning. A tensor of order zero is a number. A tensor of order one is an array of numbers i.e. a vector. A tensor of order two is an array of vectors, or a matrix. Unlike NumPy arrays, we can use tensors with GPUs as well, giving us the advantage of faster computation.

About Dataset

The Wheat seed dataset is quite simple to get started with PyTorch basics.
The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.

Introduction
Data Pre Processing
2.1 Loading the required libraries
2.2 Get Data
Feature Analysis
3.1 Area vs Kernel length
3.2 Area vs Kernel Width
3.3 Plotting Groove vs Perimeter
3.4 Area vs Kernel Groove
Setting up the PyTorch model
4.1 Preprocessing and creating DataLoaders
4.2 Model Creation
4.3 Training the model
4.4 Plotting losses and accuracies
Summary
Future Work
References

Introduction

The prediction of wheat seed species is a classification problem. The number of observations for each class is balanced. There are 199 observations with 7 input variables and 1 output variable. The variable names are as follows:

Area.
Perimeter.
Compactness.
Length of kernel.
Width of kernel.
Asymmetry coefficient.
Length of kernel groove.
Class (1, 2, 3).

Data Pre processing

Loading the required Libraries

Let us load the standard list of libraries which we would need while working on this dataset. I usually prefer to load them in single shot 🆒

Get Data

The dataset can be downloaded from GitHub or Kaggle. I have downloaded the dataset from GitHub.

Feature Analysis

As the dataset is now downloaded, Let us explore the dataset to derive insights into the features.

From the above graph, the relation between the Species and the features is not very clear. Hence I run the below code to understand the correlation between features.

Following are my observation

Kernel Length and width seems to have strong relation with Area.
Kernel Length and width seems to have strong relation with Perimeter.
Area and Perimeter have a strong relation with Groove.
Groove is very close to Zero(0.03) against Type, Hence confirming there is no linear relation.

I now plot all the above mentioned point for each Type. Plotting Area and Kernel Length for each Type.

Area Vs Kernel length

Area Vs Kernel Width

Plotting Area and Kernel Width for each Type.

I observe the data overlap between Type 1 and Type 2 and minimal overlap with type 3. Hence I conclude while Type 1 and Type 2 seeds might be similar in Width but Type 3 seeds are definitely greater in size.

Plotting Groove Vs Perimeter

Area Vs Kernel Groove

The kernel groove length of Type 1 and Type 2 seems to be falling under the same range. While Type 3 seems to have higher groove length.

Building the model using PyTorch

As this is a classification problem. I am building a logistic regression model here. Few key points to note

A logistic regression model is almost identical to a linear regression model. It contains weights and bias matrices, and the output is obtained using simple matrix operations (pred = x @ w.t() + b).
We use nn.Linear to create the model.
The output is a vector of size 3, with each element signifying the probability of a particular target label (i.e., 0 to 2). The predicted label for an Wheat species is simply the one with the highest probability.

Preprocessing and creating DataLoaders

First, The data is needed to be converted to a tensor.

To set up a custom dataset you would need to use a DataLoader. The DataLoader method is available inside the torch library. You can use the DataLoader as follows. The data is split in testing and validation set

Before setting up the model we need to understand the input size and the output size. In the current dataset, we have 7 features (Area, Perimeter, Compactness, Kernel.Length, Kernel.Width, Asymmetry.Coeff, Kernel.Groove). Hence the input size would be 7. In output class (i.e. Type) we have 3 different classifications. Hence our output size would be 3

Model Creation

Before we create the model, Let us understand what is logistic regression ?
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

Here is how the models will look like. Frankly, you might have to be familiar with Object-oriented concepts with python. You don’t really have to master it. Fundamental knowledge would be sufficient to sail through in building PyTorch models.

Basic feed forward network will look like

We see above that the model accepts 7 inputs and provide 3 output. Based on the max probability we will select 1 output out of 3 outputs.

In the next step, I include the training and validation steps along with loss and optimizer.

Also, It is important to understand the Evaluation Metric ?

Evaluation Metric
We need a way to evaluate how well our model is performing. A natural way to do this would be to find the percentage of labels that were predicted correctly. We calculate this from the below function

The == operator performs an element-wise comparison of two tensors with the same shape and returns a tensor of the same shape, containing True for unequal elements and False for equal elements. Passing the result to torch.sum returns the number of labels that were predicted correctly. Finally, we divide by the total number of features to get the accuracy.

Before we train the model, let’s see how the model performs on the validation set with the initial set of randomly initialized weights & biases.

The initial accuracy is around 31%, which one might expect from a randomly initialized model.(Since it has a 1 in 3 chance of getting a label right by guessing randomly)

Training the model

Now that we have defined the data loaders, model we are ready to train the model.
Here is the pseudocode code for training and validation phase.

for epoch in range(num_epochs):
    # Training phase
    for batch in train_loader:
        # Generate predictions
        # Calculate loss
        # Compute gradients
        # Update weights
        # Reset gradients
    
    # Validation phase
    for batch in val_loader:
        # Generate predictions
        # Calculate loss
        # Calculate metrics (accuracy etc.)
    # Calculate average validation loss & metrics
    
    # Log epoch, loss & metrics for inspection

Some parts of the training loop are specific the specific problem we’re solving (e.g. loss function, metrics etc.) whereas others are generic and can be applied to any deep learning problem.

We’ll include the problem-independent parts within a function called fit, which will be used to train the model. The problem-specific parts will be implemented by adding new methods to the nn.Module class.

The fit function records the validation loss and metric from each epoch. It returns a history of the training, useful for debugging & visualization.

We are now ready to train the model. Let’s train for five epochs and look at the results.

Plotting losses and accuracies

While the accuracy does continue to increase as we train for more epochs, the improvements get smaller with every epoch. Let’s visualize this using a line graph.

It’s quite clear from the above picture that the model probably won’t cross the accuracy threshold of 80% even after training for a very long time. One possible reason for this is that the learning rate might be too high. The model’s parameters may be “bouncing” around the optimal set of parameters for the lowest loss. we can try reducing the learning rate and training for a few more epochs to see if it helps.

The more likely reason that the model just isn’t powerful enough. There are various technique to improve the performance of the model but that might be beyond the scope of this notebook.

Summary

Here is the brief summary of the article and step by step process we followed in building the PyTorch Logistic regression model.

We briefly learned about the PyTorch framework and tensors.
Downloaded the dataset and performed feature analysis.
Followed the following steps to build the model
i. Setup a DataLoader and split the data in training and validation set.
ii. Build a feed-forward network having training, testing, loss and optimizer.
iii. Defined evaluation metrics that capture model accuracy.
iv. We developed a problem independent function fit that captures validation loss, accuracy for each epoch.
v. We then trained the model with different learning rates.
vi. We finally build a visualization on the captured val_acc against the epochs we ran during the training.

Future Work

Here are some ways in which project can be extended:

Try updating the model parameters to furthers improve the accuracy.
Building a multi layer feed forward network to validate the model performance.
Try implementing classical machine learning model like SVM as they are particularly good with clusters and evaluate their performance.

References

Access or execute complete notebook - https://jovian.ai/hargurjeet/wheat-seeds-analysis-pytorch-blogs
https://pytorch.org/
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
https://jovian.ai/learn/deep-learning-with-pytorch-zero-to-gans

I really hope you guys learned something from this post. Feel free to give a 👏if you like what you learnt. This keeps me motivated.

Thanks for reading this article. Happy Learning 😃