Photo by Fanny Côté on Unsplash

PyTorch — Logistic Regression on Iris dataset

Ayisha D

--

The Iris dataset is a multivariate dataset describing the three species of Iris — Iris setosa, Iris virginica and Iris versicolor. It contains the sepal length, sepal width, petal length and petal width of 50 samples of each species.

Logistic regression is a statistical model based on the logistic function that predicts the binary output probability (i.e, belongs/does not belong, 1/0, etc) based on the given independent input variables.

To start with, let us import the required libraries.

The first step towards working with any dataset is to make sense of its content, that is, the metadata of the dataset. The iris dataset can be loaded either from scikit-learn or by loading iris.csv available on GitHub.

This is a balanced dataset, as we already know from the description of the dataset. But this information may not always be available externally and will have to be extracted from the dataset.

Visualization

This dataset has 4 input and 1 output features. Thus, the visualization plot for this dataset will be a 5-D scatter plot with varying hue and depth and different shapes representing the different species. While this may be hard to interpret, it is not impossible.

For this, we need to convert the string classes in the ‘species’ column into numeric classes.

5D scatter plot of Iris dataset where size represents petal length and colour represents species.

The 360 for the above image can be found here.

For higher-dimensional data, visualization can be tedious and unintelligible. It is also possible that all the given dimensions do not affect the output variable and may lead to noise.

Keeping this in mind, let us now plot the dataset taking two features at a time and determine the separability of the three classes in each instance.

From the plots, it can be inferred that the species Iris setosa is easily separable while the other two have overlapped for every pair of features. The kde plots along the diagonal indicate the overlaps among the species for the given input feature. This is a minimum for petal length and petal width. Thus we have identified 2 out of 4 features that help contribute the most towards classifying the species.

Now that the data has been viewed and analyzed, the next step is to prepare the data for the model.

Preparing the data

Convert the input and output features into tensors with appropriate datatypes. Two input tensors are created — one with all the input features and the other with only the selected features. The former shall be used later to train the same model and compare results.

Create a TensorDataset object using the input and output columns. For now, we focus only on the 2 features we identified as input.

Split into train, validation and test sets. 70–20–10 splits have been used for the same.

Create DataLoader instances for each set. These are iterables that can be used to loop through the train, validation and test sets and return batches of data of the required size. If the last batch has fewer elements than the specified batch size (= 16 here), they are not truncated. Instead, the last batch is simply returned with fewer elements.

Create the model.

The model class extends nn.Module and has four functions —

  • __init__

This function calls the __init__ function of the super class. This is mandatory. The different layers to be included in the model are defined here. This model has 3 layers — 2 nn.Linear and 1 nn.Dropout.

The size of the input and output are not explicitly defined as we will be running the model for two different input sizes (2 and 4). Output size corresponds to the number of classes as logistic regression returns probability corresponding to each class.

  • forward

This function returns the output obtained after the input data has been passed through the layers of the model.

  • training_step

Every input in the given batch is passed through the forward function that can be called within its class as ‘self’. The outputs are then compared with the targets to calculate and return the loss. We use the cross-entropy loss as this is a classification problem.

  • validation_step

Here too, the inputs are passed through the forward function to obtain outputs used to calculate cross-entropy loss. Accuracy is calculated as the number of correct predictions in the batch divided by the total number of predictions done. The .detach() is required to set requires_grad = False so that the values are excluded from gradient calculation.

Next is the function that evaluates the performance of the model.

Evaluating an untrained model

Outputs of the form [loss, accuracy] are appended to a list for each batch. The transpose of its tensor form contains all the losses in the first row and all the accuracies in the second. The mean for each is returned as final loss and accuracy.

Next is the function that fits the data to the model.

A default optimizer is specified in the parameters while a different one can also be passed as an argument while calling the function. For every epoch, losses from each batch of the training set are used to calculate the gradient. The optimizer then updates the model parameters and resets gradients to zero before moving on to the next batch so that the gradients are not accumulated. At the end of the training set loop, the loss and accuracy for the epoch are calculated using the validation set.

Now comes the training part:

The test accuracy is 0.867.

Let us try training the model with all input features.

The test accuracy is 0.800.

The test results in both cases vary with epochs, batch size and learning rate.

Though it seems here that the accuracy is better when trained with the identified features (the process is known as feature selection), the accuracies vary with batch size, learning rates and epochs.

I would like to thank Jovian for hosting this course and all of you for reading.

--

--