Image classification (American Sign language) using PyTorch

9 min readJul 1, 2020

This is a part of course project conducted by jovian.ml with freecodecamp. In this project, I have used sign language MNIST dataset to predict sign language images using different models like logistic regression, feed forward NN, convolution NN.

Sign Language MNIST Dataset:

The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z which require motion).

The dataset format is patterned to match closely with the classic MNIST. Each training and test case represents a label (0–25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). The training data (27,455 cases) and test data (7172 cases) are approximately half the size of the standard MNIST but otherwise similar with a header row of label, pixel1, pixel2….pixel784 which represent a single 28x28 pixel image with grayscale values between 0–255. The original hand gesture image data represented multiple users repeating the gesture against different backgrounds. The Sign Language MNIST data came from greatly extending the small number (1704) of the color images included as not cropped around the hand region of interest.

Preparing The Data:

First, I will import some libraries that I will use throughout this project:

Now we will load our csv file, for that we will define two variables and load test and train csv files:

The next step is to convert dataframes into NumPy arrays:

Next step is to convert all NumPy arrays into PyTorch tensors

We can see that we converted each image in a 3-dimensions tensor (1, 28, 28). The first dimension is for the number of channels. The second and third dimensions are for the size of the image, in this case, 28px by 28px.

Now we will define hyperparameters for our model

# Hyperparmeters
batch_size = 64
learning_rate = 0.001

# Other constants
in_channels = 1
input_size = in_channels * 28 * 28
num_classes = 26

Training and validation dataset

Now we are going to use three datasets-

Training set — used to train the model (compute the loss and adjust the weights of the model using gradient descent).
Validation set — used to evaluate the model while training, adjust hyperparameters (learning rate etc.) and pick the best version of the model.
Test set — used to compare different models, or different types of modeling approaches, and report the final accuracy of the model.

val_size = 7455
train_size = len(train_ds_full) - val_size

train_ds, val_ds = random_split(train_ds_full, [train_size, val_size,])
len(train_ds), len(val_ds), len(test_ds)

Out:

(20000, 7455, 7172)

Now we will load the training, validation and test dataset in batches

train_dl = DataLoader(train_ds, batch_size, shuffle=True, num_workers=4, pin_memory=True)
val_dl = DataLoader(val_ds, batch_size*2, num_workers=4, pin_memory=True)
test_dl = DataLoader(test_ds, batch_size*2, num_workers=4, pin_memory=True)for img, label in train_dl:
    print(img.size())
    break

torch.Size([64, 1, 28, 28])

Models for image classification

We are going to create three different models for this project:

Logistic Regression
Deep Neural Network
Convolutional Neural Network

Logistic regression

class ASLModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size, num_classes)
        
    def forward(self, xb):
        xb = xb.reshape(-1, in_channels*28*28)
        out = self.linear(xb)
        return out
    
    def training_step(self, batch):
        images, labels = batch 
        out = self(images)                  # Generate predictions
        loss = F.cross_entropy(out, labels) # Calculate loss
        return loss
    
    def validation_step(self, batch):
        images, labels = batch 
        out = self(images)                    # Generate predictions
        loss = F.cross_entropy(out, labels)   # Calculate loss
        acc = accuracy(out, labels)           # Calculate accuracy
        return {'val_loss': loss.detach(), 'val_acc': acc.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], val_loss: {:.4f}, val_acc: {:.4f}".format(epoch, result['val_loss'], result['val_acc']))
    
model = ASLModel()

Training The Model

def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds == labels).item() / len(preds))

def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result)
        history.append(result)
    return historyresult0 = evaluate(model, val_dl)
result0

Out:

{'val_loss': 163.75135803222656, 'val_acc': 0.040651481598615646}

The initial accuracy is around 4%, which is what one might expect from a randomly intialized model (since it has a 1 in 10 chance of getting a label right by guessing randomly). Also note that we are using the .format method with the message string to print only the first four digits after the decimal point.

We are now ready to train the model. Let's train for 5 epochs and look at the results.

history4 = fit(10, 0.000001, model, train_dl, val_dl)

Epoch [0], val_loss: 10.3801, val_acc: 0.9426 Epoch [1], val_loss: 10.3712, val_acc: 0.9425 Epoch [2], val_loss: 10.3667, val_acc: 0.9421 Epoch [3], val_loss: 10.3638, val_acc: 0.9422 Epoch [4], val_loss: 10.3586, val_acc: 0.9425 Epoch [5], val_loss: 10.3527, val_acc: 0.9424 Epoch [6], val_loss: 10.3484, val_acc: 0.9421 Epoch [7], val_loss: 10.3414, val_acc: 0.9424 Epoch [8], val_loss: 10.3342, val_acc: 0.9425 Epoch [9], val_loss: 10.3324, val_acc: 0.9428

Now with 40 iteration we went from 4% accuracy to 94% accuracy. It's quite amazing.

Deep Neural Network

A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship.

Defining the model

def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds == labels).item() / len(preds))
class ASLModel2(nn.Module):
    """Feedfoward neural network with 2 hidden layer"""
    def __init__(self, in_size, out_size):
        super().__init__()
        # hidden layer 1
        self.linear1 = nn.Linear(in_size, 512)
        # hidden layer 2
        self.linear2 = nn.Linear(512, 256)
        # hidden layer 3
        self.linear3 = nn.Linear(256, 128)
        # output layer  
        self.linear4 = nn.Linear(128, out_size)
        
    def forward(self, xb):
        # Flatten the image tensors
        out = xb.view(xb.size(0), -1)
        # Get intermediate outputs using hidden layer 1
        out = self.linear1(out)
        # Apply activation function
        out = F.relu(out)
        # Get intermediate outputs using hidden layer 2
        out = self.linear2(out)
        # Apply activation function
        out = F.relu(out)
        # Get inermediate outputs using hidden layer 3
        out = self.linear3(out)
        # Apply a activation function
        out = F.relu(out)
        # Get predictions using output layer
        out = self.linear4(out)
        return out
    
    def training_step(self, batch):
        images, labels = batch 
        out = self(images)                  # Generate predictions
        loss = F.cross_entropy(out, labels) # Calculate loss
        return loss
    
    def validation_step(self, batch):
        images, labels = batch 
        out = self(images)                    # Generate predictions
        loss = F.cross_entropy(out, labels)   # Calculate loss
        acc = accuracy(out, labels)           # Calculate accuracy
        return {'val_loss': loss, 'val_acc': acc}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], val_loss: {:.4f}, val_acc: {:.4f}".format(epoch, result['val_loss'], result['val_acc']))

Using a GPU

To work with GPU’s we have to take help of some utility functions, so let’s define couple utility functions:

torch.cuda.is_available()

Out:

Truedef get_default_device():
    if torch.cuda.is_available() == True:
        return torch.device('cuda')
    else: 
        return torch.device('cpu')device = get_default_device()
device

Out:

device(type='cuda')def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)
class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
        
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)    def __len__(self):
        """Number of batches"""
        return len(self.dl)train_dl = DeviceDataLoader(train_dl, device)
val_dl = DeviceDataLoader(val_dl, device)
test_dl = DeviceDataLoader(test_dl, device)print(train_dl.device)
print(test_dl.device)
print(val_dl.device)

cuda cuda cuda

Training the Model

def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result)
        history.append(result)
    return historyinput_size, num_classes

Out:

(784, 26)model = ASLModel2(input_size, out_size = num_classes)# for loading our model into GPU
model = to_device(model, device)model

Out:

ASLModel2(
  (linear1): Linear(in_features=784, out_features=512, bias=True)
  (linear2): Linear(in_features=512, out_features=256, bias=True)
  (linear3): Linear(in_features=256, out_features=128, bias=True)
  (linear4): Linear(in_features=128, out_features=26, bias=True)
)history = [evaluate(model, val_dl)]
history

Out:

[{'val_loss': 14.12060546875, 'val_acc': 0.041877392679452896}]

So initially, this model has very small accuracy of almost 3% that is very low. To improve this, we will iterate the process upto some epochs:

In [172]:

history += fit(10, .001, model, train_dl, val_dl)

Epoch [0], val_loss: 1.9782, val_acc: 0.3867 Epoch [1], val_loss: 1.3152, val_acc: 0.5732 Epoch [2], val_loss: 1.0640, val_acc: 0.6374 Epoch [3], val_loss: 0.8769, val_acc: 0.6941 Epoch [4], val_loss: 0.6305, val_acc: 0.7931 Epoch [5], val_loss: 0.5267, val_acc: 0.8190 Epoch [6], val_loss: 0.3588, val_acc: 0.8940 Epoch [7], val_loss: 0.1764, val_acc: 0.9652 Epoch [8], val_loss: 0.2343, val_acc: 0.9314 Epoch [9], val_loss: 0.1089, val_acc: 0.9845

Testing on test dataloader

result = evaluate(model, test_dl)

In :

result

Out:

{'val_loss': 0.8095934987068176, 'val_acc': 0.7504112124443054}

So, with DNN we got 75% accuracy on our test dataloader.

Convolution Neural Network

In Deep Learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics. They have applications in image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, and financial time series.

def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds == labels).item() / len(preds))

class ASLBase(nn.Module):
    def training_step(self, batch):
        images, labels = batch 
        out = self(images)                  # Generate predictions
        loss = F.cross_entropy(out, labels) # Calculate loss
        return loss
    
    def validation_step(self, batch):
        images, labels = batch 
        out = self(images)                    # Generate predictions
        loss = F.cross_entropy(out, labels)   # Calculate loss
        acc = accuracy(out, labels)           # Calculate accuracy
        return {'val_loss': loss.detach(), 'val_acc': acc}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format(
            epoch, result['train_loss'], result['val_loss'], result['val_acc']))

In:

class ASLCNNModel(ASLBase):
    def __init__(self, in_channels, num_classes):
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(in_channels, 28, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(28, 28, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),     #image size : 28*14*14 

            nn.Conv2d(28, 56, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(56, 56, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # image size : 56*7*7

            nn.Flatten(), 
            nn.Linear(56*7*7, 512),
            nn.ReLU(),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes))
        
    def forward(self, xb):
        return self.network(xb)model = ASLCNNModel(in_channels, num_classes)

Now again we have to load all our data into GPU:

train_dl = DeviceDataLoader(train_dl, device)
val_dl = DeviceDataLoader(val_dl, device)
test_dl = DeviceDataLoader(test_dl, device)
to_device(model, device);

Train the model:

@torch.no_grad()
def evaluate(model, val_loader):
    model.eval()
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        model.train()
        train_losses = []
        for batch in train_loader:
            loss = model.training_step(batch)
            train_losses.append(loss)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        result['train_loss'] = torch.stack(train_losses).mean().item()
        model.epoch_end(epoch, result)
        history.append(result)
    return history

Evaluating our model:

evaluate(model, val_dl)

Out:

{'val_loss': 3.2676336765289307, 'val_acc': 0.04229172319173813}

now to improve this accuracy , we will iterate through 10 epochs

history = fit(num_epochs, 0.001 , model, train_dl, val_dl, opt_func)

Epoch [0], train_loss: 0.8871, val_loss: 0.0532, val_acc: 0.9804 Epoch [1], train_loss: 0.0219, val_loss: 0.0454, val_acc: 0.9852 Epoch [2], train_loss: 0.0154, val_loss: 0.0004, val_acc: 1.0000 Epoch [3], train_loss: 0.0001, val_loss: 0.0001, val_acc: 1.0000 Epoch [4], train_loss: 0.0000, val_loss: 0.0001, val_acc: 1.0000 Epoch [5], train_loss: 0.0000, val_loss: 0.0001, val_acc: 1.0000 Epoch [6], train_loss: 0.0000, val_loss: 0.0001, val_acc: 1.0000 Epoch [7], train_loss: 0.0000, val_loss: 0.0001, val_acc: 1.0000 Epoch [8], train_loss: 0.0000, val_loss: 0.0000, val_acc: 1.0000 Epoch [9], train_loss: 0.0000, val_loss: 0.0000, val_acc: 1.0000

Testing with test data

result = evaluate(model, test_dl)
result

Out:

{'val_loss': 0.36439788341522217, 'val_acc': 0.9439418911933899}

Now, we come to an end of our post.

We will predict some of our test images and compare them with their original labels. For this, we will define a function

def predict_image(img, model):
    # Convert to a batch of 1
    xb = to_device(img.unsqueeze(0), device)
    # Get predictions from model
    yb = model(xb)
    # Pick index with highest probability
    _, preds  = torch.max(yb, dim=1)
    # Retrieve the class label
    return preds[0].item()

Thank you!

Entire notebook link:- https://jovian.ml/sachinsom507/final-project-sign-language-prediction

Image classification (American Sign language) using PyTorch

Sign Language MNIST Dataset:

Preparing The Data:

Training and validation dataset

Models for image classification

Logistic regression

Training The Model

Deep Neural Network

A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship.

Defining the model

Using a GPU

Training the Model

Convolution Neural Network

Written by Sachin Som