The Dog Pooping in My Yard Transformer Detectron PyTorch Tutorial

Matthew Chung

Published in

The Startup

4 min readJan 22, 2021

The Dog Pooping in My Yard Transformer Detectron PyTorch Tutorial

Part 1

The Problem

The problem is some owners do not pick up their dog poop. I believe this is not a problem isolated to where I live, in the bay area, but is ubiquitous to the world. To combat this problem, We are going to develop the Dog Pooping in my yard Transformer Detectron.

This is a 3 part tutorial wherein:

Part 1 gets something working using transformers via the Timm library
Part 2 unpacks what is going on under the hood with a code-first approach.
Part 3 will be putting this into production so we can catch dogs pooping in my yard in the wild

Here is a link to Google Colab and Github.

This tutorial assumes a working knowledge of PyTorch and the tutorial series will focus on learning Transformers.

The Data

Fortunately for us, we can simply wget and unzip some images I scrapped. The zip has two folders, one called `poop`, one called `no-poop`. There are 19 pooping images and 17 not-pooping. We’ll save 4 for validation which leaves us 32 images for training.

!wget https://github.com/matthewchung74/blogs/raw/main/data.zip !unzip -qq -n data.zip -d data

Now you may be wondering, how can we train off 32 images? If we need more images, I’m in trouble since I’m tired of downloading images of dogs pooping. Well, lucky for us, the answer is below. Just look for pretraining in the TIMM section.

Let’s take a look at the data. We’re going to create a dataset, split, do some transforms, and print out a batch.

skipping a bunch for brevity. refer to the Colab above for complete source

class PoopDataset(Dataset):
    def __init__(self, file_list, labels, transform=None):
        self.file_list = file_list
        self.labels = labels
        self.transform = transform

    def __len__(self):
        self.filelength = len(self.file_list)
        return self.filelength

    def __getitem__(self, idx):
        img_path = self.file_list[idx]
        img = Image.open(img_path)
        img_transformed = self.transform(img)

        label = self.labels[idx]
        return img_transformed, label

fig, axes = plt.subplots(2, 2, figsize=(8, 8))
images, labels = next(iter(train_loader)) 

for idx, ax in enumerate(axes.ravel()):
    img = transforms.ToPILImage()(images[idx])
    title = "poop" if labels[idx] == 1 else "no poop"
    ax.set_title(title)
    ax.imshow(img)

Aren’t Transformers for NLP?

Nope. This paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale changed all that. We will get more into the details in future parts. This tutorial is all about quickly being able to detect pooping dogs which you can take and apply to your own problem, such as cats pooping in your yard!

What is TIMM

Timm is the open-source library we’re going to use to get up and running. If you have time, check out his repo. It is amazing. In a nutshell, it is a library of SOTA architectures with pre-trained weights.

First, let’s pip install it.
Then let’s look at all the models, using the method list_models, that has pre-trained weights, and match the wildcard vit which stands for the visual transformer. We will pick the one we're interested in, vit_base_patch16_224
Let’s look at the name of the model for a second. patch16 means there are 16patches. We’ll dive more into this later, but to save on resources, the algorithm divides the image into 16squares and process them separately. 224 is the size of the image we’re training on. If you look at the GitHub notebook, you’ll see the images are resized to 224.
Let’s print out a summary.

!pip install timm

import timm from pprint import pprint model_names = timm.list_models('*vit*', pretrained=True) 
pprint(model_names)

model = timm.create_model('vit_base_patch16_224', pretrained=True) model = model.to(device) 
pprint(model)output:
(head): Linear(in_features=768, out_features=1000, bias=True)

Now this summary above is large, and we’ll cover what it all means later. What is important now is the last time (head). This is the last layer which classifies into 1000 different classes. We only have 2, is it a pooping dog or not. So let's change that.

Standard training stuff

Now we’re going to set up our criterion optimizer scheduler like any other PyTorch project. We’re just switching the model out for another one. Then we’re going to run in a standard training loop.

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = StepLR(optimizer, step_size=1, gamma=gamma)for epoch in range(epochs):epoch_loss = 0
epoch_accuracy = 0for data, label in tqdm(train_loader):
  data = data.to(device)
  label = label.to(dtype=torch.long, device=device)
  output = model(data)
  loss = criterion(output, label)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  acc = (output.argmax(dim=1) == label).float().mean()
  epoch_accuracy += acc / len(train_loader)
  epoch_loss += loss / len(train_loader)
  
  with torch.no_grad():
    epoch_val_accuracy = 0
    epoch_val_loss = 0
    for data, label in valid_loader:
      data = data.to(device)
      label = label.to(dtype=torch.long, device=device)
      val_output = model(data)
      val_loss = criterion(val_output, label)
      acc = (val_output.argmax(dim=1) == label).float().mean()
      epoch_val_accuracy += acc / len(valid_loader)
      epoch_val_loss += val_loss / len(valid_loader)
)