How to Train a Custom Keypoint Detection Model with PyTorch

Tutorial on how to fine-tune Keypoint RCNN

Alex P
8 min readNov 14, 2021

By default, PyTorch provides a Keypoint RCNN model which is pre-trained to detect 17 keypoints of the human body (nose, eyes, ears, shoulders, elbows, wrists, hips, knees and ankles).

The keypoints on this picture are predicted by this model:

Photo by Jean van der Meulen from Pexels

I will demonstrate how to fine-tune the above-mentioned model using a custom dataset. For this purpose I’ve created a dataset of images with glue tubes and assigned two keypoints to each glue tube (head and tail).

Images and annotations (custom dataset)

The dataset includes 111 training and 23 test images. Each image has one or two objects (glue tubes).

Annotations for each image include:

  • coordinates of bounding boxes (each object has a bounding box, which is described with top left and bottom right corners in [x1, y1, x2, y2] format);
  • coordinates and visibility of keypoints (each object has 2 keypoints, which are described in [x, y, visibility] format).

All keypoints are visible (i.e. visibility=1) in this dataset. The 1st keypoint is head, the 2nd one is tail.

You can download the dataset here.

Take a look at several random images from the dataset and a random image with visualized annotations:

Random images from the dataset
Random image with visualized annotations

Pycocotools library adjustment

During the training process we will evaluate some metrics of our model. This is done with help of pycocotools library. Go ahead and install it with pip install pycocotools command.

To evaluate how precisely predicted keypoints match with the ground truth keypoints, pycocotools uses class COCOeval which, by default, is tuned to evaluate 17 keypoints of human body. But if we want to evaluate a custom set of keypoints (in our case it’s only 2 keypoints), we need to change the predefined array of coefficients kpt_oks_sigmas in that script.

To do that we need to open pycocotools/cocoeval.py file and change the line self.kpt_oks_sigmas = np.array([.26, .25, .25, .35, .35, .79, .79, .72, .72, .62,.62, 1.07, 1.07, .87, .87, .89, .89])/10.0 to self.kpt_oks_sigmas = np.array([.5, .5])/10.0

For example, in Google Colab this file can be found via the following path: /usr/local/lib/python3.7/dist-packages/pycocotools/cocoeval.py

You can read here the description of keypoint evaluation metrics, object keypoint similarity (OKS) and OKS coefficients.

Update. It’s possible not to edit pycocotools/cocoeval.py file in pycocotools library to change kpt_oks_sigmas, but to edit coco_eval.py file, as Diogo Santiago suggested:

# self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)coco_eval = COCOeval(coco_gt, iouType=iou_type)coco_eval.params.kpt_oks_sigmas = np.array([.5, .5]) / 10.0self.coco_eval[iou_type] = coco_eval

1. Imports

Create a new notebook in Jupyter Notebook. First, we need to import the necessary modules:

Next, download coco_eval.py , coco_utils.py, engine.py, group_by_aspect_ratio.py, presets.py, train.py, transforms.py, utils.py files from this repository and place them into the folder with your notebook.

Import these modules as well:

2. Augmentations

Here we will define a function with augmentations for the training process. This function will apply different transforms to the images before each training iteration. Among such transforms there could be a random change of brightness and contrast or an image rotation by 90 degrees random number of times.

Thus, we essentially “create new images”, which are in some ways differ from original ones, but still perfectly suitable for training our model.

We will use albumentations library for augmentations.

3. Dataset class

The dataset should inherit from the standard torch.utils.data.Dataset class, and __getitem__ should return images and targets.

Below is the description of the parameters for the targets:

  • boxes (FloatTensor[N, 4]): the coordinates of the N bounding boxes in [x0, y0, x1, y1] format.
  • labels (Int64Tensor[N]): the label for each bounding box. 0 always represents the background class.
  • image_id (Int64Tensor[1]): an image identifier.
  • area (Tensor[N]): the area of the bounding box.
  • iscrowd (UInt8Tensor[N]): instances with iscrowd=True will be ignored during evaluation.
  • keypoints (FloatTensor[N, K, 3]): for each one of the N objects, it contains the K keypoints in [x, y, visibility] format, defining the object. visibility=0 means that the keypoint is not visible.

Let’s define dataset class:

Here are additional explanations for the part of the dataset class where augmentations are applied (right after if self.transform: line):

The description of Keypoint RCNN states that keypoints should be provided in [x, y, visibility] format.

In case we want to apply augmentations to the image and its annotations using albumentations library, we should use [x, y] format. Besides that, the list of all keypoints should not be nested.

So, we need to modify keypoints in the initial list from [x, y, visibility] format to [x, y] format and flatten the list, then apply augmentations, and afterwards unflatten the list and modify keypoints back from [x, y] format to [x, y, visibility] format.

For example, if the image contains two objects, and keypoints are described with the list [[[392, 1247, 1], [152, 1055, 0]], [[530, 993, 1], [622, 660, 1]]]:

  • First, we modify the list to [[392, 1247], [152, 1055], [530, 993], [622, 660]], which is ok for albumentations API.
  • Next, after we apply albumentations augmentations, we get a list of transformed keypoints [[672, 392], [864, 152], [926, 530], [1259, 622]].
  • Finally, we modify the list of transformed keypoints back to [[[672, 392, 1], [864, 152, 0]], [[926, 530, 1], [1259, 622, 1]]], which is ok for Keypoint RCNN API.

4. Visualizing a random item from dataset

Here we will look at an example of original and transformed targets:

Output:

Original targets:
({'boxes': tensor([[296., 116., 436., 448.],
[577., 589., 925., 751.]]),
'labels': tensor([1, 1]),
'image_id': tensor([15]),
'area': tensor([46480., 56376.]),
'iscrowd': tensor([0, 0]),
'keypoints': tensor([[[408., 407., 1.],
[332., 138., 1.]],
[[886., 616., 1.],
[600., 708., 1.]]])},
)
Transformed targets:
({'boxes': tensor([[ 116., 1484., 448., 1624.],
[ 589., 995., 751., 1343.]]),
'labels': tensor([1, 1]),
'image_id': tensor([15]),
'area': tensor([46480., 56376.]),
'iscrowd': tensor([0, 0]),
'keypoints': tensor([[[4.0700e+02, 1.5110e+03, 1.0000e+00],
[1.3800e+02, 1.5870e+03, 1.0000e+00]],
[[6.1600e+02, 1.0330e+03, 1.0000e+00],
[7.0800e+02, 1.3190e+03, 1.0000e+00]]])},
)

Here we will look at an example of original and transformed images:

Output:

5. Training

Here we define a function which returns Keypoint RCNN model:

By default, the AnchorGenerator class in PyTorch has 3 different sizes sizes=(128, 256, 512) and 3 different aspect ratios aspect_ratios=(0.5, 1.0, 2.0 (look here). I’ve extended those parameters to sizes=(32, 64, 128, 256, 512) and aspect_ratios=(0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 4.0).

Training loop:

In the training loop I’ve used 3 images per batch. In this case about 10 GB of GPU VRAM are used, so the model can be trained using Google Colab.

I’ve got very good metrics already after the 5th epoch:

6. Visualizing model predictions

Now let’s look how a trained model predicts bounding boxes and keypoints for glue tubes on a random image from the test dataset:

Output:

Predictions:[{'boxes': tensor([[ 618.9335,  144.0377, 1111.2960,  529.3129],
[ 741.4827, 420.9630, 1244.8071, 930.4985],
[ 653.7405, 258.7889, 1018.7531, 509.9501],
[ 824.6623, 540.7152, 1170.4821, 886.6503],
[ 711.1497, 0.0000, 1134.0641, 1066.0247],
[ 708.5067, 177.0665, 1102.3306, 385.1994],
[ 657.0708, 398.0692, 987.9990, 498.4578],
[ 887.4133, 453.8322, 1184.2448, 727.9111],
[ 895.7014, 52.4423, 1106.8652, 1080.0000],
[ 545.8564, 318.9463, 1276.8043, 519.7277],
[ 732.6523, 0.0000, 891.0267, 918.9849],
[ 794.4460, 667.6695, 1091.6316, 861.5293],
[ 809.3927, 273.1192, 1037.3994, 915.0168],
[ 603.3748, 293.8343, 1473.1097, 860.4436],
[ 991.6447, 218.8240, 1144.5980, 924.2585],
[ 419.0262, 196.2676, 1204.9933, 679.9295],
[ 880.3656, 274.3975, 1166.3279, 863.6169],
[1006.1213, 478.2608, 1208.6801, 746.1869],
[ 390.1542, 234.1698, 1592.7747, 502.9070],
[ 433.5611, 472.5373, 1346.7277, 1010.1754],
[ 394.9036, 59.5816, 1268.1086, 491.0312]], device='cuda:0'),
'labels': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0'), 'scores': tensor([0.9955, 0.9911, 0.7638, 0.7525, 0.7217, 0.3831, 0.3320, 0.3311, 0.2415, 0.1709, 0.1700, 0.1456, 0.1174, 0.1086, 0.1041, 0.1025, 0.0758, 0.0608, 0.0604, 0.0582, 0.0510], device='cuda:0'), 'keypoints': tensor([[[6.6284e+02, 4.6822e+02, 1.0000e+00],
[1.0645e+03, 2.0082e+02, 1.0000e+00]],

[[1.1794e+03, 4.8645e+02, 1.0000e+00],
[8.3855e+02, 8.4773e+02, 1.0000e+00]],

[[6.6883e+02, 4.6905e+02, 1.0000e+00],
[6.5446e+02, 4.7048e+02, 1.0000e+00]],

[[8.2538e+02, 8.4989e+02, 1.0000e+00],
[8.4260e+02, 8.4557e+02, 1.0000e+00]],

[[1.1333e+03, 2.0672e+02, 1.0000e+00],
[8.3846e+02, 8.5642e+02, 1.0000e+00]],

[[1.0571e+03, 1.7778e+02, 1.0000e+00],
[1.0628e+03, 2.0219e+02, 1.0000e+00]],

[[6.7074e+02, 4.6476e+02, 1.0000e+00],
[6.5779e+02, 4.9774e+02, 1.0000e+00]],

[[1.1721e+03, 4.9329e+02, 1.0000e+00],
[1.1835e+03, 4.9329e+02, 1.0000e+00]],

[[1.1061e+03, 2.1457e+02, 1.0000e+00],
[1.0573e+03, 2.0160e+02, 1.0000e+00]],

[[6.6456e+02, 4.6882e+02, 1.0000e+00],
[6.6312e+02, 4.7025e+02, 1.0000e+00]],

[[8.9031e+02, 9.1682e+02, 1.0000e+00],
[8.4279e+02, 8.5057e+02, 1.0000e+00]],

[[7.9516e+02, 8.6081e+02, 1.0000e+00],
[8.3823e+02, 8.4358e+02, 1.0000e+00]],

[[8.1011e+02, 8.4521e+02, 1.0000e+00],
[8.4166e+02, 8.4809e+02, 1.0000e+00]],

[[6.6745e+02, 4.6612e+02, 1.0000e+00],
[8.3017e+02, 8.5828e+02, 1.0000e+00]],

[[1.1439e+03, 4.9884e+02, 1.0000e+00],
[1.0696e+03, 2.2098e+02, 1.0000e+00]],

[[6.6590e+02, 4.6905e+02, 1.0000e+00],
[1.0632e+03, 1.9699e+02, 1.0000e+00]],

[[1.1656e+03, 4.9553e+02, 1.0000e+00],
[8.8108e+02, 8.6146e+02, 1.0000e+00]],

[[1.1749e+03, 4.9195e+02, 1.0000e+00],
[1.1749e+03, 4.7898e+02, 1.0000e+00]],

[[6.6741e+02, 4.6914e+02, 1.0000e+00],
[1.1859e+03, 5.0219e+02, 1.0000e+00]],

[[1.1804e+03, 4.7470e+02, 1.0000e+00],
[8.3901e+02, 8.4514e+02, 1.0000e+00]],

[[6.6463e+02, 4.9031e+02, 1.0000e+00],
[1.0646e+03, 1.9980e+02, 1.0000e+00]]], device='cuda:0'),
'keypoints_scores': tensor([[36.9580, 26.7403],
[31.9451, 28.6134],
[22.5176, -0.4728],
[ 7.7444, 21.3082],
[ 1.3215, 7.6223],
[ 2.0522, 22.6735],
[26.5938, -2.3956],
[19.8818, 2.7854],
[ 0.5259, 16.2155],
[39.5929, -0.1582],
[ 0.4924, 21.0935],
[ 0.5597, 19.3637],
[ 3.4223, 25.5078],
[17.6618, 0.4896],
[ 5.9306, -1.5709],
[27.4080, 2.4160],
[11.7086, -1.3879],
[26.0192, 3.0886],
[15.6420, -1.7428],
[ 7.1422, 10.9291],
[14.1688, 15.1565]], device='cuda:0')}]

Here we see a lot of predicted objects. We will choose only those with high confidence score (for example, > 0.7). Then we will apply Non-Maximum Surpression (NMS) procedure to select the most appropriate bounding boxes among remaining ones.

Essentially, NMS leaves the boxes with the highest confidence score (the best candidates) and removes other boxes, which partially overlap the best candidates. To define the degree of this overlapping, we will set the threshold for Intersection over Union (IoU) equal 0.3.

Read more about NMS implementation in PyTorch here.

Let’s visualize predictions:

Output:

Predictions look good: bounding boxes are almost precise, keypoints are in the right places. It means the model is trained quite well.

In the same manner you can train Keypoint RCNN using another dataset, choosing any number of keypoints.

Here is a GitHub repository and notebook with all the steps described above.

UPDATE: It may be interesting for you to read this article: “How to Annotate Keypoints Using Roboflow”

Another in-depth articles about computer vision:

  1. How to Train an Ensemble of Convolutional Neural Networks for Image Classification: tutorial on how to create an ensemble of DenseNet161, ResNet152 and VGG19 for classification of TinyImageNet.
  2. Tutorails on how to generate synthetic datasets for computer vision with help of Python, OpenCV, Numpy and Albumentations:

--

--

Alex P

Machine learning engineer, computer vision enthusiast