Enhancing Security Measures through Clothes Detection

Joseph Assaker
Zaka
Published in
8 min readDec 24, 2020

A tragic event just occurred and you’ve got some witnesses describing the culprit. You’re panicking and you ask your team to gather all security footage between 1:00 and 2:00 PM, analyze them and search for a male person, wearing a blue t-shirt, black pants and sunglasses. Your team rushes to the task, analyses a dozen of 1 hour camera footage from various angles. After several hours, your team comes back with approximate results, and it might have been already too late to catch the culprit.

Now, imagine that all of this was already being processed in real-time and at all times. What if you already have a rich set of data describing each individual person at any given time. If so, answering such questions, would only require milliseconds. What if you own a fashion store, and want to analyze the clothes of all of your customers leaving the store without buying anything. Or even, what if you could filter out your own clothes by color or by type and visualize them on yourself by querying the data collected from your own home security cameras!

Here at Zaka, we dropped the “what ifs” and tackled this challenge face on! In this blog, we will take a closer look at the clothes detection module from this complex and amazing challenge!

Outline:

  1. The big picture
  2. Data and Labelling
  3. Building and Training the Model
  4. Results!
  5. What’s Next?

1. The Big Picture

To start off, building such a complex system requires a perfect interplay between many working parts which as a result contributes to the whole.

Firstly, we have to begin by building modules that work on the whole image. For example, and in order to begin extracting all the interesting information, we need to have a person detector module in order to detect people in a given image, or video frame. We will also have a robust tracker, on top of the first module, to keep track of individuals in the video throughout various timesteps, or frames.

Secondly, we need to build modules that act on the detected people. For example, we will have a gender classification model whose role would be, given a detected person from the video frame, to identify a person’s gender. Another much more complex example is to have a clothes detection model that, given a detected person from the video frame, detects and identifies various clothing types and colors that this person is wearing. Logging all of this data and storing it into a highly structured and easily accessible database, is yet another component of this system.

Each of these components deserves their own blog posts, and in this one, we’ll be focusing on the clothes detection module!

2. Data and Labelling

Fashion-related datasets have always been abundant, from the Fashion MNIST dataset, to the huge sea of web images filled with photos of people posing in various fashion items. Nevertheless, the lack of proper detection datasets has really been a challenge for us as we don’t really care about just feeding a cropped image of an item to a model, and have it guess its type, or class. In fact, our goal is to have a model that takes an image as input, and be able to detect various fashion objects in this image, outputting their class and their location in the image.

In order to build such a model, we had to manually build and label our custom dataset. The images we settled on in the end were mostly composed from the StreetStyle27K dataset. This dataset was impressively diverse and representative. It included a plethora of clothes types, posing scenarios, and a great balance of peoples’ gender, age and ethnicity. A caveat to the StreetStyle27k dataset was that the labels provided with this dataset were not very useful for us, as they only included a single annotation per image, even if the image included multiple people. On top of that, these single annotations were weirdly centered around the peoples’ heads, with plenty of empty space being included in the bounding box.

It is for this reason that we took on the challenge of labelling the images ourselves, and we did that pretty cleverly. After several discussions and iterations, we settled on the following clothes types, or classes:

  • T-Shirt (short sleeve)
  • Sweater (long sleeve)
  • Tank Top (no sleeve)
  • Shirt (formal shirt)
  • Suit (formal jacket/blazer)
  • Outerwear (jacket/coat/etc.)
  • Dress
  • Pants (formal pants/jeans/etc.)
  • Shorts
  • Skirt
  • Hat (beanie/casket/etc.)
  • Hijab
  • Abaya
  • Scarf
  • Glasses *
  • Sunglasses *

* Glasses and Sunglasses have no colors associated with them.

And on the following color classes:

  • Black
  • White
  • Grey
  • Blue
  • Yellow
  • Green
  • Red
  • Violet (Purple)
  • Orange
  • Pink
  • Brown
  • Beige
  • Multicolor*

* Multicolor is anything with an unclear primary color. Even if a shirt is black and white, we label it as multicolor, as one person could say it’s white, and another could argue that it’s black.

The final set of labels were composed of the combination of clothes and color types having the following format: {clothing_class}_{color_class}.

This led to a total of 185 classes! The goal was to make the dataset as versatile as possible, as to how we train on it and how we could quite possibly utilize it for other detection tasks. Let’s take a look at how we utilized the data in the next section!

3. Building and Training the Model

Before training our model, we had to decide on how to feed in our data. Being very resourceful in our data annotation, we had plenty of options to choose from at this stage! For example, we could directly feed in the data as is and try to detect 185 distinct classes. While this is definitely on our checklist, it would require a really (like really) huge dataset in order to produce promising results.

So, while working on building this huge dataset, we decided to test out our model on a subset of 1,000 labelled images, containing a total of more than 8,000 objects. In fact, we decided to merge our labels by their clothes type for this test. So for instance, “pants_black”, “pants_blue”, …, “pants_multicolor” would all be treated as the same “pants” object. This produced a fairly balanced dataset, with a great effect of balancing the representation of each class. This is because in the original set of classes, classes such as “pants_blue” or “shirt_white” were much more common than classes such as “pants_multicolor” or “shirt_yellow”. All in all, this merging step produced a total of only 14 classes.

Keep in mind that having such a versatile labelling schema, will allow us further down the road to try various merging and combinations of labels. We could, for instance, decide to merge all the “upperwear” classes into a single class and the “bottomwear” classes into another class. Or we might even decide to combine the classes by their color types (we actually did this, and got some really interesting results to showcase!).

Now that our data is ready, we’re ready to tackle our model building and training!

For this model, we decided to use transfer-learning as a start, using a ResNet34 backbone architecture coupled with Faster RCNN. This combination turned out to have the best trade-off between speed and accuracy for the current subset of data. After defining several hyper-parameters and adding a couple of subtle data augmentations (e.g., image rotation and contrast shifting), we trained the model for 72 epochs which took around 8 hours to complete.

The numerical results were acceptable, settling at around 50–60 AP (Average Precision) for most classes. However, pure numbers often don’t convey the whole story, especially in object detection tasks. So we took a look at some inferences produced from the testing dataset, and the results were… amazingly impressive!

4. Results!

So without further ado, here are some of the testing results!

Model detection example - 1: various people of multiple scales in the street.

This image is particularly impressive! We can see the model detecting clothes objects, both on large and small scale people! Notice the miss-classified shoes in the lower part of the image, confused with a small scale pants object. Nevertheless, the confidence of this detection is around 63% which can be easily disregarded. While there are a couple of missed instances, remember that is our alpha version of the model trained on only 1,000 images! We really can’t wait to see the results we get by training a model on tens of thousands of images!

Model detection example - 2: three guys in front of a store.

In this image, we can see a close to perfect detection! This is with the exception of the sunglasses on the person in the rightmost side of the image, almost being hidden by his hair, and the shorts, on the man to his left, being confused with a skirt due to having an odd-looking texture to them.

Model detection example - 3: woman wearing a sports outerwear at the gym.

Moving onto this inference, it yields pretty interesting results! Given that this outerwear object is not physically “linked” through clear “pixels”, the model kind of segmented it into three distinct parts! First the hat, then the two separate, left and right, sections of the outerwear. This inference made us re-evaluate what is considered a “good” detection, and further convey the point we’ve proposed earlier, that pure numbers really don’t convey the whole story!

And here are a couple more inferences produced by our model!

Model detection example - 4: Two people posing in front of a wall.
Model detection example - 5: A family posing in a public outdoor park.

Finally, take a look at some of the results we’ve got when merging the classes according to their color! The model’s new role here would be to detect “any type of clothes that are red”, or “any type of clothes that are white”, etc.

Model detection example - 6: Three people on a table.
Model detection example - 7: A guy posing in front of a wall in cold weather.
Model detection example - 8: A large group of people posing in a group photo, mostly wearing green.

5. Next Steps!

Having built such an impressive proof of concept with merely 1,000 images, our expectations for the future versions of this model are set pretty high!

In the foreseeable future, we expect to aggregate enough data to allow us to feed directly into the model which should make detecting the 185 distinct classes possible!

Don’t forget to support with a clap!

Do you have a cool project that you need to implement? Reach out and let us know.

To discover Zaka, visit www.zaka.ai

Subscribe to our newsletter and follow us on our social media accounts to stay up to date with our news and activities:

LinkedInInstagramFacebookTwitterMedium

--

--

Joseph Assaker
Zaka
Writer for

Artificial Intelligence Engineer at Zaka || Pythonista