Exception Spotting | New Balance

Using Computer Vision to spot exceptions in a crowd

New Balance wanted to celebrate individuals expressing themselves in distinctive ways during New York’s Fashion Week to generate excitement for their Fresh Foam Cruz Nubuck shoes.

Together with VMLY&R and Tool Of North America, we created a Computer Vision and Machine Learning pipeline with custom algorithms to identify the most common fashion choices in Soho New York, looking for people who were dressed different than the crowd.

Besides development of the tracking software, we also handled training of several neural networks and on-site support. From real-time person tracking to color detection, body part segmentation and pattern recognition in clothing, the challenges were plenty …

Curious how we tackled some of these issues? Keep on reading!

Real-time Person Tracking

One of the first issues we had to deal with was finding a stable method to track people in a real-time video feed.

How many frames do we need to track somebody to make a solid prediction on where that person will be in a next frame? Turns out this depends on the type of classification/prediction we want to make. After some trial & error, we finally settled on a minimum and maximum value. In order to make solid predictions, we need to track a person for 5 to 30 frames.

For the person detection and tracking we used the familiar Yolo algorithm, combined with an object tracker. Our go-to solution for this would be the standard OpenCV Tracker, but we decided to try the Dlib Correlation Tracker. Because, why not right?

Person tracking using Yolo and Dlib on an average of 6 frames — https://flic.kr/p/MYrgzP

When Yolo detects a person it returns a bounding box, which we then send to the correlation tracker. So Yolo is only used once for retrieving that initial bounding box. From that point on we only use coordinates returned by the correlation tracker.

On top of that we used openFrameworks and FFmpeg to record frames from the live feed, creating a slow motion video and distributing 3 frame buffer objects (1080x1920) with video & dynamic typography (coming from the computer vision system). These were sent to Resolume using Spout and finally to the LED-wall. All happening nearly real-time …

Test capture with slow motion and output to multiple screens — https://flic.kr/p/PAotpy

Color Detection

On an intuitive level we all know color is important in fashion. Being an exception is very often a synonym for being dressed differently, so it was pretty obvious from the get-go that color was going to play a crucial role in this.

But computers don’t know color, do they?

There are real world lighting conditions that influence color as well as human bias. For instance, humans can agree on a surface being red even when it is in the shade or in the sun. For a computer this is something entirely different. Computers don’t have that context and can only work with raw pixel values. On top of that even we, as humans do not always agree on color. At what point does gray become black, yellow becomes orange, …

We tested 2 different approaches to tackle this problem:

A. Capturing the entire color spectrum
B. Approximate the color spectrum by dividing colors into n layers/steps.

Color detection, first approach: converting RGB color space to HSV

With the first approach we were converting RGB color space to HSV space, searching for the 4 most dominant colors and predict where in the spectrum these colors reside. Although this was working, it was very error-prone and too “open for interpretation” … Back to the drawing board.

Our second approach was easier: using color quantisation and mapping color values to a semantic scheme in a lookup table. Color values were mapped to color names by using simple euclidean distance, this approach produced much better results.

Color detection, second approach: color quantisation and mapping

Body Part Segmentation

A lot of people wear different clothing on their upper and/or lower body. So, if you want to track colors and patterns you need some sort of process to divide the human body in upper and lower regions.

Some of the most difficult decisions we had to make, had nothing to do with technology: Do we really need segmentation? And if so, what would be the best way to achieve it? Do we segment body parts or clothing, or both?

Do we actually need body segmentation?

Initially we thought that we might not need segmentation at all, because in theory a model trained on the Open Images data set should be able to provide us with all the information we need. Trouser vs skirt, blouse vs dress, …

So we ran some tests with default TensorFlow models, in this case: faster_rcnn_inception_resnet_v2_atrous_oid

Segmentation with TensorFlow: faster_rcnn_inception_resnet_v2_atrous_oid

Problems with this approach:

  • The frame rate was too low (2–5 fps)
  • The accuracy of the predictions on images in the wild was too low (one frame would spit out trousers or skirt, while the next frame only said person)

The way to deal with both problems, would be to take categories from the Open Images data set (skirt, trousers, footwear, dress, …) and retrain the model on these categories. Unfortunately our timeline did not allow to pursue this idea any further.

Other approaches for segmentation we tried: A) creating a neural network which tells the difference between upper, lower or full body clothing and B) using pose estimation, also powered by a neural network. Because of time restraints and the lack of an accessible data set we did not pursue masking.

FasterR-CNN object detection trained on data from the DeepFashion dataset — https://flic.kr/p/2bE17A5

For the first approach we tried training (transfer learning) a FasterR-CNN network with the Tensorflow Object Detection API, using the Attribute Prediction DeepFashion dataset.

After a few iterations of dealing with creating the TensorFlow Records, cleaning up data and setting hyper parameters, we were able to get some decent results. Unfortunately results achieved with this approach were not stable enough to be used in a real-time context.

FasterR-CNN object detection: misclassified or incomplete predictions

However, if we had some more time to finetune the hyper parameters and cleaning up the data set, we feel this approach could work very well in terms of speed and accuracy.

The last approach we tried, and eventually settled on, was working with a pose detection system using the MPII Human Pose Models trained on the MPII Human Pose Dataset.

Creating upper/lower body bounding boxes based on pose detection — https://flic.kr/p/MYrthM

Pattern Recognition in Clothing

The success of any classification task lies in the availability of a good data set, with enough quality samples (both in imagery and in labeling categories). Finding good data sets in the wild is not that easy. We have tested quite a few and eventually settled with this one, provided by Figure Eight.

CNN Pattern prediction

The Figure Eight pattern data set consists out of 15702 images divided in 17 categories. However this data set is not that easy to deal with (usually none of them are):

  • Bounding boxes are drawn onto the image
  • A fair amount of images are mislabeled
  • Similarities between categories can create bias (eg geometry vs. squares)
  • Very unbalanced: the category ‘plain’ has 8000+ images, while ‘hounds tooth’ only has 66 images
  • It only has dresses (no skirts, pants, shirts, …)
Using openFrameworks and OpenCv (inpaint) to clean up dataset. Removing red rectangle and saving bounding box coordinates to file — https://flic.kr/p/2aCPhPu

In the end we ended up with 2 data sets, each with 6 categories: animal, floral, plain, polka dot, squares and stripes. One data set contained the full images, the other set contained cropped images. We trained a model for each data set being used.

Full image vs Cropped image

By using the pose estimation and being able to successfully divide upper and lower body, we were able to get a very good success rate on detection with the cropped images of our model.

To test & validate this approach we used Google Images.

Pattern recognition in the wild — https://flic.kr/p/2bJmPfg

Due to the sparse data set (an average of 120 images per class) we used a combination of transfer learning and finetuning the VGG19 architecture pre-trained on Imagenet in Keras — https://keras.io/applications/#vgg19

It took some trial & error but these are the biggest changes we made that seems to work out pretty well:

  • Unfreeze layer 12–23 and retrained them (finetuning)
  • Drop fully connected layers and replace them with (transfer learning)
    2048 Dense, Dropout 0.5
    1024 Dense, Dropout 0.5
    1024 Dense
    Softmax layer — 6 predictions
  • Use Global Average Pooling instead of Flatten
  • Use class weights (calculate with help from sklearn)
  • Use simple SGD as optimiser (learning rate 0.0001)
Try to see what is going on at both the computers — https://flic.kr/p/2bE16zN

Tying it all together…

By using various techniques, we were able to capture data of pedestrians and identify — in real-time — whether they were an “exception” to the norm. This “norm” had been set in weeks leading up to the activation. We captured and ran hours of footage through our computer vision pipeline and were able to draw statistics based on 1000+ pedestrians in Soho, New York.

These are some of the stats our system retrieved:

– 50% of people in Soho are wearing black tops
– 63% of people in Soho are wearing black
– 90% of people in Soho are wearing neutral colors
– 71% of people in Soho wearing color are wearing dark color
– 83% of people in Soho wearing color are wearing solids
– 1 in 3 people in Soho wearing color are wearing blue
– 1 in 4 people in Soho wearing patterns are wearing floral

The real-time aspect of the installation was crucial to its success. It had to be responsive and snappy in order to retain people’s interest. From real-time person detection & tracking to analysing what they were wearing, creating a slow motion video and getting everything up on the LED panels …

Between detecting a person and putting up the final results on screen we had an average delay of 1.7 seconds

The final installation used 4 networked computers to get everything up and running. The most challenging part was sending image & video data over the network to different machines with as little latency as possible. For this we used a combination of regular TCP, UDP and OSC.

Communication between Python and openFrameworks was handled with the ZeroMQ framework. We created our own windows implementation based on an existing OSX implementation with an example on how to send image data.

A big shout out and thank you to the teams at VML, Tool Of North America and New Balance. They made this awesome project possible. This would not have been possible without their support: Thank you Ben, Tracy, Adam, Lauren, Craig and so many others for sticking with us.



Behind the scenes footage

Thank you

- newbalance.com, Client — New Balance
- vmlyr.com, Agency — VMLY&R
- toolofna.com, Production — Tool Of North America

Standing on the shoulders of giants

This would not have been possible without the help of the open-source community behind the following projects. Big thank you!

- openFrameworks, Creative Coding Toolkit & Community
- ofxZMQWindows & ZeroMQ
- ofxColorQuantizer
- ofxVideoRecorder
- ofxSpout2
- github.com/codebrainz/color-names
- pjreddie.com/darknet/yolo
- tensorflow.org & model zoo
- figure-eight.com/datasets
- pose.mpi-inf.mpg.de
- spout.zeal.co
- ffmpeg.org
- resolume
- anaconda
- keras.io
- dlib.net