Exploring Custom Vision Services for Automated Fashion Product Attribution: Part 2

A case study in using custom vision services vs. manually developed machine learning models to classify dress patterns

Tom Szumowski
Mar 14, 2019 · 8 min read

Study also supported by: Alan Rosenwinkel, Robin Sanders, and Rafi Hayne.
Source code, presentation , results, and links to data can on GitHub
This study was also covered on
TWiML & AI talk #247.

In Part 1, we covered what a custom vision service is, the business benefits of fashion product attribution, and overviews of each custom vision service.

In this article, we summarize our experience with these services from several perspectives spanning usability and performance. Since our evaluation team included both experienced Data Scientists as well as Interns, we also offer perspective from both user bases. We also presented these findings at the 2018 REWORK Deep Learning Summit in London.

Table of Contents


For each of the services/techniques above, we trained and evaluated against multiple public datasets as well as a URBN dresses dataset. Many of these datasets have thousands of training samples, however one of the benefits of these services is that they are supposed to perform well on less training samples. So we created reduced datasets (named “tiny” in the GitHub notebooks) which have 100 samples per class.

We also evaluate based on various dimensions of “usability”, as performance is not the only driving factor for these services.

Experiment Datasets

We were interested in performance across different domains, both public and custom datasets. So we evaluated with CIFAR-10 [1], MNIST [2], and Fashion MNIST [3] for public benchmarking. Then for a more business-relevant dataset, we used a smaller dataset derived from URBN dress products (5000 training samples, 500 test samples). The classes were dress patterns, hand-labeled. We kept to simple classes for interpretability. We left one obvious non-dress class (shoes) as well as included some challenging mixed-pattern dresses. The classes are also slightly imbalanced, with some patterns (solids) being up to 6X more prominent than others (stripes). This is representative of a real dataset we would have from our URBN catalog. Our GitHub repository contains links to download these datasets as we used them.

Image for post
Image for post
We evaluated a mix of public and custom datasets, both full set and reduced training set.



Performance: Classification Accuracy

For evaluation, we compared both classification accuracy (shown here) and AUC (shown in presentation). The top-scoring service (row) is highlighted for each of the datasets (columns). Refer to the annotations for details.

Image for post
Image for post
Classification accuracy for all datasets and services. The bottom row includes public benchmarks, when available. Fast.ai ranks highest in five of eight datasets, but often not by much. See footnotes here.

Overall, performance was comparable (i.e. within a few percent) for most of the services and most of the datasets. It’s also interesting that the results weren’t consistent between sub-divisions such as all the small datasets, or within a single dataset (small or large). When considering the evaluation dataset sizes, most even fall within the Standard Error. In some cases, a few percent can mean a lot of samples (e.g. public datasets), and be of practical concern (think autonomous driving). However, for our business application of dress attribution, a percent difference would not make classification accuracy the differentiating factor. So instead, let’s consider usability.


The next table summarizes each service (columns) by different usability factors (columns). “In-House” includes both the Keras and Fast.ai approaches.

Image for post
Image for post
Summary of “usability” dimensions for each service with subjective coloring. Green indicates is an included feature, yellow indicates partial feature, red indicates missing or manual. See footnotes here.

Since some services are only in Beta, this table only represents current capabilities as of September 2018. Some highlights include:

  • Costs: Costs vary due to different pricing structures. For example, some offer free training and evaluation, but charge for online predictions. Others charge for training as well.
  • SDKs: We found a language-native SDKs to be extremely useful as developers. For example, Clarifai offers a clean Python API that wraps files in a convenient handler. This makes life easier, not having to worrying about REST requests management, parsing, and retries.
  • Time to Results: Having to wait for training or to iterate on training can be cumbersome, particularly for in-house models. We found the ability to quickly prototype different datasets, dataset slices, or labels to be useful with the services.

With all of these performance numbers and usability dimensions, it can be easy to get lost in the weeds. So if there’s just one thing to take away, if you plan on using these services, consider more than just performance and price. The appropriate selection will vary by business and use case.

Benefits for Data Scientists and Researchers

Even if your team is well-versed in ML model tuning and data wrangling, we found these managed services can still provide value.

Human Labeling

Human labeling is very convenient, particularly when it is inherently tied into the model building process. We spent plenty of time trying acquire attribute datasets by: joining on different databases internally, filtering missing values, de-duplicating, etc. Having a reasonably priced labeling service is helpful.

Image for post
Image for post
Example of a built-in human labeling service (Google AutoML Vision)

Rapid Audit & Visualization

We’ve all been there as data scientists, constantly writing and re-writing scripts to visualize performance, showing images with heatmaps, grids, etc. And don’t forget manually identifying errors, re-labeling, and re-training. The services were very useful to streamline the audit, visualization, and iterative training process. You can quickly navigate false positives, false negatives by class, reassign labels, and retrain the model.

Image for post
Image for post
Example of rapid audit, visualization, and re-train (Clarifai)


Sure we can create the latest and greatest deep learning model. But it’s also important to first run a benchmark to make sure the fancy model is even moving the needle in terms of performance. By default, one may use a scikit-learn model or simple CNN. But the services offer a quick sanity check on performance before doing heavy internal research, at a reasonable cost. Everything runs within a half hour, often less than 5 minutes, and provides navigable performance metrics so you don’t have to waste time slicing and dicing performances by class.

Image for post
Image for post
Example navigable confusion matrix for dress dataset (Clarifai)

What’s Missing for Target Audience

Recall however the target audience for these services is a customer who may not have significant ML experience, such as a developer or analyst. Since our evaluation team was a mix of newer developers and Data Scientists with ML experience, we identified examples where these services still need some work to address the needs of the target audience.

Data Quality Assistance

It can be easy for these models to accidentally overfit on features in an image that aren’t relevant to the identified class. This is common in ML, and even more common if your images have multiple items in them or complex backgrounds. They may require cropping or localization before using them in these services. This came up in our dresses dataset since URBN product imagery always involves curated outfits, models, and sometimes scenic background. In the figure below, the striped dress kept ringing up high as a solid pattern. It wasn’t until we fed in different crops of the image to the service that we realized it was locking onto the solid black hat.

Image for post
Image for post
Example where a managed service “locked” onto the solid hat even though the desired product was the dress.

Here is another example where there’s some inadvertent “data poisoning”. Consider a case where there are multiple images per product. The user than feeds in all images for the product into the service. The service then arbitrarily splits the data into train and validation. In this case, it is classifying dress length. Since the service has seen this product labeled as midi in several training images, it may be recognizing the exact dress (or model, or background), and not the length. That is what occurs on the right where the dress length isn’t even shown, and it still classifies “midi” with high score.

Image for post
Image for post
Improper handling or splitting of input data can result in “data poisoning” where the classifier fits on the wrong features.

Inform Utility of Additional Training

One can keep iteratively updating labels and retraining the model. However, it’s tough for a non-ML developer to know when they’ve hit diminishing returns in terms of performance. And this can get costly both in time and dollars. How does a developer know if it is worth spending more money for more training hours? Currently there is no best practice.

Image for post
Image for post
Example of a one-hour training session performing better than a 24-hour training session. This is likely not due to a poor service or model, but rather there is nothing more left to learn out of the data.

Identify What Type of Additional Data Improves Performance

Similar to above, one can also gather more and more labeled data to improve the model. But again there currently isn’t an indicator for when performance has peaked.

Feedback driven interface

An interactive performance panel is very helpful. But it only shows current state-of-affairs. It’d be helpful to also provide tips on how performance can be improved, e.g. more images, diversify images, pre-process, etc. These come naturally to a ML practitioner, but may be new to the target audience. (Note: Clarifai came close with documentation near the evaluation tools.)


Overall the services provide a very pleasant and intuitive experience that open the opportunity to build custom vision models for non-ML practitioners such as developers or analysts. They provide operationally adequate performance, doing reasonably well on smaller samples for benchmark datasets. There’s also utility for data scientists with features like: human labeling, rapid benchmark, and convenient visualizations at affordable prices.

But there is still work needed to reach full “AI democratization” in terms of avoiding pitfalls in data preparation, data cleansing, and understanding how to improve the performance. Whether or not a managed service or a homegrown service is the right approach is dependent on the use case, business needs, and the team managing the solution.


[1] Krizhevsky, Alex, Vinod Nair, and Geoffrey Hinton. “The CIFAR-10 dataset.” online: http://www. cs. toronto. edu/kriz/cifar. html (2014).

[2] LeCun, Yann, Corinna Cortes, and C. J. Burges. “MNIST handwritten digit database.” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist (2010).

[3] MLAXiao, Han, Kashif Rasul, and Roland Vollgraf. “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.” arXiv preprint arXiv:1708.07747 (2017).


Performance Table

1. Rounded to 3rd decimal, Average AUC in backup

2. Min: Minimum requested by services (100 images/class)

3. Transfer Learning: Resnet-50 architecture pre-trained + variants

4. Source: CIFAR-10 CNN

5. Source: MNIST Public Benchmark

6. Source — unconfirmed by Zalando but Issues can be opened to discuss on GitHub

7. Source: Keras mnist-cnn

8. Due to cost constraints, stratify sampled 450 out of 10000 test points. This results in up to +/- 4.3% Standard Error vs. +/- 0.8% Standard Error in others that used 10K test set sample size

9. Due to cost constraints, stratify sampled 1000 out of 10000 test points. This results in up to +/- 1.72% Standard Error

10. Due to cost constraints, stratify sampled 2500 out of 10000 test points. This results in up to +/- 1.72% Standard Error

11. Never converged on validation accuracy after several trials

Usability Table


  1. All services have a free tier that allow limited training, prediction
  2. Salesforce requires enterprise relationship unless you use Heroku where 10K is a price split point. After that it is $850 for 250K predictions.
  3. Reflects preview discount. Unknown what long term prices will be.

URBN Engineering

Powering Urban Outfitters, Inc.

Tom Szumowski

Written by

URBN Data Scientist, Machine Learning Enthusiast, Coffee Snob, Geocacher, & Engineer. Currently out exploring ML deployment best practices & data engineering.

URBN Engineering

Powering Urban Outfitters, Inc. through software by pushing the boundaries between e-commerce and brand experiences every day.

Tom Szumowski

Written by

URBN Data Scientist, Machine Learning Enthusiast, Coffee Snob, Geocacher, & Engineer. Currently out exploring ML deployment best practices & data engineering.

URBN Engineering

Powering Urban Outfitters, Inc. through software by pushing the boundaries between e-commerce and brand experiences every day.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store