Exploring Custom Vision Services for Automated Fashion Product Attribution: Part 2
A case study in using custom vision services vs. manually developed machine learning models to classify dress patterns
In Part 1, we covered what a custom vision service is, the business benefits of fashion product attribution, and overviews of each custom vision service.
In this article, we summarize our experience with these services from several perspectives spanning usability and performance. Since our evaluation team included both experienced Data Scientists as well as Interns, we also offer perspective from both user bases. We also presented these findings at the 2018 REWORK Deep Learning Summit in London.
Table of Contents
- Experiment Datasets
- Performance: Classification Accuracy
- Benefits for Data Scientists and Researchers
- What’s Missing for Target Audience
For each of the services/techniques above, we trained and evaluated against multiple public datasets as well as a URBN dresses dataset. Many of these datasets have thousands of training samples, however one of the benefits of these services is that they are supposed to perform well on less training samples. So we created reduced datasets (named “tiny” in the GitHub notebooks) which have 100 samples per class.
We also evaluate based on various dimensions of “usability”, as performance is not the only driving factor for these services.
We were interested in performance across different domains, both public and custom datasets. So we evaluated with CIFAR-10 , MNIST , and Fashion MNIST  for public benchmarking. Then for a more business-relevant dataset, we used a smaller dataset derived from URBN dress products (5000 training samples, 500 test samples). The classes were dress patterns, hand-labeled. We kept to simple classes for interpretability. We left one obvious non-dress class (shoes) as well as included some challenging mixed-pattern dresses. The classes are also slightly imbalanced, with some patterns (solids) being up to 6X more prominent than others (stripes). This is representative of a real dataset we would have from our URBN catalog. Our GitHub repository contains links to download these datasets as we used them.
Performance: Classification Accuracy
For evaluation, we compared both classification accuracy (shown here) and AUC (shown in presentation). The top-scoring service (row) is highlighted for each of the datasets (columns). Refer to the annotations for details.
Overall, performance was comparable (i.e. within a few percent) for most of the services and most of the datasets. It’s also interesting that the results weren’t consistent between sub-divisions such as all the small datasets, or within a single dataset (small or large). When considering the evaluation dataset sizes, most even fall within the Standard Error. In some cases, a few percent can mean a lot of samples (e.g. public datasets), and be of practical concern (think autonomous driving). However, for our business application of dress attribution, a percent difference would not make classification accuracy the differentiating factor. So instead, let’s consider usability.
The next table summarizes each service (columns) by different usability factors (columns). “In-House” includes both the Keras and Fast.ai approaches.
Since some services are only in Beta, this table only represents current capabilities as of September 2018. Some highlights include:
- Costs: Costs vary due to different pricing structures. For example, some offer free training and evaluation, but charge for online predictions. Others charge for training as well.
- SDKs: We found a language-native SDKs to be extremely useful as developers. For example, Clarifai offers a clean Python API that wraps files in a convenient handler. This makes life easier, not having to worrying about REST requests management, parsing, and retries.
- Time to Results: Having to wait for training or to iterate on training can be cumbersome, particularly for in-house models. We found the ability to quickly prototype different datasets, dataset slices, or labels to be useful with the services.
With all of these performance numbers and usability dimensions, it can be easy to get lost in the weeds. So if there’s just one thing to take away, if you plan on using these services, consider more than just performance and price. The appropriate selection will vary by business and use case.
Benefits for Data Scientists and Researchers
Even if your team is well-versed in ML model tuning and data wrangling, we found these managed services can still provide value.
Human labeling is very convenient, particularly when it is inherently tied into the model building process. We spent plenty of time trying acquire attribute datasets by: joining on different databases internally, filtering missing values, de-duplicating, etc. Having a reasonably priced labeling service is helpful.
Rapid Audit & Visualization
We’ve all been there as data scientists, constantly writing and re-writing scripts to visualize performance, showing images with heatmaps, grids, etc. And don’t forget manually identifying errors, re-labeling, and re-training. The services were very useful to streamline the audit, visualization, and iterative training process. You can quickly navigate false positives, false negatives by class, reassign labels, and retrain the model.
Sure we can create the latest and greatest deep learning model. But it’s also important to first run a benchmark to make sure the fancy model is even moving the needle in terms of performance. By default, one may use a scikit-learn model or simple CNN. But the services offer a quick sanity check on performance before doing heavy internal research, at a reasonable cost. Everything runs within a half hour, often less than 5 minutes, and provides navigable performance metrics so you don’t have to waste time slicing and dicing performances by class.
What’s Missing for Target Audience
Recall however the target audience for these services is a customer who may not have significant ML experience, such as a developer or analyst. Since our evaluation team was a mix of newer developers and Data Scientists with ML experience, we identified examples where these services still need some work to address the needs of the target audience.
Data Quality Assistance
It can be easy for these models to accidentally overfit on features in an image that aren’t relevant to the identified class. This is common in ML, and even more common if your images have multiple items in them or complex backgrounds. They may require cropping or localization before using them in these services. This came up in our dresses dataset since URBN product imagery always involves curated outfits, models, and sometimes scenic background. In the figure below, the striped dress kept ringing up high as a solid pattern. It wasn’t until we fed in different crops of the image to the service that we realized it was locking onto the solid black hat.
Here is another example where there’s some inadvertent “data poisoning”. Consider a case where there are multiple images per product. The user than feeds in all images for the product into the service. The service then arbitrarily splits the data into train and validation. In this case, it is classifying dress length. Since the service has seen this product labeled as midi in several training images, it may be recognizing the exact dress (or model, or background), and not the length. That is what occurs on the right where the dress length isn’t even shown, and it still classifies “midi” with high score.
Inform Utility of Additional Training
One can keep iteratively updating labels and retraining the model. However, it’s tough for a non-ML developer to know when they’ve hit diminishing returns in terms of performance. And this can get costly both in time and dollars. How does a developer know if it is worth spending more money for more training hours? Currently there is no best practice.
Identify What Type of Additional Data Improves Performance
Similar to above, one can also gather more and more labeled data to improve the model. But again there currently isn’t an indicator for when performance has peaked.
Feedback driven interface
An interactive performance panel is very helpful. But it only shows current state-of-affairs. It’d be helpful to also provide tips on how performance can be improved, e.g. more images, diversify images, pre-process, etc. These come naturally to a ML practitioner, but may be new to the target audience. (Note: Clarifai came close with documentation near the evaluation tools.)
Overall the services provide a very pleasant and intuitive experience that open the opportunity to build custom vision models for non-ML practitioners such as developers or analysts. They provide operationally adequate performance, doing reasonably well on smaller samples for benchmark datasets. There’s also utility for data scientists with features like: human labeling, rapid benchmark, and convenient visualizations at affordable prices.
But there is still work needed to reach full “AI democratization” in terms of avoiding pitfalls in data preparation, data cleansing, and understanding how to improve the performance. Whether or not a managed service or a homegrown service is the right approach is dependent on the use case, business needs, and the team managing the solution.
 Krizhevsky, Alex, Vinod Nair, and Geoffrey Hinton. “The CIFAR-10 dataset.” online: http://www. cs. toronto. edu/kriz/cifar. html (2014).
 LeCun, Yann, Corinna Cortes, and C. J. Burges. “MNIST handwritten digit database.” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist (2010).
 MLAXiao, Han, Kashif Rasul, and Roland Vollgraf. “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.” arXiv preprint arXiv:1708.07747 (2017).
1. Rounded to 3rd decimal, Average AUC in backup
2. Min: Minimum requested by services (100 images/class)
3. Transfer Learning: Resnet-50 architecture pre-trained + variants
5. Source: MNIST Public Benchmark
6. Source — unconfirmed by Zalando but Issues can be opened to discuss on GitHub
8. Due to cost constraints, stratify sampled 450 out of 10000 test points. This results in up to +/- 4.3% Standard Error vs. +/- 0.8% Standard Error in others that used 10K test set sample size
9. Due to cost constraints, stratify sampled 1000 out of 10000 test points. This results in up to +/- 1.72% Standard Error
10. Due to cost constraints, stratify sampled 2500 out of 10000 test points. This results in up to +/- 1.72% Standard Error
11. Never converged on validation accuracy after several trials
- All services have a free tier that allow limited training, prediction
- Salesforce requires enterprise relationship unless you use Heroku where 10K is a price split point. After that it is $850 for 250K predictions.
- Reflects preview discount. Unknown what long term prices will be.