Exploring Custom Vision Services for Automated Fashion Product Attribution: Part 1

An overview of what is a custom vision service, what they look like, and how can they can be used to generate fashion product attributes

Tom Szumowski
Mar 14, 2019 · 11 min read

Study also supported by: Alan Rosenwinkel, Robin Sanders, and Rafi Hayne.
Source code, presentation , results, and links to data on GitHub
This study was also covered on
TWiML & AI talk #247.

This is Part 1 of a two-part series on custom vision services for automated product attribution. In Part 1, we describe the the business benefits of fashion product attribution. We also cover each custom vision service with some example screenshots. In Part 2, we summarize our findings from both a performance and usability perspective.

URBN data science evaluated five custom vision services and two in-house alternatives for fashion product attribute tagging.

Table of Contents

Case Study Overview

In September 2018, our Data Science team presented at the REWORK Deep Learning Summit London. It covered our experience assessing several custom vision services. These services allow a user to create a Machine Learning (ML) model that classifies images based on the categories the user provides. For example, our Urban Outfitters Inc. (URBN) data science team was interested in automatically attributing dress images with the print (e.g. striped, floral, solid, etc) as a proof of concept.

What are custom vision services?

The general motivation of custom vision services is to “democratize AI”, or to make these technologies more accessible to a larger audience. Their target audience includes developers and analysts who may not necessarily have a ML background. Suppose you want to provide a “flower classifier” service to the public. Traditionally this would mean you would need to:

  1. Collect lots of labeled data on different flowers,
  2. Find a ML practitioner or Data Scientist to manually train and tune a classifier,
  3. Find a Data Engineer to wrap that classifier in a RESTful HTTP interface,
  4. Expose that interface to the public,
  5. Validate the deployed classifier is accurate, and
  6. Validate the classifier service can operate at-scale.

Custom vision services intend to manage nearly all these steps for you. They allow a user to upload a reasonably small set of labeled images. Then the service automatically optimizes a classifier to detect those labels. Once complete, they expose the classifier in a managed HTTP API so that you can serve classifications at-scale, in real-time.

How do they differ from other ML-based services?

Custom vision services differ from machine learning (ML) services, such as Google Cloud ML Engine or Amazon Sagemaker. ML services streamline deployment, but still require the user to build their ML models. Custom vision services also differ from vision services. Non-custom vision services, such as Microsoft Azure Computer Vision or Google Cloud Vision, come pre-trained with existing labels. For example, you can classify logos, handwriting, faces, landmarks, etc. In contrast, custom vision services allow you to choose your own labels and train a custom ML model to classify with those labels.

What custom vision services are available?

There are many of these services in the market varying from start-up to the tech giants. In the REWORK presentation, we covered:

We also compared our experience with those managed services with “homegrown” solutions including:

  • a pre-trained Keras Resnet neural network with various permutations, and
  • a model trained with the fast.ai library (study occurred after presentation)


We were provided either public or private trial access to some of the services for our evaluation. The level of access did not impact our assessments presented at REWORK, or here.

  • Google Cloud AutoML Vision: alpha partner,
  • Clarifai: private free trial provided for duration of evaluation,
  • Salesforce Einstein Vision: public free trial used,
  • Microsoft Azure Custom Vision: public free trial used,
  • IBM Watson Visual Recognition: virtual demo provided on our dress dataset, but no trial provided.

Urban Outfitters, Inc. (URBN) also had prior (non Data Science related) business partnerships with the following organizations: Google, Salesforce, Microsoft, and IBM.

Product Attribution at Urban Outfitters

URBN is a portfolio of multiple global consumer apparel brands, including: Urban Outfitters, Anthropologie, Free People, and BHLDN. Between these brands, thousands of new products are activated a week, each with a diverse class of products. Each of those products are described with attributes that are assigned during the buying process or at catalog insertion. Quality attributes are beneficial for various efforts, including: personalization, recommendations, trend tracking, trend forecasting, and assortment planning. However, attributes can change over time due to changes in fashion trend. This makes it challenging to manage a growing list of attributes, or to backfill older products with new attributes to maintain consistency.

Our Data Science team was interested in experimenting models to assess the accuracy and reliability of specifically fashion attribute classifiers. For example, some attributes for dress may include: product category, neckline, sleeve length, color, pattern, dress length, fabric composition, and many more. While we can build in-house solutions, as the number of attributes grow, our team’s time would become more consumed with managing models for all these attributes. So we were interesting in seeing how well a custom vision service can ease that management.

A Tour of Custom Vision Services

The typical workflow for each service includes:

  1. User collects a small set of representative images, typically 100 per class, but as low as 10 per class.
  2. User labels images with the classes, i.e. “categories”.
  3. User uploads images via API or the User Interface (UI).
  4. Service splits the data into train and validation sets (with the exception of Google. AutoML where you can split manually).
  5. Service trains and optimizes classifier model (typically varies between 2 and 20 minutes).
  6. Service notifies user training is complete, and provides a report on performance.
  7. With the model, the user has a few options now:
  8. Iteratively improve model by fine tuning prediction labels through UI,
  9. Feed test set to evaluate hold-out data, or
  10. Serve the model in production

Most managed service shares several capabilities in common:

  1. They offer a user interface (UI) to interact with for training, evaluation, and visualizing the results
  2. They offer a RESTful API to programmatically interact with the service.

In our study, we primarily interacted via API to automate the procedure and ensure consistency. The GitHub repository includes automation scripts for each service as well as the in-house Keras and Fast.ai models. We did interact with the UI for each service to experiment with features that a user without programming and ML experience would use.

The next few sections summarize some features of each of the services. Any descriptions or screenshots of the services reflect their state in September 2019. Since many of the services were in beta or active development, they may look different today.

Google Cloud AutoML Vision Beta

Google Cloud AutoML was announced Google AutoML in a January Google blog post, and showcased at Google Next 2018. It is currently in Beta. They have a main page for each dataset where you can explore the different labels and upload additional images via CSV. You can click into different images and re-classify as well.

Example main page from Google AutoML Vision once data is uploaded. You can filter and sort on labeled/unlabeled images.

The “Evaluation” tab lets you view metrics and also view classification errors. This becomes very useful to quickly scan your datasets for potential source labeling errors. In our case, that may mean a product was incorrectly attributed in our company database, and the ML model was able to detect the error. Most of the services offer similar overview and evaluation pages.

Part of the evaluation screen for Google AutoML Vision. They provide metrics for each label, as well as global metrics. You can browse through some, but not all, of the labels near the bottom (true-positives, false-positives, etc)

Google AutoML has a unique feature where you can purchase human labeled images. This is similar to Amazon’s Mechanical Turk, but managed through Google as part of the AutoML workflow. This is how we labeled the pattern on the dress data we tried. As a warning, the labels are only as good as (1) the descriptions and samples you provide, and (2) how differentiable your classes are. We had several dresses that were ambiguous in their pattern, e.g. a mix of solid and stripe. This made it difficult to choose a single label.

Example screenshot of the human labeling page from Google AutoML Vision. It is a paid service on top of regular AutoML pricing, and takes a few days to get the results.

At the time of our experiments, the service trained a bit slower than the others, taking between 5 and 15 minutes for their standard models. We hypothesize this may be due to Google AutoML having a more involved training procedure since it uses an architecture-search method (NAS-Net) instead of the faster, but sometimes less accurate, Transfer Learning method.


Clarifai was founded Matthew Zeiler, a ML researcher who is known for his high ranking in the 2013 ImageNet challenge, the paper Visualizing and Understanding Convolutional Networks, and the ADADELTA learning rate. Though not one of the larger companies, Clarifai placed a focus on computer vision AI since 2013, and therefore their product shows as more refined. Clarifai had the fastest training with accuracy on-par with most other services. You can update your model after every image you correct within seconds, allowing rapid iterative fine-tuning. Their user interface was faster than the other offerings as well, with many knobs as well. They don’t currently have a human-labeling option (like Google), and there wasn’t any way to store off prediction results in the service which is convenient for management (Azure offers this, described later).

Example screenshot from Clarifai’s main page. You can sort through Concepts (labels) on the left. Visual search is also natively available at the top.
Example screenshot of Clarifai’s image evaluation page. They offer custom labels (top right) as well as labels for different domains (bottom right), navigable in a carousel (bottom).

Salesforce Einstein Vision

Salesforce Einstein Vision differs a bit from the other service. Instead of it being a standalone public service, it is instead part of their larger Einstein Enterprise platform. It is an additional feature existing Salesforce customers can purchase to integrate into the platform. For example, from their documentation: “Salesforce Social Studio integrates with this service to expand a marketer’s view beyond just keyword listening. You can “visually listen” to detect attributes about an image…” This provides the advantage of being able to naturally embed their Vision capability into any Salesforce tools the customer already has or builds out. They do offer a Heroku app option, but the tiers are a bit limited. It appears the Heroku app is more to provide a demo of the features.

Because it is more a capability that can be integrated into other Salesforce tools or apps, there isn’t an explicit user interface. The bulk of the experience is through the API. They do provide an open-sourced example sandbox UI that shows how to use the Salesforce Einstein Platform API using an Apex based wrapper. However, not having a managed UI made it difficult to visualize and understand the dataset and performance.

Screenshot from YouTube video explaining an example UI one can create to manage Salesforce Einstein Vision

Microsoft Azure Custom Vision Preview

Microsoft Azure’s Custom Vision is part of their Cognitive Services API that also provides services for Speech, Language, Knowledge, and Search. Similar to the Google and Clarifai, they offer SDKs as well as a UI to interact. It may be due to our inexperience with Azure services, but we found it difficult to navigate through the sign-up pages and acquire the appropriate API keys.

Example summary page for Microsoft Azure Vision, similar to the other services

Their UI features are very similar to the other services, offering an analysis page, performance evaluation page, and manual uploads for predictions.

Image for post
Image for post
Screenshot of Microsoft Azure Custom Vision performance summary page, providing Precision and Recall globally and for each class.

Microsoft’s solution offers CoreML and Tensorflow model exports. They also offer to store prediction results within their service which was convenient. Because of the integration into other Azure services, Microsoft Azure Custom Vision can be a good choice for those already using the Azure ecosystem.

IBM Watson Visual Recognition

We did not directly evaluate IBM Watson Visual Recognition because it ended up not being price-competitive relative to the other services. Their training is $0.10 per image, costing roughly 20X more than the other services overall. They have a 1000 images/month free tier, however that was not sufficient for testing the various datasets we were interested in evaluating. We were provided a virtual demo of the service using our dresses dataset. However, we were not provided access to interact with our dataset using their service. Overall the UI had similar features as the other services.

In-House: Keras ResNet

We also compared performance and usability against a few different “homegrown” models using Keras. First we applied a naive convolutional neural network (CNN). Then we applied Transfer Learning, using ResNet50 (paper) as the base architecture. Our goal was not to find the absolute best ML model manually, but rather provide a comparison to a reasonable model that a data scientist may be able to assemble quickly.

Image for post
Image for post
Example of Transfer Learning. Only the final fully-connected layer is unlocked for training. Source: Learn OpenCV Keras Tutorial.

We tried several variations on both the dataset and the models. For the dataset:

  • original datasets as well as generic augmentations involving: 15% rotation, 10% translation with similar pixels as fill method,
  • both cropped and uncropped datasets, where the cropped dataset have backgrounds removed.

For the models:

  • ResNet50 as just a generic feature extractor (no re-training),
  • ResNet50 via transfer learning, re-training the final layer
  • A summed ensemble of ResNet50 models
  • A two-path neural network, where both cropped and uncropped images are processed via ResNet50 feature extractors.

The GitHub repository includes notebooks that implement these variations.

In-House: Fast.ai ResNet

Fast.ai is a massive open-online course (MOOC) whose goal is to “make neural nets uncool again”. Part of Fast.ai’s mission statement says “The world needs everyone involved with AI, no matter how unlikely your background.” They provide ML and Deep Learning courses, as well as a fast.ai python library (built on PyTorch), to the general audience. With a top-down learning approach, their goal is to get users building models in the first lesson, and then teach how it works in much later lessons.

After the conference presentation, the Fast.ai deep learning course V3 started. We were curious how all the datasets would perform using just their Lesson 1: Image Classification, which trains and evaluates using ResNet34 and ResNet50 transfer learning as well as fast.ai-originating training methods. Similar to the Keras approaches, our goal with using fast.ai was to see how well the models perform with little to no hand-engineering.

Fast.ai’s library includes visualization utilities, for example a “top losses” plot that shows the highest loss images from training


Custom vision services aim to make vision-based ML models accessible to the public by automatically training, deploying, and managing a classifier using the user’s own custom labeled images. In this part, we explored how attribute-tagging in general can provide business value. We also toured some different services and in-house options for developing these models.

In Part 2, we’ll cover the findings our data science team presented at REWORK, including performance across both public and custom fashion datasets, as well as a summarize usability across several dimensions from the perspective of an experienced data scientist and a developer without ML experience.

Related Resources

URBN Engineering

Powering Urban Outfitters, Inc.

Tom Szumowski

Written by

URBN Data Scientist, Machine Learning Enthusiast, Coffee Snob, Geocacher, & Engineer. Currently out exploring ML deployment best practices & data engineering.

URBN Engineering

Powering Urban Outfitters, Inc. through software by pushing the boundaries between e-commerce and brand experiences every day.

Tom Szumowski

Written by

URBN Data Scientist, Machine Learning Enthusiast, Coffee Snob, Geocacher, & Engineer. Currently out exploring ML deployment best practices & data engineering.

URBN Engineering

Powering Urban Outfitters, Inc. through software by pushing the boundaries between e-commerce and brand experiences every day.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store