We benchmarked our AI and found it knows style better than fashion pros

Do you know your monk shoes from your Derby shoes? Neither. But we have a friendly machine that does.

Kris Graham
Inside EDITED
7 min readSep 6, 2018

--

At EDITED we build tech that the apparel industry uses to make smarter decisions, powered by enormous amounts of data from global retail.

To make that data at all useful, it has to be classified appropriately so that any user, anywhere in the world, can jump straight into our software and start gaining insights.

Not all sneakers are created equal.

The tricky thing is there is no industry standard for naming products. If you ordered a ‘jumper’ in the US, you wouldn’t end up with a nice new sweater. No, you’d have treated yourself to a pinafore dress.

Very quickly we understood that word recognition software alone wasn’t going to help make our data useable. Classification became a major focus for our data scientists.

Comp Shopping

Our obsession with this AI makes more sense when you know more about the industry context. Fashion, a $2.4 trillion industry, is reliant on retailers making sure they are carrying the right trends when consumers most want them.

Doing this means that brands and retailers are continually looking at what they have, versus what the rest of the market has and making tweaks and changes. Traditionally, this is done in a rather clunky and laborious way. Pre-internet, retailers would send staff out to stores to counts what competitors stock.

Now the industry’s buyers and merchandisers do a mixture of that and counting on competitors’ websites, filling out endless spreadsheets with their observations. Or they invest in EDITED, where a couple of clicks can tell them all of that information, globally.

All of which means these classifiers do a pretty important job if you’re in the apparel business. We’re now at a stage where our classifiers processing over 16 million products (and even more SKUs) every day. But why be content with that? We wanted to see just how good our classifiers were so that we can keep making them even better.

When manually looking through the products classifiers have processed we can tell that they’re good at understanding what is what in fashion apparel, footwear and accessories. But we often get asked ‘how good?’

In order to figure that out, we had to put them against their harshest competitors — industry professionals. And these guys make a career out of knowing their mom jeans from their dad jeans.

So here’s how we benchmarked the role our AI plays

Because our team comes from a huge range of retail backgrounds and are working with apparel products daily, we knew we had the harshest critics in-house.

Fifty respondents from around the company had to identify nearly 1,300 products that were randomly selected from our database in July 2018.

The sample was stratified to ensure a representative coverage of each of the markets we track, across all categories of garments, and all subcategories and styles of footwear.

We asked each respondent to complete the seemingly trivial task of identifying 57 products in turn. It’s an amount that humans wouldn’t think much of undertaking but could tire of easily and using a skillset they wouldn’t necessarily dedicate much effort to improving upon.

We then compared every respondent’s prediction to a ground truth for each of the products, provided by Sam, our data QA. This allows us to calculate comparable measures of performance for both the AI and our retail professionals.

Now, to the good stuff.

What benchmarking revealed about our AI

The results showed that the classifier outperforms human respondents in the ‘simple’ task of identifying garment types by around 2.5 percentage points (97.8% accuracy against 95.4% for the humans).

The footwear subcategory classifier — something we’ve recently fine tuned — performs almost as well as the garment classifier, with 96.7% accuracy. That’s a whole 9.3 percentage points above the human accuracy for this task.

Beating the retail professionals at every task it was set makes our final score classifiers 3, humans 0. In sport we’d call that a walkover.

The final score at the end of the benchmarking? Classifiers 3–0 Humans.

That said, shoes are complicated — both the classifiers and human respondents struggled with the task.

The footwear styles classifier correctly classified the style of 69% of the footwear in the sample, compared with just under 63% of products classified by the human respondents. That means the classifier was 6.5 percentage points more accurate than our human respondents.

Human error is erratic

When we look at the classification of footwear subcategories, we can see the differences in the human and classifier performance.

The human errors (left) are less consistent than those made by the classifier

The classifier identifies most footwear subcategories well, with the exception of slippers; a small category where only 40% are categorised correctly, with 60% of them identified as shoes. There are very few other misclassified products (1% of sandals predicted as boots, and 1% of sandals predicted as shoes, for instance).

However, the same cannot be said for the human effort. Here we found that only trainers and boots are more than 90% accurately classified. Where mis-categorisation does occur, the predicted labels are also not consistent. 55% of slippers are predicted as shoes, 21% as sandals, 7% as trainers, and 3% as boots, for instance.

This lack of consistency in predicted classes means that if you relied on human classification, your data would have more errors, but also there would be more types of error.

Sandals as shoes, shoes as sandals, shoes as trainers, slippers as boots, slippers as sandals, slippers as shoes and slippers as trainers are all apparent in the predictions from retail professionals. Meanwhile, the classifier really only mistakes shoes for sandals and slippers for shoes with any real frequency.

This shows that human errors can vary wildly, and tend to be wrong more often than AI. That’s tricky to fix, whereas the beauty of AI error is that it’s predictable and conservative in its ‘wrongness’. It’s easier to spot and to correct or account for.

Smarter than average

Overall, the classifiers beat the retail professionals. But because there were some respondents who got better scores in specific sections, we compared the accuracy scores of each of the respondents over the products they saw, with the accuracy score that the classifier achieved over the whole sample (nearly 1300 products).

In comparing this way, we need to be careful in our interpretation. Respondents may have seen an “easy” subset of products, and over a smaller number of products.

Therefore an error rate of 1 per 100 may not show any mistakes over 57 products, but over 1,300 we would expect more errors! But despite this unfairness, the classifiers still performed better than the average retail professional in each of the categories.

Human batteries wear out

Another insight from the data is that the human predictions declined in quality over time. After 10 minutes, accuracy rates had dropped on average around 4 percentage points.

Humans tire — it’s a vital flaw in our hardware.

But AI doesn’t have an attention span — it won’t get tired. That makes it perfect to process enormous amounts of data, classifying around 16 million products a day.

That is a quantity of data over which humans have no possible chance of being able to process accurately or consistently.

And even if they could, it would be slow. From the time it took our respondents to classify the 57 products, it would have taken them almost 2 and half hours to classify the full sample of nearly 1,300 products, but with AI this job was done in seconds.

That means, to classify all 16 million products, it would take a retail professional working five day weeks, for 7.5 hours a day, a total of 18 years to complete the task. Meanwhile, our machines do it in a day.

Some retailers still don’t use EDITED — go figure!

Let the classifier wash your dishes?

All of this proves that AI is more accurate, more consistent, quicker, and more reliable at scale than humans.

This means we should let it take the strain of the big boring tasks, and leave retail professionals freer to do the creative parts of their jobs. Like designing unique stuff customers want, thinking up innovative ways to market products and creating excellent shopping experiences.

Instead of fearing the onset of AI, and the risk it poses to industry jobs, we should be embracing it.

We like to compare it to the dishwasher. That’s a piece of kitchen machinery that’s not going to start replacing the chef. Instead it makes grotty parts of the process a whole lot more efficient.

Kris Graham is an ex-economist turned data scientist. He’s also part of an electronic music DJ group, and has what Kris modestly refers to as a “small-time radio show” but which we think is “hella cool”. Incidentally, it’s a really good soundtrack for crunching data to.

If you’d like to work with smart people like Kris, we’re always growing our team across London, New York and San Francisco. Check out our latest job listings here.

--

--