Empowering Fashion Tech: Aiuta’s Quest to Interpret and Communicate Style
By: Fabio Cuzzolin (Chief Data Scientist), Max Balian (Chief Product Officer) and Denis Kirianov (ML Lead)
Hello! We’re Aiuta — a fashion-tech startup backed by cutting-edge AI. Our product helps people express personality through style, provides feedback for more satisfying fashion choices and empowers them to feel confident about the way they dress.
To us, fashion is just another language, so we’re currently developing FashionGPT — a technology similar to ChatGPT, in the way that it combines deep learning with human feedback. Instead of being text-based, though, it’s mostly visual.
This technology lies at the core of the future Aiuta app. It scores looks and outfits, suggests new combos and shows how you would look wearing them.
In this article, we look at the story of Aiuta, explore similarities between fashion language and human language that stand behind FashionGPT, and take a dive into the recent history of machine learning. Buckle up!
What is Aiuta?
There was a scene in the cult TV show Clueless featuring a made-up computer that suggested outfits and displayed what they’d look like on a person. The plot was fully fictional back then, yet it spotlighted a problem that was already too real. We did some research and found out that choosing clothes for either a workday or a special occasion causes a lot of stress. 47% struggle to pick an outfit to wear for work, and it can take eight hours per month to choose what to wear.
Surprisingly, this episode portrays quite well what we’re working to achieve, as in to help people decide what to wear and to show what they would look like in a variety of styles.
What we want Aiuta to do:
- Provide a second opinion on what you’ve already put on. Let’s say you need to choose between two outfits in the morning. Or you’re stuck in the fitting room trying to select the right color of sweater. The Aiuta app will provide scores for each option and show the best match. It will also generate tips and tell what you could do to level up the look.
- Check if a new clothing item is a good fit and come up with solutions if it’s not. Something you’ve found online can match with your body type and other pieces in your wardrobe. Or perhaps not. Aiuta will score each purchase idea and suggest other matching items from different stores.
- Digitize your wardrobe and put it in your pocket. When Aiuta knows the clothes you already have, it can generate stylish combinations of outfits. It will also help the algorithm get a better grasp of your taste and eventually exclude from suggestions things you are not likely to wear.
In the first version, we will provide recommendations based on several trendy styles, giving the users a chance to try them on — digitally. Our scalable approach will allow us to add more styles and trends moving forward.
For all of this to work, we need to do one basic thing first, which is to learn how to evaluate how well different items come together in a single outfit. This is a fundamental task and a challenging one. Having solved it, we realized that our project can bring value for users in a number of cases. We have learned to calculate what we call the “AiutaScore”, which assesses not only the stylistic compatibility of the items, but also the relevance, trendiness and personal characteristics of a person’s body type and appearance.
We got ourselves a hell of a task. How to determine what is stylish in the fashion world, with its huge number of spoken and unspoken rules?
After months of R&D, we figured out how to tackle this problem using our own FashionGPT technology. We relied on the recent technological breakthroughs — the latest generation of AI algorithms.
The hype and the promise of generative AI
Let’s take a dive into all the recent ChatGPT hype (if you know all about machine learning, you can skip this part!). In the past six months, there has been a lot of buzz around generative AI and what it is capable of.
Large language models, such as ChatGPT and the most recent GPT-4, impressed both users and AI experts with their ability to process language. These models can write simple pieces of code with stunning accuracy, and produce detailed reports about almost everything. Sometimes in a better writing style than most people (even university students!). Despite these amazing achievements, the potential generative AI is still untapped in many sectors.
What exactly is generative artificial intelligence and how is it different from “traditional” AI? Computer scientists make a distinction between “discriminative” and “generative” models.
A typical example of a discriminative model is a classifier: for instance, presented with a photograph, a classifier is able to tell, with a certain degree of confidence, whether the image contains, say, a person, a dog or a chair. Beforehand, classifiers need to be “trained” with thousands of photographs labeled with the class of object they contain.
Unlike discriminative models designed to differentiate between various “kinds” of data instances (e.g., images containing people versus those containing dogs), generative models, as the name suggests, are trained to generate new data instances. In other words, they learn how to produce data similar to that shown to them as examples.
Taking the example further, a generative model can learn to create images that “look like” those demonstrated during training (e.g., a new picture of dogs) or pieces of text that resemble those that were used for training.
To be able to create new data, rather than just classify it, generative models require a deeper understanding of the nature of the data.
Mathematically, this deeper understanding requires working out the probability of a data-point creation (e.g., an image or a word). Generating an image can be compared to spinning a sort of a complicated roulette, where the task of generative AI, given the numbers the wheel spits out, is to understand how the roulette works and reproduce its results.
Mainstream media seem to have discovered the “generative AI” buzzword in the last six months and described it as if it were a completely new thing. In fact, generative models have been around for quite some time now.
Large language models appeared with the help of the hugely impactful “transformer” technology published by Google in 2017. Before transformers, neural networks struggled to accomplish compound tasks such as generating texts that would make sense, for they were unable to capture the overall structure of a long sequence of data such as a 2,000-word paragraph.
One cannot correctly predict the next word just by focusing on the few last words in the sentence — what is essential is to understand the context of what is being said. Transformers managed to deal with such tasks by assessing how “similar” each word is to all the others in the paragraph. In this way, they can capture long-range connections between words with similar meaning all across the text. In AI jargon, this mechanism is called “self-attention”.
Large language models learn the best way of capturing these connections from a massive amount of human-generated text crawled from the internet, so that they can use them to generate text when asked a query, like ChatGPT does.
What is new in ChatGPT-like models that enabled them to create such hype?
ChatGPT is an example of a large language model or LLM. They are trained to predict the next word in a sentence. For instance, given the fragment of text “Mary went to a coffee shop to get a …”, the model will try to predict what word comes next; “coffee”, “cappuccino” or “drink” to complete the sentence.
As they generate data, most large language models rightly belong to generative AI.
Previous models such as GPT-3 often failed to produce outputs consistent with human expectations: they did not follow instructions precisely, made up false facts — something called “hallucination” — or generated unacceptable texts containing slurs or instructions on how to put together an explosive device.
A basic language model cannot “understand” which mistakes are important (those that change the meaning of a sentence) and those that are not.
Assume the model needs to predict the missing word in “The Mongol Empire XXX in the early 13th century…” Picking “ended” instead of “started” matters, because it completely changes the sense of the statement and generates a false output.
To address this problem, ChatGPT introduced, for the first time, human feedback in the loop to help refine its outputs. In particular, human beings are asked to vote on a large number of sentences produced by the model — this information is then used to refine the way ChatGPT chooses the next word.
The power of the latest generation of language models opens up a vast array of business opportunities, from the automated analysis of customer feedback to the building and deploying of AI chains, smarter search engines offering chatbot-like interfaces, AI-powered communication platforms and bases or personalized ChatGPT apps.
From a technological point of view, Big Tech companies, currently elbowing to get a competitive advantage in generative AI, run the risk of losing out to more agile start-ups, which can move faster and achieve domination in market niches.
The fashion industry is the perfect example. It is all about visuals, so a text chatbot can hardly power a serious new business in this market. The sector requires a different approach. At Aiuta, we are on a mission to prove that new advancements in tech can help disrupt fashion in a fascinating new way.
From ChatGPT to FashionGPT
Our fundamental belief is that fashion is also a language. Following this perspective, we created a technology that is able to understand whether a certain outfit looks fashionable or not and come up with more matching options.
Why is fashion a language? Just as words form sentences to convey meaning, fashion items combine to create outfits that communicate a particular style or message. In this context, fashion can have its own vocabulary (clothing items) and item-arrangement grammar within an outfit and combination rules. Just like in human language, there’s room for mistakes: we can speak a language poorly and we can dress badly. The rules of fashion, though, are much less tangible than the grammar of language. People do not often agree on what is “right” to wear, yet a certain code exists — wearing a meaningless outfit can be just as disappointing as saying a meaningless phrase. Also, there are ways of dressing that are almost universally scorned, such as the notorious “socks and sandals” combination; fashion icons and events continually try to corral the public towards a certain shared ideal of look.
Viewing fashion as a language allows us to leverage computational approaches for analyzing and generating fashion.
Once we consider garments as visual entities, using pictures and photos to build a representation we can learn to understand the “visual language” of outfits. This knowledge can then be used to recognize patterns, style preferences, and trends in order to recommend personalized outfits, generate new fashion combinations, or even predict future fashion directions. We decided to call our technology FashionGPT — Fashion Garment Pretrained Transformer.
To put it simply, we see an outfit as a sentence. A “bad” outfit stands for an erroneous or inexact sentence, while a “good” one represents a meaningful and consistent one.
Just as a sentence is composed of different words, an outfit is composed of individual garments. When processing natural language, algorithms use various techniques to split text into words and subwords. Similarly, in the visual domain, outfits and collages can be split into image patches, to allow a finer-grained analysis. Different garments characterized by the same shape and color may be considered as synonyms.
Based on these strong similarities, we reasoned that some sort of LLM-like approach could also do the trick here. Our concept is simple yet striking: fashion language modeling can provide us with stable representations for garments and outfits, to be used for both scoring and recommendation.
To do that, we need data.
FashionGPT requires data in the form of outfits correctly split into specific garments, either because they come as collages to start with or because they are obtained by “segmenting” a photograph into its constituent elements. There are several open datasets called “fashion compatibility assessment” suitable for this task.The most prominent ones are DeepFashion, FashionVC and Polyvore. Open benchmarks are a huge asset, as they allow us to compare different architectures. However, fashion is a fast-changing domain, so some samples quickly become outdated. The truly relevant data consists of stock feeds and composed outfits from online fashion retailers. We found several partners who were able to provide us with relevant topical data.
To train our FashionGPT model, we had to show it many examples of “good” outfits — much in the same way OpenAI or Google do while feeding sentences from Wikipedia and other sources when training their respective language models (ChatGPT, Bard, etc.). However, there is an important difference: while teaching natural language models, we can be mostly sure that the sentences they are trained on are grammatically correct — because people are (on average) mostly literate, they write in their own native language and grammar rules for natural languages are quite strict.
The language of fashion is much more ambiguous and its “rules” are vague. Thus, we needed humans to evaluate the example outfits our algorithm sees while being trained. Not random people, however — we needed experts who know fashion. So we hired them! We now have a team of in-house professional stylists that provide our scientists with feedback.
Because it is impossible for them to annotate the literally hundreds of thousands of outfits we get, we put in place a special in-house annotation pipeline, ensuring that our product is scalable.
It works this way:
- We randomly select a sample of hundreds of outfits from a particular dataset (feed / open dataset / generated by heuristics);
- The stylists annotate these samples;
- We measure the overall score for the annotated samples and assess the agreement among the stylists;
- We propagate the score to all the other outfits in the same dataset, keeping in mind that such a propagated score has only a certain probability of being correct;
- For some tasks, we also use an “active learning” kind of approach where we re-annotate the samples with “borderline” scores and retrain the algorithm.
This method enables us to collect data suitable for scoring. This is a fundamental part of our product, as we want to be able to tell people how good their outfits are.
You’re looking great! Or are you?
In the past years, several research teams investigated the use of transformers and language models for analyzing outfit compatibility through image captioning or swin transformer networks, relying on pre-trained models with no human feedback. An “AI fashioner” based on (pre-transformers) deep learning and computer vision techniques has also been proposed, as well as an image-based fashion recommender based on user interests that, however, models the social aspect of the problem rather than the visual language of outfits. Interestingly, some researchers have looked at “personalized” recommendations based on physical attributes, rather than outfit compatibility.
Aiuta’s approach, on the contrary, leverages state-of-the-art transformer models coupled with human input and savvy strategies for augmenting the training data and generating negative examples of outfits.
Namely, our FashionGPT consists of several components:
- Image encoding. We need to convert the image of each garment into a vector.
- Transformer. Each vector represents an element of the outfit. We get them together and feed them to a transformer layer that models the overall context of the outfit via the “self-attention” mechanism, and produces a much stronger representation for the latter.
- A scoring head. In this part, a particular “adapter” transforms vectors into the score to estimate how good the particular look is.
There are fundamental differences between written or spoken language and the language of fashion. An average outfit (three to four garments) contains fewer elements than an average sentence (typically, a dozen words). Additionally, a word usually occurs in many contexts, while a particular piece of clothing does it less frequently, even within a single retailer; as soon as we extend our “corpus” to different shops and styles, data sparsity may increase dramatically.
To solve this, we made our space more dense by generating negative examples, replacing a single item in an outfit with another randomly selected item from the same shop’s stock. This amounts to so-called “negative mining”, which aims to generate negative examples to improve models: we cannot train a model by showing only good examples to it, we also need some poor-looking outfits.
The caveat is that a proportion of the randomly generated outfits will happen to be “good”. Our augmentation strategy then required additional checks, through the above-mentioned technology of sample annotation and label propagation.
We can also augment our dataset via “synonymical” replacement. This means that we can replace a given white T-shirt with a similar one, while retaining all other garments in the outfit, obtaining a new outfit with arguably the same score.
Thanks to our stylist annotation and augmentation strategies, we’ve achieved a performance above 80%, in (technical) terms of area under the receiver operating characteristic (ROC) curve. This roughly means that, in 80% of the cases, we’re sure that the model is scoring an outfit correctly. This figure will further improve, yet for a concept as complicated and hard to capture as “fashion”, the result is already impressive and provides strong foundations for product building.
Bottom line, we’ve solved the fundamental task for our future app — to be able to score fashion looks. We called that metric AiutaScore.
However, scoring looks is just one of the possible implications of our technology, and we have a lot more planned — evaluation of shopping ideas, “this or that” scenarios, generating outfits based on the wardrobe, making emotional decisions more rational, virtual try-ons, and much more.
Here’s some homework for you: try to guess which of these outfits is generated by Aiuta — and which is a real photo. We’ll share the answers in our next post along with a detailed story on how we’ve managed to achieve these results. Stay tuned!