Stories by Michael Sugimura on Medium

A Tricksy Look at Outfitting: Kitsune

Michael Sugimura — Fri, 19 Mar 2021 22:03:35 GMT

This blog is based on a hack week I did near the end of 2020. For background, the ShopRunner Data Science allows members to spend a week per quarter working on a ShopRunner-related topic of their choice.

Ever since getting involved in e-commerce data science, one of the things I have been thinking about has been automating outfit creation. Starting to work at ShopRunner, my general focus has been helping the team build out more deep learning and, in particular, computer vision expertise. For the first year and a half or so, the focus was on building out those foundational capabilities. I personally had a lot of fun working on multi-task learning pipelines (open sourcing our ShopRunner multi-task library, multi-task multi-dataset, pytorch multi-task learning, and keras multi-task learning) and am proud of how the team has turned those internal pipelines into an open source library called Octopod (github, pypi). However, once that foundation was laid, the focus then started to shift towards things like visual similarity, user-uploaded visual search, and outfitting.

The idea of outfitting is that, for a given seed product, we should be able to return a set of complementary products that go along with it to create an outfit. For me, outfitting is an interesting problem area because it isn’t that well-defined or mapped out, and in these spaces where there are no real right answers or accepted methods, it means I can satisfy my thirst for adventure and shenanigans.

There are a number of reasons that outfitting in general is a fairly hard problem:

It’s hard to find large datasets of curated outfits.
The model needs to be able to generalize to new and unseen items.
Even if you find/acquire a dataset, you run the risk of it not working well for your in-house dataset/catalog.

In terms of datasets, one of the most-frequently used ones has been one from a website called Polyvore (dataset here), which has a good number of products and has been used a lot in academic literature. Another newer dataset which was really cool to see is one from Alibaba from their paper POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion (paper here & dataset here). This POG dataset contains around 500K unique items and 1M outfits, so it is a great resource for the community. So for this hack week, I decided to focus my efforts on using outfits from the POG dataset for outfitting.

Even though the POG dataset helps with that initial hurdle of finding curated outfits, it still doesn’t necessarily solve the other two problems of 1) generalizing to new and unseen items and 2) being relevant for a given new catalog of items. Some things that can cause issues between catalogs could be differing photography styles, product image compositions, model positions (the person kind, not the math kind), etc. between the two catalogs.

This is where I think the work and perspective I have from working on other projects like visual search and user-uploaded visual search comes in. Both of these work by taking vectors and doing cosine similarity searches on our full ~7M product catalog. The idea is that we extract vectors from images using models we have fine-tuned to internal tasks as the underlying representation of our products. This lets us handle never before seen products since we can always get vectors for them. However, my one remaining concern was that while we could use vector representations of the products in an outfit and train a model to generate vectors of products that go along with the initial item, an external dataset might not map well onto an internal ShopRunner dataset or product catalog.

I tend to enjoy clever and interesting ways to attack problems, and every good project needs a nickname. I nicknamed this hack week project, and in particular the model, Kitsune. Partially because I like 9 tailed foxes from East Asian mythology and their trickster nature, but I also liked the idea of a creature with multiple tails representing my model (you will see how later). Now, off on an adventure as I go through my hack week project, Kitsune.

Kitsune outfitting example

Rethinking Outfitting

For as long as I have been thinking of outfitting, we have been thinking of the problem as:

“the goal of outfitting is for a given seed product we want to be able to recommend other products that go along with it”

There is nothing wrong with this statement since it is factually what we are trying to do, but it also constrained my point of view because it was so well-established. In particular the focus on returning a set of products.

“the goal of outfitting is for a given seed product we want to be able to recommend other products that go along with it”

What this focus on returning a product meant is that any model has to be able to return the necessary information for a full product. But what if our model didn’t? The way that we typically think about that from a computer vision point of view at ShopRunner is via a combination of vectors from our in-house taxonomy and attribute models. If you think about models as employees in a department store, then a taxonomy model tries to sort products into their respective sections (jeans, shirts, dresses, shoes, and so on), while an attribute model aim to sort products within each of those sections (into things like color, pattern, dress length, sleeve length, season, etc.).

However this focus on attribute and taxonomy vectors sort of constrained my ability to think of different ways to approach the problem. So all of my brainstorming basically looked like the below gif from Big Hero 6.

So I needed to shake things up.

For me, a lot of my breakthroughs at work come when I am able to take a step back and do other things and basically let my brain wander. A lot of that comes down to my martial arts (these days it’s a lot of practice on my Wing Chun wooden dummy), or my other hobbies like cooking. This time, oddly enough, it came after a fairly heavy Wing Chun workout when I had a dream about this outfitting problem. The general gist of it was what if we didn’t think about taxonomy and attributes together, but just the attributes?

Can we just represent an outfit by its attribute vectors? Grey shirt, black pants, & black shoes vs grey, black, & black.

This was an interesting idea because attributes like the ones I listed are not bound to a particular item of clothing, but rather apply generally across all types of clothing. So if we made a model that just learned to generate attribute vectors that go with the attribute vector of a seed item, then that model wouldn’t be tied to the structural taxonomy information of the base dataset.

The remaining problem that I created with this pipeline is that attribute vectors by themselves are pretty useless for finding actual products, by design. In a taxonomy type of problem, the model learns how to differentiate products at a very structural level (pants, dresses, shoes, etc.). For attributes, they are trained to ignore structural information in favor of the details across a wide variety of types of products for things like color and pattern.

So my thoughts on how to re-add this structural component was just to make a dictionary of mappings from a taxonomy of a seed item (which we could always get in our catalog) to a set of taxonomies that could reasonably go well with it. For example, a seed taxonomy could be dresses, and the ones in the mapping might be ankle boots, shoulder bag, and jacket, since each of these mappings can reasonably be in an outfit with a dress. The main risk I was taking here was whether or not attribute vectors generated with some taxonomy in mind could be applied to an arbitrary new one, but those risks are what hack weeks are for!

Setup and Preprocessing

In terms of modeling, there are a lot of possible paths, particularly around transformers which have been used in papers like the Alibaba one that I mentioned earlier. The tactic there is to train a transformer to generate the next item in the outfit based on some seed item or current outfit state. However, a lot of my work at ShopRunner has been around building out multi-task learning pipelines, so I wondered if I could throw this pipeline at this outfitting problem.

Dataset

One of the real time savers in doing this is that one of my colleagues had used a subset of the Alibaba dataset for modeling work which focused on length 4 outfits and had stored the taxonomy and attribute model vector representations of the items. All in all this left us with around 50K outfits for our use.

So with the dataset in place, my goal was to build a multi-task model that can take in a seed item and generate 3 attribute vectors that go along with the seed item.

Vector Compression

The raw attribute model vectors are length 256, but we have also previously found that using PCA or another dimensionality reduction technique helps us increase the speed of our subsequent cosine similarity searches and space required to store the vectors with very little performance decrease. I also figured that a compressed representation would mean a more concise vector for my generator model to learn. So, I went ahead and compressed each item vector via PCA so each outfit was represented by 4 length 50 attribute model vectors.

Data Augmentation

Once I had my outfits of length 4 where each element was a length 50 vector, I went ahead and did some data augmentation. For this model pipeline, I made the conscious choice to have a single outfit for each unique seed item vector (around 95K unique items). This means that I have more outfits than my original 50K (now around 95K), but less than if I took all available permutations.

The reason I made this choice is based on some intuition I have (that could be wrong, but that is what hack weeks are about) that with this approach, if you take all permutations of every outfit, there are multiple right answers for every given seed item.

This intuition is based on some initial testing I was doing with my pipelines, where I used every permutation of every outfit. This gave me a ton of additional data and I thought it would be helpful. However, with this setup every task head not only generated basically the same vectors (because their datasets are all the same now, just shuffled), but the items they yielded were basically always the same. This could be due to the lack of ability of the model to handle the complexity of the dataset, but I found that changing the architecture didn’t affect these results. My results got cleaned up when I didn’t permute the dataset. By just using the base dataset without any permutations it meant that every task head had a unique dataset associated with it, and every item was appearing more or less a single time. So what I decided to do was a partway measure between these two passes. First, I would permute the dataset and then remove duplicates so that every seed item would only appear one time, but every item would have an outfit associated with it. This gave me more data than not permuting at all but didn’t have the duplicates that I found to be detrimental to this training pipeline.

So while this approach does give us less data to work with, it does make it so each task head should give a more unique specific answer for a given seed item and the task heads are not taught to clash with one another.

Modeling: Modifying Octopod to Train Kistune

The actual modeling pipeline is a standard Octopod training pipeline where I made custom model and dataset classes. In terms of structuring the training pipeline, Octopod is a multi-task multi-dataset pipeline where generally, we have a different dataset for each task head. In this case, each of the three task heads has a dataset where the x’s are all the seed items in the dataset while the targets are the vectors for the item in that first, second, or third slot in the outfit. The model class for Kitsune is just a simple model made up of a few linear layers stacked on one another with inspiration from the decoder section of an autoencoder, where the model takes in a length 50 vector and increases in size for a few layers and finishes in with 3 task heads each of length 50. The goal of the model is to take in a length 50 seed attribute vector and generate 3 length 50 vectors which can be used to find products.

Next I just let the model train while cycling the learning rates for a few hours on a 2080Ti GPU.

Constructing a Pipeline to Bring Kitsune to Life

So I mentioned before that an upside to this approach to outfitting is that by just focusing on attributes instead of both attributes and taxonomy would be a useful way to learn on an arbitrary dataset and apply it to an in-house one. However, a downside that I also mentioned is that attribute vectors are pretty worthless on their own for visual similarity tasks because they don’t encode any structural information which makes it hard to get similar looking items out of them. So in addition to the model I trained above I made a mapping of seed item taxonomy categories to different taxonomy categories. For example:

mapping_dict = {
    'Dress': ['Ankle boot', 'shoulder bag', 'leather jacket'],
    'T-shirt':['jeans', 'running shoes', 'hoodie']
}

So, with the mapping_dict above, if a seed item was a dress, the task head 1 vector should look for Ankle boot items, task head 2 should look for shoulder bag items, and task head 3 should look for leather jacket items. First round of mappings were pretty basic and arbitrary, but with later versions, I made a few different examples for each of our several hundred taxonomy categories based on looking at WAY TOO MANY outfits online.

The core part of Kitsune is the model I trained above plus the mapping_dict which allows us to get a vector and taxonomy category. These two pieces are what we need to search our product catalog and find relevant items.

Once I had my mapping in place, I constructed a pipeline (see below) where I could grab an item from our product catalog for its length 50 attribute vector and taxonomy category. Run the seed item vector through my model which results in three separate length 50 vectors. For each vector from my model, I then get a mapping from the mapping_dict to get the three taxonomy categories for each of the three output vectors. Then I search our elasticsearch backed product catalog for similar vectors using cosine similarity, filtered down to a specific taxonomy for each of the three output vector + taxonomy pairs, which yields a sort of outfitting model.

This pipeline was not too bad to build, mostly because our DS team and the other engineering teams at ShopRunner have been working quite hard to improve our pipelines, databases, and our ability to leverage elasticsearch for problems like visual similarity.

Results

An interesting part of this pipeline is that I had very minimal ways to visually sense-check the results of the pipeline until I finished my pipeline to query and visualize results from elasticsearch. When I finally reached this point, I was pleasantly surprised because I found that my hypotheses had held up reasonably well!

Checking individual results

So for the example below of the seed shoe on the left, the Kitsune model took in its vector and output 3 vectors for items that Kitsune thought went well with the seed item vector. Then, those 3 vectors get paired with 3 taxonomy categories based on my mapping I built and those two things are used to do cosine similarity searches of our ~7M item product catalog. The results are pretty reasonable/good! (I would normally say “not completely terrible”, but I realize I need to scale my reactions to things to be more generally applicable at times.)

Something that I was curious about was whether or not these models would have a tendency to yield outfits containing more neutral items, which they do, like in the sample below.

The red dress gets paired with a bunch of black items. However if you think about it, this isn’t particularly bad behavior, since it is a reasonable way to dress. It also has the benefit of making the outfits more widely applicable to potential shoppers if the items are more generic/universal.

However, Kitsune does yield some more flavorful items, particularly when the base items are more generic in nature. So for the below brown trench coat (I guess it has been classified as a peacoat, but that is fine…) the middle row contains a number of different colored boots. For me this is fairly interesting because I initially thought that the vectors out of Kitsune would behave more like standard vectors out of our attribute models that encode things fairly concisely, but seeing that some of the vectors have a variety of colors and other characteristics was an interesting find.

Mixing and Matching Vectors and Taxonomy

So besides wondering whether or not this method would yield anything interesting, I had a number of other thoughts about it going in. An interesting use case I had been thinking about was whether or not you could generate a set of vectors, get a set of 3 taxonomy categories, and then just mix and match all of them around. So if our original set of vectors and taxonomy categories looks like:

Item 1 + taxonomy 1

Item 2 + taxonomy 2

Item 3 + taxonomy 3

and this is a valid set of items to make up an outfit, then would shifting the taxonomy categories again also yield a valid outfit, such as:

Item 1 + taxonomy 3

Item 2 + taxonomy 1

Item 3 + taxonomy 2

If this turned out to be true, then it means we can generate 3 distinct outfits for a given seed item using a single mapping.

The next three images are for the same blue dress on the left and same vectors out of Kitsune, but what I did was shuffle around the taxonomy categories to see how the results looked and varied. I was mostly curious if all the results looked reasonable.

This first one seems reasonable, the top vector encodes things like black, grey, and white by the looks of it, and the shoes follow a similar fairly neutral pattern.

This second one is interesting because the middle vector has some red encoded into it, which we didn’t see show up in the initial set of items. However, these all work reasonably well.

In this third set, like the second above, we see that there are red items showing up in the middle set of results, even in the first image which shows a person (it is technically for the bag they are holding, I just fetched an image with a person modeling the bag).

Closing Thoughts

The approach in Kitsune is nice because it allows us to make outfits for arbitrary items using just the seed item’s taxonomy value and a length 50 vector that encodes attribute characteristics. This means it extends to never-before-seen items and can be used in a real world product catalog if we so wished. It was a path I wanted to go down because if we were able to get reasonable looking results with just generating the attribute vectors, it meant that we could mix and match them sort of at will to create a lot of possible outfits.

One idea that came up when I showed my team this project was that a user could technically input taxonomy categories they were personally interested in rather than just the ones generated by the model. For instance, say we output a dress, high heeled shoes, and a bag, and one user says that they don’t wear high heels, but would like to see wedges as options instead. This Kitsune pipeline supports that type of flexibility and exploration.

From a technical point of view, it is interesting to me that the model used on the backend is surprisingly simple. It is just a few linear layers piped into three task heads. There is a niceness in the simplicity here since I have thought about a lot of other architectures and setups, all of which involve much heavier weight modeling techniques, but sometimes simple solutions get the job done.

Project Pendragon + Tonks: Multi-Task Feature Extraction for Farming Fate Grand Order

Michael Sugimura — Mon, 11 May 2020 02:14:10 GMT

When not blogging about data science, I work as a Senior Data Scientist at an e-commerce company called ShopRunner. Over the past year, our team has been building out large multi-task deep learning ensembles to predict relevant fashion attributes and characteristics of products within our product catalog using both images and text. Recently, our team open-sourced the main training pipelines and framework that we have been building internally to train our multi-task learners in a package called Tonks. Tonks is available to install on pypi, with the source code available on GitHub here.

As we went through the process of discussing open sourcing Tonks I realized a potential early use case could be upgrading a side project of mine where I am building a series of reinforcement learning (RL) agents to play the mobile phone game Fate Grand Order (FGO), which I’ve nicknamed Project Pendragon.

Project Pendragon has two main parts, the feature extraction pipeline which sends inputs to the RL agents that make decisions and send commands back to play the game. While my RL agents that play FGO have been repeatedly upgraded, my feature extraction pipelines are basically where they were at a year ago.

This post will cover how I replaced my original feature extraction pipeline which used three large convolutional neural networks (two ResNet 50s and a ResNet 34) and replaced these three models with a single Tonks multi-task ResNet 50 trained with multiple datasets.

An early version of multiple RL agents playing through FGO content with multi-task network extracting information from the FGO mobile phone game.

Tonks

Our ShopRunner team has built out our PyTorch based Tonks library to help us build large multi-task network ensembles using both images and text. In most of those use cases, we care about being able to return relevant attributes about products based on the provided images, descriptions, title, and other information.

When you build multi-task learners, you are usually attacking a problem with the mindset that features learned on one task might be beneficial to another. For us in the e-commerce space, two tasks like this might be dress length and sleeve length where two single task models would likely learn to look for lines and length while not worrying about color, pattern, and background. When tasks meet these criteria it means we can combine the tasks into multi-task networks and in tanks, we do this by having a core model such as a ResNet for images or a Bert model for text and connect the outputs of these core models to our individual task heads

When your tasks are all within the same domain multi-task learning is useful because it lets you build and maintain a single model instead of using multiple single task learners. In my current pipeline, I train and use 3 large CNNs where I build tailored datasets for each task previously. Tonks is built to handle situations where you train using tailored datasets for individual tasks so here it helps me cut down on the number of models I need to maintain and lets me train with my previously used datasets.

Something to think about when building multi-task models is that when tasks are not within similar domains, multi-task models can suffer from destructive interference, where conflicting signals from different tasks pull the model in different directions. A discussion of how to handle this destructive interference could be a good follow- up post or talk, but is beyond the scope of this post. For my FGO use case, my gut feeling was that the problem was fairly doable since all the tasks are all using FGO screenshots, similar-ish text, color palettes, etc., so I would likely not have issues with destructive interference.

For more details about Tonks see our launch post.

First CNN detects the attack button, Python sends a click to it. On the next screen, another CNN detects the 5 command card types and an RL bot makes a decision based on that input.

Project Pendragon

In the Fall of 2018, I started making some basic bots to play FGO, but before I go into the details of the bots, I’ll give a quick rundown of what FGO even is. Fate Grand Order is a turn-based mobile phone game where you pick between 3 and 6 characters with various stats and abilities. You then use that team to fight through waves of enemies, until all enemies or the team is defeated.

My main motivation for building these bots is that as part of regular events in FGO, players are often required to farm (repeatedly play) levels dozens, if not hundreds, of times. For a recent event, I did 100 farming runs over a week-long period where each level takes between 3–5 minutes to play (so 5–8 hours of gameplay in total). So with that in mind, I thought building a bot to be able to do this repetitive task for me would be a great side project! This once-simple “side project” has since turned into an interesting on-again-off-again year-long rabbit hole with a lot of interesting additions like multiple reinforcement learning agents and various custom game environments.

Despite making many upgrades to the codebase around the bots and how the decisions get made, I really haven’t touched my feature extraction pipelines to get information to my RL game bots.

Feature Extractor One: Whose turn is it now?

FGO is a turn-based game, and in order for the bots to play the game, I needed to be able to detect when it was actually their turn. The way that I decided to detect when the bots’ turn was starting was to look for the “attack” button that appears on the main combat screen before command cards are picked.

Below is a sample main combat screen where the attack button is highlighted in the bottom right corner.

The attack button at the bottom right corner, highlighted.

I structured this network to have 2 classes, “attack” and “not attack,” basically meaning the network is trained to detect whether or not the attack button is present in that portion of the game screen. If it is, then it means that it is the bots’ turn and you can do useful things like take actions/use skills on the present screen or go ahead and bring up the command card screen where you can pick which of the 5 cards to play.

Picking command cards is the main combat mechanic of FGO. So once I was able to detect whose turn it was, it was time to build a classifier that would help me identify what command cards had been dealt that turn. My first bots used this information to play algorithmically while later bots were trained via reinforcement learning to choose which cards to play, but all of them needed to be able to identify which cards had been dealt as a key part of the feature extraction process.

Feature Extractor Two: What command cards were dealt?

The main combat mechanic in FGO is picking “command cards” on your turn. There are 3 types of cards: “Arts”, “Buster”, and “Quick,” and each card type does slightly different things. In each turn, 5 cards are dealt and the player has to pick 3 of them to play that turn. Below is a sample of the 5 cards presented and 3 being picked.

Sample command cards and cards being picked. The feature extraction model predicts the type of each of the 5 cards. Those predictions are then sent to a “card picker” model which decides which 3 to play.

While I could have built a detector to find the card location behind the scenes, I found an easier solution. These screens are relatively consistent and the cards are placed in the same spot, so what I opted to do is hard code the locations of the five command cards and crop them out of screenshots of the command card screen (see below for an example). Then, I passed each of the five cards through a PyTorch-trained CNN to determine the card “type”.

Colored sections are the 5 command cards dealt.

These card type classifications can then be fed into various RL agents and used to make decisions about which cards to play on that given turn. For a long time, these two networks (attack button and card type detector) were the core feature extractors for my first FGO reinforcement bot, nicknamed Pendragon Alter. They let me make a bot that could do the main combat in FGO and also play through a full game in an automated way. From this point onward, I really just had to think about what other information I would need to play the game and how to extract that from the game.

Feature Extractor Three: What wave of enemies are we on?

The final network I added into my framework is actually a wave counter that I use as an input into several different versions of my bots. The reason I added this is that FGO levels are almost always structured as having between 1 and 3 rounds of enemies that you have to fight through, and the actions an agent may want to take may depend on the round. For instance, the first wave of enemies may be relatively weak, but the third wave can be quite strong, so saving abilities for the third wave is often a good tactic.

I highlighted the round counter in red near the top of the below screenshot.

Round counter shows current round and how many total rounds

While I could have used some sort of optical character recognition to get the number, all I really cared about is telling whether it is round 1, 2, or 3, so I trained a CNN to have three classes mapping to those numbers and every time the attack button is detected, I check to see what round number it currently is.

These three networks achieve the basic feature extraction that I need, but needing to run three large CNNs simultaneously adds quite a bit of GPU or CPU computational overhead and additional storage (2 Resnet 50s are ~220MB and 1 Resnet 34 is ~84Mb).

Tonks Training Pipeline

Our Tonks pipeline follows the general framework setup by Fastai where the pipeline is organized into data loaders, models, and learners. We end up managing the multiple tasks and multiple datasets using a variety of dictionaries which help with bookkeeping.

The following sections will show code snippets that appear in the linked training notebook below and discuss what is happening in them.

The notebook I used for training is located here

Tonks Dataset and Dataloaders

Our custom Tonks dataset (line 26), FGOImageDataset, follows fairly standard PyTorch dataset layouts where we need to provide a way to index appropriate values as part of the data generator (lines 58–59), apply transformations (line 61), and return the image data and label (line 65).

Lines 11–24 get applied depending on whether or not we are looking at the training for validation set. I like keeping the transforms in this format, but it’s just a personal preference.

Cell 14 of the fgo_tonks notebook

Once we have created our custom Tonks dataset class, we can start creating our training and validation datasets for our three tasks. This part of the process is also fairly similar to a standard PyTorch training pipeline, where you have to place your training and validation splits into datasets and then eventually into dataloaders. The only difference here is that we have three datasets instead of the normal one.

Lines 1–26 show how we prepare train and validation splits for each of the three tasks. Lines 28–37 show how we make a dictionary of dataloders which we use in the next steps to make out Tonks’ multi-task multi-dataset dataloaders.

In this above section of code, we are creating training and validation datasets for each class using the custom dataset class I showed previously. This involves specifying the x and y inputs (file paths to images and their labels) as well as what transforms we would like to apply. In this pipeline, I just have randomized crops in the training set with ImageNet normalization and only normalization in the validation sets. Once the datasets are created we make a dictionary of base PyTorch dataloaders where the keys are the names of the tasks and the values are the dataloaders associated with those tasks. The idea here is that we can keep track of which dataset we should generate batches for as part of our multi-dataset training pipeline.

The below snippet shows how we place our two dictionaries of PyTorch dataloaders into two Tonks MultiDatasetLoaders. These Tonks dataloaders are what we use to integrate with our multi-task multi-dataset training pipeline.

Tonks MultiDatasetLoaders

Example Tonks Network Architecture

The next major piece of the pipeline is the model. While we provide some sample image and text model architectures, it might make sense to customize the architectures to your needs. For me, I just made a simple ResNet50-based architecture and connected that to the individual task layers. In Tonks, we handle this part with two PyTorch ModuleDicts called the pretrained_classifiers and new_classifiers . The idea we had here is that the first time you train a network we send tasks to the new_classifiers dictionary and when we save the trained network these get moved to the pretrained_classifier dictionary for saving. Then on subsequent runs, we can load pretrained task heads into the pretrained_classifier dictionary. This helps us with keeping track of where to apply learning rates (since you might want to have different learning rates depending on whether or not a task head has previously been fine tuned or not)

Loading Tonks Model

The main inputs when we are loading an instance of a Tonks model is in the task_dictionary. There are two potential ones: the first is a new_task_dict, which is the one you should use the first time you are training a model, while the second is a pretrained_task_dict, which is where you would place tasks where there are already existing Tonks pre-trained weights for those tasks. For us at ShopRunner, this is useful because we can now potentially add on new tasks to existing models with ease.

For this FGO example, I have three tasks that are all new. I will place a dictionary with the task name and the number of categories in each task into a dictionary called new_task_dict and feed that into the model class when I initialize it. This is what tells the Tonks model to create three task heads with a certain number of nodes in each.

Training

Once we have our model initialized, all we need to do before we kick off our training run is define a loss function for the various tasks, specify an optimizer, assign learning rates, create our learner, and call the fit function.

For this pipeline, each of the tasks is a multi-class problem, so we use a cross entropy loss. When we assign learning rates, we may assign different learning rates to different sections of the model. For the main ResNet section, we assign a low 1e-4 learning rate, but for the new sections, we assign a more aggressive 1e-2 learning rate.The main idea behind this is that we don’t really want to drastically change the ImageNet weights in the ResNet50 core model, but since the new classifier layers are randomly initialized, we can let them be adjusted more aggressively. Then we define a scheduler to decrease the learning rate every 2 epochs.

Once that is done, we can define our learner using the Tonks MultiTaskLearner class. This class contains all the functionality we need to train our models and takes in the model architecture we loaded previously, the training and validation Tonks dataloaders, and the task dictionary, which contains a mapping of all our tasks which is used to retrieve batches from our dataloaders.

Finally, we can call fit() on our learner. For details on the different arguments you can check our read the docs.

Results

Once I finished training the three-task Tonks network, I just had to replace the three networks in my Project Pendragon repos with the new multi-task network. Since Tonks is built on PyTorch, in order to use a model, all you need to keep track of is the model architecture and the weights file, so you don’t necessarily need to install Tonks and all of its dependencies to use the trained models in new projects.

The Tonks model performs strongly across all tasks and so far I have not had any issues in deployment in my FGO pendragon game interface. I have been using the Tonks model for my other recent developments as I continue to improve my RL agents. The most recent installment is getting the agents to play using coordinated strategies.

Agents coordinating to clear difficult content. Feature extraction for the agents and bots are based on the multi-task tonks model trained here.

So while this example is a cute one based on a mobile phone game, the reasons to use a framework like Tonks are the same here as they are when we look at industrial scale problems. It removed my need to maintain multiple single task learner networks and I was able to train quickly and easily because I was able to use three existing datasets I had previously built.

Tonks is a library that our ShopRunner Data Science team has been using to build industrial scale multi-task deep learning ensembles using both images and text trained with multiple datasets. For us, it has made it relatively straightforward for our team to create multi-task models to meet new and varying needs. Since building multi-task learners is a very real world need, but one that is not currently supported we open-sourced our work to help give back to the data science community.

Project Pendragon + Tonks: Multi-Task Feature Extraction for Farming Fate Grand Order was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tonks: Building One (Multi-Task) Model to Rule Them All!

Michael Sugimura — Tue, 28 Apr 2020 16:01:38 GMT

Co-written by Nicole Carlson and Michael Sugimura

NOTE: Our team previously had a tradition of naming projects with terms or characters from the Harry Potter series, but we are disappointed by J.K. Rowling’s persistent transphobic comments. In response, we renamed the Tonks Library as Octopod. More details on that process here.

Intro

Nicole Carlson and Michael Sugimura are the lead developers on Tonks, a multi-task deep learning library (pypi, github). This post is the story of how we built this library together. We will discuss technical details of the library as well as interpersonal challenges we faced along the way. This project ended up being a rich experience for us, in ways we never could have guessed.

What is Tonks?

Tonks is a library that streamlines the training of multi-task PyTorch networks. It supports training with multiple task-specific datasets, multiple inputs, and ensembles of multi-task networks.

We started building Tonks to meet our need to build multi-task networks at scale. At ShopRunner, we have millions of products aggregated from 100+ retailers. In order to facilitate better browsing and search, we need to label attributes such as color and pattern for each product.

We first considered building individual task networks for each task, e.g. a color neural network, a pattern neural network, a season neural network, etc. However, we quickly realized that maintaining that many models would be difficult.

We decided to build one multi-task model that could predict all of our attributes using both images and text. In our fashion domain, leveraging both images and text of products boosts the performance of our models, so we had to be able to ensemble image and text models together. To meet all of these criteria, we made a library, Tonks, to use as a training framework for the multi-task, multi-input models we use in production.

One issue that often arises with multi-task networks is that other libraries require you to have one dataset with every attribute labeled. Our library allows you to train each task with a different dataset in the same neural network. For example, you can train one network to predict both pants length and dress length from two separate labeled datasets of pants and dresses.

In our initial release of Tonks, we are open sourcing our pipelines, data loaders, and some sample model classes. We’ve also created tutorials for training image models, text models, and ensembles of image and text models. Now we’ll tell you how we actually built Tonks.

Initial R&D

Michael: The core functionality of Tonks is building multi-task models using the PyTorch deep learning framework, but one of the major problems we had to solve is how to train multi-task network with multiple datasets simultaneously. Unlike multi-task training, multi-dataset training is something that is talked about less since it is a less common research use case, but does make sense for industry applications.

Developing this multi-dataset multi-task pipeline took a good bit of R&D and during that time I took inspiration from Stanford Dawn and their blog about training multi-task NLP models and relistened to Andrew Ng discussing it in his 2017 deep learning course more than a few times while I was stuck in research mode. However after a lot of trial and error, I was able to get a methodology for multi-dataset multi-task training working:

Prepare all of your datasets and place them into data loaders (PyTorch data generators).
Randomly shuffle the batches of the various datasets.
Sample randomly without replacement and feed that batch through our deep learning model.
Calculate the loss for that specific batch based on the outputs from that specific batch’s matching task head. For example, if a batch from a pattern dataset is selected, we only calculate the loss based on the pattern task’s output and use that for backpropagation. At every batch the model is predicting all outputs, but is only rewarded/punished for its decisions on the relevant task.

An epoch consists of repeating steps 3 and 4 until all batches have been sampled.

See below for a graphic walkthrough of this process:

Graphic showing multi-dataset multi-task training pipeline.

After working out this part of the process, the code was really just in rough notebook form. Around this time Nicole ended up finishing up on another workstream and was able to assist me in turning the notebooks into production ready code!

Refactoring the Code

Nicole: Michael’s first version of our attribute multi-task model used only images. We always knew we wanted to add in the text, but he was busy with other work. Since I was interested in NLP, I offered to work on adding text to our attribute model.

As Michael mentioned above, his work was mostly in Jupyter notebooks. As I started digging through his code, the software engineer in me couldn’t help but want to get rid of repeated code that was copied into multiple notebooks. I also knew that if I broke the code out into functions and classes that it would help me understand what was happening.

I ended up completely refactoring the code from Michael’s notebooks into a python library. I moved all of the training/evaluation code into a learner class. I also created custom dataloaders that did the necessary preprocessing for our models. Once I had the library refactored, it was pretty straightforward to add in a text component to the original attribute model architecture.

I initially refactored the code for my own understanding, but I decided to share it with Michael to see if he thought it would be useful to add in to Tonks. Unfortunately, this refactoring was completely outside the scope of my ticket. I was supposed to just add in the text component, but I had rewritten the entire library. And I had not communicated that I was doing this to Michael. I definitely regret the way that I handled this; I should have communicated with Michael much earlier in the project.

Github Pull Request for the PR where I (Nicole) refactored everything. As you can see by the initial name, I was only supposed to train a new model, not change all 33 files in the library!

Michael: With our very different styles, it took Nicole and I a bit to hit our groove for working as a team. So when Nicole started the refactoring process, my first knee-jerk reaction was being slightly defensive since it was outside of the current scope of work. After sitting back and thinking about it for a bit, I realized that Nicole’s refactoring work was better for Tonks as a project overall since Nicole is a very talented engineer and was helping to get Tonks to a productionalized state. While I knew this was good for Tonks, I felt like I was being dead weight for this part of the process because I felt like I wasn’t actively contributing or helping Nicole with refactoring. This was a breakdown in communication on my part because I could have both helped get the code to an initially stronger state and have tried to engage more along the way to help Nicole out with the whole process.

Michael and Nicole shaking fists towards each other from two panels of a zoom call.

When I finally got to see and use Nicole’s refactored pipelines, my personal feeling was that it was like seeing fire for the first time. It was elegant, simple but complex, and very powerful. From my point of view, my hacky R&D gave our Tonks project an overall shape, but Nicole’s refactoring is what gave it a heart and soul.

Nicole: The ironic thing about the refactor was that I was much more impressed with Michael’s work than my own. When I looked at Michael’s code, I was completely blown away by the work he had done. Randomly sampling through the different tasks during training was so elegant, and I knew I could never have come up with that by myself. I was also relatively new to PyTorch so I was amazed at how easily Michael had built a model architecture that could use both images and text. I felt like I wasn’t really contributing much to the project since I had only refactored some code, not done any of the R&D work.

We actually each separately went to our boss with our fears that we weren’t contributing to this project. She encouraged us to talk directly and express how much we appreciated the other’s work. This was another lesson for us about being more communicative with one another and valuing the fact that we each brought different strengths and weaknesses to the project.

Adding a New Attribute

Example of an original attribute model architecture with two attributes: Color and Season.

Nicole: Our original attribute model had four attributes. After refactoring the model, we were confident that it would be relatively straightforward for us to add in a fifth attribute. Unfortunately, that was not the case. As soon as I added in the fifth attribute, the performance of one of the other attributes would degrade. I could not get high performance across all five attributes. I trained single task models for each model to get baselines for each task, but the multi-task model could not get close. So I went to Michael for help.

Michael: This whole thing was both very interesting and also terrifying, since most multi-task literature just discusses how networks improve with additional tasks that fall within the same domain. Much like detective work, we really needed a clue to help get us to a breakthrough. For this our breakthrough came from that same Stanford blog, the same one I had initially used as inspiration for our Tonks pipeline. They mentioned a problem with something called “destructive interference” with tasks and how they dealt with it for NLP competition leaderboard purposes. Looking into “destructive interference”, I found that it is a problem in multi-task networks where unrelated or weakly related tasks can pull a network in opposing directions when trying to optimize the weights. For that bit of research, this paper section 3.1 was helpful.

So the symptoms that we were seeing in our multi-task models matched literature around destructive interference. Now that we know our foe, all Nicole and I had to do was figure out a way to best it.

Multiple ResNets

Example of a multiple ResNet ensemble model where tasks are separated if they cause destructive interference.

Nicole: After Michael discovered the destructive interference, we realized that our best solution was to have multiple ResNets in our final ensemble model. I modified the ensemble model class to accommodate this new architecture, and we finally had a model that retained high performance with new tasks.

This problem also drove home the lesson that Michael and I were much stronger as a team. Michael did the research to name and solve our problem, and I modified our library to incorporate the solution.

Conclusion

Michael and Nicole pointing finger guns at each other from two Zoom panels.

We hope that deep learners everywhere will enjoy using our library. This was a great learning experience for us, and it really proved that having people with opposite strengths work together was more powerful than either of us working alone. Although we had some communication issues along the way, we’ve come out of this with a much stronger working relationship. Even as we wrote this blog post, we realized we were repeating the same pattern: Michael wrote the initial draft and Nicole edited the text. This time around, we communicated with one another before making changes to the other person’s work!

Breaking the Game: Pendragon Four Rise of Merlin

Michael Sugimura — Tue, 25 Feb 2020 14:44:19 GMT

Multi-agent reinforcement learning for the mobile phone game Fate Grand Order adding in powerful supports to clear late-game content

Continue reading on TDS Archive »

Pendragon Four: Training Pipeline Deeper Dive for Multi Agent Reinforcement Learning

Michael Sugimura — Mon, 17 Feb 2020 14:50:03 GMT

Deeper dive into training multiple RL agents simultaneously to play the mobile phone game Fate Grand Order

Continue reading on TDS Archive »

Pendragon Four: Multi-Agent Reinforcement Learning with Fate Grand Order

Michael Sugimura — Tue, 17 Dec 2019 15:32:07 GMT

Multi-agent reinforcement learning in a custom game environment to train 4 agents and have them play the mobile phone game Fate Grand Order

Continue reading on TDS Archive »

This Dress Doesn’t Exist

Michael Sugimura — Wed, 09 Oct 2019 17:23:02 GMT

Fine tuning GPT-2 and StyleGAN for a fashion use case to generate synthetic product images and descriptions

Continue reading on TDS Archive »

Building and Labeling Image Datasets for Data Science Projects

Michael Sugimura — Mon, 09 Sep 2019 11:30:43 GMT

Some tips and tricks for building image datasets

Continue reading on TDS Archive »

BERT Classifier: Just Another Pytorch Model

Michael Sugimura — Mon, 10 Jun 2019 13:53:47 GMT

Building a custom Pytorch pipeline for a BERT classifier to figure out how BERT works piece by piece

Continue reading on TDS Archive »

Two Tasks, Two Datasets, One Network: Multi-task Learning with DnD

Michael Sugimura — Mon, 03 Jun 2019 12:45:37 GMT

Multi-task learning with multiple datasets to learn multiple tasks.

Continue reading on TDS Archive »