Hexagonal icon that says Tonks and has five different hairstyles pictured all in neon colors.

Tonks: Building One (Multi-Task) Model to Rule Them All!

Co-written by Nicole Carlson and Michael Sugimura

Michael Sugimura
9 min readApr 28, 2020


NOTE: Our team previously had a tradition of naming projects with terms or characters from the Harry Potter series, but we are disappointed by J.K. Rowling’s persistent transphobic comments. In response, we renamed the Tonks Library as Octopod. More details on that process here.


Nicole Carlson and Michael Sugimura are the lead developers on Tonks, a multi-task deep learning library (pypi, github). This post is the story of how we built this library together. We will discuss technical details of the library as well as interpersonal challenges we faced along the way. This project ended up being a rich experience for us, in ways we never could have guessed.

What is Tonks?

Tonks is a library that streamlines the training of multi-task PyTorch networks. It supports training with multiple task-specific datasets, multiple inputs, and ensembles of multi-task networks.

We started building Tonks to meet our need to build multi-task networks at scale. At ShopRunner, we have millions of products aggregated from 100+ retailers. In order to facilitate better browsing and search, we need to label attributes such as color and pattern for each product.

We first considered building individual task networks for each task, e.g. a color neural network, a pattern neural network, a season neural network, etc. However, we quickly realized that maintaining that many models would be difficult.

We decided to build one multi-task model that could predict all of our attributes using both images and text. In our fashion domain, leveraging both images and text of products boosts the performance of our models, so we had to be able to ensemble image and text models together. To meet all of these criteria, we made a library, Tonks, to use as a training framework for the multi-task, multi-input models we use in production.

One issue that often arises with multi-task networks is that other libraries require you to have one dataset with every attribute labeled. Our library allows you to train each task with a different dataset in the same neural network. For example, you can train one network to predict both pants length and dress length from two separate labeled datasets of pants and dresses.

In our initial release of Tonks, we are open sourcing our pipelines, data loaders, and some sample model classes. We’ve also created tutorials for training image models, text models, and ensembles of image and text models. Now we’ll tell you how we actually built Tonks.

Initial R&D

Michael: The core functionality of Tonks is building multi-task models using the PyTorch deep learning framework, but one of the major problems we had to solve is how to train multi-task network with multiple datasets simultaneously. Unlike multi-task training, multi-dataset training is something that is talked about less since it is a less common research use case, but does make sense for industry applications.

Developing this multi-dataset multi-task pipeline took a good bit of R&D and during that time I took inspiration from Stanford Dawn and their blog about training multi-task NLP models and relistened to Andrew Ng discussing it in his 2017 deep learning course more than a few times while I was stuck in research mode. However after a lot of trial and error, I was able to get a methodology for multi-dataset multi-task training working:

  1. Prepare all of your datasets and place them into data loaders (PyTorch data generators).
  2. Randomly shuffle the batches of the various datasets.
  3. Sample randomly without replacement and feed that batch through our deep learning model.
  4. Calculate the loss for that specific batch based on the outputs from that specific batch’s matching task head. For example, if a batch from a pattern dataset is selected, we only calculate the loss based on the pattern task’s output and use that for backpropagation. At every batch the model is predicting all outputs, but is only rewarded/punished for its decisions on the relevant task.

An epoch consists of repeating steps 3 and 4 until all batches have been sampled.

See below for a graphic walkthrough of this process:

Graphic showing multi-dataset multi-task training pipeline.

After working out this part of the process, the code was really just in rough notebook form. Around this time Nicole ended up finishing up on another workstream and was able to assist me in turning the notebooks into production ready code!

Refactoring the Code

Nicole: Michael’s first version of our attribute multi-task model used only images. We always knew we wanted to add in the text, but he was busy with other work. Since I was interested in NLP, I offered to work on adding text to our attribute model.

As Michael mentioned above, his work was mostly in Jupyter notebooks. As I started digging through his code, the software engineer in me couldn’t help but want to get rid of repeated code that was copied into multiple notebooks. I also knew that if I broke the code out into functions and classes that it would help me understand what was happening.

I ended up completely refactoring the code from Michael’s notebooks into a python library. I moved all of the training/evaluation code into a learner class. I also created custom dataloaders that did the necessary preprocessing for our models. Once I had the library refactored, it was pretty straightforward to add in a text component to the original attribute model architecture.

I initially refactored the code for my own understanding, but I decided to share it with Michael to see if he thought it would be useful to add in to Tonks. Unfortunately, this refactoring was completely outside the scope of my ticket. I was supposed to just add in the text component, but I had rewritten the entire library. And I had not communicated that I was doing this to Michael. I definitely regret the way that I handled this; I should have communicated with Michael much earlier in the project.

Screenshot of Github Pull Request (PR) with the title “DS 1420 train dress model”. The PR has 74 commits and 33 changed files
Github Pull Request for the PR where I (Nicole) refactored everything. As you can see by the initial name, I was only supposed to train a new model, not change all 33 files in the library!

Michael: With our very different styles, it took Nicole and I a bit to hit our groove for working as a team. So when Nicole started the refactoring process, my first knee-jerk reaction was being slightly defensive since it was outside of the current scope of work. After sitting back and thinking about it for a bit, I realized that Nicole’s refactoring work was better for Tonks as a project overall since Nicole is a very talented engineer and was helping to get Tonks to a productionalized state. While I knew this was good for Tonks, I felt like I was being dead weight for this part of the process because I felt like I wasn’t actively contributing or helping Nicole with refactoring. This was a breakdown in communication on my part because I could have both helped get the code to an initially stronger state and have tried to engage more along the way to help Nicole out with the whole process.

Michael and Nicole shaking fists towards each other in mock anger from two panels of a zoom call.
Michael and Nicole shaking fists towards each other from two panels of a zoom call.

When I finally got to see and use Nicole’s refactored pipelines, my personal feeling was that it was like seeing fire for the first time. It was elegant, simple but complex, and very powerful. From my point of view, my hacky R&D gave our Tonks project an overall shape, but Nicole’s refactoring is what gave it a heart and soul.

Nicole: The ironic thing about the refactor was that I was much more impressed with Michael’s work than my own. When I looked at Michael’s code, I was completely blown away by the work he had done. Randomly sampling through the different tasks during training was so elegant, and I knew I could never have come up with that by myself. I was also relatively new to PyTorch so I was amazed at how easily Michael had built a model architecture that could use both images and text. I felt like I wasn’t really contributing much to the project since I had only refactored some code, not done any of the R&D work.

We actually each separately went to our boss with our fears that we weren’t contributing to this project. She encouraged us to talk directly and express how much we appreciated the other’s work. This was another lesson for us about being more communicative with one another and valuing the fact that we each brought different strengths and weaknesses to the project.

Adding a New Attribute

Network diagram showing an architecture where all tasks were built from single image and text models and ensembled
Example of an original attribute model architecture with two attributes: Color and Season.

Nicole: Our original attribute model had four attributes. After refactoring the model, we were confident that it would be relatively straightforward for us to add in a fifth attribute. Unfortunately, that was not the case. As soon as I added in the fifth attribute, the performance of one of the other attributes would degrade. I could not get high performance across all five attributes. I trained single task models for each model to get baselines for each task, but the multi-task model could not get close. So I went to Michael for help.

Michael: This whole thing was both very interesting and also terrifying, since most multi-task literature just discusses how networks improve with additional tasks that fall within the same domain. Much like detective work, we really needed a clue to help get us to a breakthrough. For this our breakthrough came from that same Stanford blog, the same one I had initially used as inspiration for our Tonks pipeline. They mentioned a problem with something called “destructive interference” with tasks and how they dealt with it for NLP competition leaderboard purposes. Looking into “destructive interference”, I found that it is a problem in multi-task networks where unrelated or weakly related tasks can pull a network in opposing directions when trying to optimize the weights. For that bit of research, this paper section 3.1 was helpful.

So the symptoms that we were seeing in our multi-task models matched literature around destructive interference. Now that we know our foe, all Nicole and I had to do was figure out a way to best it.

Multiple ResNets

Network diagram where separate image networks’ features are ensembled with a BERT model for final predictions.
Example of a multiple ResNet ensemble model where tasks are separated if they cause destructive interference.

Nicole: After Michael discovered the destructive interference, we realized that our best solution was to have multiple ResNets in our final ensemble model. I modified the ensemble model class to accommodate this new architecture, and we finally had a model that retained high performance with new tasks.

This problem also drove home the lesson that Michael and I were much stronger as a team. Michael did the research to name and solve our problem, and I modified our library to incorporate the solution.


Michael and Nicole pointing finger guns at each other and smiling from two Zoom panels.
Michael and Nicole pointing finger guns at each other from two Zoom panels.

We hope that deep learners everywhere will enjoy using our library. This was a great learning experience for us, and it really proved that having people with opposite strengths work together was more powerful than either of us working alone. Although we had some communication issues along the way, we’ve come out of this with a much stronger working relationship. Even as we wrote this blog post, we realized we were repeating the same pattern: Michael wrote the initial draft and Nicole edited the text. This time around, we communicated with one another before making changes to the other person’s work!