Catalyst dev blog - 20.07 release

Published in

PyTorch

5 min readJul 16, 2020

Hi, I am Sergey, the author of the Catalyst — PyTorch library for deep learning research and development. In our previous blog posts, we covered an introduction to the Catalyst and our advanced pipeline for NLP on BERT distillation. In this post, I would like to share with you our development progress for the last month. Let’s check what features we have added to the framework in such a short time.

tl;dr

Training Flow improvements: BatchOverfitCallback, PeriodicLoaderCallback, ControlFlowCallback
Metric Learning features: InBatchSampler, AllTripletsSampler, HardTripletsSampler, tutorial
Fixes and acknowledgments
New integrations: MONAI & Catalyst
Ecosystem update — Alchemy

You can find all examples from this blog post on this Google Colab,

Google Colaboratory

Edit description

colab.research.google.com

Training Flow improvements

BatchOverfitCallback

For better user experience with deep learning, you need to think not only about cool engineering features like distributed support, half-precision training, and metrics (we already have them). You also have to think about common difficulties that occur during experimentations.

Imagine a typical research situation: you wrote your fancy pipeline, got the dataset, and try to fit this data into your model. But something goes wrong and you can’t get desired results.

One of the potential causes — there is a problem with pipeline convergence. You could subsample your data and check that model easily overfits only on this subset. But do it again and again along all your projects? Looks like we need a general solution for this problem. And here comes our BatchOverfitCallback (contributed by Scitator). The idea behind it is straightforward— let’s take only a requested number of batches from your dataset and use only them for training.

So, let’s check some deep learning pipeline,

Catalyst pipeline setup

You can run it with

Catalyst experiment run

Thanks to the update, you can check your pipeline convergence with only one extra line

Run with `overfit` flag

This way you can easily debug your experiment without extra code refactoring. You could also redefine the required number of batches per loader.

Run with `BatchOverfitCallback`

What is even cooler, we have integrated this feature into our Config API. You can use it with

catalyst-dl run --config=/path/to/config --overfit

PeriodicLoaderCallback

During your research practices, you could find yourself in the situation, when you have only a few train samples and a huge test set to check your model performance. Alternatively, you could have computational heavy validation (for example, during the NMS stage on anchor-box object detection) that takes too much time of your training pipeline. You can increase the train set for each epoch with BalanceClassSampler, but what if you want to keep your training data unchanged? Try our new PeriodicLoaderCallback (contributed by Ditwoo).

For the example above you can set a validation run every 2 epochs:

Run with `PeriodicLoaderCallback`

Thanks to Catalyst design, we could extend it for any number of your data sources:

Run with `PeriodicLoaderCallback` and multiple loaders

ControlFlowCallback

After PeriodicLoaderCallback we asked ourselves: “If you can enable/disable data sources, why can’t you do the same with metrics and entire Callbacks?”. For example, you have a metric you don’t want to compute during the training or validation stage. With ControlFlowCallback (contributed by Ditwoo) it could be done easily:

Catalyst pipeline with ControlFlowCallback example

Now you can define with which loaders and epochs you would like to use Callback, or ignore it.

Metric Learning features

I also want to make a preview of extra updates during this release. For the last month we were working hard developing a foundation for Metric Learning research. We have prepared several InBatchTripletsSamplers (contributed by AlekseySh) — helper modules for online triplets mining during training,

AllTripletsSampler to select all possible triplets for the anchors
HardTripletsSampler to select the hardest triplets based on distances between samples

We hope these abstractions would help in your research. We are working on Metric Learning minimal example now to create a starting benchmark for this case. Stay in touch for the upcoming tutorial.

Fixes

Last but not least, as with every release, this one was with a few fixes,

thanks to Oleksii Sliusarenko we fix our “first epoch” issue with EarlyStoppingCallback
with Lokesh Nandanwar support we make our OneCycleLRWithWarmup great again
a number of Github and catalyst-codestyle improvements by Yauheni Kachan

Integrations — MONAI segmentation example

In collaboration with the MONAI team, we have prepared an introduction tutorial on 3D image segmentation with the MONAI and Catalyst framework.

Google Colaboratory

Edit description

colab.research.google.com

Plans

We still have a lot of plans:

TPU support — with current cpu, gpu, and Slurm support, we want to push the frontiers and get Catalyst to the fancy TPU
kornia integration — we already have a native integration with the famous albumentations library, but… why should not we make a fair comparison between alternatives and take the best for our customers? Stay in touch for an upcoming benchmark on image augmentation libraries benchmark by Catalyst-Team
model auto-pruning — as far as Catalyst is a framework for deep learning research and development, and we already support model auto-tracing, we want introduce framework support for models auto-pruning.

Ecosystem release — Alchemy

During this Catalyst release, we also have another great new — we are moving our ecosystem powered monitoring tools to the global MVP release. Feel free to use it and share your feedback with us.

Alchemy

Alchemyalchemy.host

We help researchers to accelerate pipilines with Catalyst and to find insights with Alchemy along the whole R&D process: these ecosystem tools are available for you to train, share and collaborate more effectively.

Afterword

Our goal is to build a foundation for fundamental breakthroughs in deep learning and reinforcement learning areas. Nevertheless, it is really hard to build an Open Source Ecosystem with only a few motivated people. If you are a company that is deeply committed to using open source technologies in deep learning, and want to support our initiative, feel free to write us at catalyst.team.core@gmail.com. For details about Ecosystem, check our vision and manifesto.

Catalyst dev blog - 20.07 release

tl;dr

Google Colaboratory

Edit description

Training Flow improvements

BatchOverfitCallback

PeriodicLoaderCallback

ControlFlowCallback

Metric Learning features

Fixes

Integrations — MONAI segmentation example

Google Colaboratory

Edit description

Plans

Ecosystem release — Alchemy

Alchemy

Alchemy

Afterword

Written by Sergey Kolesnikov