TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Deep Learning Model Training Loop

10 min readDec 9, 2018

--

Several months ago I started exploring PyTorch — a fantastic and easy to use Deep Learning framework. In the previous post, I was describing how to implement a simple recommendation system using MovieLens dataset. This time I would like to focus on the topic essential to any Machine Learning pipeline — a training loop.

The PyTorch framework provides you with all the fundamental tools to build a machine learning model. It gives you CUDA-driven tensor computations, optimizers, neural networks layers, and so on. However, to train a model, you need to assemble all these things into a data processing pipeline.

Recently the developers released the 1.0 version of PyTorch, and there are already a lot of great solutions helping you to train the model without a need to dig into basic operations with tensors and layers. (Briefly discussed in the next section). Nevertheless, I believe that every once in a while most of the software engineers have a strong desire to implement things “from scratch” to get a better understanding of underlying processes and to get skills that do not depend on a particular implementation or high-level library.

In the next sections, I am going to show how one can implement a simple but useful training loop using torch and torchvision Python packages.

TL;DR: Please follow this link to get right into the repository where you can find the source code discussed in this post. Also, here is a link to the notebook that contains the whole implementation in a single place, as well as additional information not included in the post to make it concise.

Out-of-the-Box Solutions

As it was noted, there are some high-level wrappers built on top of the framework that simplify the model training process a lot. In the order of the increasing complexity, from minimalistic to very involved:

  1. Ignite — an official high-level interface for PyTorch
  2. Torchsample — a Keras-like wrapper with callbacks, augmentation, and handy utils
  3. Skorch — a scikit-learn compatible neural network library
  4. fastai — a powerful end-to-end solution to train Deep Learning models of various complexity with high accuracy and computation speed

The main benefit of high-level libraries is that instead of writing custom utils and wrappers to read and prepare the data, one can focus on the data exploration process itself — no need to find bugs in the code, hard-working maintainers improving the library and ready to help if you have issues. No need to implement custom data augmentation tools or training parameters scheduling, everything is already here.

Using a well-maintained library is a no-doubt choice if you’re developing a production-ready code, or participating in a data science competition and need to search for the best model, and not sitting with a debugger trying to figure out where this memory error comes. The same is true if you’re learning new topics and would like to get some working solution faster instead of spending many days (or weeks) coding ResNets layers and writing your SGD optimizer.

However, if you’re like me then one day you’ll like to test your knowledge and build something with fewer layers of abstraction. If so, let’s proceed to the next section and start reinventing the wheel!

The Core Implementation

The very basic implementation of the training loop is not that difficult. The pytorch package already includes convenience classes that allow instantiating dataset accessors and iterators. So in essence, we need to do something shown in the snippet below.

We could stop our discussion on this section and save some time. However, usually, we need something more than simple loss computation and updating model weights. First of all, we would like to track progress using various performance metrics. Second, the initially set optimizer parameters should be tuned during the training process to improve convergence.

A straightforward approach would be to modify the loop’s code to include all these additional features. The only problem is that as time goes, we could lose the clarity of our implementation by adding more and more tricks, introduce regression bugs, and end up with spaghetti code. How can we find a tradeoff between simplicity and maintainability of the code and the efficiency of the training process?

Bells and Whistles

The answer is to use software design patterns. The observer is a well-known design pattern in object-oriented languages. It allows decoupling a sophisticated system into more maintainable fragments. We don’t try to encapsulate all possible features into a single class or function, but delegate calls to subordinate modules. Each module is responsible for reacting onto received notification properly. It can also ignore the notification in case if the message intended for someone else.

The pattern is known under different names that reflect various features of an implementation: observer, event/signal dispatcher, callback. In our case, we go with callbacks, the approach represented in Keras and (especially) fastai libraries. The solution taken by authors of ignite package is a bit different, but in essence, it boils down to the same idea. Take a look at the picture below. It shows a schematical organization of our improved training loop.

Each colored section is a sequence of method calls delegated to the group of callbacks. Each callback has methods like epoch_started, batch_started, and so on, and usually implements only a few of them. For example, consider loss metric computation callback. It doesn’t care about methods running before backward propagation, but as soon as batch_ended notification is received, it computes a batch loss.

The next snippet shows Python implementation of that idea.

That’s all, isn’t much more sophisticated than the original version, right? It is still clean and concise yet much more functional. Now the complexity of training algorithm is entirely determined with delegated calls.

Callbacks Examples

There are a lot of useful callbacks (see keras.io and docs.fast.ai for inspiration) we could implement. To keep the post concise, we’re going to describe only a couple of them and move the rest few into a Jupyter notebook.

Loss

The very first thing that comes into mind when talking about Machine Learning model training is a loss function. We use it to guide the optimization process and would like to see how it changes during the training. So let’s implement a callback that would track this metric for us.

At the end of every batch, we’re computing a running loss. The computation could seem a bit involved, but the primary purpose is to smooth the loss curve which would be bumpy otherwise. The formula a*x + (1 — a)*y is a linear interpolation between old and new values.

Geometric interpretation of linear interpolation between vectors A and B

A denominator helps us to account a bias we have at the beginning of computations. Check this post that describes the smoothed loss computation formula in detail.

Accuracy

The accuracy metric is probably one of the best-known metrics in machine learning. Though it can’t give you a good estimation of your model’s quality in many cases, it is very intuitive, simple to understand and implement.

Note that the callback receives notifications at the end of each batch, and the end of training epoch. It computes the accuracy metric iteratively because otherwise, we would need to keep outputs and targets in memory during the whole training epoch.

Due to this iterative nature of our computations, we need to account a number of samples in batch. We use this value to adjust our computations at the end of the epoch. Effectively, we’re using the formula the picture below shows.

Where b(i) is a batch size on iteration i, a(i) — accuracy computed on batch b(i), N — total number of samples. As the last formula shows, our code computes a sample mean of accuracy. Check these useful references to read more about iterative metrics computations:

  1. Metrics as callbacks from fastai
  2. Accuracy metric from the ignite package

Parameter Scheduler

Now the most interesting stuff comes. Modern neural network training algorithms don’t use fixed learning rates. The recent papers (one, two, and three) shows an educated approach to tune Deep Learning models training parameters. The idea is to use cyclic schedulers that adjust model’s optimizer parameters magnitudes during single or several training epochs. Moreover, these schedulers not only decrease learning rates as a number of processed batches grows but also increase them for some number of steps or periodically.

For example, consider the following function which is a scaled and shifted cosine:

Half-period of shifted and scaled cosine function

If we repeat this function several times doubling its period, we’ll get a cosine annealing scheduler as the next picture shows.

Cosine annealing with restarts scheduler

Multiplying the optimizer’s learning rate by the values of this function, we are effectively getting a stochastic gradient with warm restarts that allows us to escape from local minima. The following snippet shows how one can implement a cosine annealing learning rate.

There is an even more exciting scheduler though called One-Cycle Policy. The idea of this schedule is to use a single cycle of learning rate increasing-decreasing during the whole training process as the following picture shows.

One-cycle policy scheduler

At the very beginning of the training process, the model weights are not optimal, yet so we can allow yourself use larger update steps (i.e., higher learning rates) without risk to miss optimal values. After a few training epochs, the weights become better and better tailored to our dataset, so we’re slowing down the learning pace and exploring the learning surface more carefully.

The One-Cycle Policy has a quite straightforward implementation if we use the previously shown class. We only need to add a linear segment that goes before cosine decay, as the lines 27-30 show.

The final step is to wrap schedulers with a callback interface. An example of implementation is not shown here to make this post concise and easy to read. However, you can find a fully functional code in the aforementioned Jupyter notebook.

Stream Logger

The last thing we would like to add is some logging to see how well our model performs during the training process. The most simplistic approach is to print stats into the standard output stream. However, you could save it into CSV file or even send as a notification to your mobile phone instead.

OK, finally, we’re ready to start using our training loop!

Your Favorite Dataset

Now when the callbacks are ready, it is time to show how our training loop works. For this purpose, let’s pick the ubiquitous MNIST dataset. You can easily train it even on CPU within a few minutes.

The dataset is very simple for modern Deep Learning architectures and algorithms. Therefore, we can use a relatively shallow architecture, with a few convolution and linear layers.

We don’t use a transfer learning here but you definitely should when working on your daily tasks. It makes your network to converge much faster compared to the training from scratch.

Next, we use torchvision package to simplify dataset loading and iterating. Also, we apply a couple of augmentation methods to improve the quality of the model. Then, we build a callbacks group that adds a bunch of features to our basic training loop. Finally, we make a couple of small preparations and call training function to optimize the model.

You should get an output similar to the output shown below.

Epoch:    1 | train_loss=0.8907, train_accuracy=0.6387, valid_loss=0.1027, valid_accuracy=0.9695Epoch:    2 | train_loss=0.4990, train_accuracy=0.8822, valid_loss=0.0828, valid_accuracy=0.9794Epoch:    3 | train_loss=0.3639, train_accuracy=0.9086, valid_loss=0.0723, valid_accuracy=0.9823

Note that the code shown above includes make_phases() function that is not shown here. Please refer the notebook to see its implementation. In essence, it wraps data loaders with thin structures helping to track performance metrics during model’s training.

Conclusion

An ultimate goal of a Deep Learning engineer is to build a robust and accurate solution for a specific dataset and task. The best way to achieve the goal is to use proven tools and well-maintained frameworks and libraries tested in many use cases by users throughout the world.

However, if you would like to be versed in Data Science and eventually build your custom solutions, you probably “should understand backprop”. Knowing your tools well gives you the capability to tailor them for your specific needs, add new functionality and learn new instruments faster.

I believe that keeping yourself in a balance between using proven APIs and understanding “low-level” details makes you a better engineer who can easily transfer obtained knowledge to new platforms, languages, and interfaces.

Interested in Python language? Can’t live without Machine Learning? Have read everything else on the Internet?

Then probably you would be interested in my blog where I am talking about various programming topics and provide links to textbooks and guides I’ve found interesting.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ilia Zaitsev
Ilia Zaitsev

Written by Ilia Zaitsev

Software Developer & AI Enthusiast. Working with Machine Learning, Data Science, and Data Analytics. Writing posts every once in a while.