Fully Utilizing Your Deep Learning GPUs

There are a lot of articles having to do with deep learning and deep learning frameworks, and there is a lot of buzz about GPUs and how to use them for deep learning. There are even a lot of articles having to do with considerations for building your own deep learning machine from the ground up. I even wrote an article on the topic that you should consider checking out if you are interested in building your own machine. One thing there is not a lot of information about is insight into taking advantage of your GPUs. To set your expectations appropriately, this article is more about the larger picture than it is about tactical technique.


Develop An Appreciation Of Your Hardware

It is useful to get a common sense of what the computing hardware you have can do. This gives a little better sense of what it is capable of and hopefully more enthusiasm for utilizing it more effectively.

As a human being, I can add and multiply. Giving a very optimistic estimate, I can add or multiply simple numbers at a rate of one addition or multiplication per second, and I can do so for 12 hours a day for about 60 productive years. In this period, I would have been able to perform approximately one billion computations. The machine that I built in the article I mentioned earlier has a theoretical total throughput (CPU+GPU) of approximately 23 billion single-precision floating point operations per second. Said another way, that computer can do more computation in 50 thousandths of a second than I can in my entire lifetime, not to mention the fact that it does so with greater precision and less error. The hardware and tools we have available to us are phenomenal, but we need to be thoughtful in how to best take advantage of them.

Considerations such as these are an excellent entry to philosophical and existential thinking about topics such as why the human brain works so well, how we might improve the state of the art using unsupervised or reinforcement learning techniques instead of the more status quo supervised learning, the enchantment of wondering about human dreams and their relation to unlearning cycles in Hopfield nets, and many others. Please do think about these things, they are hugely important, but understand that this article is about high level ideas regarding taking advantage of your GPUs.


Understand What Affects Your Ability To Execute

This article is about fully utilizing your GPU. That is actually a pretty broad topic that extends pretty far from your GPU. It is about the depth that you understand your specific problem. It is your understanding of strategy as relates to your choice of tools that help you solve your problem most efficiently. It is your understanding of tactics allowing you to iterate on potential solutions to your problem. It is your coding skill and knowledge of good software craftsmanship. It is your knowledge of administration of your machine and creating good system tools that help you move fast. It is your general problem solving ability. You actually have to know what you are doing and consistently make good decisions to fully utilize your GPU and to solve the problems you are working on at high velocity. Additionally, you need to have vision for the larger concept of how to solve problems when the solution involves training, and know how to prioritize appropriately to achieve maximum effect.


A Tale of Two Machines

I have two deep learning machines. One of them is a junker Dell Inspiron with a 4th generation 4-core i5 with 12GB of RAM that I upgraded with a better power supply and a GTX 1060 6GB (1280 cores). The other one (you can read more about it here) is a 7th generation 10-core i9 with 32GB of RAM and two GTX 1080 Tis (7168 total cores). These are very different machines. The more substantial one has a higher capacity, but this is only of value if it is taken advantage of. For a lot of smaller networks, it really doesn’t matter which one it is run on. This is because despite the disparity in capacity, both of the GPUs are the same architecture (Pascal) and run at about the same speed.

The goal of the rest of this article is to share some insights regarding how to maximize the benefits of your GPU computing hardware. This has utility whether you are using something like a massive AWS p3.16xlarge instance or just a single GTX 1060 on your local machine. If you aren’t being smart about it, you are wasting your hardware’s capacity (not to mention potentially a lot of money). Hopefully some of the ideas presented in this article help you identify ways to improve your utilization and solve problems at higher velocity.


Code Matters — A Lot

It is not just your code specifically that matters, it is actually the entire code path between your concept and the CUDA cores that are executing it on the GPU. This is one reason why you should spend some time researching what the right coding environment is. This depends on the languages you use, the style of code you write, what the environment you are developing and deploying to is, and what the larger goals you are trying to achieve are.

One thing you most likely should not be doing unless you are a researcher or a contributor to a framework is writing low-level code. More than likely you do not know as much as you think you do about performance optimization and concurrency. As well, you probably didn’t do your homework and didn’t notice the variety of good products already in the space. Don’t get me wrong, it is essential to your education that you implement things like backprop at least once in your life, but if you are trying to seriously solve a problem this is the least desirable time to be doing this.

Most of the deep learning libraries you are familiar with use a computation graph that translates into CUDA code. The specific mapping depends on the framework, but generally cuBLAS and cuDNN are the final underpinnings. Sometimes you see custom CUDA code for unusual features as well. It is worth your while looking through the source of a few of the big frameworks to see how this works. Unless you have a significant background in performance tuning computational code, you are best served not attempting to recreate these excellent vendor-privided libraries. If you want to get your feet wet, NVIDIA has a nice online programming guide.

The specific way that the computation graph is created, optimized and translated into CUDA code is another aspect that requires your attention. Somewhat because the way this is handled is important to the performance of the code, and somewhat because the computation graph API dictates a lot about how you write your code. There are two types of computation graph systems, those built around a static graph model and those build around a dynamic graph model. The static graph model systems generally have more optimization features having to do with the computation graph. The dynamic graph model systems generally are much more flexible and are simpler for experimentation. The reality is that someone coded the construction of the graph, and if that part was done with sufficient care there isn’t as much need of the optimization as one might think.

In the past I used TensorFlow and Keras a lot. I really appreciate the convenience of Keras and the ability to import the backend to be able to easily extend functionality for things like custom cost functions. One thing that has put me off in recent times is the feverish pace of development that Keras seems to be experiencing, and the lack of checking that things still work. It is frustrating to import a model and the activation functions aren’t in the version of Keras anymore (maybe they are in tf.keras bundled with TensorFlow but not in standalone Keras, who knows). It is also frustrating to load a model and it isn’t the default TensorFlow graph. Why isn’t the .fit() signature for validation data the same as for training data? Despite these complaints, Keras is still a great tool for a lot of prototyping needs. Unfortunately, it is also a tradeoff between concise abstraction and performance. It has been a long time since I implemented anything from scratch in TensorFlow. Like I mentioned earlier, if you aren’t a researcher, library maintainer or coding for the pure joy of it, you are missing the point of high velocity experimentation if you are coding anything but truly unique things from scratch. I cringe when I see projects that teams produce under the guise of high velocity if there aretf.this, tf.that and tf.the_others littered throughout. TensorFlow is a great tool, but needs to be used more wisely.

These days I prefer to write my code using PyTorch. One reason for this is the abstraction for the computation graph. It is immensely useful to be able to construct models in an organized, parametrizable way; to me PyTorch offers the simplest interface for this. The entire code base is concise, leading to simple computation graphs that are performant. One of the biggest advantages is the ability to parameterize experiments so you can iterate on them easily. The abstractions in PyTorch help you innovate complex ideas faster than many of the alternatives. This isn’t only my opinion, there are plenty of people who have positive things to say about PyTorch.

In PyTorch, you can very easily create custom network components by subclassing torch.nn.Module. Indeed, all of the provided layers are subclasses of torch.nn.Module as well, though often contain an intermediary class that establishes common functionality for the given type of layer. Generally speaking, all you need to implement is the constructor and the forward() method. The constructor is generally where you create your custom component, and the forward() method evaluates it. If you need to you can implement (or override) methods like reset_parameters(), which is where weight initialization is performed. The abstractions are really powerful, as it gives you excellent control over the computation graph in the form of a class heirarchy. It also makes propagating control, like switching between training and evaluation by calling the train() or eval() methods respectively, simple because these propagate to the children automatically. It is the balance between the freedoms and the conventions, and the ease of making flexible, performance computation graphs, that are why I find PyTorch a pleasure to use.

Like many, my own path to these observations is tempered by experience. Experience has guided me to where I am now, and I am simply sharing what has worked for me. I have written backprop from scratch many times, including in OCaml as a purely functional implementation on lists (see here if interested). For years I have used TensorFlow and Keras, which result in a marked improvement in project velocity compared to writing it all from scratch. There are some really good features regarding deployment of TensorFlow that are not being highlighted here (e.g. interoperability of SavedModelBuilder() and Serving), and TensorFlow does have more aggressive default GPU utilization, but this article is more about code design that facilitates thoughtful resource utilization and overall project velocity than it is about deployment or default utilization.


Environment Matters — A Lot

You might not think so, but the choices that you make with respect to your environment play a critical part of creating an ecosystem that facilitates good GPU utilization and the ability to iterate quickly on experiments. The more thought you put into this aspect of the problem the better, as this forms the basis of how you execute your experiments. This actually matters so much that it is one of the biggest differentiators between someone idly messing around in the space and someone having the ability to iterate quickly and make things happen.

A vision that you should develop is making your environment as simple to use as possible. If getting your code running on your GPU is not extremely simple it will slow down your entire experimentation and production training pipeline. A really simple thing to set up that will give you a quick win is creating some shell scripts that make execution on your GPU simpler. Here is an example of a shell script I use:

I call thisgpu1 and put it in /usr/local/sbin. The reason it is called gpu1 is because it is the first GPU installed. The number that CUDA_VISIBLE_DEVICES is referring to is the logical number of the video card. Since I have a machine with two compute GPUs, I also have a gpu2 script that indicates the second logical compute GPU. Now you have a very easy way of running jobs on whichever GPU you want, and without having to type python3 every time. The point is, it is very easy to create handy shell tools that make it really simple to more effectively use your GPUs.

The next tool you need to consider is monitoring your GPUs. To make this easy, you can simply alias nvidia-smi like this:

You can put that alias in your .bashrc or a similar shell customization file. This makes it really easy to see what your GPUs are doing with reasonable temporal granularity using a command with a sensible name (gps meaning GPU ps).

You nearly certainly have noticed some shortcoming for GPU process monitoring using nvidia-smi. Even if you use the -lms 500 parameter to get reasonably granular temporal information, the CUDA utilization is notably not present. This utilization is available from the NVML library, and this is exposed by tools like py3nvml in the python3 world. As a bonus, py3nvml comes with a replacement for nvidia-smi called py3smi that offers this additional information. Of course you can alias py3smi as gps with your favorite arguments if you like the shorter name.

Get used to thinking about and setting CUDA_VISIBLE_DEVICES when you are working with your GPUs. This is an NVIDIA-ism that is used across a wide swath of deep learning frameworks. It is well worth your while to familiarize yourself with how your GPUs are specified, what environment variables are commonly used to convey this information, know some easy ways to change this context, and understand how to get and set information about the GPU context manually and programmatically (I will focus on the part about programmatically getting and setting this context in my next article).


Experiments — Scripting Them Out

There are a number of ways that you can go about maximizing your GPU utilization now that you have a means to easily delegate work to the various GPUs in your machine. Each of these options have pros and cons, and it depends on the specific problem, your coding ability and other factors as to which is the best.

The first option, and one that I tend to use most, is to simply add argument parsing to your script. That way you can very easily vary parameters and run things on whichever GPU you please. Using an instance of ArgumentParser() is a really easy lift. This allows you to easily start process-based experiments from the command line. It is easy to script out experiments, and you have the ability to manage them using normal system tools since they are running as separate processes. You can use gps to get a sense of what is running on your GPUs and how much capacity remains. If you are a casual user or are making human decisions about hyper-parameter tuning, this can be a nice way of performing experiments and getting moderate GPU utilization.

The second option is to sequentially vary hyper-parameters from within one non-threaded process. On the surface this would seem to be a pleasing programmatic alternative to scripting out commands that launch individual processes. However, you loose much of the ability to manage the processes independently. Moreover, this sequential solution to the problem has affinity to GPUs that is static, which means that getting good GPU utilization is hard. The reason it is hard is because you are either distributing large workloads across multiple GPUs for each sequential experiment, or you are manually controlling different instances of the program to delegate to independent GPUs. The first case is not ideal for experimentation, whereas the second case can be quite annoying to code. It is for these reasons that I personally much prefer the previously mentioned option.

The third and final way of running experiments, and one with the best utilization potential, is to use a threading model and vary the hyper-parameters across experiments all running as threads. In order to do this effectively, you need to be able to monitor what is happening on the GPU and estimate if you have capacity for a new thread. This is a heavier lift, but can be quite rewarding. Moreover, while the reason for introducing this approach is GPU utilization, it is a step in the right direction to automating hyper-parameter selection and model architecture selection. Said another way, if you want to build your own AutoML system, this is a good place to start. I am not going to dwell on the tactics of implementing your own AutoML system in this article, but I will very soon write a detailed article on this topic.

Something that is critically important for experimentation is to get organized. It is very easy, particular when there is a manual aspect to the experimentation, to forget to make detailed observations. While there is artful intuition that can be developed through human experience in optimizing performance, a lot of the process leading to improvement should be conducted scientifically. You will likely generate a lot of collateral during your experiments (history of validation loss by epoch, best validation loss, hyper-parameters and network achitectures, the saved state of models and optimizers, smoke test collateral that is useful to human observers, etc.). Keep all of this and keep it organized; this is the data that you are using to maximize performance. It is also a good indication of the general conceptual merits of developing an AutoML system, as these systems not only keep it organized, but optimize it.


Production Training

The prior section focused on experimentation, where the emphasis for utilization is distribution of several small jobs across your GPUs. This section focuses on production training, where the emphasis for utilization is distribution of one large job across your GPUs.

Generally speaking, the difference between experimentation and production training is that experimentation is about hyper-parameter and model architecture selection and production training is about minimizing the loss. Nearly always you use a smaller training set for experimentation and are less interested in the absolute loss than relative performance across hyper-parameters and architectures. Given the smaller training sets, quite often you need to develop intuition regarding the behavior of the loss as your training progresses. It is very common to see overfitting behavior on a small training set, so you need to understand what indicates improvement, what indicates regression, and what is likely to work the way you want when you move to your full training set. This isn’t entirely prescriptive, and takes time and experience to become better at.

There are several benefits to using multiple GPUs for production training. Typically, the goal is to exploit higher amounts of parallelism so that you can use larger batches in your training. The larger the batches, the more stably the gradient tends to descent in minimizing the loss. Many computation graphs are easy to parallelize by duplicating large portions of them, so the larger your batches are and the more duplicated the graph is, the better the utilization of your GPUs are and the faster your training. This is particularly useful as you move from small training sets to larger training sets.

Many of the popular deep learning frameworks offer ways to distribute training across multiple GPUs. PyTorch makes this especially easy with the use of the DataParallel class. This class essentially wraps any subclass of torch.nn.Module and provides batch-level parallelism. What this means is you can uses larger batch sizes and automatically distribute the work to any GPUs enumerated by CUDA_VISIBLE_DEVICES. Larger batch sizes generally result in better loss estimates involving a more even distribution of your training and validation data, and faster convergence by generating gradient estimates more consistent wit the actual gradient. If you want to see more details about how DataParallel works, you can check out data_parallel.py and parallel_apply.py from the PyTorch repo.

At this point my hope is that you can see that there is a marked difference between some of the objectives of experimentation and production training. Experimentation is more about quick experiments that serve to guide you to best hyper-parameters and a best architecture. These typically are low utilization episodes of watching the magnitude and rate of change of the loss, where being able to coordinate as many experiments running at once is the most important factor. Production training is more about getting the best possible results from the information learned during experimentation. This typically means an increase in batch size and larger numbers of batches per epoch (both for training and validation), all to drive the loss as low as possible. In order to do this effectively, it can be helpful to subdivide the batches across multiple GPUs. Both experimentation and production training are part of effective deep learning innovation, but the tactics used in each phase are quite different.


Where To Go Now — Reflecting On The Road Traveled

At the beginning of this article we started out with the hope that we could use our fancy GPU computing hardware a little more smartly, recognizing that there are enormous amounts of lost computing cycles happening every day. We learned that this is related intimately with both how we choose to code and how we set up and use our environment. Along the way we discovered some ideas on making our lives simpler as well as automating our tasks. We looked at some ways to distribute experiments across one or more GPUs, and we looked at the general differences between experimental training and production training. Hopefully as we went down this road something clicked or was confirmed.

One goal of this article was to provide a high level view of how to take an initial idea and iterate on it at high velocity, then be able to get it to production. In the deep learning space these tend to have to do with effective GPU utilization and the choices made around coding and environment. I mentioned a future article about creating your own AutoML system. Please keep an eye out for that, it is a really good exercise both in terms of learning how to utilize your hardware and understanding how to optimize model hyper-parameters and architectures.