ImageNet — part 2: the road goes ever on and on

Julien Simon
Sep 24, 2017 · 5 min read

In a previous post, we looked at what it took to download and prepare the ImageNet dataset. Now it’s time to train!

Can you see it, Mister Frodo? Our first ImageNet model! Oh wait, we have to cross Mordor first…

The MXNet repository has a nice script, let’s use it right away.

python train_imagenet.py --network resnet --num-layers 50 \
--gpus 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
--data-train /data/im2rec/imagenet_training.rec \
--data-val /data/im2rec/imagenet_validation.rec

Easy enough. How fast is this running?

About 400 images per second, which means about 53 minutes per epoch. Over three and a half days for 100 epochs. Come on, we don’t want to wait this long. Think, think… Didn’t we read somewhere that a larger batch size will speed up training and help the model generalize better? Let’s figure this out :)

Picking the largest batch size

Using the nvidia-smi command, we can see that the current training only uses about 1500MB. As we didn’t pass a batch size parameter to our script, it’s using the default value of 128. That’s not efficient at all.

By trial and error, we can quickly figure out that the largest possible batch size is 1408. Let’s give it a try.

python train_imagenet.py --network resnet --num-layers 50 \
--gpus 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
--data-train /data/im2rec/imagenet_training.rec \
--data-val /data/im2rec/imagenet_validation.rec \
--batch-size 1408

That’s more like it: the GPU RAM is maxed out. Training speed should be much higher… right?

Nope. Something is definitely not right. Let’s pop the hood.

Detecting stalled GPUs

That’s not good. One second, our GPUs are running at 100% and the next they’re idle.

It looks like they’re stalling over and over, which probably means that we can’t maintain a fast enough stream of data to keep them busy all the time. Let’s take a look at our Python process…

Scaling the Python process

Idle time is extremely high (id=80.5%), but there are no I/O waits (wa=0%). It looks like this system is simply not working hard enough. The p2.16xlarge has 64 vCPUs, so let’s add more decoding threads.

python train_imagenet.py --network resnet --num-layers 50 \
--gpus 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
--data-train /data/im2rec/imagenet_training.rec \
--data-val /data/im2rec/imagenet_validation.rec \
--batch-size 1408 --data-nthreads=32

Firing on all sixteen

What about our python process?

It’s working harder as well, with all 32 threads running in parallel. Still no I/O wait in sight, though (thank you EBS). In fact, we seem to have a nice safety margin when it comes to I/O: we could certainly add more threads to support a larger batch size or faster GPUs if we had them.

What about training speed? It’s nicely crusing at a stable 700+ images per second. That’s a 75% increase from where we started, so it sure was worth tweaking.

An epoch will complete in 30 minutes, which gives us just a little over 2 days for 100 epochs. Not too bad.

Optimizing cost

Let’s look at spot prices for the p2.16xlarge. They vary a lot from region to region, but here’s what I found in us-west-2 (hint: using the describe-spot-price API should help you find good deals really quick).

Yes, ladies and gentlemen. That’s an 89% discount right there. Training would now cost something like $80.

Conclusion

In the next post, I think we’ll look at training ImageNet with Keras, but I’m not quite sure yet :D

As always, thank you for reading.


Congratulations if you caught the Bilbo reference in the title. You’re a proper Tolkien nerd ;)

Julien Simon

Written by

Hacker. Headbanger. Harley rider. Hunter. https://aws.amazon.com/evangelists/julien-simon/