Member-only story
Simple Ways to Speed Up Your PyTorch Model Training
If all machine learning engineers want one thing, it’s faster model training — maybe after good test metrics
Read for free at alexdremov.me
Does this topic even need an introduction?
Speeding up machine learning model training is one thing that all machine learning engineers want. Faster training equals faster experiments equals faster iterations for your product. Also, it means that one model training will require fewer resources. So, straight to the point
Containerization
Yes, this will not speed up your training on its own. But this targets another important aspect — reproducibility. Sometimes virtualenv with fixed library versions is enough, but I encourage you to take one step further and build an all-in-one docker container for your model training.
This ensures that the environment is fully consistent during debugging, profiling, and final training. The last thing you want is to optimize a part of code that is no longer a bottleneck due to python12 speed up, for example. Or even a bug that is not reproducible on different CUDA versions.
As a starting point, you can use pre-built images from NVIDIA. They already have CUDA, PyTorch, and other popular libs installed:
💡 A Docker container is the ultimate solution for problems like
“Hey, it works on my machine. I have no idea why it doesn’t on yours.”
Get comfortable with PyTorch profiler
Before optimizing anything, you have to understand how long some parts of your code run. Pytorch profiler is almost an all-in-one tool for profiling training. It’s able to record:
- CPU operations timings
- CUDA kernels timings
- Memory consumption history