A practitioner's guide to PyTorch

Radek Osmulski
Towards Data Science
3 min readNov 12, 2017

--

I started using PyTorch a couple of days ago. Below I outline key PyTorch concepts along with a couple of observations that I found particularly useful as I was getting my feet wet with the framework (and which can lead to a lot of frustration if you are not aware of them!)

Glossary

Tensor — (like) a numpy.ndarray but can live on the GPU.

Variable — allows a tensor to be part of a computation by wrapping itself around it. If created with requires_grad = True, will have gradients calculated during the backwards phase.

How pytorch works

You perform calculations by writing them out, like this:

Taking a dot product of x with w using the new Python 3.5 syntax. torch.cuda creates the tensor on the GPU.

Once you are done, all you need to do is call #backward() on the result. This will calculate the gradients and you will be able to access them for Variables that were created with requires_grad = True.

Accessing the gradient of w.

Things to keep in mind (or else risk going crazy)

  1. Datatypes matter!
    What will happen if you divide a ByteTensor by 50? What if you try to store the result of torch.exp(12) in a HalfTensor? In reality, this is more straightforward than it might seem at first, but it is something that does require a bit of thought.
  2. If it can overflow or underflow, it will.
    Numerical stability is a big one — can a division ever result in a zero when stored in the resultant tensor? What if you then try to take a log of it?
    Again — this is nothing that elaborate but certainly something to be aware of. And if you want to go deeper on this, let me refer you to the ultimate while at the same time very approachable source on this matter.
  3. Gradients accumulate by default!
    I left the best one for last — the above two are friendly chihuahuas in comparison to the wild beast that we will look at now!
    By default, gradients accumulate. You run a computation once, you run it backwards — everything is fine. But, for the second run, the gradients get added to the gradients from the first operation! This is so important and so easy to forget, please take a look at a follow up to our earlier example:

The solution is to zero the gradients manually between runs.

Last but not least, I would like to recommend the official tutorials — regardless of your level of experience they are a great place to visit.

I hope this will be helpful to you and will save you some of the struggle I experienced when setting out to learn PyTorch. Best of luck to the both of us as we continue to master this amazing framework!

If you found this article interesting and would like to stay in touch, you can find me on Twitter here.

--

--

I ❤️ ML / DL ideas — I tweet about them / write about them / implement them. Recommender Systems at NVIDIA