Code on Tensors Representing Images

James Cavanagh
unpack
Published in
6 min readJun 15, 2021

To learn about a tensor, we are going to first go through a definition of what it is, the benefits of a tensor for parallel processing, and finally some other points on the rules of linear algebra which relate to what is called broadcasting. We will go through this on a example of creating an abstract, ideal digit by calculating the mean of all the digits the MNIST dataset. This will have two steps of adding all of the digits together, and then dividing the sum by the total means of broadcasting.

First, a tensor is very similar to terms which are found in other disciplines of computer science and mathematics. Yet, a PyTorch tensor is distinct in several key aspects which optimize them for use in parallel computing.

In mathematics, a matrix is a convenient way of representing numbers arranged in one or more dimensions. This is known as an nd-array in computer science. Below is a 3x3 example of such an arrangement of numbers.

This is an example of a 2d array, matrix, or tensor, but the concept can continue into many dimensions. Though difficult to show here, one model a Rubix Cube in three dimensions. For tensors, the dimension is known as the rank of a tensor. A Rubix Cube would be represented by a rank 3 tensor. In this case, the normal base zero thinking doesn’t apply because the zero dimension is used by the content inside of each element, which is known as a scalar.

The main advantage of a matrix, array, or tensor is that every element is identified by a key or index.

These keys become so important because it allows the computer to easily find and keep track of every part of the array compared to a glob of unstructured data. It’s like having a book with a table of contents compared to a mass of papers spread out over a desk.

Although extremely similar to an nd-array, it is different enough to warrant a new term in a couple respects. The first is that all the elements in a pytorch tensor must be of the same data type. Although this is a rigid requirement, it has a huge benefit in terms of processing time since the program does not have to slow down to infer what is in each element of the tensor. The second difference is that tensors are optimized so that they can live on the GPU’s RAM, which speeds up things considerably because the data is not going back and forth across different parts of the computer.

Now, let’s consider the example of how to use computer vision to identify handwritten digits. One way to solve the problem is to create an abstract ideal image by creating an average of many other examples of handwritten digits. A new handwritten digit can be compared to this average in order to guess which digit is which. Having this ideal image for each digit from 0–9 will allow us to compare each digit to the ideal digit, and make a prediction based on which ideal digit the handwritten digit is most similar to.

We as people can read messy handwriting, since we have seen many examples of different people’s writing as well as the context surrounding the digit. We can show the computer how to do this by averaging out thousands of examples of handwritten digits into one blend of everything that incorporate the information that it needs.

There are hundreds of thousands of tiny calculations that need to be done in order to solve this problem. However, since image data can be converted into a PyTorch tensor, it allows us to take advantage of parallel processing.

From the basic order of operations, we know that the order in which we perform just addition/subtraction of digits doesn’t matter (we are saving the division for later), the computer can do the task in parallel using hundreds of GPU cores vs doing it serially on a single CPU.

However, if we continue this concept, we might as well treat each 2d jpg image as a 2d array since they’re mathematically similar. This loads everything from being stored in thousands of file paths on the hard drive, into registers on the RAM which is orders of magnitude faster. However, since each of these 2d arrays are the same size, we can take this one step further and treat all of them as one 3d stack of images. Then as a PyTorch tensor, this can be taken from the computer’s ram and loaded onto the GPU’s ram so it is all in one place.

Once everything is lined up in a rank 3 tensor, the following calculation of adding up each element can proceed incredibly fast compared to using a list of lists in python. In pure python, the computer has to slow down not only to infer the type of each element, but to locate each element which is stored somewhere in RAM. In most cases this doesn’t matter compared to developer productivity, but low level C code can run much faster since the data type and are explicitly stated.

However, this approach is so agnostic that the problem can be looked at from different angles that can get very confusing. Since what we want to do is to average out all the pixels in each image into one single image, which in this case is 28x28 pixels we have to tell the program exactly what we want.

Since it is now a stack of ~6000 28x28 images, we can ask the GPU to perform this simple addition on different axes to either go vertically and create a small 28x28 array of large numbers or go horizontally on a ~6000x28 stacked 28 times. Going back to the benefits of indexing discussed earlier, these calculations are no different to the GPU since it is so easy for it to locate each element. As a result, we have to specify that we want to calculate the mean of each pixel on the zero axis in order to get the 28x28 2d array that we need.

That we get a 28x28 2d array is not only important because of what we want to do which is compare new jpg images to this ideal digit, but because of the rules of the linear algebra which the computer is using.

By now, our ideal image is just a very large sum from all of the 0–255 grayscale pixel values stacked on top of each other from the thousands of images. In order to complete the calculation of the mean, we need to divide each pixel by the height of our tensor.

In normal math, this is straightforward, however it is not so in higher dimensions. Since matching indices are so important to linear algebra and by extension our GPU calculations, if you don’t have matching indices, it doesn’t work which like how a round peg won’t fit into a square hole. It is simply not possible to add, subtract, multiply or divide tensors that are different sizes.

However, there is a solution to this which is called broadcasting that allows us to do the operation we need to do on our 28x28 2d array. By hand, what we would do is instead of multiplying it by the height of the tensor (N), we would have to make a 1x28 array of the value N.

PyTorch uses broadcasting in a way ensures two key considerations are met. One that it ensures that the rules of linear algebra are followed. Secondly and it without the additional computational overhead and human error prone step of creating that 1x28 array.

Once we divide the sum of each element of the 2d array by N, we are then left with the mean image.

From all of this, we then get a 28x28 tensor which when represented graphically appears as a fuzzy digit. This fuzzy digit is useful because it gives a standard that can be used as a standard of comparison. All of these computations take place very quickly compared to the process of loading the images into the program and converting them into tensors.

--

--