Up-sampling and down-sampling with convolutions and transpose convolutions: a simple picture

Henry Bigelow
Aug 31, 2018 · 7 min read

In this note, I show that convolutions calculated by PyTorch and TensorFlow can be replicated by multiplying the input by a sparse square matrix, followed by filtering output elements with a mask. And, transpose convolutions are replicated by pre-processing input by a mask, followed by multiplication by a sparse square matrix. The pattern of up-sampling or down-sampling are completely determined by the mask in both cases.

The Matrix

The matrix is a sparse square matrix, with one copy of the filter elements on each row. A “key” element in the filter is chosen, and each row of the matrix positions this key filter element at each position in the input. The matrix thus has these key elements on its diagonal. For a filter like this:

1D convolution filter

the resulting matrix is:

Square matrix with filter weights scanned across each row. Gray represents zero value.

Rows of the matrix correspond to filter positions, and columns correspond to input elements. The matrix is square because it represents all possible positions of the filter in the input.

Multiplying this matrix by a vector of input elements is equivalent to positioning the center of the filter on each input element and performing the convolution, collecting the output. After all, a single convolution is just a dot product, and a matrix multiplication is a dot product of each row with the input.

Choice of stride and padding strategy determine which of these convolutions will be discarded, or ‘down-sampled’. Filtering output elements with a mask does this:

matrix = make_matrix(matrix_sz=len(input), filter=[1,2,3,2,1])
conv_full = np.matmul(matrix, input)
conv_downsampled = do_mask(conv_full, mask)

In contrast to a regular convolution, the transposed convolution first ‘up-samples’ the input with the mask and then does matrix multiplication with the transpose of the filter matrix:

input_upsampled = un_mask(input, mask)
conv = np.matmul(np.transpose(matrix, (1, 0)), input_upsampled)

Upsampling, downsampling and the mask

The operations above are defined as:

def do_mask(ary, mask):
‘’’remove elements where False in mask’’’
return ary[mask]
def un_mask(ary, mask):
‘’’insert zeros where False in mask’’’
it = iter(ary)
return np.array([next(it) if m else 0 for m in mask])

Here is an illustration:

The same mask is shown downsampling the input, then upsampling back to the same size

When a do_mask and then un_mask are chained together, shape is restored, although some values are replaced with zeros.

Mask construction from Stride, Filter length and Padding

Each False element in the mask means the corresponding convolution result in the output is either invalid due to filter overhanging the padded input, or to be skipped due to stride.

TensorFlow allows ‘VALID’ and ‘SAME’ padding strategies. ‘VALID’ means that only filter positions that are completely covered by the input are considered. In this case, no padding is applied, thus the positions on either end of the input are filtered out. In the matrix/mask approach, those positions will be False in the mask.

For SAME padding, TensorFlow adds a number of padding elements so that every on-stride position in the input can be used. When this number is odd, the right side of the input gets one more than the left (see tensorflow source). This is the equivalent of a mask with all True, and the filter’s key element chosen to be left-of-center in the case of odd-length filters. PyTorch adds a user-provided number of elements to both left and right.

Here are a few examples:

Mask with filter length 5, VALID padding, stride 2, for input length 15
Mask with filter length 5, SAME padding, stride 2, for input length 15

Dilation only affects the matrix

The default choice of dilation = 1 means the filter is left untouched. Dilation > 1 means the filter is augmented with dilation-1 zero elements between each filter element. This affects only the matrix. The mask is affected indirectly in the case of VALID padding strategy, when more positions on the left and right ends will be False due to an overhanging filter. Here’s what the matrix for a dilation=2 filter looks like.

Matrix with Dilation = 2, same filter as above

(Almost) Complete example

So, we can take input, filter, padding strategy, and stride to construct a matrix and a mask. Then, use the matrix and mask to perform both a convolution and a transpose convolution. Finally, show that this matrix/mask approach produces identical results as PyTorch and TensorFlow:

# Mask and Matrix construction
mask = make_mask(len(input), len(filter), stride, padding)
matrix = make_matrix(len(input), filter)
# Matrix/Mask approach for convolution and transpose convolution
mm_conv = do_mask(np.matmul(matrix, input), mask)
mm_convt = np.matmul(np.transpose(matrix, (1, 0)),
un_mask(mm_conv, mask))
# Compare matrix/mask results with PyTorch
th_conv = F.conv1d(input, filter, None, stride, padding,
dilation, 1)
th_convt = F.conv_transpose1d(th_conv, filter, None,
stride, padding, output_padding, groups=1, dilation)
assert all(mm_conv == th_conv)
assert all(mm_convt == th_convt)
# Compare matrix/mask results with TensorFlow
dilated_filter = dilate_array(filter, dilation)
tf_conv = tf.nn.conv1d(input, dilated_filter, padding, stride)
output_shape = tf.constant([1, len(input), 1])
tf_convt = tf.contrib.nn.conv1d_transpose(tf_conv, dilated_filter,
output_shape, stride, padding)
assert all(mm_conv == tf_conv)
assert all(mm_convt == tf_convt)

Note that I’ve omitted the details for dealing with batch and channel dimensions. This experiment works as far as I’ve tested for all combinations of input sizes, filter sizes, stride, padding, and dilation. A small wrinkle is that TensorFlow’s conv1d_transpose doesn’t support dilations > 1 if stride > 1. But, we can simulate that by pre-constructing a dilated filter. Another wrinkle is that TensorFlow supports padding types ‘VALID’ and ‘SAME’, while PyTorch accepts a single integer padding. This means it is sometimes impossible to produce the same convolution result between PyTorch and TensorFlow. However, the matrix/mask approach is more universal; it can replicate both PyTorch and TensorFlow results.

Generalizing to higher spatial dimensions

I’ve illustrated that PyTorch and TensorFlow convolutions and transpose convolutions in one spatial dimension are mathematically equivalent to the matrix/mask approach. It turns out that this is true in two or more spatial dimensions by first “unwrapping” both the filter and input into one dimension. I’ve done this experiment in my ml-tests repository. This uses TensorFlow’s functions:

tf.nn.conv1d and tf.contrib.nn.conv1d_transpose
tf.nn.conv2d and tf.nn.conv2d_transpose
tf.nn.conv3d and tf.nn.conv3d_transpose

and PyTorch’s functions (F = torch.nn.functional):

F.conv1d and F.conv_transpose1d
F.conv2d and F.conv_transpose2d
F.conv3d and F.conv_transpose3d

The procedure for dimensions > 1 is to first unfold the input into one dimension, using row major ordering. For any shape, this unfolding follows numpy’s:

input.reshape(-1)

Filter elements occupy the same coordinate space. Therefore, matrix construction of such a convolution is the same as in one dimension. The key element of a filter is chosen as the element in the center (or just left of center, for even-size) in each dimension. The one-to-one correspondence of output elements is defined by which input element the key element is positioned on. And, the set of output elements retained is defined by the mask as usual.

In TensorFlow, stride can be applied independently in each spatial dimension, while padding strategy must be either ‘SAME’ or ‘VALID’ for all dimensions. It turns out that the multi-dimensional mask is just the Cartesian product of single-dimensional masks, using logical AND to combine them.

The full details are defined in the ml-tests repository.

Input and Output channels

In this discussion I said that a single convolution is a dot product between filter elements and input elements. But, a dot product is just the sum of pairwise multiplied elements. One pair could be a scalar filter element and scalar input element. Or, it could be a matrix filter element and vector input element. Then the multiplication is just matrix-vector multiplication, and the final addition is just vector-vector addition. Either way, nothing about the higher level organization of this is changed.

Conclusion and perspective

One nice feature of this view of convolution or transposed convolution is that it defines a correspondence between input and output elements. Both the choice of stride and padding affect the number of output elements produced. But, this correspondence makes clear which elements are omitted or up-sampled due to stride or edge effects.

This also made me realize that the notion of stride is restrictive in that it only allows up-sampling or down-sampling at integer multiples. Thinking in terms of the mask, we can see that arbitrary sampling patterns are in theory possible.

Thanks

Naoki Shibuya’s excellent article on transposed convolutions helped me to first understand this subject better, as did this tutorial.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade