# Spherical Convolution — A Theoretical Walk-Through.

Convolution is an extremely effective technique that can capture useful features from data distributions. Specifically, convolution based deep neural networks have performed exceedingly well on 2D representation learning tasks, e.g. image analysis. Given this success, it is natural to investigate how to use this concept to capture features in different settings.

However, most state-of-the-art deep neural networks work only on Euclidean geometries and extending this concept to other manifolds such as spheres and balls is an open research problem. Representing data in spheres/balls can be quite natural and effective in cases such as analyzing 3D data. However, achieving this task is not straightforward. The main difficulty of adapting convolution to such manifolds is that in contrast to planar data, the spaces between adjacent points are not uniform. This is illustrated in Figure 1.

As shown in Figure 1, one can try to model the points of the spherical domain as a uniformly spaced Cartesian grid and perform 3D convolution on it. However, this is neither accurate nor elegant, as the spaces between points vary near equator and poles.

So it is clear that this task cannot be achieved via such naive methods. Recently, few papers have tried to tackle this problem and have achieved impressive results (Cohen et al. https://arxiv.org/abs/1801.10130, Esteves et al. https://arxiv.org/abs/1711.06721). Spherical convolution is a one approach for solving this problem.

Spherical convolution is performed over the surface of the unit sphere. **Unit sphere is a sphere with a unit radius that is centered at the origin.**

First of all, what does representing data on the surface of the unit sphere means?

Let’s say you have a set of (*x,y,z*) points. You represent each of these points in a different format using (*r,theta,phi*). These parameters are illustrated in Figure 2.

However, representing data on the surface of the unit sphere is a little different. A function on the surface of the unit sphere can be parameterized using just (*theta, phi*) because *r* is always *1.* Let’s say we have our point (*x,y,z*). we calculate *theta* and *phi* using these coordinates and make *r* the value of the function at *theta* and *phi*. More formally, let’s say our surface function on the unit sphere is *f. *Then we model the surface function on our data as* f(theta,phi) = r. *This allows us to model our points as a heat map on the surface of the sphere. This is graphically illustrated in Figure 3.

However, there is a little drawback here. I would urge you to stop reading and try to figure it out yourself before going further.

………………………………………..

Did you catch it?

What if we have multiple points which lie in the same *(theta, phi) *direction? We can model only one point in this setting. This has some smoothing effect on the overall shape of the 3D function. This issue can, however, be addressed if we model the function in the interior of the unit sphere (volumetric convolution), which I would explain in a future post.

Okay, now that we have a basic idea of the data representation, let’s move forward!

First of all, let’s have an insight into what does spherical convolution actually means.

It’s rather simple to show graphically as I have done in Figure 4 and 5. In planar convolution, the convolution kernel moves on the image plane and takes the inner product between the kernel and the image function. Similarly in spherical convolution, the kernel moves on the surface of the sphere and takes the inner product between the kernel and surface function.

So the same way the kernel picks up patterns on the plane in 2D case, spherical kernel can pickup patterns which are distributed on the surface of the sphere. Since the kernel shares weights across the plane, we have translational equivariance in 2D convolution. Similarly, since the kernel moves (rotates) on the surface of the unit sphere, we have rotational equivariance in spherical convolution. **Simply put, we would get equivariant responses even if the input 3D object is rotated. This is a key advantage in spherical convolution.**

Right. That’s the overall picture of the spherical convolution. If you hate mathematics and just wanted a high-level overview of spherical convolution, you can stop here :). However, if you are eager to see how this is actually done, read on!

It is easy to get an idea about spherical convolution from Figure 4 and 5. But this is easier said than done. Performing spherical convolution actually involves a fair bit of mathematics.

As you might have already guessed, performing spherical convolution is extremely difficult in spatial domain. So we transform the spherical functions to a different domain called **spherical harmonics**. I hope you are familiar with Fourier transform. **Recall that the convolution in spatial/time domain is equivalent to multiplication in Fourier domain. Therefore, it is a widely used technique to first convert the functions to Fourier domain and then multiply the functions to obtain the convolution, to reduce computational complexity. The technique used in spherical convolution is exactly something like that.**

It can be shown that any function on the surface of the sphere can be approximated using spherical harmonics moments. Spherical harmonics are a complete and orthogonal set of basis functions on the surface of the unit sphere. Some of you may not be familiar with orthogonal basis functions so **I will write a separate post on the properties of these functions and link it to this one later**. For now, take following equations as granted and as long as you are good with basic calculus, you are good to go :).

Spherical harmonics has two special properties: completeness and orthogonality. Let’s see what these mean.

**Completeness: **This means that we can approximate any function on the surface of the sphere using spherical harmonics (Equation 2).

**Orthogonality: **This means spherical harmonics follow the relationship in Equation 1.

Let *f(theta, phi)* be our surface function. Then it can be rewritten using spherical harmonic moments as,

where Y_{l,m} is the* (l,m)^th *order spherical harmonics function (https://en.wikipedia.org/wiki/Spherical_harmonics) and* \hat(f) *can be obtained using,

Let our kernel be denoted as *g(theta, phi). *Then, to perform spherical convolution, as we explained graphically above, **1) we apply a 3D rotation to g and 2) get inner product between f and g.**

*Inner product is just the integration over the surface of the sphere of the product of the two functions. Let’s see how this could be done.*

Any 3D rotation can be decomposed using Euler angles as,

Applying a 3D rotation to *g* and taking the inner product gives us,

Using the completeness property (Equation 2) we write Equation 5 as,

Rearranging equation 6 gives,

However, spherical harmonics have the following interesting property when a 3D translation is applied

Where D is the Wigner-D matrix (https://en.wikipedia.org/wiki/Wigner_D-matrix). Replacing equation 7 with this result gives us,

Rearranging the equation further gives us,

But we know that spherical harmonics have the orthogonality property (Equation 1). This simplifies the equation 10 to (I will drop the constants here),

Now this is a very interesting result. If we consider the Fourier decomposition of a 3D rotation *h*, it’s holds the following property.

If we compare Equation 12 with Equation 11, we have an important result. That means we can get the Fourier decomposition of the spherical convolution by just taking the outer product of *f(l,m)* and *g(l,m’)* at each *l*. Let’s say we practically take the maximum of l to be 3. Then this whole process can be achieved in a deep network using the procedure in Figure

First we compute the \hat(f) and \hat(g) (which are vectors) for each channel, and get the outer product at each l, which gives us the components of the Fourier transform of the convolution. In upper layers, the convolution is achieved by just taking the outer product between matrices (as opposed to vectors in the first layer) obtained at this layer. I will not derive that result here since it is very similar to the process I described.

Right! That’s all there is to it. If you have any suggestions or corrections, I would be glad to hear them :). Thanks a lot for going through the post!