# Geometric Deep Learning: Group Equivariant Convolutional Networks

## Towards smart and efficient image recognition algorithms

In this blog, group equivariant convolutional networks, or G-CNNs, are explained. In 2016, T. S. Cohen and M. Welling published an article that both provided the concept and its evaluation which, at the time, resulted in state-of-the-art performances on some well known data sets, including CIFAR10. In doing so, they designed layers that were equivariant to unaddressed relevant isometries of the sampling lattice for which conventional CNN’s exploit translations only [1]. My hope is to provide the reader with a more detailed explanation of their findings by the addition of an improved visual interpretation.

The advent of deep convolutional networks drastically increased the performance of many computer controlled models. Especially in computer vision, the ability for software to learn patterns that constitute themselves in target objects of interest within an image or video, such as animals, cars, faces, fingerprints and license plates entail opportunities. One such an interesting application is the detection of severe diseases (e.g. cancer, malaria and tuberculosis) purely on the basis of images from the patient. In this, the network is able to discriminate between clean and infected cells or tissue. These methods are generally faster and in some cases outperform conventional methods [2],[3].

One should know that convolutional networks, *convnets*, are in essence generalizations of neural networks, *NNs*. Especially for learning features in images, convnets work appropriately. By design, convolution layers preserve translational structure: the outcome of the network remains unchanged if the object is shifted along the sampling lattice, at least up to edge-effects. In other words, convolutional layers are equivariant under translation: a convolution with a translated image is the same as the translation of a convolved image. This directly enables efficient detection of objects of the same shape and orientation compared to NNs. This property boosted the use of convnets in many computer vision related applications.

Even though this conventional convolution is by far the most used operator in many models, networks still have to be huge (either extremely deep or extremely wide) in order to achieve good performance. Besides changing the networks architecture, we can as well develop a new operator within that is even smarter. This is exactly what T. S. Cohen and M. Welling did by introducing group equivariant convolutional networks. Besides translations, G-CNNs exploit other realistic symmetries of the sampling lattice as well. The working horses are the group convolutions, or G-convolutions, that again increase the degree of weight sharing. Depending on the specific implementation used, feature maps in G-CNNs are equivariant under newly imposed symmetry transformations. Specifically, they designed G-convolutions that were able to learn features that equivary under 90-degree rotations and mirror reflections besides being equivariant under shifts only. It means that the network can identify differently oriented objects as the same in a smart manner. This smart way of designing the operators in networks that exploit efficient and redefined integration of weights can help build small, intelligent and well-performing networks.

In this blog, I lay emphasis on the *implementation *of G-convolutions and hope to inplant the reader with the knowledge with which he or she can build these group convolutional layers him- or herself. I will do so by first giving a detailed visual interpretation in which I *show *how G-convolutions work and should be implemented. The G-convolutions will be build up from scratch to slow down the pace and make sure you understand it. The second part is for the ones with more profound mathematical background. In this part, I *prove *that G-convolutions are indeed equivariant under rotations and reflections.

G-convolutions rely on the formation of *groups*. In our case, groups exist of symmetry transformations of the sampling lattice: shifts (translations), *90-degree* rotations (pure rotations) and/or *mirror *reflections (pure reflections). These transformations change the orientation of the sampling lattice. The possible orientations are the elements of the group. A G-convolution is a specific function on that group. This is in a nutshell what is needed to be addressed in detail to fully understand these G-convolutions. Let's start.

Since we are talking about transformations that are perceptible in the real world, there should be a way in which these G-convolutions can be better understood with the addition of visuals. The idea is to generate a *graph* or *abstract pattern *which shows how *all* elements of the group are connected through their *group actions*. Let us not race ahead in the matter of merely words but on the contrary start with a simple object and create the relevant graph(s) from scratch. Consider the following cactus as object.

We will denote a (pure) rotation and a (pure) reflection as *r*() and *m*(). For brevity we forget about the brackets and perform both separately on the object:

Extending this further, we allow every possible combination of *r *and *m* applied sequentially. For example we can apply a reflection first, followed by two rotations:

Not surprisingly, some combinations lead to equivalent *poses* the object can adapt to (e.g. *rm* = *mr³* and *r⁴*=*e*). In a Cartesian sampling grid, the possible orientations are limited to eight. The ingenious idea is to generate *graphs* in which the *nodes* denote the possible poses and the *connections* between them the transformations. These graphs are depicted below in which a red arrow symbolizes a rotation and a blue line a reflection. We define the group *p4* as the group that includes all rotations and the group *p4m* that includes all rotations and mirror reflections.

Mathematicians call these graphs representations of the *groups* if they obey to some specific rules which are more subtle than I will explain here. The idea is that the graph is perfectly symmetrical in some sense. From your perspective the graph is identical no matter the location you are at: sequentially taking a red, a blue, a red and a blue path will lead you back to your initial position. Understanding how graphs can illustrate groups will help you understand the way G-convolutions work and thus are implemented. Moreover, it will help you understand the proof later on.

I think we are ready for the first step. Start with a simple square image as input for the first G-convolution and let us consider the group p4: I will stick to the cactus. First, transform the input to every possible pose (being four in this case) whereafter you perform conventional convolution to every one of them with the *same* filter, creating four feature maps. The main idea is to allocate every feature map in the above proposed graph according to the transformation it had to undertake. To reduce computational overhead, the filter is transformed (which is generally much smaller) instead of the input, providing a similar outcome. Let me clarify the first G-convolution with an illustration.

Observe that the first G-convolution, *P4ConvZ2*, adds an extra dimension that resembles all group elements. Visually, it means that the feature maps will be placed at different locations. The feature map is said to be *structured*.

Now, what happens if the inputs for the next group equivariant convolution is the structured feature map? Or, in other words, how do G-convolutions work on their group elements? The idea is similar in which we again create an identically structured feature map. Two vital tools are needed to fully understand what happens: the *transformation *of a structured object and the *dot-product* between two structured objects.

The transformation of a structured object undergoes two actions in parallel: individual transformation of the data on every node and following the arrows that indicate that transformation. In the case of a rotation, all data should be (1) individually rotated and (2) moved to the a different node by following one red arrow. People often call this final step *permutation: *the individual nodes permute locations.

The dot-product of two (identically) structured objects is the pointwise summation of the results of the conventional convolutions that are taken if we overlay both there graphs:

We are ready to define the full G-convolution. First, we create a structured filter in which at every node the data can be independent (same used as in the previous two illustrations). We transform the structured filter for every group element, thus creating four differently transformed structured filters. We separately take the dot-product between these filters and the structured (input) feature map, thus creating four (unstructured) output objects. Finally we allocate them in a structured feature map according to the transformations the structured filter had to undertake. That’s it! Please see what happens in the illustration below. I have left out the cacti and instead colored the data on the nodes differently.

Lets evaluate what happens with the feature maps in a shallow 2-layered G-CNN. I feed-forward both an upward standing (*e*) and a rotated input (*r*) through the network and *show *that they all feature maps are equivariant for the imposed rotation.

In this example, the feature maps are indeed equivariant under rotation: both the structured outputs, after first and second layer, on the left (upward standing pose) are equivalent to the right (90⁰ rotation) upto a 90⁰ rotation.

I hope that the visual interpretation was clear. In the next I will show the proof. Although this part is almost pure math and merely paraphrasing the arguments given in the article, I think it should be shown to fully acknowledge their work and for you to fundamentally understand it. In most cases, the notation is identical. Let’s start with some mathematical equipment to make sure the proof is crystal clear. We will start with the notion of equivariance in its broader sense. Next, symmetry groups will be explained and a convenient parameterization of the group p4 and p4m will be showed. Last but not least, I will talk about functions on groups as they are essential in the proof.

*Equivariance *is a notion for functions and is stated as follows. A function *f *is said to be equivariant under the transformation group *G *and the domain *X *if,

in which *Tg* is the operator form of the group transformation *g*. In words, it means that first transforming* x* and map it is equivalent to first map *x* and than transforming it. So the idea is to find a group, whatever a group is, for which we have to find a function, such as the G-convolution, that renders that function equivariant under all group actions. Moreover, we would like the group to resemble interesting, thus realistic, transformations, such as rotations. Let me first introduce the notion behind a transformation.

A symmetry transformation leaves the structure of the object invariant: e.g. if we rotate a square by 90 degrees, its structure is preserved and merely its orientation changed. For squares there is 8 such orientations (remember the poses of the cactus?). By applying one or more symmetry transformations, every orientation that preserves structure can be obtained. It is time to define a symmetry group: a *symmetry group*, *G*, is defined as a set of symmetry transformations, *{a,b,…}*, including a binary operation on *(g,h) *→ *gh* that obeys the following rules:

We are now at the point at which we can define specific groups that include rotations and reflections. A convenient parameterizations for the group p4 and p4m are,

in which *r *is the parameter that expresses rotations, *m *reflection and *(u,v)* translations. Observe that the group p4m is an extension of the group p4. The group operation is defined as the matrix multiplication:

in which *(u’, v’)* is the original location in the sampling grid. Please prove for yourself that they indeed obey the four rules. We will use these groups to define the G-convolutions.

First of all, what happens in conventional CNN’s? In CNN’s a feature maps is modeled as a function, *f*, on the sampling grid (Z²) that produces a *K(l)*-dimensional vector at every pixel location in which *K(l)* is the number of channels in the *l*-th layer. For simplicity of the mathematical analysis we extend the dimension of the sampling lattice to infinity and define the convolution as

in which *i* denotes filter index. For G-convolutions, we will see that we are as well concerned with transformations on feature maps. For this purpose, we will define a notation that denotes a transformation, the operator *Lg*, which is a concrete instantiation of *Tg*:

It states that the transformed feature map at location* x* can be computed by looking at the original feature map at the point *g^(-1)*. We can easily verify that the transformation is homomorph, that is

since the group operation is a matrix multiplication (2nd to 3rd line). Using this framework, the convolution can be expressed in terms of a shift operator, *Lt*, such that,

As we will see in the definition of the full G-convolution, we must deal with group actions that work on group elements (remember that the sampling grid is part of that group, a subset). This is solved by replacing *x*, an element of the sampling grid, to *h*, an element of* G*. To understand what happens if a transformation works on a group element, visual interpretation becomes to some degree necessary, or at least helpful. Earlier, without the plane math, I already tried to make you understand what really happens within these layers. Let us make the math and visualizations intertwine. Generally speaking (read for both our groups), *h* has 3 coordinates, in which we have 1 transformation coordinate (pose/orientation) and 2 translation coordinates (pixel position). This new dimension was clarified with the introduction of a graph in which every node resembles a different pose or orientation which indicates the transformation coordinate. If we apply a transformation to a function on one of the groups, all three coordinates can change. As a matter of facts, I already provided an example with the rotation of a structured object:

In this case the four images allocate themselves by following 1 red arrow while simultaneously undergoing a 90-degree rotation. We are ready to define the G-convolution. The *G-convolution* for the first layer is found by replacing the translation by a more general transformation, *g*, that belongs to *G*:

The new feature map is a function on the discrete group *G*. For the full G-convolution we must replace, as I did earlier, *y *with *h* (an element of the group *G*). The *full G-convolution* is defined as,

For both layer types we can prove that they equivary for all group actions. The main step in the proof is the replacement of *h* by *uh *such that

Similarly we can prove that individual feature maps for a conventional convolution are not equivariant under rotation since

which states that the convolution of a rotated feature map is the same as rotating the convolution of the original feature map with an inverse rotated filter. It means that the stack of feature maps can be equivariant if the CNN learns rotated copies of all filters, but the individual feature maps will not. This finalizes the part that mathematically proves that the G-convolutions are equivariant under some newly added transformations as well.

In the beginning, I clarified that the ability for deep models to learn rotational and reflective symmetry, besides merely translations in conventional CNNs, can enhance efficiency. One such an implementation is its manifestation in so G-convolutions that, through clever weight sharing, result in rotation- and reflection-preserving feature maps. Graphically I showed how G-convolutions work and are implemented and in the final part I proved that G-CNNs are equivariant to their inherent transformations. Even though the implementation is rather complex, the computational overhead is minimal compared to CNNs. *T. S. Cohen* en *M. Welling *showed in 2016 that adapting some well known networks (e.g. residual networks and All-CNN) to allow more preservation of structure led to a significant increase in performance.

I will write another blog, in which I will guide you through my own *Pytorch* implementation that reproduced their results.

References:

- [1] T. S. Cohen and M. Welling.
*Group equivariant convolutional networks*. Proceedings of the International Conference on Machine Learning (ICML), 2016. - [2] Araújo T, Aresta G, Castro E, Rouco J, Aguiar P, Eloy C, et al. (2017) Classification of breast cancer histology images using Convolutional Neural Networks. PLoS ONE 12(6): e0177544. https://doi.org/10.1371/journal.pone.0177544
- [3] Lakhani, P., & Sundaram, B. (2017). Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks.
*Radiology, 284 2*, 574–582 .