Capsule networks: A simplified mathematical understanding and why they fell out of favor?

Debasrita Chakraborty
8 min readFeb 6, 2023

--

Capsule networks are a type of neural network architecture that was introduced in a 2017 paper by Hinton et al. called “Dynamic Routing Between Capsules.” The main idea behind capsule networks is to replace the traditional scalar-output neurons of a neural network with “capsules,” which are vectors that output multiple values. Capsule networks, also known as capsule networks, are a type of artificial neural network designed to improve upon traditional convolutional neural networks (CNNs) in their ability to understand and recognize objects in images. Capsule networks do this by using a novel type of layer called a capsule layer, which allows the network to explicitly model the relationships between parts of an object and their orientations. This enables capsule networks to better capture the spatial relationships between parts of an object and to be more robust to changes in viewpoint and deformation.

While traditional CNNs view an image as a collection of pixels, capsule networks view an image as a collection of objects and their parts, each represented by a capsule. The capsules are connected to form a hierarchical structure that represents the relationships between the objects and their parts. This structure allows the network to better understand the spatial relationships between parts of an object and to be more robust to changes in viewpoint and deformation.

What is a capsule?

Mathematically, a capsule is represented by a vector, where each element of the vector represents a property of the object or part, such as the presence of an edge, the orientation of an edge, or the presence of a color. The length of the vector represents the probability that the object or part exists in the image, while the orientation of the vector represents the pose or orientation of the object or part.

During the forward pass, the capsule layer takes the output from the previous layer, which is typically a set of feature maps and applies a transformation to obtain a set of vectors, one for each object or part in the image. The transformation is defined by a matrix of weights, which are learned during training.

During the backward pass, the capsule layer uses the loss from the task being performed, such as image classification or segmentation, to update the weights in the transformation matrix and refine the representation of the objects and parts in the image.

In this way, the capsule layer captures the relationships between the parts of an object and their orientations and allows the network to better understand the spatial relationships between parts of an object, making it more robust to changes in viewpoint and deformation.

A capsule i in a capsule network can be represented as a vector,

and the output of the capsule can be represented as a non-linear function of the vector:

Where,

is the weight matrix for capsule i, and f() is a non-linear function.

The mathematical equations that define a capsule network vary depending on the specific architecture of the network, but here are some common components:

  1. Scalar activation: The scalar activation function of a capsule is a scalar value that represents the probability that an object or part exists in the image. It is computed as the square root of the sum of the squares of the elements in the vector that represents the capsule:

where n is the number of elements in the capsule vector

2. Routing by agreement: In the routing by agreement mechanism, the output of each capsule is passed to the next layer of capsules, where it is transformed and combined with the outputs from other capsules. The routing mechanism is defined by a set of softmax weights that are learned during training:

where c_ij is the weight assigned to the output of capsule i for capsule j, b_j is a bias term, and s_j is the output of capsule j.

3. Squashing: The squashing function is used to normalize the vectors that represent the capsules and maintain their orientation information:

where v_j is the normalized vector that represents capsule j and s_j is the output of capsule j.

These equations provide a high-level overview of the mathematical components of a capsule network. The specific details of the architecture and implementation of the network may vary, but these equations provide a foundation for understanding the basic concept and operation of capsule networks.

Capsule neuron
Traditional neuron

How are these different from a CNN?

Convolutional Neural Networks (CNNs) and Capsule Networks are both types of deep learning models that are used for image recognition and computer vision tasks. However, there are some key differences between these two types of networks:

  1. Architecture: CNNs have a traditional feedforward architecture with multiple layers of convolutional and pooling operations, while capsule networks have a hierarchical architecture with multiple layers of capsule layers.
  2. Representation of objects: In a CNN, the feature maps generated by the convolutional and pooling layers are used to represent objects in the image. These feature maps are treated as a whole and do not retain the relationships between parts of an object. In a capsule network, each capsule represents a part of an object, and the relationships between parts are captured by the orientation of the vectors that represent the capsules.
  3. Robustness to deformations and rotations: CNNs are susceptible to changes in viewpoint and deformations of objects, which can lead to decreased performance in recognition tasks. Capsule networks, on the other hand, are designed to be more robust to these types of changes and maintain the relationships between parts of an object even when the object is deformed or rotated.
  4. Routing mechanism: In a CNN, the information is passed from one layer to the next through matrix multiplications and activation functions. In a capsule network, the routing by agreement mechanism is used to pass information between capsule layers, allowing the network to learn the relationships between parts of an object.

These differences highlight the strengths and weaknesses of both CNNs and capsule networks, and the choice of which network to use for a specific task will depend on the specific requirements of the task and the nature of the data being analyzed.

Capsnet

The architecture of a CapsNet typically consists of the following components:

  1. Primary Capsules: The primary capsules layer takes in the input image and processes it using multiple filters to generate multiple vectors that represent different parts of the image.
  2. Digit Capsules: The digit capsules layer receives the output from the primary capsules layer and uses a routing by agreement mechanism to determine the relationships between parts of an object in the image. This layer outputs a set of vectors that represent the whole object.
  3. Reconstruction Layer: The reconstruction layer takes the output from the digit capsules layer and uses it to reconstruct the input image. This layer is used to ensure that the network retains information about the objects in the image even when the objects are deformed or rotated.
  4. Loss Function: The loss function used in a CapsNet is typically a combination of the reconstruction loss and the classification loss. The reconstruction loss measures the difference between the reconstructed image and the original image, while the classification loss measures the accuracy of the network’s predictions.
A representative diagram of a Capsule Network-based digit recognizer

These components are combined to form a hierarchical architecture that is designed to capture the relationships between parts of an object in an image and maintain this information even when the object is deformed or rotated. The routing by agreement mechanism is a key component of this architecture and allows the network to learn these relationships and improve its accuracy in recognizing objects in images.

If they are so good, then why are they not used so much?

Capsule networks have not been widely adopted compared to other deep learning architectures such as Convolutional Neural Networks (CNNs) for several reasons:

  1. Complexity: Capsule networks are a relatively new and complex architecture, and many practitioners are still learning about their capabilities and limitations. In addition, implementing and training a CapsNet can be more challenging compared to a traditional CNN.
  2. Performance: While CapsNets have shown promising results in certain computer vision tasks, they have not consistently outperformed traditional CNNs in all tasks. In many cases, the added complexity of the CapsNet architecture does not result in a significant improvement in performance compared to a simpler CNN architecture.
  3. Lack of standardized evaluation protocols: There are currently no standardized evaluation protocols for comparing the performance of CapsNets to other deep learning architectures, making it difficult to compare the performance of different models and to determine which architecture is best suited for a specific task.
  4. Limited availability of pre-trained models: Unlike CNNs, there are currently limited pre-trained models available for CapsNets, which makes it more challenging for practitioners to quickly implement and fine-tune these models for their specific tasks.

Overall, while CapsNets show promise as a deep learning architecture, their complexity, and limited adoption have limited their widespread use in practical applications. However, as more research is conducted and best practices are established, it is possible that CapsNets will become more widely adopted and used in real-world applications.

--

--