Spiria logo.

Deep learning: the capsule network revolution

July 19, 2018.

Geoffrey Hinton, a major figure in the history of deep learning, introduced a completely new type of neural network. This did not fail to arouse the curiosity and enthusiasm of the artificial intelligence research community. Capsule networks will certainly transform the capabilities and possibilities of machine learning in many areas. But what do they bring?

Geoffrey Hinton is a world leading British-Canadian researcher specializing in artificial neural networks. Professor in the Department of Computer Science at the University of Toronto, he was one of the first researchers to demonstrate the application of the backpropagation algorithm for training multilayer neural networks, a technique that has since been widely used in the world of artificial intelligence. We also owe him many models and algorithms whose use has become common today.

Last fall, Hinton and his team (Sara Sabour and Nicholas Frosst) published an open access scientific article: “Dynamic Routing Between Capsules”, which presents the architecture of a type of neural network, capsule networks, or CapsNets (the concept of CapsNet had already been presented in a 2011 paper). But above all, the architecture is accompanied by an algorithm allowing the training of these new networks. As fundamental innovations are rare, the interest of specialists has been piqued and they see in CapsNets a major advance over convolutional neural networks (ConvNets), widely used for still and moving image recognition, recommendation systems and automatic natural language processing.

ConvNets are awesome for many tasks that they manage to perform quickly and efficiently, but they have their own limitations and drawbacks. Take the classic example of face recognition: detecting its oval shape, a pair of eyes, a nose and a mouth indicates a very high probability of having to deal with a face. But the spatial distribution of these elements and their relationship between them are not really taken into account by the ConvNets.

The limits of facial recognition with a convolutional neural network.

The limits of facial recognition with a convolutional neural network.

The main components of a ConvNet are convolutional layers that detect notable characteristics in the input data. The first layers, the deep layers that are closest to the raw input data, learn to detect simple characteristics, while the upper layers combine simple characteristics to produce more complex characteristics. Finally, the final layers will combine high-level characteristics and produce classification predictions. Between the convolutional layers, max-pooling layers are introduced which reduce the size of the representation by subsampling while accentuating strong signals, which allows essential gains in computational power.

Through this whole process, and especially the max-aggregation layers, the notions of position, orientation, scale and relation between the characteristics detected by the first layers are lost. In fact, the internal representation of the data of a convolutional neural network does not take into account the important spatial hierarchies between simple and complex objects, for example, we no longer know if the nose (simple object) is in the middle of the face (complex object).

To be as simple as possible, a CapsNet is composed of capsules and a capsule is a group of artificial neurons that learn to detect a particular object in a given region of the image and which produces a vector whose length represents the estimated probability of the object’s presence and whose orientation encodes the object’s pose (“instantiation parameters” — position, size, rotation, etc.). If the object is slightly modified (for example, translated, rotated, resized, etc.), the capsule will produce a vector of the same length, but oriented slightly differently. Thus, the capsules are equivariant. Unlike ConvNets where a small change in input will not produce a change in output (invariance).

As with ConvNet, CapsNet is organized in layers. The deep layer is composed of primary capsules that receive a small portion of the input image and attempt to detect the presence and placement of a motif, such as a circle, for example. The top layer capsules, called routing capsules, detect larger and more complex objects.

Capsules communicate through an iterative “routing-by-agreement” mechanism: a lower level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule. “Lower level capsule will send its input to the higher level capsule that ‘agrees’ with its input. This is the essence of the dynamic routing algorithm.”

We are, of course, only scratching the surface here of the complexity and richness of CapsNets, but we must remember that they represent a great step forward in remedying the traditional shortcomings of ConvNets. The technology is still in its infancy, but since the publication of “Dynamic Capsule Routing”, many researchers have been working to refine algorithms and implementations, and advances have been published at a rapid pace.

The main advantages of CapsNets:

  • Unlike ConvNets which require a large amount of reference data for the training phase, CapsNets can generalize using much less data.
  • CapsNets do not lose information between layers as ConvNets do.
  • CapsNets give the hierarchy of characteristics found, for example: this nose belongs to this face. The same operation with a ConvNet involves additional components.

The drawbacks, in the present state (but it evolves quickly…):

  • CapsNets are very demanding in computing resources.
  • They do not work as well as ConvNets with large images.
  • They cannot detect two objects of the same type when they are too close together (this is called the “crowding problem”).

The subject has caught your attention and you would like to explore it further? Here is a list of useful references: