Time aware Inference in CNNs

Nikos Fragoulis
4 min readFeb 26, 2019


An important step towards achieving human brain-inspired, computational rationality in Deep Learning

In modern CNNs, inference decisions are only available after a specific amount of time, equal to the amount of time that a system with specific computational capabilities and resources needs, in order to carry-out the associated calculations. As a consequence in cases when there is a mismatch between the size or the complexity of the underlined CNN model and the processing platform, the inference time can be large.

The usual workaround of this trade-off is the reduction of the complexity of the CNN model by applying a mixed scheme of quantisation and pruning. Although these schemes can often result into impressive results, they actually end-up to the same trade-off in a different point in the design space. In other words, there is always the need to squeeze CNNs models into smaller and smaller platforms.

On human brains, the inference decisions are taken in a more smart or if you want a more…. rational way. In particular, humans and other biological organisms, are taking decisions of an accuracy or quality depending on the amount of the available time, by making use of resources in a rational way and continuously trading-off time and accuracy for every given task. Humans, operating under bounded inference resources, employ an increasing amount of computing effort by improving decision over time, but being able at the same time to take fast decisions although at decreased quality or accuracy.

Our team in Irida Labs, inspired by this biological feature, developed an approach (patent pending) able to give rationality to a modern, CNN-based AI system, by developing one of the most fundamental components: A CNN model able to provide time aware inference i.e. inference able to end-up with results of different quality for different levels of computing effort (or equivalently computational time).

The idea relies on the implementation of a mechanism able to sequentially, activate a meaningful number of computing resources in realtime, depending on the required computing effort and the available time. This is achieved through the use of a special artificial neural module which is called Learning Kernel-Activation Module — LKAM, previously introduced in our proprietary dynamic pruning technique named parsimonious inference. This module, featuring a negligible computational complexity, when inserted in certain points of a CNN network is able to control the number of convolutional kernels and activations used in the various layers of a CNN.

Moreover is able to learn, in the context of a gradient descent training session, which kernels to activate, depending on the input of the previous layer (i.e depending on the processing task), while targeting to minimise kernel activation according to a tuning parameter that we call lambda: The lower is this parameter the less kernels are activated. By using the lambda, parameter is possible to affect the kernel activation number, and thus directly affect the computational capacity of the whole network.

The principle of operation is demonstrated in the following figure: A typical CNN (a vanilla type is indicated here but it can be of any other form), is augmented by a number of LKAMs placed in appropriate positions. Normally every LKAM features a specific lambda. But here the idea is to use a multiple of LKAMs featuring various lambdas and thus various kernel activation profiles.

By using LKAMs with different lambdas, is like using convolutional kernels with a different number of convolutional layers, each featuring a different number of convolutional kernels and thus a different computational complexity. An arbitrator implementing a utility function, trading off between decision quality and decision time, can then decide in real-time which modules and to engage and how. The important additional feature here is that these decisions are taken in realtime and depending on the input datum. In this way we end-up with a network that dynamically alter its architecture in order to cope with the complexity of the task and the available processing time.

As shown in the figure above, the result is a system able to rapidly take an inference decision about a specific task, but also to improve the quality of this decision in the progress of time,