3 Essential Deep Learning Architecture Abilities (i.e. “Ilities”)

Credit: https://unsplash.com/search/photos/geodesic-dome?photo=c9z9RlCh0Zo

People in software development are familiar with the phrase “-ilities”. It is actually not a word, but you can google it:

Informally these are sometimes called the “ilities”, from attributes like stability and portability. Qualities — that is non-functional requirements — can be divided into two main categories: Execution qualities, such as security and usability, which are observable at run time.
Non-functional requirement — Wikipedia

You may think because it is not a real world, that it is some informal convention, some kind of loose jargon. This is actually not the case, software quality attributes have in fact been formalized in ISO 9126:

Quality attributes are realized non-functional requirements used to evaluate the performance of a system. These are informally called “ilities” after the suffix that of many of the words share. In software architecture, there is a notion of “ilities” that are qualities that are important in evaluating our solutions. Lacking in DL literature is enough of an understanding of how to evaluate quality of a Deep Learning architecture. What then are the “ilities” that are specific to evaluating Deep Learning systems?

Despite the newness of the field, there are 3 main “ilities” that a practitioner should know of:

Expressibility — This quality describes how well a machine can approximate functions. One of first questions that many research papers have tried to answer is “Why does a Deep Learning system need to be Deep?” Another way of saying this is, what is the importance of having multiple layers or a hierarchy of layers. There is some consensus in the literature that deeper networks require less parameters than shallow, wider networks to express the same function. You can find more detail of the various explanations here: http://www.deeplearningpatterns.com/doku.php/hierarchical_abstraction. The measure here appears to be, how few parameters (i.e. weights) do we need to effectively create a function approximator. A related research are here is weight quantization, how few bits does one need and not lose precision.

Trainability — The other kind of research that gets published is on how well can a machine learn. You will find hundreds of papers that all try to out do each other by showing how trainable their system is as compared to the ‘state-of-the-art’. The open theoretical question here is why do these systems even learn at all? The reason this is not obvious is because the work horse of Deep Learning, the stochastic gradient descent (SGD) algorithm, appears absurdly too simplistic to even possibly work! There is a conceptual missing link here that researchers have yet to identify.

Generalizability — This is a quality that describes how well a trained machine can perform predictions on data that it has not seen before. I’ve written about this in more detail in “Rethinking Generalization” where I do describe 5 ways to measure generalization. I think that everyone seems talks about generalization, unfortunately few have a good handle on how to measure it.

In computer science, we do understand expressibility. This is its most general from is the notion of “Turing Completeness” or “Universal Computation” (see: “Simplicity of Universal Machines”. Feed-forward networks and Convolution Networks are for example not turing complete simply because the don’t have memory. What Deep Learning brings to the table that is wildly radical from conventional computer science is the latter two capabilities.

Trainability, the ability to train a computer, rather than program a computer is a major capability. This is “automating automation”. In other words, you don’t need to provide specific detailed instructions, but rather you just need to provide the machine examples of what it needs to do. We’ve actually seen this before in the difference between imperative versus declarative programming. The difference however in Deep Learning (or Machine Learning), we don’t need to define the rules. The machine is able to discover the rules for itself.

Even better, Generalization implies that if the machine, once trained, encounters situations where it has not been shown an example before, is able to figure out how to make the correct prediction. Generalization implies that even after discovering the rules after training, it is now able to create new rules on its own for unexpected situations. The machine has become more adaptable.

These ilities tie in with the “5 Capability Level of Deep Learning”. At each level we can explore the nature of expressibility, trainability and generalizability we require to achieve that level. So as an example, we can look at machines with the Classification with Memory. What does the additional memory component add to expressibility, trainability and generalizability. In the case for expressibility, we can see that memory permits a machine to perform translation instead of just classification. In terms of trainability, we had to come with additional mechanism to learn how to update memory. Finally, for generalizability we need to use other kind of benchmarks (i.e. BLEU, bAbl) to perform evaluations on this kind of system. At every capability level, we need to re-explore how we achieve each of these 3 ilities.

Ideally we would like to see a framework where one understands how to compose various building blocks driven by an understanding as to how each block contributes to trainability, expressivity or generalization. Deep Learning is still very young in that we have few tools to evaluate the effectiveness of our solutions. Additionally, other ilities such as interpretability, transferability, latency, adversarial stability and security are worth exploring.

For more on this, read “The Deep Learning AI Playbook