3 Essential Deep Learning Architecture Abilities (i.e., “Ilities”)
People in software development are familiar with the phrase “-ilities.” It is actually not a word, but you can google it:
Informally these are sometimes called the “ilities”, from attributes like stability and portability. Qualities — that is non-functional requirements — can be divided into two main categories: Execution qualities, such as security and usability, which are observable at run time.
You may think because it is not a real word, that it is some informal convention, some kind of loose jargon. This is not the case; software quality attributes have been formalized in ISO 9126:
Quality attributes are realized non-functional requirements used to evaluate the performance of a system. These are informally called “ilities” after the suffix that of many of the words share. In software architecture, there is a notion of “ilities” that are qualities that are important in evaluating our solutions. Lacking in DL literature is enough of an understanding of how to evaluate the quality of a Deep Learning architecture. What then are the “ilities” that are specific to evaluating Deep Learning systems?
Despite the newness of the field, there are three main “ilities” that a practitioner should know of:
Expressibility — This quality describes how well a machine can approximate functions. One of the first questions that many research papers have tried to answer is “Why does a Deep Learning system need to be Deep?” Another way of saying this is, what is the importance of having multiple layers or a hierarchy of layers. There is some consensus in the literature that deeper networks require fewer parameters than shallow, wider networks to express the same function. You can find more detail of the various explanations here: http://www.deeplearningpatterns.com/doku.php/hierarchical_abstraction. The measure here appears to be, how few parameters (i.e., weights) do we need to effectively create a function approximator. Related research is here is weight quantization, how few bits does one need and not lose precision.
Trainability — The other kind of research that gets published is on how well can a machine learn. You will find hundreds of papers that all try to out do each other by showing how trainable their system is as compared to the ‘state-of-the-art.’ The open theoretical question here is why do these systems even learn at all? The reason this is not obvious is that the work horse of Deep Learning, the stochastic gradient descent (SGD) algorithm, appears absurdly too simplistic to even possibly work! There is a conceptual missing link here that researchers have yet to identify.
Generalizability — This is a quality that describes how well a trained machine can perform predictions on data that it has not seen before. I’ve written about this in more detail in “Rethinking Generalization” where I do describe five ways to measure generalization. I think that everyone seems to talks about generalization. Unfortunately few have a good handle on how to measure it.
In computer science, we do understand expressibility. This is its most general from is the notion of “Turing Completeness” or “Universal Computation” (see: “Simplicity of Universal Machines.” Feed-forward networks and Convolution Networks are for example not Turing complete simply because they don’t have memory. What Deep Learning brings to the table that is wildly radical from conventional computer science is the latter two capabilities.
Trainability, the ability to train a computer, rather than program a computer is a major capability. This is “automating automation.” In other words, you don’t need to provide specific detailed instructions, but instead, you just need to provide the machine examples of what it needs to do. We’ve seen this before in the difference between imperative versus declarative programming. The difference however in Deep Learning (or Machine Learning), we don’t need to define the rules. The machine can discover the rules for itself.
Even better, Generalization implies that if the machine, once trained, encounters situations where it has not been shown an example before, can figure out how to make the correct prediction. Generalization implies that even after discovering the rules after training, it is now able to create new rules on its own for unexpected situations. The machine has become more adaptable.
These ilities tie in with the “5 Capability Level of Deep Learning”. At each level we can explore the nature of expressibility, trainability and generalizability we require to achieve that level. So as an example, we can look at machines with the Classification with Memory. What does the additional memory component add to expressibility, trainability, and generalizability? In the case of expressibility, we can see that memory permits a machine to perform translation instead of just classification. In terms of trainability, we had to come with an additional mechanism to learn how to update memory. Finally, for generalizability, we need to use another kind of benchmarks (i.e., BLEU, bAbl) to perform evaluations on this kind of system. At every capability level, we need to re-explore how we achieve each of these three ilities.
Ideally, we would like to see a framework where one understands how to compose various building blocks driven by an understanding as to how each block contributes to trainability, expressivity or generalization. Deep Learning is still very young in that we have few tools to evaluate the effectiveness of our solutions. Additionally, other ilities such as interpretability, transferability, latency, adversarial stability and security are worth exploring.