Deep Learning Modularity and Language Models

Carlos E. Perez
Intuition Machine
Published in
5 min readFeb 16, 2023

--

Dall-E to SD via ControlNet

Modularity is essential for any disruptive technology. For years deep learning lacked sufficient remix functionality to quickly customize solutions. Everything had to either be trained from scratch or fine-tuned. The latest innovations are tearing down these restrictions.

Modularity allows a developer to combine an existing module with others to generate a bespoke solution. Years ago, this was difficult to do with deep learning. (see: The Emergence of Modular Deep Learning)

Transformer and diffusion models have radically changed the methods. Transformers are an essential building block for specifying constraints. Diffusion models serve as a model of parallel constraint satisfaction. This has led to the powerful generative AI of today.

An alternative way of presenting machine learning is through the lens of a constraint-based approach. This approach has been overlooked by too many practitioners who misframe everything in terms of classification (see: https://www.amazon.com/Machine-Learning-Constraint-Based-Marco-Ph-D/dp/0081006594/ref=sr_1_4)

Perhaps this is due to our substance philosophy bias that places on the pedestal the act of naming. “And Adam called names for all cattle, and all birds of Heaven, and all animals of the Earth.” It took Darwin to dispel this mythology of static categories.

Humans categorize things to easily reason about them. It’s cognitive chunking behavior where stereotyping reduces the complexity of our thought process. Of course, often, this leads to gross errors. We can think outside of this box by working through analogies and metaphors.

All biological cognition is based on constraint classification. All actions are based on balancing multiple competing forces to arrive at a good but not perfect solution. All biological life is in a constant struggle to maintain homeostasis.

However, in biology, the specification of control is implicit and a consequence of billions of years of evolution. In the machines that humans need to control, we are dependent on using names (like Adam) to reason about how to control systems.

Symbols are necessary for control because humans need the complexity hiding nature of symbols. Humans need artificial constructs to reason. We don’t have the cognition of a deep learning system that can “reason” absent any symbols.

The immense utility of Diffusion models may have easily been discovered if not for how they were tied to language models (see: CLIP). In the past, GANs (see: StyleGAN) was driven by an implicit code. But natural language surfaced relationships in words missing in code.

Deep learning modularity is dependent on our human ability to specify constraints and that is made possible via language models. Few-shot LLMs make this possible without any re-training of a network.

Natural language isn’t the only way to define constraints. For imagery, we are often at a loss of words as to how to describe our images. That is why a development like ControlNet (and similar systems) leads to immense opportunities. (see: llyasviel/ControlNet: Let us control diffusion models)

There exists a very subtle but powerful methodology for deep learning. This simple idea is to use a neural network to extract the “specification” from an example. Then use another neural network to generate examples from the extracted example. This is a universal idea.

An example of this approach is demonstrated by “Hard Prompts made Easy (PEZ)”. The idea is to extract words from sample images to use as a generator for new similar images. huggingface.co/spaces/tomg-gr…

Pez Dispenser — a Hugging Face Space by tomg-group-umd

Many projects, unfortunately, have modularity as an afterthought. What I mean by this is that the models that they’ve trained are monolithic and don’t leverage the existing ecology of networks.

A striking example of this is Instruct-Pix2Pix. The amazing thing about instruct-pix2pix is that it can perform difficult transformations with diffusion models. However, its downside is that it loses existing functionality. (see: GitHub — timothybrooks/instruct-pix2pix)

What I mean is that there are surprising gaps in what can be instructed using Instruct-pix2pix. These gaps are likely to result from gaps in the training set. In contrast, img2img diffusion methods have no gaps between image and concept. One can morph any image to another.

18/n ControlNet is more powerful than Instruct-pix2pix because of its modular architecture and its more expressive form of constraint specification. Hence you can turn a cat into gold, white porcelain or a corgi!

ControlNet can still be made more modular by using a method analogous to LoRA. The benefit of this method is that you can stack many fine-tune models together to generate more complex constraints. (see: Using LoRA for Efficient Stable Diffusion Fine-Tuning). Hugging faces recently released PEFT (See: https://github.com/huggingface/peft ) as a general framework to adopt pre-trained networks.

https://github.com/microsoft/LoRA

This allows many kinds of guidance methods to evolve the diffusion model simultaneously. However, each guidance method takes a different kind of specification. Rather than just a single text or image input, multiple kinds of input will be required.

If this is done for images, it should also be done for text input. Why should we confine ourselves to only text in our commands to machines? Thus we see papers that combine both text and images to generate controlled responses. (see: Multimodal Chain-of-Thought Reasoning in Language Models)

The future of interaction with AI will not be plain natural language text but rather graphs that explicitly express the relationships of multimedia. Deep learning modularity begins with a standardized expression language (i.e., DSL) that captures complex relationships.

--

--