Transformers Rule Everything Around Me: Text to Geometry as Computational Design enabler for the metaverse and product design

Published in

Labs Notebook

8 min readApr 14, 2022

NB: as I wrote this blog post, OpenAI’s DALL-E 2 dropped and it produces strikingly finished, yet very human, 2D imagery from text. It’s easy to pontificate about what its performance implies, start seeing AGIs everywhere, and label creative work by humans as cancelled. I’m not ready to do that yet, since — like seeing patterns in noise — I’m not sure whether what it produces is genuinely meaningful content, whether it is mostly hacking our intuition about what meaningfulness is, and whether there’s any difference between the two. Certainly there are people whose speak or write in (essentially) Markov chains, generating new combinations of the same old elements without, essentially, ever adding any underlying novelty, and there are many people who eagerly hear or read them despite the lack of novelty…so does it matter when a model does the same thing, and is that what DALL-E and GPT-3 do? DALL-E raises a lot of big questions and, uncharacteristically, provides no answers, so I’m going to pause in my evaluation of it for a while.

A 3D model generated from text (Source: Jain et al)

Our team’s research has used language models and transformers (GPT-3 , BERT, etc.) for several years to explore the impact such models can have on design practice.

One project uses the text of project briefs to generate strategic questions for product teams early in the design process (think really specific, Oblique Strategies cards, or having a 5 year-old with a surprising amount of domain-specific knowledge). Our goal is to get product teams (designers, PMs) to consider aspects they may not have expected before they begin detailed design.
Another generates rich, diverse user personas from basic demographic profiles to help product teams avoid the common pattern where personas end up shockingly resembling the design team themselves, or as idealized model consumers.

(HT to Charles Foster for leading both projects when he worked with us)

These projects have proven insightful tools for augmenting current design processes, but the more we work on them, the clearer it becomes that language models, and language transformation, can have a much greater, and much more direct, impact on design, and we’re now starting to see the inklings of it. In this blog post I want to talk about several recent papers that may point the way toward the kind of computational design that will dominate how products are developed in the future (see my earlier blog posts for more thoughts/justifications).

3D models generated from text. The grey donkey on the left provides the structural model onto which the jeans/donkey surface is applied. (Source: Michel et al)

Let’s start with an abstract thought: many things, perhaps everything, can be thought of as a kind of language. We already talk about visual design as a kind of language when we:

Discuss how one brand’s design language differs from another’s.
Use computer languages to explicitly link text representations to visual elements, as in CSS, PostScript, SVG, or any number of other similar representations.
Use generative grammars to describe families of shapes. Since the beginning of computation there have been systems for describing shapes in the world using quasi-linguistic systems (Stiny’s shape grammars, Conway polyhedra, Lindenmayer’s L-systems, etc). Christopher Alexander’s A Pattern Language for architecture was explicitly labelled as such (RIP Alexander — also see Steenson’s Architectural Intelligence for a discussion of Alexander and AI).

On an even more fundament level, information theory (in my limited understanding) posits that all phenomena carries information, information can be represented as symbols, and those symbols can be systematically manipulated to create new knowledge and new phenomena. Combining symbols and rules is one way to define a language (HT to John Lawler, my college computational linguistics professor).

In other words, we can turn everything — ideas, events, shapes, time, etc — into a kind of language, and then relate all of those pieces to each other as an act of translation. Without spiraling down into whoa-dude everything-is-everything noodling, I think we already experience these symbolic translation relationships every day when we communicate visual or interaction design ideas using words and symbols, or when words inspire insights into the design of experiences. We already make these associations, so there’s no reason that language models can’t do the same. Computational models may well do it much better than we can, since they are not limited by the cognitive and sensorial hardware we’re born with.

From Ingenious Mechanisms for Designers and Inventors, Vol 1, Jones, 1930.

Some example languages I’m thinking of include:

the language of toothed clockwork mechanisms
the language of human limb prosthetics
the language of footwear

A mechanical wristwatch, a wind up duck, and an automatic transmission all exist as statements in the language that can be defined by toothed clockwork mechanisms. Using a spatial metaphor, it’s toothed clockwork space, the way that the rules of haikus define the haiku subspace of written language. Moreover, because language representation spaces have continuous gradations, there are intermediate stops between two human-defined concepts that are otherwise discrete in our current definitions. Somewhere between the cluster of points in toothed clockwork space that describes clocks and the one with wind-up ducks is a set of intermediate states that represents cuckoo clocks which are more clock-y when closer to clocks and increasingly duck-y when approaching the ducks.

When every shape in the world is the product of a language the practice of design becomes the process of describing, constraining and navigating the possibility space defined by the translation of one language to another. It is no longer manually manipulating geometrical representations. Design can be what you do when you connect one symbolic domain, say words or personal behavior data sets, to another, say the shape and arrangement of different functional components. Or it could be in the constraint of a resulting translation (“Make all products this model produces in 1960s Ferrari dialect, as defined by the following examples of period car models, vintage color palettes, and advertisements.”).

When a model learns to associates the features of one representation, say the cultural use context (“worn on the wrist by children to keep and represent time using various temporal scales”), with the shape and arrangement representation of another (toothed clockworks), the transformer produces a possibility space of solutions (“children’s mechanical wristwatches”).

I realize this sounds impractical and fantastical, but it’s increasingly real. Just in the last couple of weeks several papers have dropped that describe models which exactly this kind of text to geometry transformation.

From Michel et al (note that the model has understood “Luxo lamp” to be one with minimal ornamentation, as compared to “lamp”, which probably represents a kind of centroid point in lamp language space)

Michel et al’s “Text2Mesh: Text-Driven Neural Stylization for Meshes” describe a model that abstracts the surface treatment of physical objects from those objects underlying structure. Much like texture and bump mapping in computer graphics, it extracts both surface patterns and, at some level (which I will admit I don’t understand) it learns the sensical arrangement of the geometric elements on a base model. This is, I suspect, analogous to how language models simultaneously learn vocabulary and syntax, so when they map one language to another, they generate grammatically correct sentences. Moreover, since the design space is continuous, they can do the exact kind of cuckoo clock partial transformation I describe above (see Figure 10 in their paper) and generate designs that fall in between concepts, such as something that’s 20% of the way between a wooden chair and a crochet chair. Undeniably, the results can produce monsters, like the early Google Inception work, but points to some very exciting possibilities.
Khalid et al “Text to Mesh Without 3D Supervision Using Limit Subdivision” describes a different system with a similar name, but a somewhat different goal. It aims to generate novel 3D models solely from text prompts (so no base model to work from) and to do so in a completely unsupervised way, or — more specifically — to use the models we already have for identifying and annotating 2D images (i.e. those computer vision models that distinguish pictures of chihuahuas from blueberry muffins). Their results are significantly more surreal than what Michel at al’s work produces, but the point of the work that they’re starting from, quite literally, nothing, from words, and then generating semi-viable 3D geometry.

“Mold beauty out of clay/Write words for me to say” — from Squid’s Narrator

Meta’s Builder Bot demo from last month is part of the same universe, though it appears to just be selecting pre-made models from an existing database, then placing them in a scene, rather than generating the geometry from scratch.

There are many applications. I’m most interested in product design, but probably the most obvious near-term application of this technology is the design of assets for metaverse projects. Metaverse/VR design is not (for the most part) going to be done by starchitects or master designers, or built out by modelers. It’s going to be done by consumers asking the builder system to “make me a plaza surrounded by buildings that look like how Beyonce’s ‘Single Ladies’ sounds on a rainy day. Awesome. Now make me taxis that look like Corgis at play.”

Still from Squid’s ‘Narrator’ video, one of my favorite recent videos that provides an idea of how all future metaverse content could be generated. Also a great song about the dangers of creating your own reality. (source)

Broadly speaking I think this is another example of how computational design will be as distinct from design for mass manufacturing as that kind of design is distinct from pre-Industrial Revolution individual craftsmanship. The tools, concepts and workflows (and likely social structures, business models, and notions of quality and value) will all be distinct as the field matures. One of the main questions now is how to turn these impressive and amusing, but impractical, examples into valuable design tools that do not just create baroque chimeras that shock us, but useful designs that we can use.

If you are interested in doing research in this space, I am hiring researchers for my team. We’re looking for folks with PhDs or a decade of deep technical and research experience exploring topics related to computer vision, deep learning, computational geometry, and human-computer interaction. You do not need all the qualifications listed in the job description to apply, but you do need to be articulate and passionate about multidisciplinary research.

Bonus links

Speaking of which, here are some entertaining chimera chair designs (will the “avocado chair” become the Utah teapot of this discipline?):

Various techniques are being deployed to generate geometry directly from from text. Here’s a particularly surreal one from Jain et al. (source)

If you want a deep dive into how this all works, check out this great blog post by Lj Miranda talks about, broadly, how these systems connect words to geometrical representations on the other using a feedback loop between a system that generates geometry and one that evaluates the output of that geometry in relation to the input words. I don’t think it’s exactly how DALL-E works, but it’s a great place to start.

Transformers Rule Everything Around Me: Text to Geometry as Computational Design enabler for the metaverse and product design

“Mold beauty out of clay/Write words for me to say” — from Squid’s Narrator

Bonus links

Written by Mike Kuniavsky