Dall-E and Wittgenstein’s Picture Theory of Meaning

Carlos E. Perez
Intuition Machine
Published in
3 min readJan 6, 2021
Picture created in collaboration with several intuition machines.

Ludwig Wittgenstein’s picture theory of meaning (aka the picture theory of language), is a theory of linguistic meaning articulated by in his published work the Tractatus Logico-Philosophicus. Wittgenstein suggested that a meaningful proposition is pictured as a state of affairs. Wittgenstein claimed that there is an unbridgeable gap between what can be expressed in language and what can only be expressed in non-verbal ways. The theory states that verbal statements are meaningful only if they can be pictured in the real world.

OpenAI has recently released a demonstration of Wittgenstein’s theory. Automation appears to bridge the gap between language expression and pictures.

What does Dall-E tell us about the meaning of understanding?

The demo output below shows that Dall-E can generate an image with the right texture and letter ordering as described by the text. How does it implicitly know of the texture neon or that the letter order needs to be preserved?

Said differently, where is the grounding for that information? It’s is of course in the training data. It’s seen enough neon signs. But what about the letter arrangement? Where does that come from?

From the perspective of humans, the order is right there on the text itself. But a GPT-3 language model doesn’t actually see text, it sees tokens of ‘word pieces’.

But how does it know how a word piece is rendered? It actually never sees the text as we see it. They are just some opaque number from its perspective.

The grounding of the visual representation of text is also in the training set. When we present data of a stop sign, sometimes the word ‘stop’ is rendered on the sign.

How does it infer that all these snippets of data are useful for its task and how does it compose it to satisfy the input description?

Understanding of symbols (i.e. language) requires symbol grounding. GPT-3 already had a partial form of symbol grounding. Dall-E just demonstrates that this symbol grounding is expressible in a different medium (i.e. images).

Intelligence can be framed from the interpretation of intent and subsequent generation of a procedure to satisfy the expressed intention.

How does an agent express understanding? Here Dall-e expresses this by rendering an image. Do we have to revise our understanding of ‘understanding’ with this new development?

A human example of understanding is demonstrated by these hilarious photoshop edits:

takes a level of understanding to create these hilarious edits. It of course doesn’t capture the original intent. GPT-3 can in fact generate humor when it generates responses that have a similar literal interpretation.

The route towards AGI can be framed from the metaphor of programming compilers.

OpenAI has demonstrated automation that translated intention into action in a manner described by Wittgenstein almost a century ago.

gum.co/empathy

--

--