Artifacts of Utterance: Interpretation and Translation in the age of AI

Environments Studio IV | Dan Lockton | Spring 2018

Ongoing collaboration with Scott Leinweiber

Project Brief: Regarding the question of human intelligences and artificial intelligences (of various kinds) together, in environments — what dimensions could there be to these interactions, and what issues do they highlight, now and in the future? What is the role of designers in these situations?


Facilitated by a whirlwind of emerging tech and consequentially shifting social paradigms, the convergence of AI in creative tools has revolutionized the way creators are synthesizing ideas through digital media.

  • Early explorations of AI positioned as autonomous collaborators in the creative process
  • Emergent businesses and trades: AI systems help monitor, quantify and visualize complex patterns of user behavior ( e.g. autonomous vehicles, cyber security).
  • Open source end-user tools for 2D and 3D synthesis of sounds, visuals and form
Harold Cohen’s AARON AI Collaborator, Christ Schmandt’s “Put That There” (1979); MIT Media Lab Speech Interface group by Uber, Tesla Computer Vision
Google Magenta’s NSynth for music synthesis, Lobe Visual Interface for deep learning, “Project Dreamcatcher” by Autodesk for form interpolation

As our modes and vehicles of personal expression in this digital age become increasingly complex, how do we attribute meaning to things in our lives, consciously and innately? What are the ways in which we let algorithms attribute meaning for us? How will recognizing patterns in our meaning-making help us better understand ourselves/others?

Voice Intelligence

Our voice is one of our first creative and collaborative tools. By posing this innate device as a touchpoint for human-computer interaction, people of many age demographics and bodily agency are able to access complex information systems and technologies. In these contexts, artificial intelligence plays a critical role in the way meaning is derived from human speech.


This project poises an AI system as a tool that uses properties of natural speech (acoustic qualities + socio-cultural and personal context + emotional intent) as a generative framework for interpretation and translation.

I’m interested in investigating the affordances of AI in the context of real-time interpretation and translation, as well as generating new models of “meaning-finding” in linguistic to visual translation. I am also curious as to how truly “intelligent” these systems can become regarding the phenomenon of context-switching across conversations and in different cultures. With their ability to process complex linguistic structures in tandem with the socio-cultural digital footprint of it’s user, AI-assisted voice-interfaces can revolutionize not only the way we speak to computers but also how we speak to each other. By representing two speakers’ unique utterances in the form of personalized visual artifacts, an AI can construct a visual scenario that bridges an understanding between distinct oral cultures.

Breaking Down Natural Speech

Illustration by Bee Johnson

Talk about modern/widespread frameworks for speech analysis:

  1. What you mean (semantic analysis → acoustic, prosodic, linguistic)

2. Who you are (digital footprint → social and visual media, digital habits)

3. Where you are (regional, geographic)

What You Mean

Prosody is the study of the tune and rhythm of speech and how these features contribute to meaning. The study usually applies to a level above that of the individual phoneme and very often to sequences of words (in prosodic phrases).

At the phonetic level, prosody is characterized by:

  • vocal pitch (fundamental frequency) (other: intonation)
  • loudness (acoustic intensity) (other: gain, total power, amplitude)
  • rhythm (phoneme and syllable duration)

Speech contains various levels of information that can be described as:-

  • Linguistic — direct expression of meaning
  • Paralinguistic — may indicate attitude or membership of a speech community
  • Non-linguistic — may indicate something about a speaker’s vocal physiology, state of health or emotional state
“Good for YOU” vs. “Good FOR you” vs. “GOOD for you”

Michael Halliday describes 5 simple and 2 compound primary tones for English. They are:-

  • Tone 1 — falling
  • Tone 2 — high rising
  • Tone 3 — low rising
  • Tone 4 — falling-rising
  • Tone 5 — rising-falling
  • Tone 13 — falling plus low rising
  • Tone 53 — rising-falling plus low rising”

Phonetic Profiling

“Fig. 6. Schematic illustrations of the phonetic profiles of positive and negative intensification that emerged from the key words with (a) short vowels (V S ) and (b) long vowels (V L ) in the accented target syllables. The shapes of the polygons in the lower panels represent the acoustic energy (E) courses. The upper panels sketch the characteristic F0 peak contours. Broken lines point to the possibility of voiceless-onset consonants. The different shades of the segment polygons refer to the differences in voice quality (i.e. lighter = breathier). All illustrations are based on the means of table 1. F0 ranges are oriented towards actually found values.” — Oliver Niebuhr, On the Phonetics of Intensifying Emphasis in German

It is also possible to extract meaning from individual phonemes:

“In linguistics, sound symbolism, phonesthesia or phonosemantics is the idea that vocal sounds or phonemes carry meaning in and of themselves.”

Vowel Transcription Systems

Reflects a regression model capable of recognizing input features and mapping them to specific outputs- in this case, phonemes to a cartographic 2D space.

Figure 6. Australian English diphthong schematic trajectories superimposed onto the traditional vowel map with IPA cardinal vowels indicated (International Phonetic Association, 1999).

Emotive Modeling

The Circumplex model of Affect

“Factor-analytic evidence has led most psychologists to describe affect as a set of dimensions, such as displeasure, distress, depression, excitement, and so on, with each dimension varying independently of the others. However, there is other evidence that rather than being independent, these affective dimensions are interrelated in a highly systematic fashion. The evidence suggests that these interrelationships can be represented by a spatial model in which affective concepts fall in a circle…” — Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Used by many developers/designers of voice interaction and conversational interfaces, site MIT’s EmotiveModeler CAD Tool

Where You Are

Google Advanced Image Search

Design for Meaning-Finding & Visualizing

Technical Framework

MFCC Extractor in MAX/MSP

Patcher by cososc:

For every 10 ms, with a frame size of 25 ms:

  • Do Fast Fourier Transform — FFT (to convert to the frequency domain)
  • Apply Mel scaling (Take the log of the frequencies to approximate human perception of frequencies)
  • Do Discrete Cosine Transform (to get a single real value for each frequency bin)
  • Create a feature vector, which consists of:
  • 12 MFCC features (how much of each frequency bin right now); MFCCs are representations of the short-term power spectrum of a sound
  • The “total energy” (how loud is the sound right now)

Bark Extractor in MAX/MSP

Form Manipulation with Voice

Technical Pipeline

Bark Extractor → OSC → Wekinator → OSC → Unity
UnityOSC connection (github), Noise Shader (Char Stiles), (Animator + Scaling, Scott Leinweiber)

Thank you :)

Research + References

On Creativity + Intelligences

Work by Harold Cohen

“Driving the Creative Machine”, Harold Cohen, Orcas Center, Crossroads Lecture Series September 2010

“One thing we know about creativity is that it typically occurs when people who have mastered two or more quite different fields use the framework in one to think afresh about the other. Intuitively, you know this is true. Leonardo da Vinci was a great artist, scientist and inventor, and each specialty nourished the other. He was a great lateral thinker. But if you spend your whole life in one silo, you will never have either the knowledge or mental agility to do the synthesis, connect the dots, which is usually where the next great breakthrough is found.” — Marc Tucker, the president of the National Center on Education and the Economy:

“TOWARDS A DIAPER-FREE AUTONOMY”, Harold Cohen, Museum of Contemporary Art, San Diego, August 4th 2007

“The Art of Self-Assembly: the Self-Assembly of Art”, Harold Cohen, Dagstuhl Seminar on Computational Creativity, July 2009

Work by Chris Schmandt

Early project by Chris Schmandt (1979); MIT Media Lab Speech Interface group video collection


AI Learning Techniques

Documentation + Dev Tools + Educational