Whether for virtual assistants, music discovery, or surveillance, academic and industrial researchers are collecting vast libraries of ordinary sounds for computational analysis and classification. Many of them are sharing these strange, intimate yet anonymous “audio datasets” with anyone who cares to download.
Nu Inteleg invites you to submit your voice for processing — not for the scrutiny of the global research community — but to be analyzed and then reconstituted from the nearest matching sound fragments culled from those same audio datasets. Speak into the mic, and hear a cacophony of found sounds stitched together in a rough likeness of your own utterance. When statistical models synthesize speech from nonverbal sound, is the result intelligible?
“Nu înțeleg” means “I don’t understand” in Romanian.
— wall text for Nu Inteleg at Gray Area 2018
I decided to use my time in the Gray Area incubator to push myself into that domain of software that feeds on large databases of media and then produces alien, inexplicable, sometimes delightful output — a domain otherwise known as “AI”. The result was a three-channel noise installation, trained on twelve hours of found sounds, that compelled visitors to hoot and grunt and sing and make all kinds of utterances into a microphone to hear what sonic apparitions would be summoned.
To anyone working in generative art in the last few years, the latest resurgence of AI has presented a whole other tantalizingly unpredictable medium. Whereas “generative” art typically implies rule-based systems with some random numeric input, AI takes as its input a “corpus” of digital media: a body of texts, images, sounds, etc., typically within some kind of meaningful category or genre, to be analyzed and decomposed into ephemera, from which new works may be reconstituted. So while generative art is often concerned with mathematical and geometric form (and my generative sequencer Patter certainly is), AI art instead focuses on the distortion and manipulation of existing media artifacts.
Indeed, since the first psychedelic #DeepDream images revealed what kinds of monstrosities may emerge from a corpus of millions of animal faces, more recent AI work has tended in a more cultural direction — often centering on the tropes of genre. Botnik makes absurdist satire of anything from screenplays to dating profiles. Foreshadowing current AI applications, scholar Franco Moretti used computational statistics to study Victorian literature. Artist Mario Klingemann produced otherworldly images from historic European portraiture, and Robbie Barrat is currently working with images of runway fashion. These ghostly, macabre figures were an inspiration as I began exploring what hauntological forms might emerge within the audio domain.
All this to say that a critical early step for the AI artist-researcher is foraging for corpuses — a sometimes shady enterprise. I soon found myself clicking through exhaustive lists of audio datasets, published by and for researchers working in areas such as speech synthesis or “event detection.” (That often meant crime detection; one dataset advertised “6000 events” of “glass breaking, gun shots, and screams.”) The sourcing of the samples can be uncomfortable as well. I ended up using a folder of recordings titled “children_playing”, collected from Freesound (and therefore uploaded under a Creative Commons license), but — whose children exactly? Google has published a huge dataset called AudioSet containing metadata about (but not including) the audio from millions of YouTube clips. I clicked through to one and saw a teenager speaking to the camera—presumably unaware of their status as a research subject—and I decided against browsing further. (For the 2018 installation, I used UrbanSound, Philharmonia Orchestra, and ESC-50.)
At the same time, I began exploring software techniques to produce the kinds of sonic chimeras that I had envisioned from these datasets. I naturally thought of the visual systems above, and found examples in artist Kyle McDonald’s survey of neural net music as well as some nightmarish club music. Google Magenta has harnessed neural nets to create fascinating blends of sounds, even making a neural synthesizer available to producers — but based on the experiences they shared (and my ten-week timeline), I decided to begin with a more rudimentary paradigm: nearest-neighbor search. In essence, this technique, given a vast number of items, can locate the “most similar” items to any other item. I intended to use it on sound fragments, but it’s also the basic principle that powers the recommendation engines of Netflix, Amazon, Spotify, and countless others. As a useful visual and inspiration, I thought of the clouds of related sounds in Infinite Drum Machine, a cacophonous experiment by Kyle McDonald, Manny Tan, and Yotam Mann (later VR-ified by Cabbibo as Audio Forager). If tiny sound fragments can be conceivably mapped in a virtual space, then surely they could be looked up and stitched together to reconstitute a human voice!
Of course, it was only after building and demoing a working prototype that I learned that similar research had been underway for some time — at the French audio research center IRCAM. The specific technique had been coined “audio mosaicing” and even implemented in Max, my real-time environment of choice. More broadly, the technique of stitching shorter sounds together to produce a longer one is called concatenative synthesis (see also IRCAM’s exhaustive historic survey), and has typically been applied for speech synthesis (even as another Google-produced neural-net system called WaveNet is giving it a run for its money). Additionally, there have been some other delightful pieces that cleverly sequence audio samples based on user input: Zach Lieberman’s Play the World, which uses live radio broadcasts from around the globe as the sole audio input, and the early-web experiment Let them sing it for you.
The system that I ended up building rested upon two excellent open-source libraries: Essentia, an audio analysis library from the Music Technology Group at the Universitat Pompeu Fabra (also maintainers of Freesound) and Annoy, a nearest-neighbor indexing library created at Spotify. Since neither had explicit Max implementations that I could use, I built and open-sourced one for each: [essentia~] and [annoy]. Enjoy! Both of these libraries enabled me to do my offline pre-processing in Python and real-time playback in Max. I then incorporated the Max externals into Max For Live devices, with tunable knobs for parameters like search tolerance and audio envelope. The primary audio analysis data that I extracted was the MFCCs—a specialized concept that may, for all intents and purposes, be thought of as the “fingerprint” of a sound fragment.
My most fantastical version of this project involves dozens of speakers, all playing back fragments from different datasets, creating ephemeral constellations of sound shapes, surrounding the visitor… For obvious practical reasons, I began here with just three speakers — projecting onto each a flickering visualization of the sonic fingerprint of the sound fragment being played back. The abstract shapes served a secondary purpose as a cryptic reflection of the visitor’s voice, encouraging them to experiment with all kinds of vocal utterances. Some of the most enjoyable moments occurred after a few minutes of practice—once you learn how to elicit certain responses with precise squawks and pops. My other favorite moments occurred organically by letting the system feed back on itself — softly sputtering sounds based on ambient noise, and then based on its own output.
Today, AI has become practically synonymous with neural nets, and my early foray into computational statistics hardly approaches the sophistication of those systems. But ultimately they too are tools for statistical reasoning—being assembled from building blocks of linear algebra and fed on digital media, for statistical purposes like classification and clustering. I think that our continued use of Cold War-era sci-fi terms like “artificial intelligence” and “machine learning”, with their uncanny suggestion of machines springing to life, takes away slightly from the deeply human pursuit of searching for meaning by deconstructing and recombining cultural artifacts, whether with scissors and glue, or a rack of liquid-cooled GPUs. Not to mention that these vast media datasets, corpuses, and archives — not quite describable as “big data”, which often refers to numeric and meta-data — are themselves huge achievements; their collection and labeling is the massive invisible iceberg of archival labor beneath the small but highly-visible portion of brilliant researchers. (The Anatomy of an AI System, published as I write this, plumbs these depths.) AI may be an imperfect term, but it recognizably stands for a methodology: distilling meaning from deep wells of humanistic artifacts with immense and unruly computer processing power.