Beatboxing with your past self

Robin Jungers
Qosmo Lab
Published in
7 min readMay 28, 2021

Since 2017, the Neural Beatbox project has been an iterative attempt at proposing an AI-powered, musical experimentation tool. First exhibited as part of the AI: More than Human event at the Barbican Centre, London, it has been developed as a web application in order to become a true virtual space reachable by more people. Originally built as a solo experience, it was then turned into a collaborative experiment as the pandemic hit us in 2020.

This year, for NVIDIA’s GPU Technology Conference, Qosmo was invited to showcase its process as a featured artist collective. For the occasion, we decided to develop a new version of Neural Beatbox.

Original concept

The first version of Neural Beatbox, exhibited at the Barbican Centre, London

Despite having multiple versions, the core operation of Neural Beatbox has remained constant : users record themselves in front of their webcam for a few seconds, making sounds with their mouth or any object in their vicinity. The audio is analyzed, split into multiple hits, which are then assigned to actual drum types using machine learning. For instance, your sneezing might be recognized as a hi-hat sound, your hand hitting the table as a low tom, etc. In addition, a second AI model generates a beat : together, they can be played back to the user in rhythm, with the original video synchronized to the audio (check out the project’s page for more detailed explanations).

A screenshot of an original Neural Beatbox session with the Qosmo team
The previous multi-user version of Neural Beatbox

The goal of this new version was not to just generate automatic rhythms, but to curate a guided experience that would project a person into a dialog with their past selves — a place were a person could experiment iteratively, with each new step being added to the previous. This version would be more simple, both in terms of ergonomics and technology, but allow complexity to emerge in the process.

One important aspect of this was the visual aspect of the app. As it was meant to be more simple, both in appearance and in usage, we wanted to focus on building a customized canvas at the center — a single point of interest.

Entangling timelines

In previous versions of Neural Beatbox, the camera playback had a straightforward role : to provide a visual feedback, even to catch possible accidents in a comedic manner. This version had to go a bit further : while the previous goals are still relevant, the fact that everything happens within a single central view, implies a form of layering of several images, from different timelines. While previous versions made use of the spatial dimensions of the screen, this one, with a unique view, focuses on the time dimension.

For this reason, I started working on glitch effects, in particular the effect commonly called datamoshing. Compressed video streams are most of the time optimized so that frames are encoded as a combination of actual color data and pixel movements, in order to reduce their overall size — datamoshing happens when some of this data gets dropped, and an area of pixels that is meant to move misses its new color. These glitches essentially happen because a new frame depends on the previous one, and when broken, they start accumulating residues of each other — to our eye, the consequence is that a video ends up layering itself over time.

There is a nice parallel with the goals discussed above, here. By deliberately producing this layering of pixels, we allow a single canvas to entangle different timelines : in our case, the present self (a real time mirror of the webcam) and the past selves (the musical playback of past recordings).

An animated capture of real datamoshing glitches on the new app
An example of actual glitches

It so happens that as I was recording some example footage of the website, a real glitch happened to my webcam stream. Having spent some time trying to reproduce this precise effect, it’s easy to appreciate the paradoxical organicity of it. The pixels flow like a cloud of particles, which might lack the quick reactivity that we’re looking for here, but sets a great visual reference.

Artificial glitches in the browser

While true datamoshing can be applied and rendered offline, streaming a webcam in real-time in the browser hardly allows such a direct approach. Moreover, while true datamoshing looks “authentic”, using a pure glitch effect to provide the right look and pace to our app is not very reliable and lacks the controls to fine-tune the results — specifically, since the playback of the drums may be quick at times, being able to tone it down is important, in order to preserve some level readability. For these reasons, we worked on an artificial effect that would mimic that visual style, but relying on well-defined parameters.

The rendering is done with WebGL (using Three.js, for convenience) in order to implement hardware-accelerated effects within fragment shaders. The webcam image is a singe texture on which is rendered the current webcam image, or, when a drum is played, the corresponding recorded frame. There is actually two of those textures : commonly named ping-pong buffers, this technique involves switching back and forth between the first and the second, in order to keep a view of the present and one of the past at all time, without having to copy any data around.
The visual effects are then composed of two passes : the first one is where the movement is computed and turned into glitches, while the second one implements post processing adjustments.

Making two frames mash together

Getting into the more technical aspect of it, the first GLSL pass essentially goes like this :

uniform sampler2D uTex0; // Past webcam texture
uniform sampler2D uTex1; // Current webcam texture
uniform vec2 uSize; // The frame's dimensions
uniform float uBlockiness; // How big the glitched areas look
uniform float uPersistence; // How long the glitched areas remain
varying vec2 uv; // UV coordinatesvec2 opticalFlow( vec2 uv, sampler2D tex0, sampler2D tex1 )
{ ... }
float random( vec2 uv)
{ ... }
void main()
{
float flowLevel = length( opticalFlow( uv, uTex0, uTex1 ) );
// UV coordinates are downsampled
vec2 decimFactor = uBlockiness * uSize;
vec2 decimUv = floor( uv * decimFactor ) / decimFactor;

// Assign a random value to the current downsampled UV
float randValue = random( decimUv );
float randThreshold = mix( 1.0, flowLevel, uPersistence );
// The more movement, the more likely a pixel is discarded
if ( randValue > randThreshold )
{
discard;
}
// Otherwise an actual color is rendered
gl_FragColor = texture2D( uTex0, uv );
}

It is worth noting that we don’t clear the canvas after each loop, in order to preserve the previous frame in place of each discarded fragment. Also, while not shown here, the optical flow and random functions are well documented algorithms in GLSL, and have various implementations that can easily be found online.

A typical result of the implementation

The core idea is this : we compute the quantity of movement between the current and the previous frame, and discard areas of pixels randomly but with a likelihood proportional to the measured movement.
The snippet above provides two parameters, called blockiness and persistence : with them only, the glitch effect can generate very different looks, from broken and geometrical to smooth and granular.
The first parameter defines a size for the blocks of pixels : rather than computing single pixels, this allows to process bigger chunks of images.
The second one changes the likelihood for a pixel to be drawn : when it is low, many get discarded, and previous frames tend to stack up on each other over time. When high, each new frame clears the previous one.

two artificially glitched frames, with two different values of “blockiness”

Adding some style

The second pass of the rendering involves adding some simple corrections in order to achieve a cleaner, more stylized look. The raw output of the webcam tends to be quite boring : it’s neither terrible nor very sharp, and the color temperature is often bland by design. Adding post treatments conveys an intention.

Here, three adjustments are made :
- A distortion of the pixels, in order to mimic a wide angle camera, similarly to a mild fisheye lens.
- A vignetting of the brightness, to emphasize the center of the image. This is often visible with vintage camera lenses.
- A slight shift of the blues towards the outside of the circle to produce some natural-looking color imperfections. Again, this is usually an undesired effect that is visible with older lenses.

So that’s it. The bottom line is probably this : when building an experimental experience, design choices are here the elevate the story that we’re trying to tell. Finding the technical solutions to fulfill those choices, within constraints, is where the fun happens. Whether these choices were the right ones or not in the case of Neural Beatbox is up to you to decide.

Try the app yourself : https://solo.neuralbeatbox.net/

Credits
- Front-end : Robin Jungers
- Back-end : Bogdan Teleaga
- Machine learning : Christopher Mitcheltree

We are Qosmo!

Thank you for reading till the end. We are Qosmo, Inc. a collective of Artists, Designers, Engineers and Researchers. Read other Medium articles from Qosmo Lab and if you are so intrigued to find out more, get in touch with us from here. We are actively searching for new members, collaborators and clients who are passionate about pushing the boundaries of AI and creativity. Ciao!

--

--