Visualizing sound with AI
The original idea comes up to me when I was bored from live shows. I do love music, but actually, I feel boring to play visuals now, even live. I feel a lack of sense in this process rather than entertain people. So, I quit for a while with visual live shows. And I always more interesting in a pure algorithmic output rather than me staying there and controlling knobs and parameters. You will often see me in the crowd during my live set just enjoying the flow of the machine.
So it`s not a surprise that AI-generated visuals get my attention. recent year lot of interesting things and research happen in that field, but I did n`t saw much cross-field adversarial networks. I decide to do a small experiment by myself to generate images based on the audio input without human intervention in the process.
Pix2Pix, why?
For my experiment I choose Pix2Pix. Why? It is an interesting way to teach machines to find a relationship and convert one image into another. From the similarity of style, day-to-night image, or drawing to photo. It also shows himself with quite fast training and appropriate result.
You can find more info and examples on Pix2Pix on this guide.
I think some of you already saw this video that can be a good example
So, the decision was made.
For this project, I choose this repository based on PyTorch (after a few test I think it works faster and more stable than TensorFlow version). You can follow the instruction in the repo to train your own model or use already pretrained.
But, how? I want to transform an audio signal into an image. To train and implement the Pix2Pix model I need to have an image as input and output. Let`s see how we can trick it.
Training model
We are living in the information age where most of the information represented in the digital format. As long its a digital — it doesn`t matter anymore what type it is. Video, Audio, text, everything became just a sequence of numbers.
When I started my media/digital experiments it was one of the biggest insights that always help me with projects. From conceptual to optimization side.
Even more interesting when it applied to the digitally generated image
So, there we go. With the use of TouchDesigner it quite a simple task. Just convert sound into sound wave visual representation or apply some audiovisuals to get out information and plot it.
Next step for this project to prepare some databases to train. For this YouTube was a good source from where I download around 1 hour of different trailers and videos from sci-fi movies and games with a soundtrack. So I have audio and relevant video frames. Quick network in TouchDesigner to convert it into the required format. These requirements we can figure out in the documentation. So we need a combined image where A and B in one file with the same resolution.
After a bit of time, I have around 1k images for training new sci-fi models.
Experiments
The most interesting part here was experiments to figure out the right audio representation and how it will affect the output of the trained model.
My first attempt was to use current sound-wave representation with a bit of background made form the same sound-wave (white color — higher frequency, black — lover. Straightforward, right). My intuition behind was that just two noisy lines could be not enough to generalize the model as it quite a pixel-based. So I was trying to fill as much information in each pixel of the input image.
I figure out that background not help here. But also see a bit problem in non-consistency of the image. Each frame image generated independently from previous one. That looks not so enjoyable. I would like to have a bit more smooth transition between scenes/frames.
For second attempt I change the audio visualization and create a new datasets. I shrieked the 44100samples of audio signal into few hundreds and use analysis and timeline to keep the window of visualization at 10 second. Now each frame depends not just from current sound, but also previous 10 seconds.
It works. Now we can see that the image transform much more smooth from one key frame to another. It`s not just sequence of independent images, they do have flow.
Experiment 3 was test to use less raw audio data and more analyzed, like (low, high, mid frequencies, kick detection and so on). As most of this information already calculated with some time windows we get our smoothing by default. But also we get more reaction for an events (like kick).
And last one — use analyzed info over time.
You can see that 4 different approaches give me a different result. I do like them and can apply based on my mood now. What we also figure out — that the sound to video generation works quite well even with image to image models. the ‘digital world theory’ proved :)
Each of this models was trained for one night on the colab service. that allow me to do a quick experiment every day. There was a bit more different experiments. Datasets in around 1k images also looks quite enough for this more abstract type of images. I was surprised that model figured out the way how to extract information from the image and convert it in totally different looking result. Resolution for the images was 512x512px.
Performance
The next stage of this small experiment was a live performance where 4 models was playing together.
You can watch full version of the stream here:
It was also quite interesting setup as Eugenio was playing sound from the Italy, while me was in Shanghai and we stream it online. Even not so good internet connection allow me to do it. For the sound we used audiomovers. Eugenio sent sound through Ableton plugin, and I receive it on my side in browser and use wire connection on my sound card to put output into input in TouchDesigner network.
All 4 models was played simultaneously on my computer with 1070ti card quite smooth. That also surprised me and the reason I prefer PyTorch over TensorFlow. i often have a problem even to run one TF inference model if something already occupied graphic card it cant start CUDA. With PyTorch here no conflict between different instances.
Also quite interesting that i can feed different input into model, not the one it was trained on, even mix them and still get quite interesting and consistent result. that allow more nice effects and transitions during live show.
Tutorial
Here, the last part. A short tutorial how you can setup your Pix2Pix model with TouchDesigner through Spout.
Asset with the code and sci-fi model
Additional
And a bit more experiments with Pix2Pix.
Real-time face to Expressionism/Fauvism transformation. Again not the perfect result. To get a better looking will require more time to train and I thing a bit more adjustment to original network.
Thank you for your time. Clap if it was interesting and useful for you)
Links:
- More tutorials about TD and different ML models at derivative.ca
- processing course channel
- TD tutorials
Share it it useful for you)