Over the last year, I’ve dedicated my time to exploring AI’s applied role in music software by prototyping a few applications. I’m writing this post for two reasons. First, I simply want to document what I’ve built and share my learnings. Secondly, it’s for some closure to this chapter because I have proven to myself beyond any doubt that AI is ready to be integrated into practical audio tools for creatives. I want to be part of that. However, I’ve also concluded I can’t do it alone. It will require a team with a shared vision to fully realize the potential of AI audio tools and this is exactly what comes next for me.
Who is this for?
The target reader is someone who is interested in or is actively experimenting with creating generative music systems. I will not dive into details about music theory or algorithmic implementations. This is a broad outline of a system I’ve created and some conclusions I’ve arrived at from doing so.
What have I built?
Broadly speaking the application is an attempt to imagine an audio sample library with infinite content that is procedurally generated. The scope was both a server backend (how the samples are generated) and UI (how samples are presented).
I am certainly a big fan of products like Splice & Loopcloud. But I do believe their original, core biz model, a sample library that made them successful could only exist during a certain transitional period where quality samples are a finite commodity with monetary value. This is in contrast to new AI systems which themselves have monetary value but the output (samples) are merely a byproduct of the system. We’ve already seen this play out with image diffusion services. i.e: https://ai.mid-journey.org/ and https://openai.com/product/dall-e-2 I am not claiming to have produced samples of the quality of these services, but I am exploring the idea.
Why would anyone want this?
This is a good question! I think the answer has more to do with possible future use cases vs the current music landscape. The current model is all based on the idea that someone or some entity arranges a linear composition and ‘owns’ it. With this model, the possibilities for the creative consumption of audio samples are severely limited. In fact, the use cases are predefined in EULAs (end-user license agreements). For example, many sample EULAs limit the use of a sample to one single song or video. That may not accommodate generative systems.
Let’s take the concept of non-linear music. Think VR/Video games or maybe a hypothetical system that reacts to a consumer feedback loop. As of today, these systems are limited to predetermined decision trees. This doesn’t accommodate a more dynamic possibility where sounds and samples are queried & consumed at runtime. It’s not practical to have an unrestrained system pulling dynamic content at the current unit cost. AI/Procedural systems could very well be an answer to this use case.
AI music has many applications. Which did you explore?
The obvious applications of AI in music can be broadly categorized as analysis/classification, component extraction (ex. stem splitting), generation/composition & timbre/rhythm transfer. For me, as a producer, the most immediately interesting application is the concept of audio style transfer. Audio style transfer is the dream of transferring audio signal attributes from one source to another such as timbre or rhythm. For example, taking a country guitar sample and imposing timbre from an orchestra onto it. Vocal audio transfer is an example of this technique, and is currently trending due to controversial compositions using the vocal signature from the well-known rapper Drake: https://www.rollingstone.com/music/music-news/viral-drake-and-the-weeknd-collaboration-is-completely-ai-generated-1234716154/
I believe audio style transfer is one of the aspects of AI music that is very much compatible with artistic expression and creativity as it currently exists. To understand music is to understand that it is often a form of cultural and personal expression. It’s an outlet for human creativity and is constantly evolving. Having a creator choose a source, and then something to transfer onto it is very much a creative process. So, that’s what I’ve spent most of my time thinking about. It’s not that I’m not interested in other aspects of AI music such as generative composition, I just believe that removing humans from an inherently human experience is a non-goal.
Whatever, just explain the damn system!
Ok, ok, here is a high-level overview of how the loops on https://signalsandsorcery.ai/ were created. I want to make it clear that this was in no way an attempt to create an all-inclusive music system. If you have ever put thought into any rule-based systems you know that they are limited by default. In my case it’s clear that this system is limited to western harmonic `rules` and in fact a very basic subset of those rules. The awesome thing, however, is that I haven’t even come close to hitting the ceiling of what is possible even with those limitations. Limitations are not necessarily a bad thing though. Anyone with experience in software startups or prototyping apps knows that keeping the scope restrained and realistic is key to producing results!
Here is an infographic I created to highlight some key concepts in the system.
The process to create the loops is as follows:
The first part of my process is to generate MIDI templates. You might ask: Why even create harmonic templates if you can apply AI-style transfer to existing audio clips? My answer is: Ya, that is a fine approach, but in my case, I want to deterministically arrange the loops during the presentation which requires accurate labels and metadata. Since I created them, I know exactly what their harmonic compositions are with 100% accuracy. I have a strong working knowledge of tonal harmony due to classical music training so creating basic functional rule sets to follow diatonic chord progressions is pretty easy (excluding the concept of melody). Around 2015, I created a dataset of chords-to-melody relationships which I used to train Recurrent Neural Networks: https://github.com/shiehn/chord-melody-dataset This is an obvious area of improvement for the system. I strongly prefer a data-driven template generator but I accepted a crude rule-based system in order to focus on other aspects of the system.
Original Timbre Generation
This part was a big challenge. The current state of DAWS (digital audio workstations) rarely exposes the ability to bounce/render audio. I’m not sure if this is intentional because the companies don’t want people using their software in a truly autonomous fashion, or maybe it’s simply not a commonly requested feature. (DAW creators: If you’re reading please support this use case!). Luckily, one major DAW sorta supports this which is Reaper. Reaper exposes much of its functionality as an API. So I was able to create scripts to import, annotate & bounce audio. Once the MIDI is imported I create patches for liberally licensed plugins such as Vital Synth. This part of the system is the reason I will continue to use this project. I very much enjoy the sound design aspect & synth patch creation.
Export And Upload
The export is pretty self-explanatory. I created scripts that bounce both the audio & MIDI for each track and create a JSON metadata file. The metadata files contain info about harmony, key, bpm, etc. The bundle is then compressed and uploaded to the server. Once uploaded to the system the bundle can be extracted and it can operate on the files.
Once the bundle of audio and MIDI loops has been uploaded, and some basic file conversions have happened, the interesting stuff can start. In my mind, what I’ve created is ultimately a music production amplifier. When a single loop is uploaded 10s or 100s of variations are created and cached depending on the harmony and other loops in the system at the time. Each of the loops goes through four distinct permutations processes: harmonic chopping/splicing, pitch shifting, time stretching, and finally AI style transfer is used to create even more variations. Let’s break down each one:
Let’s take a 4-bar loop example with the chords “Am7-Am7-Dm7-Cmaj7”. From a harmonic perspective, we have a degree of confidence that this loop could be chopped and reassembled in many ways. A few examples might be: “Am7-Am7-Am7-Am7”, “Am7-Am7-Dm7-Dm7”, “Cmaj7-Cmaj7-Cmaj7-Cmaj7”. It’s worth noting that there are more factors that play into chopping slicing beyond just vertical harmony such as rhythm and effects. Effects such as delay that cause ‘bleed’ into subsequent chord sections resulting in undesired harmonic makeup. This is where a data pruning mechanism comes into play. Meaning an admin of the system can exclude data that sounds bad!
This automated process is similar to harmonic splicing but instead of rearranging the music bars, the bars are pitch shifted by a few semi-tones with minor degradation of quality. For example, the chord progression “Am7-Am7-Dm7-Cmaj7” can easily be shifted to “Am7-Dm7-Em7-Cmaj7”
This is the most self-explanatory. A BPM of say 100bpm can be time-stretched +- 5bpm with only minor degradation of quality.
AI generation & style transfer
This is not a tutorial post so I’m not going to go into the specifics of each framework. But I have trained models which I use to either generate loops targeting a fixed harmonic structure/bpm or use them to apply style transfer onto existing generated loops, creating variations. I have not found any model that generates high-quality output 100% of the time. So in my system, all generated loops are unapproved by default and then approved via a custom interface allowing me to review the output. GPUs (need to perform inference) are very expensive. If that wasn’t the case I would have liked to directly expose the results of the style transfer to the user and have them decide which is good or not. Here is a list of frameworks that I have explored and recommend you to do the same:
Audio Language Model by Haohe Liu
- This is my favorite! It does style transfer very well and maintains pitch! I was using ChatGPT to generate 1000’s audio descriptions such as “Dark children’s choir” and running that against loops in my system
HarmonyAI — DanceDiffusion
- This one has the best community (by far). There is basically step-by-step instruction on how to get started. However, I did struggle to output results with the exact pitches and lengths I needed.
Python Audio Diffusion (by Tetico)
This is a nice library. You essentially take your dataset and generate MelSpecogram images. Then the model learns from the image representations (vs an audio signal)
Python Audio Diffusion (by Falvio)
This is the exact same idea as the other Audio Diffusion framework. It's more configurable though.
To be completely honest I just started exploring this one after noticing it's widely used in some interesting projects. So I can’t personally vouch for it, but I wanted to list it because of its reputation.
For completeness, the Signals & Sorcery system also has a special admin evaluation UI which enables an admin to flag any loops that sound ‘bad’. I also set it up so a percentage of the ‘unapproved’ generated loops can be previewed in context and have them approved or rejected via an interface.
I hope this gives some transparency into what I was able to accomplish for the Signals & Sorcery project. Moving forward, I will be redirecting my effort to a new, related opportunity but will keep this app live. It will serve as a free sample resource for creators and a sound design outlet for myself. As mentioned earlier there is really no upper bound on how sophisticated a pipeline like this could be. I would encourage anyone wanting to create a similar system to go for it. I achieved decent results alone. Given a team effort and assuming more attention/care was given to each step, I believe amazing things can be accomplished!