AI in Music Production (Part 3)

Hans-Martin ("HM") Will
8 min readJun 21, 2023

--

This is the third part of the mini-series on state-of-the-art, applications and impact of AI in music production. In this third part, we will be looking at how AI already has and continues to find its way into the musician’s creative process and production workflow. With pace of advancement that we see in Generative AI, we will also look at some of things we can expect or would like to see in the not-too-distant future.

No AI or robots in this picture. We are interested in human ingenuity at the center of the creative process.

In case you have not seen them, the previous posts in this series were:

  1. Part I: Origins of AI, generative AI and examples for using generative AI in music. You can find Part 1 here.
  2. Part II: Key approaches of the most recent AI technology wave, and specific additional techniques relevant to musical applications. You can find Part 2 here.

As always, I am looking forward to any comments, discussion points or additional pointers you may have. Have fun!

AI in Music Production Today

AI has surely found its way into the contemporary music production setup. I mentioned the recent online survey by Ditto, a music distribution service, amongst more than 1200 of their users in Part 1 of this series. Nearly 60% of the participants, which are independent musicians, indicated that they would already be using AI within their music projects. 77% of the respondents indicated that they would use AI tools to create their album artwork, 66% for mixing and mastering their music, and 62% for music production. So, Mixing and Mastering is an area where AI & ML have found quite some adoption.

Mixing and Mastering

Prominent example of production tools for mixing and mastering that incorporate AI are the iZotope Production Suite (with iZotope Ozone for mastering, iZotope Neutron for mixing, etc.). These tools have been trained on broad collections of audio material to match specific input signals against proposed processing recommendations. Those are then applied to what is essentially a traditional signal processing chain. The value proposition and suggested benefit of those tools is to provide cost-effective access to high-quality (“professional)” mixing and mastering to musicians, who may not have the required training (both listening skills and utilization of tools). For professionals, the AI support may streamline the initial setup and configuration. iZotope just recently became part of Native Instruments, which only underlines their broad adoption and relevance.

Similarly, there is a growing portfolio of online services that are tapping into the same space. LANDR, for example, is building out an online music creation and collaboration platform that started out with a mastering service. Similarly, Dolby provides a mastering service that can be integrated by third parties into their own applications. For example, Soundcloud, an online music and audio platform, integrates the Dolby Mastering Service to get “your tracks release-ready” “for a fraction of the cost”.

We are also seeing an increasing number of products for audio effects processing, such as reverb, that integrate AI. Again, the idea is to help the user to determine best parameter settings based on the profile of the incoming signal. Examples are Accentize Chameleon, or Rivium AI, or iZotope Neoverb. We will likely see smart selection of processing parameters finding their way into other audio processing tools as well.

Sound Separation for Sampling and Remixing

While AI for mixing and mastering was mostly about streamlining and ease of use, sound separation is an area that is truly enabled by AI. It is essentially the reverse of mixing and mastering, that is, extracting individual audio signals that have been mixed together with others (say, extract a vocal line from a finished song) or to remove certain processing that has been applied, such as echo or reverberation. Once isolated, those parts can be used for creating a remix or to incorporate into another production. Examples of such tools are Audionamix XTRAX STEMS, Hit’n’Mix RipX, lalal.ai, fadr.com, or Splitter.ai, the latter focusing on Karaoke. Moises specifically addresses scenarios for practicing an instrument.

Of course, because original creators will usually have access to the original audio material, those tools will be most useful for others who do not. In particular when in comes to incorporating others material, there’s been a long, on-going debate about whether sampling is art, theft or laziness. With ever-improving AI tools for sound separation, this debate will only continue… Of course, there are exceptions, such as recovering John Lennon’s voice for finishing a Beatle’s song posthumously. Which opens other ethical questions about creative ownership and control past one’s own demise.

Voice Synthesis, Virtual Singers and Digital Replica

Synthesis of spoken language as used, for example by digital assistants such as Alexa, Siri or the Google Assistant, was transformed by AI techniques to become ever more realistic and human-like. The same underlying techniques have since been evolved where human singing voice can be simulated with a high degree of realism and expressiveness. The progress can be appreciated when, for example, comparing a Vocaloid Reel from 3 years ago, with an emotional cover song recreated using contemporary approaches and AI models.

This latter example was created using a software called Synthesizer V by Dreamtonics. Synthesizer V received a lot of publicity in 2022 when the SOLARIA voice was released, which is based on a real artist, Emma Rowley. The creation of the SOLARIA voice was initially crowdfunded via Indiegogo. More recently, we have seen quite a debate around distinction between the creative performance of a human artist and the utilization of a model. In this case here, even though the voice is based on a human’s recorded voice, music created using the AI model should credit SOLARIA rather than referring to Ms. Rowley. At the same time, the artist was actively involved in the creation of the model.

This is rather different from building a model replicating an artists voice without their permission. For example, recently a track “Heart on My Sleeve” that used synthetic replica of the voices of Drake and the Weeknd went viral on music streaming platforms before being taken down. Even when there is consent, there are still questions about the appropriate licensing and revenue model when AI is trained on data derived and traceable back to individuals. Grimes’ experiment inviting the use of building an AI imitation in exchange for a 50% revenue share is just one example of possible approaches.

At a panel discussion on AI & Music at the FYI campus in Hollywood that was organized as part of the 2023 LA Tech Week, will.i.am, both event host via FYI and panel member, predicted that in the future artists would want to invest into building their own model that would encompass all of an artist’s creative input and potential. He has been a proponent of this vision for a long time, even depicting it in a video as far back as 2010, prior to the big breakthroughs of Deep Learning. Through FYI, will.i.am is personally engaged in bringing AI to the creative process.

Generative AI for the Creative Process

Generative AI, which creates text, images, and audio using textual descriptions, so-called prompts, has taken the world by storm and is very much at the core of the current AI hype. I have covered examples and some of the core technology concepts in Part 1 and Part 2 of this series. In a nutshell, one can say that there are 3 important ingredients that AI technology provides us with:

  1. Learning the rules, including exceptions, that make up a structure. This is what language models do for spoken or written language, and it is how Music Language Models can capture specific aspects of cultures, musical genres, or composers and artists.
  2. Reducing complex spaces to a set of essential dimensions and parameters that capture the essence and allow to traverse and recreate the original universe.
  3. Drive the quality of generated output to levels that are indistinguishable from “true” examples.

In addition, the current wave of large language models allows us to intuitively interact with AI systems by simply saying what we mean and iterate and clarify within a natural dialogue. Not only that, but with multi-modal models we can even provide non-verbal information, such as audio snippets or even relevant visual cues. So then, here’s a few tools that we can expect to see in the near future:

  1. Finding and creating new sounds based on prompts and examples: Rather than having to sift through large collections of recordings of sounds (“samples”) or endless list of parameter settings for synthesizers (“patches”), we can envision an iterative approach to sound design. We start by describing the kind of sound we are looking for, and we can iteratively refine it by simply saying what we mean. We could even start out by providing an acoustic example: “Give me a drum sound like the one here, but…”. Or, we could ask the system to find suitable sounds that complement what we already have, given knowledge of certain genres and styles, for example.
  2. Interactively building rhythms, harmonies and melodies: Rather than relying on rule-based systems or collections (like currently available virtual drummers or virtual keyboarders), again, we can envision an interactive and iterative process to create song elements. Because a Music Language Model understands the “internal language” and natural structure of possibly a very large body of works, we can look forward to fluid exploration of creative space. Again, we will be able to adjust elements of the generated material through additional verbal descriptions, or we can provide melodic and other tonal snippets that will be seamlessly integrated into the composition process.
  3. Building song structures and arrangements, including transitions and the overall emotional journey: These are same concepts as above, just applied to the highest level of abstraction. In addition, we can envision providing references in more abstract form. For example, the overall structure and emotional journey may be derived from a video clip. Or, for example, lyrics and music can influence each other and be converged towards a common outcome. We may provide a few examples and ask the AI to create outcomes in a certain, abstract “style” that is embodied in the provided examples.

Overall, these previous examples are not too distinct from how artists with different skills sets would collaborate with each other. As such, generative AI my find a place as natural, additional partner in the overall creative process among a group of individuals.

Summary

The continued evolution of generative AI and the emergence of multi-modal systems combining picture, language and sound will surely brings us more exciting capabilities in the near future. Beyond the technological evolution, it will be even more interesting to see how the artist community is going to adopt and integrate those possibilities in their creative process. As we have seen with the impact of sampling or digital audio processing, we may see the emergence of complete new musical genres. At the same time, there is going to be continued discussion about the nature of artistic creation and expression, and how to attribute and credit contributions by previous generations of artists that have influenced or been incorporated into newly created works.

This post concludes this mini-series of articles providing an overview of AI in music production. I will continue to write further articles that will focus on specific topics within this broader space, touching on creative aspects, ethical considerations, new products and technologies, and impact on affected industries in general. Stay tuned!

--

--

Hans-Martin ("HM") Will

Technologist & Product Builder - AI, Data & Spatial Computing