Whitebalance

Powering copyright-free live entertainment. Trusted by top professional sports teams, leagues, and media. #ClearedForUse

A seamless visual progression from vintage analog music technology to futuristic AI-generated soundscapes, unified by a continuous thread of musical elements and evolving tones, symbolizing the evolution of music innovation.
Image Generated with Flux

The State of Generative Music

Understanding the Impacts of New Generative Music Technologies on Music, Artists, and Rights Holders

Christopher Landschoot
Whitebalance
Published in
13 min readNov 27, 2024

--

This year has been a whirlwind for generative music and audio AI. The idea of generative music has been around for decades, but only within this past year have we seen it take off from basic MIDI and sample creation to generating full-length, coherent songs with lyrics. It was little more than a year ago I was generating unique, garbled samples with Dance Diffusion (from Harmonai) to use in a song that I was working on. Admittedly it was very exciting at the time to be able to generate new sounds from scratch. I was so interested in this technology that I put together an open-source project, tiny-audio-diffusion, geared towards hardware-constrained individuals (like myself) who wanted to get their toes wet training generative audio models. While this felt cutting-edge at the time (my models could only generate 1.5-second one-shot drum samples), here we are just over a year later and commercial generative music models can produce high-quality, full-length songs with lyrics from just a text description.

With any transformative new technology comes some growing pains, however. These new capabilities have raised alarm bells in the minds of many musicians, producers, rights holders, and creators, that generative music models could be an existential threat to their art and livelihoods. This has generated (pun intended) a rift between AI supporters and detractors, leaving many musicians and rights holders reticent to allow any AI model, generative or not, to use their data during training. This is an unfortunate binarization of an issue that actually has many shades of gray, as AI use cases can range from predatory to neutral to significantly benefiting creators and rights holders.

To fully comprehend the contentious state of generative AI and explore a potential path forward, it is important to understand the arguments from both sides along with the context in which these disagreements emerged. A shared foundation is essential for fostering more nuanced and productive conversations about AI and its role in music and art.

The Difference Between “Generative” and “Non-Generative” Models and Why It Matters

As I briefly alluded to earlier, there has been a recent boom in generative music technology which has complicated public perception of the use of AI in audio. Additionally, many companies have begun using Artificial Intelligence (AI), Machine Learning (ML), Generative AI (gen AI), etc. labels interchangeably in an attempt to capitalize on the clout of these buzzwords. Unfortunately, this only leads to more confusion, even though each of these terms has a specific definition. For example, Artificial Intelligence (AI) is an umbrella category encompassing technologies designed to simulate human intelligence, while Machine Learning (ML) is a subset of AI that enables systems to learn and improve from data patterns without explicit programming. More relevantly, much of the tension surrounding AI in music arises from misunderstandings about the differences between Generative and Non-Generative AI.

Generative AI can be qualified as a machine learning system that has the ability to create something new, i.e. generate something. Some examples of generative AI in audio are:

  • Song or Audio Creation — e.g. Generating a song from a text prompt
  • Symbolic Melody Creation — e.g. Creating a melody in MIDI format, standard notation, or tablature
  • Voice Creation or Cloning — e.g. Synthesizing a new voice or mimicking a person’s voice
  • Song Extension and Inpainting — e.g. Completing a partially written song or filling gaps in an audio recording
  • Music-to-Music Generation and Style Transfer — e.g. Transforming a rock song into a jazz arrangement
  • Audio Upsampling — e.g. Enhancing the quality of low-resolution or compressed audio files
  • Text-to-Speech — e.g. Converting written text into spoken audio with natural-sounding voices

Non-generative AI can be qualified as a machine learning system that can analyze, categorize, or enhance audio without creating something new. Some examples of non-generative audio are:

  • Music Recommendation — e.g. Spotify recommending songs based on listening history
  • Audio Source Separation — e.g. Isolating vocals or instruments from a track
  • Music Transcription — e.g. Converting music into MIDI or sheet music
  • Audio Classification — e.g. Classifying musical genres or different types of sounds
  • Automatic Mixing and Mastering — e.g. Adjusting audio levels and effects to optimize sound quality for production
  • Speech-to-Text — e.g. Converting spoken audio into text

Because there is a clear distinction between these two categories, they should be addressed as separate entities when discussing music and AI in both general and legal contexts. Many non-generative (and a few generative) machine learning models have actually been around for many years, such as music recommendation systems, and the legality and morality of their uses historically was not a primary focus. Most artists felt that those use cases were generally of benefit to them and did not pose an existential threat to their work. It is only the recent generative music developments that initiated a global rethinking of data rights across the entire gamut of AI in audio.

While it is important to comprehend this gen vs non-gen distinction, alone it does not offer enough depth to make a case for or against generative audio models, which is where the controversy is currently swirling. It still leaves the question open of whether these models are genuinely creating something new and original. In order to make informed judgments, it is imperative to understand the fundamental principles of how generative music models actually work.

How Do Generative Audio Models Work?

Many of the state-of-the-art music generation models leverage a process known as diffusion, made famous by the image generation model, Stable Diffusion, in the summer of 2022. While not all generative models work in this exact manner, its core concepts are transferable. The scope of this piece does not cover any technical details regarding diffusion models, but if you are interested in learning more about the nuts and bolts, see this article.

Image generated by Stable Diffusion
Image generated with Stable Diffusion — original Stable Diffusion public release

In the context of image and audio generation, diffusion is the process of refining noise into something recognizable. This can be thought of as taking static from an old TV and slowly rearranging the pixels into an image.

Diffusion Process for Image Generation
Diffusion Process for Image Generation (Image generated with Stable Diffusion)

The same concept can be applied to audio, but instead of resolving into an image, the model refines the noise into a waveform (in this case music). To teach a model this capability, it is shown millions of hours of music, repeating the noise to music process for each sample. Over time, the model becomes adept at learning how to transform noise into music.

Diffusion Process for Audio Generation
Diffusion Process for Audio Generation

The key takeaway from this process is that a generative music model doesn’t learn how to precisely reconstruct the exact music that it has been trained on. This is because the music training data falls within a confined distribution on the scale of all possible sounds. The model learns to generate within the general range of this musical distribution, not to any single point of training data (i.e. a certain song). Therefore, these models can generate “new” music, but it will only fall within the distribution that it has been trained on.

Think of it like a child that learns how to build with Legos. If the child is shown the instructions for many Lego starship sets, she will be able to build new creations that are similar to the starship sets that she learned from. However, if she was never shown instructions for the Eiffel Tower set, she wouldn’t know how to build anything similar to the Eiffel Tower. So while she is able to build “new” creations, they will all be similar to starships. In the same vein, a model that is trained only on music would not know how to generate the sound of a dog barking.

Young girl building a starship out of legos
Image generated with Flux

The idea follows that if the scale of the data is large and diverse enough, the model will learn to generalize across all types of music and be able to generate “new” pieces of music within this sufficiently wide distribution. While these explanations are oversimplifications, they illustrate the range of creative capabilities these models possess, which has significant implications for how their outputs should be treated. However, the question of what qualifies as an “original” work is still hotly debated.

The Battle of AI Training Data

This open question leads to the crux of the matter, what considerations should be taken into account surrounding rights holders’ data and model training. Having recently attended the ISMIR (International Society for Music Information Retrieval) conference in San Francisco, I was able to observe first-hand the sentiment within the technical community surrounding this issue. Coupled with my experience as a musician and producer, I have seen diverse opinions from all sides.

Cartoon of 2 groups of people arguing. One group holds a sign reading “Music For Everyone!!” and the other group holds a sign reading “Respect Artist Rights!”
Image generated with Flux

The parties involved are largely broken down into two camps:

The first camp’s stance is “Generative models are creating something original so they should be allowed to train on copyrighted material.” This group largely consists of practitioners working at AI generation companies. Their arguments are:

  • Humans listen to and learn from a wide array of music over the course of their lives which influences their musical creations. Models are no different, as they are influenced by music that they’ve trained on, but don’t replicate it.
  • Music creation has always included practices that blur the lines of originality, such as remixing, sampling, and reinterpretation, which are widely accepted in artistic culture. Generative models are seen as a natural extension of these practices.
  • Relying on U.S.’s Fair Use doctrine, which is generally more permissive than copyright laws in many other countries, they argue that the outputs from generative models are original and transformative, therefore they do not incur damages on the works on which they have been trained.
  • Generative AI expands creative possibilities by enabling anyone, regardless of skill level, to create music, thus democratizing music production and lowering barriers to entry.
  • These models allow for new forms of artistic collaboration, where humans and AI can co-create works that would not have been possible otherwise.
  • Training on a broad dataset of music ensures that AI models can capture the diversity of musical traditions and genres, fostering innovation rather than limiting it to narrow, predefined styles.

The second camp’s stance is “Generative model outputs will compete with the works they are trained on, therefore consent is required to use any artist’s work in training.” This group consists of the remainder of the technical community as well as most musical artists and producers. Their arguments are:

  • Using copyrighted material without consent is exploitative, as it leverages artists’ work to create potentially competing outputs without compensation or acknowledgment.
  • While the training process may partially mimic how humans are inspired by a wide array of music, a major difference is in the scale. Generative AI systems can produce thousands of songs a minute whereas human musicians often only release several songs or albums a year.
  • Generative models can replicate styles, melodies, or even voices, blurring the line between inspiration and outright copying, which could harm the value of the original works.
  • The outputs from generative models may saturate the market with similar-sounding content, reducing the diversity of musical expression and making it harder for original works to stand out.
  • Artists and producers argue for more transparency, requiring AI companies to disclose the specific datasets used for training, and to seek permission before including copyrighted works.
  • The legal frameworks for intellectual property protection are not yet fully equipped to handle AI-generated content, creating uncertainty and potential long-term harm to creators’ livelihoods.
Fair Use

I have found that many of these arguments on the legal side center on the U.S.’s Fair Use laws, which can be understood through four main principles.

  1. The Purpose and Character of the Use: Considers whether the use is for commercial purposes vs. nonprofit or educational purposes. This also takes into account whether the use is transformative, meaning it adds new meaning, purpose, or value to the original work.
  2. The Nature of the Copyrighted Work: Considers the type of work being used. Creative works (e.g., music, film) have more copyright protections than factual works (e.g., data compilations).
  3. The Amount or Substantiality of the Portion Used: Considers both the quantity and quality of the content used. For music, this could be the difference in training on only small chunks of a song vs. a full song.
  4. The Effect of the Use on the Potential Market for or Value of the Work: Considers whether the new use serves as a substitute for the original work, reducing its market value or potential licensing opportunities. In some cases, showing that the use could benefit the original work (e.g., increasing its exposure or audience) might strengthen the fair use argument.

Both sides have legitimate points, but it is important to note the contention derives from artists feeling exploited by the way the generative AI companies have approached this issue without looping them into the conversation. This creator backlash yields unintended downstream effects for the broader music AI community, as the newly adopted default stance for artists is to resist any AI model (generative or not) training on their data without consent and compensation.

This marks a paradigm shift from the previous status quo, where earlier non-generative machine learning models could freely train on any publicly accessible data. Much like the widely accepted practice of Google crawling web-data to consolidate search results, music was once commonly used to train ML recommendation, classification, and other analysis systems without much scrutiny. While Google allows websites to opt out of indexing, most choose to allow it due to its perceived net benefit. Rights holders should always be entitled to consent to who uses their content, but it is the reversal of the default position from allowing open access to restricting access to their data that is important to highlight. Imagine if every website online now automatically opted out of indexing unless explicitly opting in.

None of this is to argue that artists should not have the right to consent to training on their data, be fairly compensated, and remain informed. But it demonstrates that the current path is pinning AI companies and rights holders against each other rather than as collaborators. This leaves the responsibility on the technical music AI community to earn back the trust of artists and creators.

This may be why the phrase “We want to empower artists, not replace them”, has become somewhat overused and cliché. As the belief that an AI company is working in bad faith has become more common, it has been used as a signal to artists that a company wants to work as a collaborator rather than an adversary. I was heartened to see that the majority of the ISMIR community rallied around this sentiment and is actively researching and developing ways to solve some of these outstanding challenges. Additionally, when I have talked with artists and producers one-on-one, they have expressed openness to working with AI in music provided their concerns are taken into consideration. They simply want uses of their work to remain above board and have the ability to consent to their data being used in training models.

So while there are no immediate solutions to healing this rift, here is a good place for companies to start:

  • Abide By the 4 Factors of Fair Use — The community and courts still need to determine what constitutes “originality” regarding generative AI. While this is admittedly a very complex subject to quantify, the qualitative answer can actually derive from a simple moral gut check — “if it feels wrong, it probably is”.
  • Communication — Offer transparency about the data that models are being trained on. Present clarity on the type of model (e.g. generative, non-generative) and its intent (e.g. generation, classification). Foster opportunities for collaboration between rights holders and AI companies.
  • Consent Management — Allow rights holders to opt out of using their data to train models on a case by case basis. The default choice to opt-in or opt-out requires further discussion and may depend on the model type and intent.
  • Licensing & Attribution — Work with rights holders to determine an appropriate attribution model for each use case. This could range from one-time licensing for training to revenue-sharing to fractional attribution.
  • Education — Inform artists and rights holders about AI technology so they know what they are signing up for or opting out of.
  • Copyright Tracking — Encourage the development of technical tools that allow artists to track where their work has been used across AI and beyond.

Of course, each of these points can lead to a bevy of questions and debates. But having a shared knowledge of how these models work, the parties involved, and the state of the industry is a prerequisite to discussing these complex topics in a considerate and nuanced manner. These conversations are essential to solving such pressing issues. No one wants a future devoid of genuine music, art, or creativity, so reducing the emotional responses and finding a common starting place is the best (dare I say only) path forward.

All images, unless otherwise noted, are by the author.

I am an audio machine learning engineer and researcher at Whitebalance as well as a lifelong musician.

Find me on LinkedIn & GitHub and keep up to date with my current work and research here: chrislandschoot.com

Find my music on Spotify, Apple Music, YouTube, SoundCloud, and other streaming platforms as After August.

--

--

Whitebalance
Whitebalance

Published in Whitebalance

Powering copyright-free live entertainment. Trusted by top professional sports teams, leagues, and media. #ClearedForUse

Christopher Landschoot
Christopher Landschoot

Written by Christopher Landschoot

I am an audio machine learning engineer and researcher with a focus on music AI and spatial audio as well as a lifelong musician.

Responses (13)