What do we (really) want from AI music generation?

Alex Han
8 min readFeb 5, 2023

--

Music is far from exempt from the rapidly and continuously evolving field of generative AI research. The MusicLM model from Google Research that was revealed last week represents an exceptionally big leap forward. Their text prompt-based generative audio model outperforms even the most successful, cutting edge models in terms of audio fidelity and adherence to the prompt content. However, as is the case with so many of the advances in generative AI research, I did not feel particularly inspired, hopeful, or excited upon reading the paper outlining the design and capabilities of the model. At the same time, I also did not feel the same sort of existential terror or sometimes disgust that I also commonly see from artists, musicians, and even researchers when confronted with the latest advancements in generative art. What I did feel was resoundingly lukewarm. The examples of the model’s output are impressive, but more as a novelty — I felt myself constantly questioning the purpose, scope, and potential impact of this technology.

The paper, like many in the ML sphere, focuses on the model architecture, the improved performance relative to last year’s models, and areas for improvement. Conspicuously brief, if not absent, were discussions about ethics, use case, and conceptual motivations. The model clearly outperforms other audio-generative models, and introduces some significant and novel functionality. But why continue to optimize such a model? What is the end goal? The most pessimistic reading is that, just as we use AI to automate tasks for us, and provide solutions or answers, this technology will eventually create music that is indistinguishable from human-made music, and human musicians may be replaced by AI altogether. I doubt that the authors of this paper believe this to be the case, nor would they identify creating this kind of future as their objective. But then what is the point? I certainly don’t know the answer, and I was disappointed I could not find a single discussion in the paper about what good this technology would do.

The most charitable reading of this technology’s impact is that it could be used either to generate music that serves more of a commercial or purely “functional” (as opposed to aesthetic) purpose, or that it might not be used as music-in-itself but be part of a human-centered creative process where a musician/producer/composer/songwriter might use it to brainstorm ideas, to generate textures to sample and repurpose, or otherwise use the output as a catalyst to further human creation.

The latter seems more valuable to me, although my prediction would be that a capitalistic society would encourage the former. I don’t even think that is necessarily a bad thing — anecdotally speaking, musicians who are making music for commercial/royalty-free use in YouTube intros/outros/transitions, advertisements, and other interstitial or “incidental” purposes, are not doing so with a passionate sense of artistic integrity. I would love to meet a musician who proudly professes their love for and expertise in creating Muzak for telephone “hold” music. Rather, they more likely create this kind of music for the income it brings — which is nothing to sniff at, and would be strongly affected by future AI automation.

Perhaps a world where the musical equivalent of stock photos can be handled by machines would allow musical artists to devote their energy elsewhere. Perhaps we as a society would begin to appreciate the fundamentally human values that go into the creation of music. For my own peace of mind, this is the kind of optimism I choose to adopt.

I am also reminded of generative image models like DALL-E 2 or Stable Diffusion, which present many of the same issues of creative property and potential use-cases. Without delving into an in-depth discussion of image models, I will highlight what I think is an excellent use of them: memes and satirical/ironic content. To me, use of AI as parody or as the butt of a joke is not only funny but an interesting case. Asking DALL-E to draw a baby panicking about the stock market, the Michelin Man helping a lost Russian child, or even just a hand with five fingers, yields hilarious results whose humor, I would argue, relies on the very fact that it is AI-generated. There is something funny about seeing where AI fails and gets things wrong — it can instantly provide images that would/could/should probably never be drawn or created, but sometimes fails horribly at the simplest of tasks (I said FIVE fingers!) Sometimes the output can be both comedic and philosophical, in a meta- kind of way, like this attempt at a color that does not exist. Maybe there is a future for AI-generated memes in the musical domain. I am of the belief that sometimes when we take things less seriously, the usage of this technology becomes much more interesting and exciting.

So, maybe there is room for some genuinely interesting answers to questions of “What is the point? Why do this at all?” Still, this is far from the end of the discussion of the risks associated with AI in art.

There are a couple of brief moments where the authors allude to ethical considerations. In the “Broader Impact” section that concludes the paper, one sentence is devoted to potential bias and cultural appropriation:

The generated samples will reflect the biases present in the training data, raising the question about appropriateness for music generation for cultures underrepresented in the training data, while at the same time also raising concerns about cultural appropriation.

This is woefully vague and far from comprehensive. Sometimes it feels like ML papers stick in such statements almost as a perfunctory measure, to ensure that they acknowledge the potential for harm without stopping to really examine it and question the impact of their results. Packed into this sentence are a number of valid ideas that I wish were clarified or explored further: what kind of biases could be present in the training data? In what way could these kinds of biases impact different communities? What constitutes “cultural appropriation” in the context of AI-generated art? Who bears responsibility for perpetuating or promoting biases? The authors? The people who train or label the data? The people who implement the model for themselves? The “AI agent” itself?

There is a difference, for instance, in underrepresenting certain kinds of music or training data from certain artists, and constructing labels and text descriptions of the audio samples that reflect reductive and problematic language that is found all over musical discourse. Compare, for instance, the nuanced critical discourse around the many subgenres of rock by institutions like Rolling Stone magazine or the Grammys, compared to the often monolithic treatment of traditionally black music like jazz or even the more egregiously euphemistic “urban music”.

The authors continue the “Broader Impact” section with the following statement:

We acknowledge the risk of potential misappropriation of creative content associated to the use-case. In accordance with responsible model development practices, we conducted a thorough study of memorization…

While I am glad that the authors clearly put some effort in quantitatively assessing the potential for this model to directly “steal” material from the training data, this method of evaluation and the implications of it were also under-examined. The authors, borrowing from some precedent in LLMs, evaluate the degree to which the model’s output matches source material exactly. They also extend this to “approximate matches” (with less strict matching criteria) and conclude that the potential for misuse of original creative content is minimal — exact matches were barely present according to their analysis and about 1% of examples were “approximate matches”. While 1% sounds small, I would question whether this is small enough to dismiss when projected to a world where this kind of technology is ubiquitous. However, I didn’t even have qualms about this issue, as I was skeptical that this metric itself accomplishes what the authors say it accomplishes. The memorization analysis focuses on the semantic modeling stage only — is this sufficient? Is misuse of creative property something that can be measured by analyzing audio waveforms or semantic tokens in a language model? I do not know. But I do think that the concept of originality and ownership of creative content is slippery and cannot be operationalized along individual dimensions, whether that be on a purely audio waveform level, a lyrical content level, a harmonic progression level, a melodic level, an instrumentation level, a sample-use level…

This problem is not unique to AI-generated music. Every year we hear about legal battles around copyright, where songwriters and artists claim theft of their creative and intellectual property, to varying degrees of success and merit. While this could be another essay entirely, suffice it to say that the U.S. legal system doesn’t have firm quantifiable answers about what constitutes creative property, and the heuristics it does have are often dubious to professional musicians and songwriters. Some musical features are shared verbatim across hundreds of songs — original works of music might have identical chord progressions, drum grooves, or melodies and still clearly retain artistic individuality and integrity. I’ve seen legal debates citing things like key signature, tempo, or harmonic progression as evidence of creative theft, something that is (in my opinion) irrelevant in most cases.

In contrast, I suspect that genuine plagiarism of musical material could take place on a more holistic, conceptual level, such that individual elements (chords, melody, lyrics) might not be lifted verbatim but nevertheless simply “rephrased”, perhaps with varying degrees of similarity across multiple compositional dimensions. Again, this is a problem that is hardly unique to AI-generated music, and the line between this sort of abstract interpretation of plagiarism and more acceptable forms of creative borrowing (Remixes? Covers? Transparent interpretation/inspiration?) is blurry. However, there is an important distinction with AI-generated music, in that there is barely any precedent (if at all) or legal infrastructure to govern the use/misuse of creative property when questions of ML training data sources or the originality of generated output is concerned. I applaud the authors for expressing caution and advocating restraint as they conclude their paper:

We strongly emphasize the need for more future work in tackling these risks associated to music generation — we have no plans to release models at this point.

Keeping up with the ever accelerating trajectory of generative AI music always stirs up such a range of emotions for me. I desperately want to feel inspired and excited for the future of music. The sheer power and potential of generative AI is too much to ignore, and it would be naïve to just proclaim that we should collectively shove it back into Pandora’s box and avoid wrestling with the new set of ethical and aesthetic questions it poses for the arts. But I cannot help but feel that the progress in the field is rapidly outpacing the discussions of intent, purpose, and impact. I am almost certain that generative AI will increasingly move towards the center of artistic discourse in the coming years, and our legal, cultural, and artistic attitudes towards it will need to be re-examined and discussed openly and often.

In the face of all this uncertainty and brimming possibility (both good and bad), I feel a strange sense of calm. As someone who is passionate about both creating music and about cutting-edge technology, this discussion means a lot to me. I have adopted a sort of radical optimism about the future of AI in music — I strongly believe that AI will never replace human musicians or make human-created music obsolete. The reasons we love music are inseparable from the process, intent, and depth of expression that come from human hearts and minds. If we accept this truth, and the fact that this technology will continue to evolve and exist, we may end up with a new set of tools and sounds that will engender novel music with truly human artistic integrity. It may even end up making us appreciate music for the humanity that is at its essence.

--

--