A.I. in music: from advertising soundtracks to innovations in style

Emmanuel Deruty
11 min readAug 19, 2022

--

The 2012 Deep Learning revolution opened new perspectives for A.I. music. Still, in 2022, we are yet to hear truly innovative music involving A.I., and A.I. music remains far from the mainstream market: it remains a niche — or does it? A 2019–2022 collaboration between Hyper Music and the Sony CSL music team shows that A.I. music technologies can be helpful for domains as different as (1) music for mainstream market advertisement and (2) innovations in composition and production style. Are A.I. technologies finally ready to join synthesizers and DAWs as truly productive ways to make music?

“The Most Wanted” Azzaro commercial — perhaps the first commercial ever involving A.I.-generated sounds.

CSL x Hyper Music

Hyper Music (Yann Macé and Luc Leroy) is a sound and music production company that produces music for ads, TV series, feature films and artists. The music research group of Sony Computer Science Laboratories (CSL) Paris develops AI-based music technology to support music artists in their creative workflows. CSL and Hyper Music have been working together since 2019. They engaged in two collaborations.

Part 1. Over the years, Hyper Music has produced a catalogue of music tracks. When potential clients require music, they advertise their needs, and Hyper Music submits a track from their catalogue. If the track is selected, Hyper Music customizes the track for the client. The first part of the CSL x Hyper Music collaboration focused on checking whether CSL A.I. technology could be useful for producing music that would fit into the Hyper Music portfolio. To this day, two of Hyper Music’s tracks involving CSL A.I. technology have found their way into released tracks: the music for the worldwide The Most Wanted Azzaro commercial and for the Peugeot 9X8 WEC Hypercar Reveal. It suggests that A.I. music technology is now ready for mainstream music production.

Peugeot 9X8 WEC Hypercar Reveal - most likely the first commercial ever with a soundtrack involving A.I. in the compositional process.

Part 2. What are the effects of music A.I. technology on musical style [Pascall, 2001]? In the next part of the collaboration, the goal was to be more experimental and centre the songs around the outputs of the A.I. tools without needing the resulting songs to fit into Hyper Music’s portfolio. The process resulted in five tracks. In one of the tracks, “Melatonin”, the use of A.I. technology resulted in what seems to be an original style [Deruty and Grachten, 2022]. It suggests that not only AI music technology is compatible with mainstream music, but it can also be useful for creating truly innovative music.

CSL x Hyper Music in advertising music

A.I. technology can be useful to music production in many ways: audio synthesis, conditional audio track generation, MIDI generation from context, MIDI generation from given style, automated audio processing… Hyper Music started with drum audio synthesis. In the soundtrack for The Most Wanted Azzaro commercial, Hyper Music used the GAN-based Impact Drums and DrumGAN to generate custom kick drums, snare drums and hi-hat samples. Impact Drums was developed at CSL between 2017 and 2019 by Stéphane Rivaud, Cyran Aouameur and Matthias Demoucron. DrumGAN was developed at CSL between 2019 and 2021 by Javier Nistal, Stefan Lattner and Matthias Demoucron.

“The Most Wanted” Azzaro commercial (April 2021).

One year later, the DrumGAN audio synthesis technology was licensed to Steinberg, resulting in the release of Backbone 1.5 (June 2022). AI-based audio synthesis is now technology available for all.

Figure 1. Workflow for “Rebirth”, which ended up as soundtrack for the Peugeot 9X8 Hypercar Reveal.

Hyper Music then went further and used six CSL prototypes to produce “Rebirth”, for which the workflow can be seen in Figure 1. A.I.-based audio synthesis was used once more (Impact Drums), along with automated audio processing (Profile EQ) [Deruty, 2019], conditional drum part generation (DrumNet) [Lattner and Grachten, 2019], and conditional generation of lead melodies (LeadNet, derived from BassNet) [Grachten et al., 2020]. Of particular interest is the use of LeadNet, not only to generate the piece’s main melody but to inspire the main body of the piece. “Rebirth” was included in Hyper Music’s portfolio and later selected for the Peugeot 9X8 Hypercar Reveal.

Peugeot 9X8 Hypercar Reveal (July 2021).

As usual, Hyper Music customised the track for the client. The following elements remained from the original version. The melody between 1’06 and 1’26 is from LeadNet. The processing chain for the keyboard playing the melody includes Profile EQ. The processing chain for the keyboard playing the chords throughout the piece includes Profile EQ. The synth ostinato from 1’27 to 1’40 is from LeadNet.

CSL x Hyper Music: innovative style with A.I.

Can A.I. technology be the source of an original musical style? Style can originate from the means of production. [Parry, 1911] explains for instance how the violin may be linked to a specific style, with its capacity for wide-ranging melody and high tessitura, and how the organ, where alternating feet on the bass pedals produce patterns that are distinct for late Baroque German organ music. In contemporary Popular Music also, tools used for generating and processing audio may play a prominent role in defining style. Examples include the analog synthesizer, the distortion pedal, and, more recently, the Antares Auto-Tune plugin.

In all these examples, not only is it possible to recognise the sound of the technology (violin, organ, distortion, Auto-Tune…), but it is also possible to associate musical idioms with it [Pascall, 2001]. For instance, the distortion pedal can be associated with using power chords and analog synthesizers to slowly evolving textures (see, for instance, Alessandro Cortini’s “Scappa” and most of his repertoire). In both cases, the result wouldn’t have been reached without the corresponding technology, both in terms of “sound” and “music”. A question therefore arises: can we associate AI technology with musical idioms that wouldn’t exist otherwise?

Hyper Music x CSL: “Melatonin”. See the corresponding paper’s supplementary material for more audio.

One interesting example in this regard comes from the track “Melatonin”, produced during the second part of the Hyper Music x CSL collaboration (see [Deruty and Grachten, 2022] and the corresponding supplementary material for more details).

Figure 2. Workflow for Melatonin, section 2.

Melatonin can be divided into two sections. Figure 2 shows the workflow for Melatonin’s section 2. Two musical idioms that appear in part 2 are (1) heterophony (see also [Cooke, 2001]) between bass lines and (2) homorhythmic homophony (see also [Hyer, 2001]) within bass lines. Figures 3 and 4 illustrate the two idioms using transcriptions using Western notation.

Figure 3. Heterophony between bass lines.
.
Figure 4. Simultaneous pitches (homorhythmic homophony) within a bass line (heard from one single harmonic complex).

Heterophony is not so common in Western classical music. However, it is basic to some non-European music, for example, the gamelan music of south-east Asia, much accompanied vocal music of the Middle East and East Asia, and group singing within orally transmitted monophonic traditions [Cooke, 2001].

Homophony deriving from a single harmonic complex (see [Yost, 2009] for the definition of a harmonic complex) is reminiscent of overtone singing, also a non-European tradition (see [Pegg, 2001]). However, in overtone singing, only natural harmonics can be heard simultaneously to the fundamental, whereas in the case of “Melatonin”, as shown in Figure 4, other notes than the natural harmonics can be heard from one single harmonic compound. See [Deruty and Grachten, 2022] for more details, also see this Medium article for an example of two simultaneous pitches from one harmonic compound in contemporary Popular Music.

About computational creativity in music

There are theoretical papers about computational creativity, the most famous probably being [Boden, 1998], further formalised by [Wiggins, 2006] amongst others. [Boden, 1998] describes three forms of creativity. Largely quoting Boden, the three forms are:

(1) Combinational creativity. The first type involves novel (improbable) combinations of familiar ideas.
(2) Exploratory creativity. The second type involves the generation of novel ideas by exploring structured conceptual spaces. This often results in structures (“ideas”) that are not only novel but unexpected. One can immediately see, however, that they satisfy the canons of the thinking style concerned.
(3) Transformational creativity. The latter involves the transformation of some (one or more) dimensions of the space so that new structures can be generated which could not have arisen before.

Form (2) and (3) shade into one another, since exploration of the space can include minimal “tweaking” of fairly superficial constraints. The distinction between a tweak and a transform is to some extent a matter of judgement, but the more well-defined the space, the clearer this distinction can be.

Reaching transformational creativity is an interesting goal: if new structures can be generated which could not have arisen before the use of the technology, then the technology has the potential to produce truly creative music.

Figure 5 illustrates the architecture of Magenta’s Music VAE [Adams et al., 2018]. Such a model is susceptible to generate novel ideas by exploration of the VAE’s latent space, with an output style that emulates the training dataset’s.

Figure 5. The architecture of Magenta’s Music VAE, 2018.

Figure 6 illustrates the approach followed by Open AI’s Jukebox [Dhariwal et al., 2020]. Similarly, novel ideas can be generated by exploring the VQ-VAE’s space, with an output style that emulates the training dataset’s. The creativity demonstrated by both Music VAE and Jukebox falls into the second category, exploratory creativity.

Figure 6. The approach followed by Open AI’s Jukebox, 2020.

Figure 7 represents an image generated by DALL·E 2 [Ramesh et al., 2022]. The result is “in the style of Claude Monet”. As Boden puts it, it “satisf[ies] the canons of the thinking-style concerned”. As such, it remains exploratory creativity.

Figure 7. “A painting of a fox sitting in a field at sunrise in the
style of Claude Monet”, generated by DALL·E 2, 2022.

Figure 8 shows a self-portrait by Vincent Van Gogh. As in most Van Gogh paintings, the brush strokes follow a characteristic set of rules [Putri et al., 2017].

Figure 8. Vincent Van Gogh, self-portrait, 1889.

Similarly, the images shown in Figure 9, which Deep Dream created, feature an original style, a recognizable and original set of patterns. The creativity shown by Deep Dream goes beyond exploratory creativity. It doesn’t satisfy a canon. “New structures are generated, which could not have arisen before” the use of the Google team’s technique of iterative Inceptionism.

Figure 9. Neural Net “dreams” generated purely from random noise by Deep Dream, 2015.

In this perspective, we suggest that in the context of “Melatonin”, BassNet also goes further than exploratory creativity.

In conditional audio generation such as BassNet, the model learns the relations between one input and one output. After training, once relations are learnt over the training dataset, the output is generated from an input, and both can be played together. Parts that are generated using the same input conditioning, but using slightly different positions in the latent space, have global similarities but local differences. Once the input conditioning is removed, the result is an ensemble of compatible tracks that are more or less similar to each other, leading to an original style of counterpoint that involves heterophony (see an example in Figure 7). The result doesn’t satisfy a canon. “New structures are generated, which could not have arisen before” the use of conditional audio generation.

Figure 7. A six-bass part involving heterophony.

BassNet doesn’t only learn relations in terms of pitch, it learns relations in terms of spectral envelope. BassNet’s learning process involves the energy in each harmonic (see an example of a spectrogram of a BassNet output in Figure 8). It turns out that given particular conditions (again, see [Deruty and Grachten, 2022] for more details), the resulting harmonic complex is perceived as two simultaneous pitch values. Again, the result doesn’t satisfy a canon. “New structures are generated, which could not have arisen before” the use of the learning of those particular values for the amplitude of each harmonic.

Figure 8. The spectrogram corresponding to the transcription shown in Figure 4.

Transformational creativity involves the transformation of some (one or more) dimensions of the space. If the space, in this case, is the ensemble of rules that prelude to the construction of the musical discourse, then perhaps homophony within a single line is transformational creativity. A previous (and fundamental) rule was: that one single harmonic complex leads to one perceived pitch value, a new rule is: that one single harmonic complex can lead to two perceived pitch values.

Conclusion

The same machine learning model ([Grachten et al., 2020]) that provided inspiration and the main lead part for the Peugeot 9X8 Hypercar Reveal soundtrack, was able to suggest original musical idioms. In a world where popular music is “considered to be of lower value and complexity than art music, and to be readily accessible to large numbers of musically uneducated listeners rather than to an élite” [Middleton and Manuel, 2001], an A.I. that can simultaneously accommodate both mainstream production music (to use [Crain, 2018]’s vocabulary) and sonic experiments involving psychoacoustics and counterpoint (possibly being capable of transformational creativity in the process), is perhaps more perceptive than many humans. In any case, music technology has always been fun, A.I. is certainly no exception.

References

[Adams et al., 2018] Roberts, Adam, et al. “A hierarchical latent vector model for learning long-term structure in music.” International conference on machine learning. PMLR, 2018.

[Boden, 1998] Boden, Margaret A. “Creativity and artificial intelligence.” Artificial intelligence 103.1–2 (1998): 347–356.

[Cooke, 2001] Cooke, Peter. (2001). Heterophony. In Grove Music Online. Oxford University Press.

[Crain, 2018] Crain, Thimoty. M. (2018). Production music. In Grove Music Online. Oxford University Press.

[Dhariwal et al., 2020] Dhariwal, Prafulla, et al. “Jukebox: A generative model for music.” arXiv preprint arXiv:2005.00341 (2020).

[Deruty, 2019] Method and electronic device, world patent WO2019063736A1, Sony Europe Limited, 2019.

[Deruty and Grachten, 2022] Deruty, Emmanuel and Maarten Grachten, “Melatonin”: A Case Study on AI-induced Musical Style. 3rd Conference on AI Music Creativity, September 13–15 2022. Link to supplementary material.

[Grachten et al., 2020] Grachten, Maarten, Stefan Lattner, and Emmanuel Deruty. “Bassnet: A variational gated autoencoder for conditional generation of bass guitar tracks with learned interactive control.” Applied Sciences 10.18 (2020): 6627.

[Hyer, 2001] Hyer, Brian (2001). Homophony. In Grove Music Online. Oxford University Press.

[Lattner and Grachten, 2019] Lattner, Stefan, and Maarten Grachten. “High-level control of drum track generation using learned patterns of rhythmic interaction.” 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019.

[Middleton and Manuel, 2001] Middleton, Richard, and Peter Manuel (2001). Popular Music. In Grove Music Online. Oxford University Press.

[Parry, 1911] Parry, Charles Hubert Hastings. Style in musical art. Macmillan and Company, limited, 1911.

[Pascall, 2001] Pascall, Robert. (2001). Style. In Grove Music Online. Oxford University Press.

[Pegg, 2001] Pegg, Carole. (2001). Overtone-singing. In Grove Music Online. Oxford University Press.

[Putri et al., 2017] Putri, Tieta, Ramakrishnan Mukundan, and Kourosh Neshatian. “Artistic Style Characterization of Vincent Van Gogh’s Paintings using Extracted Features from Visible Brush Strokes.” International Conference on Pattern Recognition Applications and Methods. Vol. 2. SCITEPRESS, 2017.

[Ramesh et al., 2022] Ramesh, Aditya, et al. “Hierarchical text-conditional image generation with clip latents.” arXiv preprint arXiv:2204.06125 (2022).

[Wiggins, 2006] Wiggins, Geraint A. “A preliminary framework for description, analysis and comparison of creative systems.” Knowledge-Based Systems 19.7 (2006): 449–458.

[Yost, 2009] Yost, William A. “Pitch perception.” Attention, Perception, & Psychophysics 71.8 (2009): 1701–1715.

Our team’s page: Sony CSL Music — expanding creativity with A.I.

--

--

Emmanuel Deruty

Researcher for the music team at Sony CSL Paris. We are a team working on the future of AI-assisted music production, located in Paris and Tokyo.