Vocal setups in Contemporary Popular Music

Emmanuel Deruty
31 min readSep 5, 2022

--

One key goal of our series of posts is to be able to build better artificial intelligence models for music by better understanding the music itself. One aspect of Contemporary Popular Music is the presence of vocals. A look at the “Greatest of all time” Billboard 200 albums shows that 195 albums out of 200 feature songs with vocals. Scholars such as [Moore and Martin, 2018] confirm the observation: “[t]he fact that most popular music is vocal music means that we must take count of the voice rather closely”. In this post, we examine one key difference between music made with A.I. and actual commercial music: vocal setups, i.e., how individual voices are organised together.

“Hustlers”, from “Andretti 11/30” by US rapper Curren$y, is one example of particularly elaborate vocals.

Content of this post:
Introduction (1): the A.I. song contest 2020.
Introduction (2): terminology.
Vocal setups in Western Classical Music.
-> Western Classical Music: single vocal part.
-> Western Classical Music: multiple vocal parts.
Vocal setups in Contemporary Popular Music.
-> Notation of vocal setups.
-> Prevalence.
-> Analyses in the context of songs.
A better representation for better machine learning?
Conclusion.

This post is derived from a presentation given at Sony CSL Paris in 2021.

Introduction (1): the A.I. song contest 2020

The A.I. song contest is an international music competition for songs that have been composed with the help of A.I. The first edition occurred in 2020, and gathered 13 teams. Figure 1 shows an analysis of the production process for the 13 songs according to [Huang et al., 2020].

Figure 1. “Modular musical building blocks” according to [Huang et al., 2020].

The 12 teams who chose to include a vocal track first generated a melody and then used either vocal synthesis or a human singer to sonify the melody. Such a process stems from the hypothesis according to which a song typically features one singer or, from the point of view of production, one single vocal track.

Figure 2 shows transcriptions made from Contemporary Popular Music (see [Deruty et al., 2022] for a definition of Contemporary Popular Music). The transcriptions will be detailed later in the post. Vocal lines are by no means unique.

Figure 2. Transcription of vocal tracks from contemporary Popular Music, as found later in this article.

As a result, suppose that “melody generation” + “vocal synthesis” as performed in the A.I. song contest 2020 works perfectly. Then we realise that the generated content is far from what we can hear in actual music production. The question addressed by this article is the following: how to bridge the gap between the two practices, i.e. music production involving A.I. and actual music production?

Introduction (2): terminology

Let’s better define the specific aspect of vocals that so greatly differs between the practices witnessed (1) in the A.I. song context 2020 and (2) in actual music production.

According to [Grove Music Online, 2001a], a texture is “[t]he way in which individual parts or voices are put together”. The definition corresponds to the aspect of vocals we address in this post. One obvious question is, are parts and voices synonymous? According to [Grove Music Online, 2001b], they are not. A voice is a single track or line, as defined as “one of two or more paths on magnetic recording tape receiving information from a single input channel; hence by extension, the single voice or line recorded, whether on tape or by digital means”.

From this perspective, as illustrated in Figure 3, we may group lines or voices into parts. There may be or one several parts, made from individual lines or voices.

Figure 3. “Parts” made from one or several “voices” or “lines”.

Such grouping may designate a vocal texture according to [Grove Music Online, 2001a]. Still, we prefer the word setup to the word texture, as texture evokes microscopic properties rather than general organisation schemes such as the one shown in Figure 3.

Let’s clarify that Figure 3 doesn’t show an arrangement of the vocals. According to [Boyd, 2001], an arrangement is the recomposition and transcription of original content. The below analyses of the vocal content in Contemporary Popular Music pertain to the original music.

The term vocal layering is sometimes used in studio practices. Figure 4 shows a screen capture from a YouTube video that explains how to layer vocals in hip-hop. The arrangement of the audio regions suggests that layering refers to the organisation of single voices in a part.

Figure 4. “Hip Hop Vocal Layering Like A Pro”, from YouTuber Sean Divine.

Vocal setups in Western Classical Music

[Hitchcock and Deaville, 2013] define three disciplines in musicology: historical musicology, popular music studies, and ethnomusicology. Historical musicology would be concerned by “Western art music” (a term used by the Oxford Music Faculty). Popular music studies would naturally correspond to the study of popular music, “considered to be of lower value and complexity than art music, and to be readily accessible to large numbers of musically uneducated listeners rather than to an élite” [Middleton and Manuel, 2001]. Finally, ethnomusicology (see [Bohlman, 2013] for a precise definition) would concern everything that’s not Western (an attitude that’s perhaps not without consequences).

It seems to us that the distinction between the three aforementioned disciplines of musicology should not be perpetuated. Relegating studies to the secondary disciplines of popular music studies and ethnomusicology simply because the music that’s studied is not deemed to belong to the proper music genre should not happen in the field of academic studies. In this post, we see one consequence of the phenomenon, as participants of the A.I. song contest 2020 appear to have a conception of Contemporary Popular Music that’s largely derived from Western Classical Music, in particular as far as the vocals are concerned.

Still, as we don’t wish to be relegated to the exclusive domain of popular music studies, we make the effort to consider the historical musicology’s approach to vocal setups. In doing so, we find out what we can retain from this approach regarding Contemporary Popular Music. Eventually, we’ll come to the conclusion that the terminology proposed by the historical musicology’s approach is confusing and that the model itself poses problems— still, perhaps progress has to go through explicitly questioning what already exists.

Following Figure 3 above, we keep on making the distinction between parts and voices. A part is made from different unique voices, lines or tracks. If two singers sing together the same music content (modulo the octave interval), then their contribution (two voices) is grouped into a single part. In this perspective, the score on the left in Figure 5 below features a single vocal part (whether it is eventually sung by one or several people). The score on the right features four vocal parts — independently of whether each one of them is sung by one or several persons.

Figure 5. Two scores from Franz Schubert. Left, one vocal part. Right, four vocal parts.

Western Classical Music: single vocal part

According to the literature, a single vocal part can be characterised by two main aspects. First aspect: whether the part is a monodie, in which case it includes one single voice or a chorodie, in which case it includes several voices in unison/octave. Second aspect: whether the part follows the poetic rhetoric (the text). See Figure 6 for a schematic representation of the two aspects. Notice the confusion in the vocabulary, as the word monodie/monodia can designate any of the involved setups.

Figure 6. Vocabulary used for the setup of a single vocal part in historical musicology. The numbers between parentheses lead to the following references: (1) [Baron, 1968]; (2) [Batista Doni, 1635], p. 68; (3) ibid., p. 103; (4) [Kircher, 1650]; (5) [Walther, 1732].

Aspect 1, monodie and chorodie.
Video 1 illustrates a monodie followed by a chorodie. “Libera me, Domine” is sung by one person only, then “de more aeterna” is sung by several persons, at least one singing the upper octave.

Video 1. “Libera me” [Schola of the Hofburgkapelle Vienna, 2001] — Example of both monodies and chorodies.

Aspect 2: subordination to poetic rhetoric.
Video 2 includes a recitative from Le nozze di Figaro [Mozart, 1786]. In this case, the melody is subordinated to the lyrics.

Video 2. “Che imbarazzo è mai questo” (recitativo) [Mozart, 1786]- Example of poetic subordination.

Video 3 includes an aria from Die Zauberflöte [Mozart, 1791]. In this case, the melody is not subordinated to the lyrics. The melody is sometimes largely independent, with several patterns on one syllable.

Video 3. “Der Hölle Rache” [Mozart, 1791] — Example of poetic insubordination

Western Classical Music: multiple vocal parts

As illustrated in Figure 7, [Baron, 1968] and [Batista Doni, 1635], p. 68 clearly define the content of a part: all voices in each part are homorhythmic, and have a relation of unison/octave. We’ll see that relations between parts are less well characterised.

Figure 7. Relations between parts may be difficult to characterise.

According to [Hyer, 2001], one relation between parts is homophony. It occurs when several parts are homorhythmic while not playing the same notes. An example of homophony is provided in Video 4.

Video 4. Homophony: homorhythmic parts from [Tallis, 1565].

According to [Frobenius, 2001], polyphony occurs when the different parts are not homorhythmic. Video 5 illustrates polyphony from the same music as Video 4 [Tallis, 1565].

Video 5. Polyphonic (non-homorhythmic) content from [Tallis, 1565].

Still, the term polyphony is usually reserved for music from the late Middle Ages and Renaissance, in which the different parts have equal importance [Frobenius, 2001]. Video 6 contains an example in which different parts are not homorhythmic, yet using the term polyphony for such music would be considered improper.

Video 6. Ständchen (“Zögernd, leise”) [Schubert, 1827]. Two parts, not homorhythmic, yet not “polyphony”.

A confusion appears when, according to [Hyer, 2001]: (1) in homophonic setups, “one part — often but not always the highest — usually dominates the entire texture” and (2) “there is a clear differentiation between melody and accompaniment”. [Hyer, 2001] even provides the following example for homophony:

Figure 8. “Homophony homophonia [sic]: ‘sounding alike’: Ex.2 Chopin: Nocturne in E major, op.62 no.2 (1846).” Borrowed from [Hyer, 2001].

Also, it is not always clear which part dominates. Video 7 provides an example in which the top two parts appear to dominate in turns.

Video 7. “Hallelujah” [Händel, 1741]. Which part dominates? Difficult to answer.

The above makes the word homophony poorly defined. In Figure 9, we propose a recapitulation involving the words homophony, homorhythmy and polyphony in terms of a gradation that ranges from homorhythmy (the parts are synchronous to each other) to polyphony (the parts are rhythmically independent).

Figure 9. Relations between parts, from the most interdependent to the most dependent.

In Figure 10, we propose an overall recapitulation of the respective roles of parts and voices in Western Classical Music. A part contains a single voice or homorhythmic voices in a unison/octave relation. The relations between parts are less clear, but two criteria can be considered: (1) whether one part dominates the others or the parts are of equal importance and (2) whether the parts are homorhythmic or not.

Figure 10. Summary of the relations between parts and between voices in Western Classical Music.

Vocal setups in Contemporary Popular Music

One form of Popular Music involves a solo singer simultaneously playing a polyphonic instrument, e.g. the guitar. According to [Baron, 1968], the term monodie doesn’t apply to this configuration, as a monodie is “presumably […] unaccompanied”. According to [Hyer, 2001], the configuration would qualify as homophony, as the vocals dominate (see Figure 8). As the Western Classical Music terminology appears to be more confusing than enlightening in this case, we’ll set it aside for the time being.

Figure 11 illustrates the configuration mentioned above, showing Bob Dylan live ca. 1965. The same Figure shows a screen capture of the vocal tracks in a 2010 Pro Tools session for an RnB song. One can count 49 single voices or vocal lines. The two examples show two significantly different practices.

Figure 11. 1965–2010, a spectacular evolution in the production of vocals.

Let’s provide examples illustrating the gradation from the “solo singer + guitar” configuration to contemporary complex vocal production techniques. Figure 12 illustrates the common practice of backing vocals (reminiscent of Video 6 above [Schubert, 1827]). One main singer is supported by secondary vocalists.

Figure 12. Adele and Elton John, along with backing vocals. The videos corresponding to the examples shown can be found here and here.

Backing vocals can assume a variety of functions. In video 8, the backing vocals, performed by bassist Chris Wolstenholme and support musician Morgan Nicholls, provide “answers”, harmonisation and doubling to Matt Bellamy’s lead vocals.

Video 8. “Madness” from [Muse, 2013]. “Answers”, harmonisation and vocal doubling.

In the additional resources to Mixing secrets for the small studio, [Senior, 2011] indeed provides backing vocal tracks in the multi-tracks to be mixed. As illustrated in Figure 13, he also provides “DT” tracks (track 22).

Figure 13. “DT” and backing vocals in multi-tracks provided by [Senior, 2011].

“DT” means “double-tracking”, which consists of recording twice the same vocal line or voice, presumably to obtain a thicker sound than a single voice. Video 9 shows producer Butch Vig explaining the use of double-tracking in “Smells Like Teen Spirit” [Nirvana, 1991].

Video 9. Double-tracking in the studio: Butch Vig about “Smells Like Teen Spirit” [Nirvana, 1991].

Double-tracking seems important enough to be implemented live. In Video 10, 0'08, Eminem lowers his microphone for one syllable. A pre-recorded, synchronous Eminem can be clearly heard voicing the same syllable, indicating double-tracking.

Video 10. Double-tracking performed live: “Mockingbird” [Eminem, 2004].

Notation of vocal setups

To describe the voices inside parts in the context of Contemporary Popular Music, we’ll use the notation shown in Figure 14. The white square represents a part. Each square circle represents a voice. Inside the square, the X-axis denotes the voice's panning (left/right). The Y-axis loosely denotes pitch. We’ll see that in practice, there is no need for a clear pitch to represent one voice higher than another one — an approximate register may be sufficient.

Figure 14. Notation of voices inside parts.

Video 11 shows a partial transcription of “See Emily Play” [Pink Floyd, 1967], using the notation illustrated in Figure 14. The early date of this monophonic recording (1967) suggests that double-tracking was used very early on.

Video 11. Double tracking in “See Emily Play” [Pink Floyd, 1967].

Figure 15 shows examples of basic vocal setups, followed by a list of music examples involving each setup. The list suggests that the basic vocal setups are ubiquitous and found in various music genres. The YouTube link in the reference directly leads to the right timing.

Figure 15. Examples of basic vocal setups.

Examples for setup (1)(double-tracking) include the vocals in:
- “Dear Prudence” [The Beatles, 1968], from 0'16.
- “Cleanin’ out my Closet” [Eminem, 2002], from 0'33.
- “Extreme Ways” [Moby, 2002], the chorus (for instance, from 1'16).
Double-tracking is extremely common and has been for a long time.

Examples for setup (2) include the vocals in:
- “Stayin’ Alive” [Bee Gees, 1977], the verse from 3'03 to 3'22.
- “Single Ladies” [Beyoncé, 2008], the chorus from 0'30.
- “Beautiful the World” [Uncanny Valley, 2020], the first verse from 0'17 and several sections throughout the song. The winner of the A.I. song contest 2020, it remains produced by human musicians.

Examples for setup (3) include the vocals in:
- “In the Flesh Thou Didst Fall Asleep” [Divna Ljubojevic, 2008], from 0'40 until the end. Note that although this setup may simply seem to correspond to two persons singing together, they can’t be at the same place in the space in a physical environment. The voice placement was performed during production.
- “Breathe” [Pink Floyd, 1973], from 2'45 to the end. A similar example to the previous one, even though this example is of an entirely different genre.
- “This Mortal Coil” [Carcass, 1993], from 2'29 to 2'35. The example shows that even though the setup suggests two different pitch values, there is no need to have an actual pitch to have one voice higher than the other. See this Medium post and this other Medium post for remarks about how the traditional notion of pitch may be unsuited to describe contemporary popular music.

Examples for setup (4) include the vocals in:
- “I’m your Boogie Man” [KC and the Sunshine Band, 1976], 2'32 to 2'44.
- “American Life”[Madonna, 2003], the vocal section at the beginning.
- “Tarnished” [Dälek, 2007], the chorus, for instance, from 2'30. As in [Carcass, 1993], the setup does not need a clear pitch.

Examples for setup (5) include the vocals in:
- “Mrs Robinson” [Simon & Garfunkel, 1968], the chorus from 0'32. Even though the song sounds natural and traditional, the vocals of each singer were recorded twice and panned left/right.
- “We Fly High” [Jim Jones, 2006], the chorus from 0'09. Very different music genre, same technique.
- “Alejandro” [Lady Gaga, 2010], the chorus from 3'21. Third example for this setup, third chorus.

Examples for setup (6) include the vocals in:
- “Chandelier” [SIA, 2014], from 1'28 to 1'50.
- “On the Floor” [Jennifer Lopez, 2011], the chorus from 1'10.

Examples for setup (7) include the vocals in:
- “Dontojno Jest” [Kitka, 2003], from 1'43 to 2'38. This example illustrates how studio vocal setups can be compatible with older music genres. In this organum-typed polyphonic extract (see [Smith et al., 2001] for more on organum), the sustained-note tenor is sung twice, panned left/right, while the more mobile upper part is recorded only once and panned on the centre.

Prevalence

The examples above suggest that vocal parts may feature multiple voices or vocal lines. It would be interesting to know what is the proportion of music for which it is the case. Figure 16 shows the proportion for different types of vocals for the years 1960, 1985 and 2010. Except for the single-voice part, the icons are used for illustrative purposes, they don’t correspond to the precise vocal setup. The description was derived from listening to the music provided by the list of #1 singles in 1960, the list of #1 singles in 1985 and the list of #1 singles in 2010.

Figure 16. Proportion of types of vocals. Left, 1960. Middle, 1985. Right, 2010.

The single-voice part (one single singer) is seldom used. In 1960, the majority of music involves backing vocals. The complexity of vocal production appears to increase with time. In particular, composite (several voices) lead parts make up more than 75% of cases in 2010, and almost half of the music involves composite lead + composite backing vocals.

To obtain a year-by-year diachronic representation, let’s write (0) the absence of vocals, (1) a one-voice lead, (2) a one-voice lead with backing vocals, (3) a composite lead, and (4) a composite lead with backing vocals. Figure 17 shows the evolution of these features in the US#1 Billboard singles.

Figure 16. A listen to the US Billboard #1 tracks suggests a progressive complexification in the production of vocals.

Analyses in the context of songs

Vocal setups are not static: they evolve during the songs. We transcribe and analyse excerpts from ten songs produced from 1979 to 2021. The transcription includes the perceived vocal setups. Note that (1) in absence of a score, the setups are written as heard — the production, as seen on the right of Figure 11, may have involved more vocal tracks / more voices. (2) While [Baron, 1968] and [Batista Doni, 1635], p. 68, chose to group voices into parts when the voices have a relation of unison/octave, we choose to group voices into parts when voices are homorhythmic. Indeed, voices involved in basic setups (3), (5), (6) and (7) above, for example, should belong to the same part (a more detailed explanation is provided below). Looking back at Video 6 [Schubert, 1827], the process would have resulted in two parts.

The ten songs are shown in Figure 17. They are divided into three groups. The main group is the top line. The leftmost song ([Pink Floyd, 1979]) includes only one part. From left to right, the songs feature an increasingly present second part. The second group is on the bottom left. It shows examples of songs in which vocal setups can be identified while pitch is unclear. The third group is shown on the bottom right. The two songs in this group show particularities: in [XXXTentacion, 2017], the main part becomes subordinate to the previous secondary part, and in [Curren$y, 2016], difficulties occur with the grouping of voices into parts.

Figure 17. Ten songs to be analysed, three groups.
  1. “Another Brick in the Wall, part 1” [Pink Floyd, 1979]: three different functions for voices inside one part.

Video 12 shows a partial transcription of 29 bars from [Pink Floyd, 1979] in Western Classical notation. The transcription, as well as the other ones below, was made by ear. The top stave describes the vocal part, the corresponding vocal setups are inserted above the top stave. In this one-part example, the setups can be linked to three distinct functions: (1) in bars 6 and 7, they serve a figurative purpose, highlighting the word “memory”. (2) In bar 19, they reinforce Roger Water’s vocals in their weaker higher range. (3) From bar (22), they correspond to a two-voice harmonisation.

Video 12. Partial transcription of “Another Brick in the Wall, part 1” [Pink Floyd, 1979].

Figure 18 interprets the example according to the representation shown in Figures 3, 7 and 10, which we previously used in the case of Western Classical Music. There is one single part, with one to three voices that are subordinated to the lyrics. As stated above, the part's voices are homorhythmic, not in a relation of unison/octave. Describing the homorhythmic voices as different parts would have resulted in a more complex description — see [Bimbot et al., 2016] for a description of how music analysis can take advantage of Minimum Description Length’s principles to better describe the analysed content (see [Gründwald et al., 2005] for more on the MDL principle, and refer to example 3 below for more details on the matter).

Figure 18. Parts and voices in [Pink Floyd, 1979].

2. “If you seek Amy” [Britney Spears, 2008]: a higher variety of voice setups inside one part.

Video 13 shows a partial transcription of 32 bars from [Britney Spears, 2008] in Western Classical notation. Thirty years later than the previous example, there is much more variety in voice setups inside a single part. Note that the setups are part of the form in the sense of [Caplin, 2001], and more specifically [Bimbot et al., 2016]: for instance, setups are part of the contrasts in bars 4 and 8.

Video 13. Partial transcription of “If you seek Amy” [Britney Spears, 2008].

Figure 19 interprets the example according to the now usual representation. There is one single part, with one to several voices that are subordinated to the lyrics. Considerable attention is brought to the setup of voices inside parts.

Figure 19. Parts and voices in [Britney Spears, 2008].

3. “Royals” [Lorde, 2013]: one part, yet its unicity is not always clear.

According to [Bent and Drabkin, 1987], music analysis is “the interpretation of structures in music together with their resolution into relatively simpler constituent elements, and the investigation of the relevant functions of those elements”. The description of content using fewer elements is related to the topic of Minimum Description Length [Gründwald et al., 2005], of which a key principle is “[t]he more we are able to compress the data, the more we have learned about the data”. According to this principle, given two concurrent descriptions of music, we may choose the shorter one (in other words, the more economical). In [Lorde, 2013], transcribed in Video 14, we are faced with concurrent descriptions. In that case, we’ll choose the shorter description if we can find one. The principle applies in the following situations:

(1) The top two staves between bars 9 and 15 are described as (a) belonging to one single part instead of (b) belonging to two separate parts. Description (a) is shorter than description (b), as description (b) would involve two parts and two vocal setups, the two parts sharing the same rhythm and the two vocal setups being the same.

(2) The voices in bar 8, third stave, may be described as belonging to the same part as the voices in bar 9, two top staves. The same applies to bars 12 and 13. The two concurrent descriptions: (a) same part and (b) different part, imply: (a) the note + rhythm pattern, the new setup and the declaration of new part ; (b) the note + rhythm pattern and the new setup, which is shorter.

(3) The same reasoning would lead to describing the content in bars 17 and 21 (“Royals”, staves 1 and 2) as belonging to the same part as the surrounding content in the third stave. Yet, cultural conventions found elsewhere (a phenomenon referred to as inter-opus, see [Conklin, 2003] for a distinction between intra-opus and inter-opus) would lead to isolating bar 17 as background vocals and declaring it as a separate part.

(4) Describing the content in bar 23 as a single part would lead to specifying that a part may contain non-homorhythmic content, which is outside the initial conventions. Still, the decision complies with the principle above, in the sense that in the overall description, we would have to include the fact that the specifications of a part itself can be changed, which would include an extra field that would be empty during all the extract except bar 23.

Video 14. Partial transcription of “Royals” [Lorde, 2013].

Figure 20 interprets the example according to the usual representation. There is one part with one to several voices that are subordinated to the lyrics — but the part is sometimes difficult to declare as one single part. Considerable attention is brought to the setup of voices inside parts.

Figure 20. Parts and voices in [Lorde, 2013].

4. “Human” [Brandy, 2008]: unicity of the part is not always clear (voices are not always homorhythmic).

In this example, the ambiguity between a description as one part and two parts increases. Content is presented that is more or less homorhythmic. In bar 3 for instance, both voices are homorhythmic — a bit less so in bar 4. Still following the principle of selecting the shortest description between two concurrent ones, we can describe bar 3 as a single part. In that case, it may prove more economical to describe bar 4 as a single part also, as it would spare the declaration of a new part only for bar 4 — and this comes at the price of the description of the non-homorhythmic elements in bar 4, stave 2.

Confronted with the same problem in bar 5, this time, we may choose to declare a new part for the last two beats. The reason for it is that the size of the elements that we would have to declare as a difference to homorhythmy increases — still, an ambiguity remains. The same principle applies to bar 7.

From bar 8, the difference between the two voices is such that declaring two parts appears to be clearly shorter.

Video 15. Partial transcription of “Human” [Brandy, 2008].

Figure 21 interprets the example according to the usual representation. The voices are less subordinated to the lyrics than in the previous cases, and the ambiguity between one part and two parts is frequent. Yet again, considerable attention is brought to the setup of voices inside parts.

Figure 21. Parts and voices in [Brandy, 2008].

5. “What goes around… comes around” [Justin Timberlake, 2006]: unicity of the part is not clear, is it one or two?

This example, transcribed in Video 16, also presents ambiguities. From bar 13 until the end, we may describe the voices as belonging to one part, as all content is homorhythmic. The ambiguity resides in bars 1 to 12. The reason why we may choose to declare two parts is the following: in the top stave, the content between bars 1 and 4 is stable, and so is the content in the bottom stave. Describing bars 1 to 4 as one part would involve several declarations of setup change — a process spared by the initial declaration of two parts. The same reasoning applies to bars 5 to 12, except for an extra declaration in bar 12, which may not be sufficient to make the description using two parts longer.

Video 16. Partial transcription of “What goes around… comes around” [Justin Timberlake, 2006].

Figure 22 interprets the example according to the usual representation. Between bars 1 to 12, an ambiguity exists whether we are dealing with one or two parts, but it seems more economical to describe the corresponding content as two parts. The voices are generally subordinated to the lyrics with few exceptions. Again, considerable attention is brought to the setup of voices inside parts.

Figure 22. Parts and voices in [Justin Timberlake, 2006].

6. “Bad Girl” [Danity Kane, 2008]: two parts. One may be qualified as backing vocals.

In this example, transcribed in Video 17, the ambiguities are less present, and the content may more certainly be described as two parts. The bottom section involves a rapid sequence of vocal setups. The corresponding content is more economical to describe as one part, as (1) the setups are never simultaneous, and (2) it avoids the constant declaration of two parts. The top section can be grouped as one part that may be qualified as backing vocals. An ambiguity remains near the “D” sign.

Video 17. Partial transcription of “Bad Girl” [Danity Kane, 2008].

Figure 23 interprets the example according to the usual representation. In this case, we can declare two parts. The voices are generally subordinated to the lyrics with few exceptions. Particular attention is brought to the setup of voices inside parts.

Figure 23. Parts and voices in [Danity Kane, 2008].

This was the last example from group 1, in which the songs feature an increasingly present second part.

7. “NDA” [Billie Eilish, 2021]: one part, three voices, only one voice with a constant clear pitch.

In this example, the grouping of the three voices in one single part is clear. The example is shown as an instance of the impossibility of applying [Baron, 1968] and [Batista Doni, 1635]’s definition of a part in Contemporary Popular Music: voices in parts can’t have a relation of unison/octave if they have no pitch! In [Billie Eilish, 2021], as shown in Video 18, only one voice can constantly be described using pitch. In this case, we still may use the grouping of voices in parts using homorhythmy, and represent the voice on the Y-axis of the setup using the approximate register or perhaps an f0 estimation.

Video 18. Partial transcription of “NDA” [Billie Eilish, 2021].

Figure 24 interprets the example according to the usual representation. The grouping of voices into one part is clear, even though no pitch can be identified for all voices. The voices are generally subordinated to the lyrics.

Figure 24. Parts and voices in [Billie Eilish, 2021].

8. “Press” [Cardi B, 2019]: one part, one to three voices, no clear pitch.

In this example, transcribed in Video 19, the voices have no clear pitch — they are reminiscent of extract 3 from this Medium post. In bars 1 through 14 (verse), two vocal setups alternate. In bars 15 to 18 (chorus), only one vocal setup is present.

Video 19. Partial transcription of “Press” [Cardi B, 2019].

Figure 25 interprets the example according to the usual representation. The grouping of voices into one part is clear, even though no pitch can be identified. The voices are subordinated to the lyrics.

Figure 25. Parts and voices in [Cardi B, 2019].

9. “Revenge” [XXXTentacion, 2017]: inversion of subordination between two parts — or perhaps they are voices, not parts.

This example, transcribed in Video 20, presents two ambiguities. (1) The voices in staves 1 and 2 are homorhythmic, so they should be part of the same part. Yet, it is also possible to declare them as two parts. (2) The same two parts do not have a fixed relation of subordination; the part that is first dominant then becomes subordinated to the other.

(1) The part we’ll call (a) starts at bar 9, stave 2. The part we’ll call (b) starts at bar 17, stave 1. From bar 17, part (b) is homorhythmic to part (a). Yet, it may be possible to describe the voice in part (b) as an independent part. First reason: after track 25, it stops being homorhythmic to part (a). Second reason: the voice in part (b) remains similar throughout the entire song. Declaring a different part when it stops being homorhythmic would be less economical than declaring only one part.

(2) Until bar 25, part (b) is subordinated to part (a). In terms of information, it would be less economical to express (a) from (b) than (b) from (a), as (a) is more constant than (b). Information in the expression of (a) from (b) would contain information that would have anyway been declared as such starting from (a). From bar 25, part (a) becomes subordinated to part (b). In terms of information, it would now be less economical to express (b) from (a) than (a) from (b), as (a) is not so well-formed anymore. It takes simpler means to ‘blur’ data than to ‘de-blur’ it.

Video 20. Partial transcription of “Revenge” [XXXTentacion, 2017].

Figure 26 interprets the example according to the usual representation. The grouping of voices into one part is unclear, and subordination between parts changes. The voices are subordinated to the lyrics.

Figure 26. Parts and voices in [XXXTentacion, 2017].

Notice how the representation we previously used in the case of Western Classical Music starts to be unsuited to describe the examples as they get more complex.

10. “Hustlers” [Curren$y, 2016]. Identification of parts is subject to ambiguity.

Video 21 shows a partial transcription of [Curren$y, 2016]. The top stave represents a simplified version of a part made from (mumbled) voices, and the second stave shows a more traditional sung part.

Video 21. “Hustlers” [Curren$y, 2016]: partial transcription, simplified vocals.

Still, the mumbled voices may not be so mumbled after all. As shown in Figure 27, they exhibit a clear form at two different scales. At the largest scale, it is close to a period form (“ABAC”, see [Caplin, 2001] for more details), and at the smallest scale, the patterns follow an “ABCC”-type form.

Figure 27. Form in the “mumbled” part.

Upon closer examination, the “mumbled” form can be divided into three sub-groups (tracks 1 to 3), to which we can add elements from the following section (track 4). Each of these tracks appears to be made from homorhythmic voices. There are, therefore, two hierarchical levels of decomposition in the vocals: it can first be split between the “mumbled” and the “sung” groups, and then the “mumbled” group can be split into three sub-groups.

Video 22. “Hustlers” [Curren$y, 2016], decomposition of the “mumbled” part.

The representation we used for Western Classical Music is insufficient to represent the vocals from this example. In Figure 27, we modify the representation to make it fit the example. Notice that there is no hierarchy between the two top-level parts, which is reminiscent of polyphony, even though this genre is very far from the current example.

Figure 27.

The complexity of this example is such that the representation we previously used is now unsuitable.

A better representation for better machine learning?

Let’s recall how the participants in the 2020 A.I. song contest view vocal tracks in Contemporary Popular music by inserting Figure 1 once more.

Repetition of Figure 1. “Modular musical building blocks” according to [Huang et al., 2020].

Now that we are better aware of actual practices in Contemporary Popular Music vocal production, we can see that the two visions — A.I. music community and actual music production, are very far apart. The initial question to the post is: how to bridge the gap between the two practices?

Figure 28 shows that 8 tracks from the 13 from the A.I. song contest were manually edited to make them closer to actual production practices (observations made from the tracks as found on the 2020 A.I. song contest website).

Figure 28. Analysis of the type of vocals found in the songs from the A.I. song contest 2020.

What would it mean to provide the means to train A.I. models so that they can generate realistic vocal content, without manual edits?

One way is to go “brute force” and train the model on audio, as Open AI Jukebox does. In principle, if the model is well-trained, the output audio will emulate the training dataset’s properties. Indeed, some parts with multiple voices can be heard in some of the Jukebox’s renditions. Still, if we want to create technology that is useful to musicians in a production context, with, in particular, prototypes that can adapt to the musicians’ workflows (see [Deruty et al., 2022] for more on this topic), perhaps we need to take into account the way vocals are produced in Contemporary Popular Music as explained in this post.

One problem we’ve met while considering increasingly complex examples is the growing inadequacy of the part/voice representation that was inspired by Western Classical Music. The very notion of grouping into parts becomes a problem, as it is the source of frequent ambiguities. An alternative representation would drop the notion of grouping and consider only the distance between every single voice. From this perspective, the Western Classical Music-inspired representation would be replaced by the type or representations shown in Figure 29.

Figure 29. Two-dimensional projection of a small dataset of voices.

In this representation, the distance between the icons is a two-dimensional projection of distance in perception. It may include pitch, timbre, stereo space… In the same representation, grouping voices into parts may be interpreted as clustering — which may be unnecessary for generation: we may need to generate parts that are close to each other or far from each other. Clustering may be irrelevant.

One way to evaluate such a distance would be to use contrastive learning (see [Chen, 2020]) based on multi-track simultaneity. Figure 30 illustrates and sums up the corresponding process.

Figure 30. Multi-track datasets provide simultaneity for the evaluation of distance using contrastive learning.

Figure 31 shows relations between voices and the two-dimensional projection of a distance the contrastive learning process may provide. The distances shown are between the centre voice and the other voices.

Figure 31. Two-dimensional projection of hypothetical distances learnt from contrastive learning.

Note that in musicological terms, such a distance may be interpreted as consonance/dissonance. Indeed, if one interpretation of consonance relates to the compatibility of two simultaneous notes [Helmoltz, 1877][Plomp and Levelt, 1965], another interpretation relates to the frequency according to which two elements are used together: “Consonances can be defined as consonant because they are used predominantly” [Cogan, 1976][Tenney, 1988].

Figure 32 illustrates speculative representations of the voices in eight of the examples mentioned above, involving hypothetical distances learnt from contrastive learning. From top to bottom, left to right: [Eminem, 2004], [Pink Floyd, 1979], [Muse, 2013], [Britney Spears, 2008], [Brandy, 2008], [Danity Kane, 2008], [XXXTentacion, 2017] and [Curren$y, 2016].

Figure 32. Speculative representations of voices for eight examples.

Notice, in Figure 32, the mentions “can be expressed from” and “can’t be expressed from”. If a voice can be expressed from another voice, then the latter can be used as conditioning for the former. As illustrated in Figure 33, a voice can be used as conditioning to create a double-tracked version. On the contrary, in the case of [Curren$y, 2016], there is no point in using the sung part as conditioning to generate a mumbled set of parts. If conditioning is required, it is better to learn the vocal tracks from the instrumental tracks — similarly to how BassNet [Grachten et al., 2020] generates bass tracks using conditioning from other tracks.

Figure 33. Vocals as conditioning to generate other vocals may be used in the case of double-tracking, but not when the vocals are too different from each other.

Conclusion

Vocals in Contemporary Popular Music are generally not one voice. They are elaborate sets of voices, whose combination is part of the musical form. On the other hand, vocals in the 2020 A.I. song contest derive from a single sonified melody.

One solution to bridge the gap between the two practices resides in understanding the vocal setups observed in Contemporary Popular Music. Analysis of such setups can be performed using concepts derived from Western Classical Music, yet these concepts become inadequate as the vocals get more complex.

In particular, grouping vocal tracks together brings ambiguities, whereas grouping may prove useless in the context of A.I.-based generation. These analysis practices might be advantageously replaced by the notion of distance between individual vocal tracks. Such a distance may be derived from contrastive learning based on simultaneity in multi-tracks.

Be wary of implicit cultural bias. The conception of music that derives from the 2020 A.I. song contest (bassline, harmony and melody) reflects Baroque music (with continuo [Williams and Ledbetter, 2001] and thoroughbass [Palisca, 2001]), not Contemporary Popular Music. Don’t confuse Billie Eilish or XXXTentacion with Telemann or Vivaldi!

Listen to the music! The final product, the multi-tracks, the datasets…

References

[Baron, 1968] Baron, John H. “Monody: A Study in Terminology.” The Musical Quarterly 54.4 (1968): 462–474.

[Batista Doni, 1635] Batista Doni, Giovanni, Compendio del Trattato de’ Generi e de’ Modi della Musica (Rome, 1635).

[Bent and Drabkin, 1987] Bent, Ian, and William Drabkin. The Norton/Grove Handbooks in Music: ANALYSIS. W.W. Norton & Cie, 1987.

[Bimbot et al., 2016] Bimbot, Frédéric, et al. “System & contrast: a polymorphous model of the inner organization of structural segments within music pieces.” Music Perception: An Interdisciplinary Journal 33.5 (2016): 631–661.

[Bohlman, 2013] Bohlman, Philip V. “Ethnomusicology”, Grove Music Online, Oxford Music Online, 2013.

[Boyd, 2001] Boyd, Malcolm, “Arrangement”, Grove Music Online, Oxford Music Online, 2001.

[Caplin, 2001] Caplin, William E. Classical form: A theory of formal functions for the instrumental music of Haydn, Mozart, and Beethoven. Oxford University Press, 2001.

[Chen, 2020] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020.

[Cogan, 1976] Cogan, Robert, and Pozzi Escot. Sonic design: The nature of sound and music. Prentice Hall, 1976.

[Conklin, 2003] Conklin, Darrell. “Music generation from statistical models.” Proceedings of the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and Sciences. 2003.

[Deruty et al., 2022] Deruty, Emmanuel, Maarten Grachten, Stefan Lattner, Javier Nistal and Cyran Aouameur. “On the Development and Practice of AI Technology for Contemporary Popular Music Production.” Transactions of the International Society for Music Information Retrieval 5.1 (2022).

[Frobenius, 2001] Frobenius, Wolf, “Western polyphony”, Grove Music Online, Oxford Music Online, 2001.

[Grove Music Online, 2001a] “Texture”, Grove Music Online, Oxford Music Online, 2001.

[Grove Music Online, 2001b] “Track”, Grove Music Online, Oxford Music Online, 2001.

[Gründwald et al., 2005] Peter D. Grünwald, Jay Injae Myung and Mark A. Pitt, Advances in Minimum Description Length, M.I.T. Press, 2005.

[Helmoltz, 1877] Helmholtz, Hermann. 1954 [1877]. On the Sensations of Tone, translated by Alexander J. Ellis from the fourth German edition (1877). New York: Dover Publications.

[Hitchcock and Deaville, 2013] Hitchcock, H. Wiley, revised by James Deaville. “Musicology in the United States”. Grove Music Online, Oxford Music Online, 2013.

[Huang et al., 2020] Huang, Cheng-Zhi Anna, et al. “AI Song Contest: Human-AI Co-Creation in Songwriting.” arXiv preprint arXiv:2010.05388 (2020).

[Hyer, 2001] Hyer, Bryan, “Homophony”, Grove Music Online, Oxford Music Online, 2001.

[Kircher, 1650] Kircher, Anastasius, Musurgia Universalis, 1650, pp. 315–16.

[Grachten et al., 2020] Grachten, Maarten, Stefan Lattner, and Emmanuel Deruty. “Bassnet: A variational gated autoencoder for conditional generation of bass guitar tracks with learned interactive control.” Applied Sciences 10.18 (2020): 6627.

[Middleton and Manuel, 2001] Middleton, Richard, and Peter Manuel (2001). “Popular Music”. In Grove Music Online. Oxford University Press.

[Moore and Martin, 2018] Moore, Allan F and Martin, Remy. Rock: The Primary Text (Ashgate Popular and Folk Music Series) (p. 43). Taylor and Francis, 2018. Kindle Edition.

[Palisca, 2001] Claude V. Palisca, “Baroque”, Grove Music Online, Oxford Music Online, 2001.

[Plomp and Levelt, 1965] Plomp, Reinier, and Willem Johannes Maria Levelt. “Tonal consonance and critical bandwidth.” The journal of the Acoustical Society of America 38.4 (1965): 548–560.

[Senior, 2011] Senior, Mike. Mixing secrets for the small studio. Taylor & Francis, 2011. Additional resources.

[Smith et al., 2001] Smith, N., Flotzinger, R., Reckow, F., & Roesner, E. “Organum”. Grove Music Online, Oxford Music Online, 2001.

[Tenney, 1988] Tenney, James. A History of ‘Consonance’ and ‘Dissonance’. Excelsior Music Publishing Company, 1988.

[Walther, 1732] Walther, J.G., Musikalisches Lexikon (Leipzig, 1732), pp. 138, 419.

[Williams and Ledbetter, 2001] Peter Williams and David Ledbetter, “Continuo”, Grove Music Online, Oxford Music Online, 2001.

References for music examples

[Bee Gees, 1977] Bee Gees, “Stayin’ Alive”, Saturday Night Fever, Polydor, 1977.

[Beyoncé, 2008] Beyoncé, « Single Ladies », I am… Sasha Fierce, Columbia, 2008.

[Billie Eilish, 2021] Billie Eilish, “NDA”, Happier than ever, Interscope 2021.

[Brandy, 2008] Brandy, “Human”, Human, Epic 2008.

[Britney Spears, 2008] Britney Spears, “If you seek Amy”, Circus, Jive 2008.

[Carcass, 1993] Carcass, «This Mortal Coil », Heartwork, Earache, 1993.

[Cardi B, 2019] Cardi B, “Press” (Single), Atlantic, 2019.

[Curren$y, 2016] Curren$y, “Hustlers”, Andretti 11/30, Jet Life Recordings, 2016.

[Dälek, 2007] Dälek, “Tarnished”, Abandoned Language, Ipecac, 2007.

[Danity Kane, 2008] Danity Kane, « Bad Girl ft. Missy Elliott », Welcome to the Dollhouse, Atlantic, 2008.

[Divna Ljubojevic, 2008] Divna Ljubojevic, « In the Flesh Thou Didst Fall Asleep », Orthodox Traditional music, Phonofile Balkan, 2008, (adapted from The Exapostilarion of Pascha, ca. 400).

[Eminem, 2002] Eminem, « Cleanin’ out my Closet », The Eminem Show, Interscope, 2002.

[Eminem, 2004] Eminem, « Mockingbird », Encore, Interscope 2004. Live in Madison Square Garden, New York City, August 2005.

[Händel, 1741] John Eliot Gardiner, Georg Friederich Händel, Messiah, 1741, Archiv Produktion, 1983.

[Jennifer Lopez, 2011] Jennifer Lopez, « On the Floor », Love?, Island, 2011.

[Jim Jones, 2006] Jim Jones, « We Fly High », Hustler’s P.O.M.E. (Product of My Environment), Diplomat, 2006.

[Justin Timberlake, 2006] Justin Timberlake, ”What Goes Around… Comes Around”. FutureSex/LoveSounds. Jive 2006.

[KC and the Sunshine Band, 1976] KC and the Sunshine Band, “I’m your Boogie Man”, Part 3, T K, 1976.

[Kitka, 2003] Kitka, « Dontojno Jest », Wintersongs, self-produced, 2003. Traditional Bulgarian liturgy (ca. 1000?).

[Lady Gaga, 2010] Lady Gaga, « Alejandro », The Fame Monster, Interscope, 2010.

[Lorde, 2013] Lorde, “Royals”, Pure Heroine, Universal, 2013.

[Madonna, 2003] Madonna, « American Life », American Life, Warner, 2003.

[Moby, 2002] Moby, « Extreme Ways », 18, Mute, 2002.

[Mozart, 1786] Concerto Köln / Collegium Vocale Gent, W.A. Mozart, “Il Conte…”, Le Nozze di Figaro, 1786, Harmonia Mundi, 2004.

[Mozart, 1791] Diana Damrau, The Royal Opera, W.A. Mozart, “Der Hölle Rache”, Die Zauberflöte, 1791, Royal Opera House Fundraiser, 2017.

[Muse, 2013] Muse, “Madness”, Live at Rome Olympic Stadium, Warner, 2013.

[Nirvana, 1991] Nirvana, « Smells like Teen Spirit », Nevermind, DGC, 1991.

[Pink Floyd, 1967] Pink Floyd, « See Emily Play », single, Columbia, 1967.

[Pink Floyd, 1973] Pink Floyd, « Breathe », The Dark Side of the Moon, Harvest, 1973.

[Pink Floyd, 1979] Pink Floyd, « Another Brick in the Wall pt. 1 », The Wall, Harvest 1979.

[Schola of the Hofburgkapelle Vienna, 2001] Schola of the Hofburgkapelle Vienna, Gregorian Chants for the Church year, Philips, 2001. Gregorian Chant notation from the Liber Usualis (1961), p. 1767.

[Schubert, 1827] Franz Schubert, Ständchen (“Zögernd, leise”), for alto, chorus & piano (“Notturno”), D. 920 (Op. posth. 135), 1827. Robert Shaw, “Schubert: Songs for Male Chorus”, Telarc Distributions, 1994.

[SIA, 2014] SIA, « Chandelier », 1000 Forms of Fear, RCA, 2014.

[Simon & Garfunkel, 1968] Simon & Garfunkel, « Mrs Robinson », Bookends, Columbia, 1968.

[Tallis, 1565] Thomas Tallis, “If ye love me”, 1565. The Cambridge Singers and John Rutter, Faire is the Heaven, Music of the English Church, Collegium Records, 1988.

[The Beatles, 1968] The Beatles, « Dear Prudence », White Album, EMI, 1968.

[Uncanny Valley, 2020] Uncanny Valley, “Beautiful the World”. Winner, A.I. song contest 2020.

[XXXTentacion, 2017] XXXTentacion, “Revenge”, 17, Bad Vibes Forever, 2017.

Our team’s page: Sony CSL Music — expanding creativity with A.I.

--

--

Emmanuel Deruty

Researcher for the music team at Sony CSL Paris. We are a team working on the future of AI-assisted music production, located in Paris and Tokyo.