“Pitch” doesn’t necessarily mean “notes”. The example of Kanye West’s “On Sight”.

Emmanuel Deruty
15 min readAug 4, 2022

--

It seems evident to almost everybody, music listeners, musicians and scholars alike that the sensation of pitch in music is best represented by musical tones (notes). Yet, musical tones are defined according to precise specifications, and they are not the only type of signal from which a sensation of pitch can derive (see, for instance, [Yost, 2009]). Are musical tones (notes) the only way, or at least the usual way to form pitch in music? We show that in Kanye West’s track “On Sight”, most of the elements from which an impression of pitch derives, do not subscribe to the definition of “musical tone” (note). If indeed pitch is routinely carried by other types of signals than musical tones (notes), then there may be implications in domains such as automatic music transcription, music generation and music analysis.

Preliminary definitions

In the present article, we use the following definitions.

Musical tone (note). As illustrated in Figure 1, an everyday example of a musical tone (a note) consists of a harmonic complex ([Yost, 2009], otherwise called Harmonic Complex Tone, see [Micheyl et al., 2006]), i.e. a fundamental and several harmonics, whose frequency values are stable for a minimum amount of time [Fyk, 1987]. The distance between harmonics is the same as the fundamental frequency, and the perceived pitch corresponds both to the fundamental frequency (f0) and the distance between harmonics.

Figure 1. Illustration of the definition of the musical tone (note).

Pitch. Observed frequency (in the audible range) of a periodical phenomenon in the audio signal, which may prelude to a sensation of pitch. In this understanding, it is possible that pitch may not be clearly identified by sole means of listening, for instance if the observation concerns an interval of time that’s too short (see [Fyk, 1987] for the minimum span of time necessary to properly perceive pitch). Observations on short intervals of time typically occur in STFT analyses. In this interpretation, pitch may exist independently from any context. In that sense, the interpretation doesn’t comply with the one given for instance by [Haynes and Cooke, 2001], for which “it is only when [frequencies] are connected to pitch standards that they take on a musical dimension”.

Partial. A partial has been defined elsewhere as any of the sine waves (or “simple tones”, as the translator Ellis calls them when translating [Helmoltz, 1885]) of which a complex tone (or harmonic complex, or Harmonic Complex Tone) is composed. Note that in our interpretation, a partial can have a width, measurable on the frequency axis. The width can originate for instance from the fact that the partial may be a filtered noise, in which case the partial’s width may derive from the filter’s bandwidth. The width can also originate from the observation itself: a Fourier Transform performed on a short window will associate a width to a sine wave’s representation in the spectral domain. In both cases (width from the signal’s nature, width from the observation), the central frequency for the partial may be measured using different means. Throughout the present document, we simply use the local maximum for the partial’s frequency.

Harmonic partial. A partial in a harmonic complex, for which the observed frequency value is a multiple of the fundamental.

Inharmonic partial. A partial for which the observed frequency value is not a multiple of a fundamental.

Inharmonic complex. a structure of partials that may be approximated as a harmonic complex, except that the partials are not multiples of a fundamental. An inharmonic complex may result in the perception of a single pitch. Specifically, according to [Yost, 2009], “if the harmonics are within 8% of an integer multiple of the fundamental, the harmonics are fused as part of the complex sound whose spectral structure may be used to account for the sound’s pitch”.

“On Sight”, analysis

Released in 2013, Yeezus is Kanye West’s sixth studio album. Beyond Kanye West himself, writers and producers for Yeezus’ opening track, “On Sight”, include Guy-Manuel de Homem-Christo and Thomas Bangalter, who form the duo Daft Punk. “On Sight” is, therefore, no obscure experimental music: it is widely marketed music from world-famous musicians.

Kanye West’s “Yeezus” cover.

In this article, we show using spectral analysis of four extracts, that most of the elements in “On Sight” that prompt a sensation of pitch, don’t subscribe to the definition of the musical tone. There are four properties of such elements that stem from the definition of the musical tone:

(P1) Frequency values are unstable.
(P2) The sounds are generally inharmonic.
(P3) In some cases, higher formants are loud enough to mask the fundamental.
(P4) In other cases, a single (in)harmonic complex can carry more than one single pitch.

Extract 1. Synthesizer pattern, 122 to 123s, present throughout the song.

Video 1 shows the STFT for a melodic pattern that can be heard throughout the song. The pattern appears to be built from three elements: element 1, from frame 4 to frame 21. Element 2, from frame 25 to frame 34. Element 3, from frame 38 to frame 45. Each one of the three elements can be described as a superimposition of partials that follow a parallel evolution over time.

In video 1, the lowest blue line shows the peak frequency for the partial with the most energy (the fundamental). The blue lines above the lowest one show the multiple of the fundamental’s frequency. The irregular evolution of the peak frequencies around frames 13, 29 and 42, originates from the splitting of the fundamental into two formants near those frames.

Video 1. Extract 1, synthesizer pattern, 122 to 123s, present throughout the song, STFT.

The black line in Figure 3 represents the Fourier transform for frame 7. The leftmost vertical blue line is set at the peak frequency of the partial with the most energy (the fundamental). The other vertical blue lines show the multiples of the fundamental. The progressive offset between the multiples of the lower partial’s frequency and the signal’s partials show that the signal is inharmonic (P2).

Figure 3. Fourier transform for extract 1’s frame 7.

Figure 4 focuses on the splitting of the fundamental into two formants that occurs around frames 13, 29 and 42. The black line represents the Fourier transform for frame 13. The leftmost vertical blue line is set at the peak frequency of the first significant partial. The other vertical blue lines show the multiples of this frequency. The leftmost vertical red line is set at the peak frequency of the second significant partial. The other vertical red lines show the multiples of this frequency. The other significant partials are not multiple of either frequency (P2).

Figure 4. Fourier transform for extract 1’s frame 13.

Video 2 shows the output of three different pitch analysis methods for the pattern shown in Video 1. The blue line shows the frequency for the fundamental. The red line shows the distance between the partials. Both values vary over time (P1).

The black line shows CREPE’s analysis output [Kim et al., 2018]. The line’s width and luminosity follow the output’s confidence. A wide black line indicates high confidence, a thin grey line indicates low confidence. Pitch as predicted by this deep convolutional network-based pitch-tracking algorithm is variable (P1).

Video 2. Extract 1, synthesizer pattern, 122 to 123s, output of pitch detection.

This pattern conveys a strong sensation of pitch. Yet, is it practical to describe it as a sequence of musical notes?

No. The elements in this pattern exhibit the following properties.
(P1) Frequency values are unstable.
(P2) The sounds are generally inharmonic.

Because of (P2), pitch identification in a single frame is difficult. While it is possible to associate a pitch with an inharmonic complex [Yost, 2009, p. 1706], pitch of inharmonic sounds can be ambiguous [Albrecht, 2000].

Because of (P1), identification of a musical tone (a note) for each one of the three elements is difficult. The ambitus of the evolution of pitch over time for elements 1 and 2 is ca. 1 tone, and for element 3 it is ca. 1 semi-tone. The ambitus of the evolution of pitch over time may be too large to be qualified as expressivity in the sense of [Kirke and Miranda, 2009].

Extract 2. Synthesizer pattern, 24 to 25s.

Video 3, top, shows the STFT for a sequence of four elements. As in Video 1, the lowest blue line shows the peak frequency for the partial with the most energy (the fundamental). The blue lines above the lowest one show the multiple of the fundamental’s frequency.

Video 3, bottom, shows the output of three different pitch analysis methods for the four-element sequence. The blue line displays the evolution of the peak of the fundamental. The black line displays the evolution of the distance between the partials. Observation of these two lines suggests, as was the case with the first extract, that (P1) frequency values vary considerably, and (P2), the sound is inharmonic.

A third difference between the elements in extract 2 and musical tones (notes) can be observed between frames 15 and 25. In the top part of Video 1, it is possible to notice one partial below the one that was declared as the fundamental, and partials between the multiple frequencies of this fundamental. The display of these partials is less clear than the ones that correspond to the declared fundamental, which indicates lesser energy values. A careful listening shows an ambiguity in pitch: the perceived octave is uncertain, and two octaves can be heard at the same time (P4).

Video 3. Extract 2, synthesizer pattern, 24 to 25s, STFT and output of pitch detection.

The ambiguity is achieved as shown in Figure 5: odd harmonics are louder than even harmonics (including the fundamental f0). The harmonic complex can be interpreted and heard as two pitch values, one being one octave above the other. The phenomenon is also visible in Video 2, top, and referred to in Video 2, bottom, with the dotted line showing the alternative perceived pitch. See [Deruty and Grachten, 2022] for more on (in)harmonic compounds that carry more than one pitch.

Figure 5. Mechanism of pitch ambiguity, extract 2, frames 15 to 25.

This pattern conveys a strong sensation of pitch. Yet, is it practical to describe it as a sequence of musical notes?

No. The elements in this pattern exhibit the following properties.
(P1) Frequency values are unstable.
(P2) The sounds are generally inharmonic.
(P4) A single (in)harmonic complex can carry more than one single pitch.

Because of (P2), pitch identification in a single frame is difficult.

Because of (P4), pitch identification can be ambiguous.

Because of (P1), identification of a musical tone (a note) for each one of the four elements is difficult. The ambitus of pitch evolution over time for the four elements depends on the pitch detection method, yet it is always one semitone or more.

Extract 3. Lead vocals, 40 to 42s.

Video 3 shows the f0 (blue) and distance between the harmonics (red) for the lead vocals. The vocals convey an impression of pitch, yet frequency values are highly unstable (P1), with for instance, between frames 59 and 70, pitch varying by more than 3 semi-tones in a fraction of a second. Also, a comparison between f0 and distance between the formants shows that the vocals are inharmonic (P2).

Video 4. Extract 3, lead vocals, 40 to 42s, output of pitch detection.

The lead vocals convey a strong sensation of pitch. Yet, is it practical to describe them as a sequence of musical notes?

No. The elements in this pattern exhibit the following properties.
(P1) Frequency values are unstable.
(P2) The sounds are generally inharmonic.

Because of (P2), pitch identification in a single frame is difficult. Because of (P1), transcription of the vocals to a sequence of notes is almost impossible.

Generally, rap vocals convey an impression of pitch while not following a discrete scale. In that sense, rap vocals are reminiscent of Sprechgesang. Measuring pitch (if it exists) in rap vocals poses the same problems as pitch detection in spoken vocals. Pitch detection in spoken vocals is a well-accepted problem for several decades (see for instance [Rabiner, 1977]). Then is it so difficult to accept that rap vocals may have structure even though they can’t be reduced to sequences of musical tones (notes)? Fortunately, recent publications such as [Komaniecki, 2020], start to propose formal analyses of rap vocals even though transcription to musical tones (notes) is difficult at best, and perhaps inadequate.

Extract 4. Synthesizer part, 5.7 to 11.4s.

Video 5, top, shows the STFT for extract 4. The sound is built around an inharmonic complex, with a stable fundamental at 54Hz. Video 5, bottom, shows the inharmonicity (P2): the distance between the harmonics (black line) is a bit less than one half-tone above the fundamental (horizontal blue line).

Extract 4 may initially sound like noise, yet it does induce a sensation of pitch. Two differences between extract 4 and the first three extracts: (1) the ambitus of the variations in the frequencies that correspond to the sensation of pitch is greater, and (2) the source of the sensation of pitch is different. The sensation of pitch stemming from this extract corresponds to frequencies that are much higher than 54Hz. It is carried by very loud partials that vary over time (P1), represented on the top part of the video with the top blue line. The partials are likely to be amplified using a sliding resonant filter. The amplified partials mask the fundamental frequency (P3).

Video 5. Extract 4, STFT and inharmonicity.

The extract does convey a sensation of pitch. Yet, is it practical to describe it as a sequence of musical notes?

No.
(P1) Frequency values are unstable.
(P3) Higher formants are loud enough to mask the fundamental.

As the complex on which the sound is built is inharmonic (P2), note that the pitch stemming from the amplified partials doesn’t follow an identified scale. Had the complex been harmonic, then partials would follow a scale based on the fundamental’s natural harmonics.

Is extract 4 fundamentally different from the first three extracts?

No, it is still built upon an inharmonic complex, which doesn’t contain more noise than the first three examples. The sensation of pitch comes from partials that are not the fundamental — extract 2 also features an element for which sensation of pitch derives from partials that may not be the fundamental (frames 15 to 25, octave ambiguity). A key difference is that the ambitus of the variations in the frequencies that correspond to the sensation of pitch is greater.

Conclusion: implications

The elements in the four above extracts from Kanye West’s “On Sight” do convey a sensation of pitch. Yet, in none of them, vocal or instrumental, the elements on which pitch is based, subscribe to the definition of musical tones (notes). Pitch is never stable (P1), sounds are generally inharmonic (P2), the impression of pitch may come from high partials in a(n) (in)harmonic complex (P3), and a single (in)harmonic complex may carry more than one pitch value (P4). For this track, the description of the “pitch” dimension of the signal using musical tones (notes) is inadequate.

Admittedly, the above analyses were performed on one single track. Therefore, it is not possible to generalise the above observations — analyses of many more tracks are needed to understand whether, for instance, the impression of pitch in contemporary Popular Music (in the sense of [Deruty et al., 2022]) is generally derived from musical tones (notes) or other types of signal. Yet, the subject deserves consideration, if only because “On Sight” is mainstream music, and by no means obscure experimental music. If indeed, in at least some types of music, the musical tone (note) is not the primary object from which the impression of pitch derives, then several fields may be affected.

(1) In the field of Music Information Retrieval, the goal of Automatic Music Transcription is the production of musical tones (notes) from audio (see for instance [Benetos et al., 2013]). If a note is not an adequate representation for the sensation of pitch, or if it is not the only one, then the problem of Automatic Music Transcription should perhaps be reconsidered: before we want to write notes from audio, we should first make sure it’s possible.

(2) One important focus of music generation using for instance Deep Learning concerns the generation of symbolic content, scores in particular (see for instance [Briot et al., 2017] and [Yang, 2017]). If notes are not an adequate representation for pitch, or if they are not the only one, the problem of music generation involving the generation of pitch should perhaps be reconsidered: do we really want to generate note sequences in the general case?

(3) Most music analyses are performed on musical scores. While the practice may be well-suited to Western classical music, in which scores are the primary medium of transmission, if the musical tone (note) is not an adequate representation for pitch in other music genres, or if it is not the only one, then music analysis performed on a musical score transcribed from music for which there is originally no score (most contemporary Popular Music for instance) may misunderstand the music that’s being studied.

Conclusion: supposed universality of musical tones (notes)

In Western classical music, the score is the main medium of cultural transmission. Accordingly, the use of musical tones (notes) to describe pitch in Western classical music may be acceptable. What may be less acceptable is the common assumption according to which there is universality in musical tones (notes) beyond Western classical music. For instance, [Benetos et al., 2013] consider that Automatic Music Transcription (to notes) can be used in the case of “non-Western music”, as performed by [Nesbit et al., 2004] in the case of Australian aboriginal music. Yet, we can see that even in a mainstream contemporary Popular Music track, notes may not be the right representation to describe pitch.

In the field of music generation, this assumption of universality for musical notes (notes) underlines the following sentence by [Briot et al., 2017]: “[w]e believe that the essence of music (as opposed to sound) is in the compositional process, which is exposed via symbolic representations (like musical scores or lead sheets) and is subject to analysis (e.g. harmonic analysis)”. Yet, if notes are not always the right representation to describe pitch, then equating “music” with “notes” as included in musical scores may sound like an over-generalisation of Western classical music.

According to [Yost, 2009], “[p]itch may be the most important perceptual feature of sound. Music without pitch would be drumbeats, speech without pitch processing would be whispers, and identifying sound sources without using pitch would be severely limited”. This sounds like a reasonable assumption. There may be universality in pitch, but perhaps we should be careful not to confuse the universality of the sensation of pitch with a supposed universality of notes.

References

[Albrecht, 2000] Schneider, Albrecht. “Inharmonic Sounds: Implications as to «Pitch»,«Timbre» and «Consonance».” Journal of New Music Research 29.4 (2000): 275–301.

[Benetos et al., 2013] Benetos, Emmanouil, et al. “Automatic music transcription: challenges and future directions.” Journal of Intelligent Information Systems 41.3 (2013): 407–434.

[Briot et al., 2017] Briot, Jean-Pierre, Gaëtan Hadjeres, and François-David Pachet. “Deep learning techniques for music generation — a survey.” arXiv preprint arXiv:1709.01620 (2017).

[Deruty et al., 2022] Deruty, Emmanuel, Maarten Grachten, Stefan Lattner, Javier Nistal and Cyran Aouameur. “On the Development and Practice of AI Technology for Contemporary Popular Music Production.” Transactions of the International Society for Music Information Retrieval 5.1 (2022).

[Deruty and Grachten, 2022] Deruty, Emmanuel and Maarten Grachten, “Melatonin”: A Case Study on AI-induced Musical Style. 3rd Conference on AI Music Creativity, September 13–15 2022.

[Fyk, 1987] Fyk, Janina. “Duration of tones required for satisfactory precision of pitch matching.” Bulletin of the Council for Research in Music Education (1987): 38–44.

[Haynes and Cooke, 2001] Haynes, Bruce, and Peter Cooke. “Pitch.” Grove Music Online. Oxford University Press. Date of publication, 20 Jan. 2001. Date of access 3 Aug. 2022.

[Helmoltz, 1885] Hermann von Helmholtz (1885). On the Sensations of Tone as a Physiological Basis for the Theory of Music. Translated by Alexander John Ellis (2nd ed.). Longmans, Green.

[Kim et al., 2018] Kim, Jong Wook, et al. “Crepe: A convolutional representation for pitch estimation.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

[Kirke and Miranda, 2009] Kirke, Alexis, and Eduardo Reck Miranda. “A survey of computer systems for expressive music performance.” ACM Computing Surveys (CSUR) 42.1 (2009): 1–41.

[Komaniecki, 2020] Komaniecki, Robert. “Vocal Pitch in Rap Flow.” Integral 34 (2020): 25–45.

[Micheyl et al., 2006] Micheyl, Christophe, Joshua GW Bernstein, and Andrew J. Oxenham. “Detection and F0 discrimination of harmonic complex tones in the presence of competing tones or noise.” The Journal of the Acoustical Society of America 120.3 (2006): 1493–1505.

[Nesbit et al., 2004] Nesbit, Andrew, Lloyd CL Hollenberg, and Anthony Senyard. “Towards Automatic Transcription of Australian Aboriginal Music.” ISMIR. 2004.

[Rabiner, 1977] Rabiner, Lawrence. “On the use of autocorrelation analysis for pitch detection.” IEEE transactions on acoustics, speech, and signal processing 25.1 (1977): 24–33.

[Yang, 2017] Yang, Li-Chia, Szu-Yu Chou, and Yi-Hsuan Yang. “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation.” arXiv preprint arXiv:1703.10847 (2017).

[Yost, 2009] Yost, William A. “Pitch perception.” Attention, Perception, & Psychophysics 71.8 (2009): 1701–1715.

Our team’s page: Sony CSL Music — expanding creativity with A.I.

--

--

Emmanuel Deruty

Researcher for the music team at Sony CSL Paris. We are a team working on the future of AI-assisted music production, located in Paris and Tokyo.