Music AI: Loop-in-the-Human

14 min readFeb 21, 2018

Guy walks into a computer lab and asks “Do you know your server is down?” Data scientist replies “No, but if you hum a few million bars I’ll try to fake it.”

By now, various companies offer automated music composition, For instance, Jukedeck does so in part by using machine learning to generate original pieces based on large sets of musical examples. Google’s Magenta project emphasizes data-driven “learning by example” over rule-based music creation approaches. These and numerous other projects have demonstrated impressive, human-sounding, musical results.

Rather than create entire compositions, my own project focuses on individual musical parts (say lead/bass lines within an existing composition). I want to create human-sounding music at a local scale, based on small numbers of examples. That’s because, for me, the urge to engage familiar music usually is stronger than the desire to hear something completely new. This means injecting variation into music that’s already in your head.

To this end, I want to hook into a real-time, subjective aspect of music, but at a generic, low level. In short, I want to put real-time handles on looping rhythmic/melodic patterns within existing music.

Coherence as prediction

I buy into the view that music is largely about nothing except what it is like to hear (or play) music.

So what is a piece of music like? For one thing, if the music has a beat, you‘re caught in cycles of anticipation versus outcome, as the the beat keeps coming around at different levels. Making sense in this context means forming subconscious predictions in response to the outcomes of other predictions.

These nested expectations evoke hierarchical structure, giving meaning to the surface notes and vice-versa. (The psychology of musical prediction was first explored in Leonard Meyer’s 1956 book Emotion and Meaning in Music.)

Harnessing innate musical sense

Elizabeth Margulis points out in her book On Repeat: How Music Plays the Mind that you typically recall any familiar melody only by running it through your head. This implies that music isn’t something you factually remember so much as something you regenerate in order to remember. Maybe it’s a matter of re-triggering hierarchies of nested predictions.

To some degree everyone is already a musical performer; the sense of surface-versus-structure that is familiar to musicians also exists within listeners more so than they probably realize. The difference is that musicians learn to manipulate that structure in order to create new surfaces (just as most people do with speech). That is what the building blocks aim to partially streamline.

Deepening repetition

If you like a piece of music you’ll typically listen to it many times, revisiting an increasingly familiar inner landscape. Much music has passages composed of recurring elements such as melodies, riffs, bass lines, and beats, and this is a level at which material could vary and morph without losing all identity and context.

Coord music-morphing app, morphing between chosen rhythms/melodies in real time.

But while almost everyone understands music, only practiced musicians typically manipulate it at the note level. Here I’ll describe an attempt to narrow the gap between making sense of music and improvising music, doing so via algorithms that play a coprocessor role, augmenting human musicality rather than replacing it.

Those algorithms depend on building blocks of rhythmic coherence. The building blocks are not hand-crafted constructs; they are rhythmic patterns that result from generative operations. Number theory provides a crystallization of those operations, making immediate comparison and manipulation of actual rhythms (and by extension, melodies) intuitive and computationally efficient.

(Pitches are handled as rhythmic strata within this approach, though I’ll leave that out here for the sake of brevity, such as it is.)

Coprocessor, not automaton

I want to emphasize that this approach doesn’t “compose music” wholesale, like the projects cited at the top. Rather it takes on, in real time, recursive rhythmic calculations that musicians (likely) perform subconsciously only after years of practice. Organizing notes hierarchically in time removes a significant hurdle to note-level music improvisation.

This algorithmic intervention takes place in strictly local fashion, leaving the rest of the composition and production in human hands, or under the control of other processes. In the latter case, the building blocks might offer meaningfully factored note data to other algorithms that are in play.

Music with moving parts

A dream scenario would be some music distribution format where recording artists unfreeze certain parts (say, bass or synth lines), enabling variation during playback. But for now, the algorithms discussed here are embodied in custom macOS/iOS apps that control Ableton Live sets. You steer variations on selected tracks while the rest of the mix continues to loop undisturbed.

Coord linked to Ableton Live via OSC and Max-for-Live

As a listener this means I can listen to a track I like for an extended time without tedium setting in; I can pivot into melodies and beats that are recognizably related in non-obvious ways. As a (decidedly amateur) composer I can explore ideas on the fly (perhaps overcome writer’s block) and create pieces that could evolve later, even morphing with other pieces.

Swift-based macOS app that implements the building blocks and algorithms

Generative analysis and composition

The rest of this article will first give an informal overview of the intersection between music theory and number theory at work in this approach. Then I’ll describe some practical applications.

Most technical details will be glazed over to some degree. Full details can be found in the paper “A self-similar map of rhythmic components” (which currently has free access if accessed from this page). Links to related conference papers, with detailed algorithms, are here.

Coming to terms

Familiarity with the particularly usage here of a few musical terms is helpful:

Meter refers to nested pulses formed by recursive subdivision of the time line. That is, there are two half notes per whole note, two quarter notes per half note, and so on, with the pulses at each metrical level alternating between weak and strong beats. (Only binary subdivision is in scope here, not expressive timing, triplets, or triple meters, etc.)
An anticipation is a note that occurs on a weak beat at some metrical level. It “anticipates” the stronger beat that will immediately follow at that level.
Syncopation occurs whenever anticipations are not followed by the anticipated notes (on the subsequent beats).
Loops here mean repeated patterns that are one, two, or four bars of 4/4 long. (Loops are ubiquitous in EDM but are also fairly common in many other genres.)
Variation as (already) used in this discussion refers not only to formal theme and variations; but more broadly to any melody or beat that can recognizably take the place of another (even in a sort of musical opposition).

Above anticipations plotted as beat strength versus time step.

On the math side, binomial coefficients, Pascal’s triangle, and the Sierpinski gasket will appear in connection to the formation of rhythmic patterns.

Co-processing versus pre-computation

Attempting to manage complexity in music creation is not new. Take the piano keyboard; a great deal of pre-computation is embedded into an arrangement that works out log/linear and ratio/proximity pitch relationships at two levels (diatonic and pentatonic). Almost none of that is available to, say, a violin player, so it’s no surprise that a piano player can more easily play polyphonic and harmonic textures.

Rhythm is also built on log/linear and ratio/proximity relationships, given the underlying meter’s recursive structure. But these can’t be made ready-to-hand by a static layout because the time axis itself is in play. Computers seem an obvious tool for getting handles on musical time, assuming that the points in time can be grouped in as meaningful a fashion as pitches are on the piano. That is what the building blocks under discussion aim to do.

Algorithmic music analysis and generation

Here I’ll briefly run through how and why the rhythmic building blocks are generated. First I’ll describe what they encapsulate.

Anticipation and repetition

David Huron observed in his 2006 book Sweet Anticipation: Music and the Psychology of Expectation that when a note occurs on a weak beat (at some metrical level) you’ll tend to expect a note on the subsequent strong beat. This is simply the gravitational pull of musical meter.

That anticipated note may or may not occur. But in any case, if the pattern is repeated, you’ll expect whatever happened to keep happening, even if that (somewhat paradoxically) means “expecting” surprise. (Fred Lerdahl and Ray Jackendoff’s noted the link between parallelism and coherence as part of their in their 1983 book A Generative Theory of Tonal Music.)

Nested levels of potential anticipation, with the downbeat in the final position.

And so, two psychological constraints will be the organizing principles for all that follows.

A note that falls on a weak beat raises the expectation of a following note on the strong beat.
The outcome of the expectation defined in (1) raises expectation that such outcomes will recur.

In other words, anticipation and repetition generate predictions about, and because of, each other. This is where rhythmic coherence is bootstrapped.

Syncopation and elaboration

As sketched earlier, the constraints act at multiple levels operate simultaneously, forming hierarchies that characterize rhythms. Those hierarchies are our building blocks; we can get the whole set by formulating the above psychological constraints as generative operations, where each building block is derived from a simpler one.

Take as the first building block a looping rhythm that consists of a single note attack on the first beat. Each of the other building blocks is derived by recursively applying exactly one of the following two operations at each rhythmic levels:

Syncopate by shifting all attacks one beat earlier in time.
Elaborate by combining the above syncopation with the original attacks.
Do nothing.

The result is a tree of building blocks that together account for every evolutionary outcome of syncopation or elaboration operations.

Left: looping syncopation at the quarter note level. Center: looping elaboration at the same level. Right: Elaboration at the quarter note level combined with syncopation at the 8th note level.

All possible building blocks for two metrical levels. (That is, all combinations of elaboration/syncopation/neither at the quarter note and 8th note levels.)

Elaboration mapped to Pascal’s triangle

Something surprising (well, to me) happens when you set about encoding combinations of the above operations. Say you have three metrical levels, and you generate a building block by applying the elaboration operation at the first and third levels, encoded as a vector 101 (that is, binary number with three bits, one for each metrical level, containing a 1 corresponding to each level where elaboration took place.)

The rhythm evolves like this:

Elaboration at half note, then 8th note levels has the same result as vice-versa.

Now consider Pascal’s triangle, an arrangement of binomial coefficients, in particular the odd coefficients (in bold), The rhythm encoded by the binary representation of 5 is found on the fifth row (counting from zero).

Pascal’s triangle with odd entries in bold, tilted to line up with elaboration-based building blocks.

Coincidence? No, it turns out that any encoding of elaborations into a binary number indicates the row on Pascal’s triangle where the odd entries correspond exactly to the resulting rhythm. (This is a because, according to Lucas’s theorem, the binary digits of successive odd binomial coefficients form patterns that correspond exactly to combinations of elaborations at distinct metrical levels.)

Perhaps-syncopated elaboration mapped to the Sierpinski gasket

Getting the entire set of building blocks including those incorporating syncopations operations requires going a step further: shifting each elaboration-generated rhythm one beat at each combination of metrical levels where elaboration did not occur.

Since each combination of elaborations and syncopations is encoded by a pair of binary numbers that share no 1s in the same binary place (because at most one operation can occur at each metrical level), each such pair of binary numbers can be combined into a single ternary number, distinguishing the 1s in the syncopation encoding by converting them to 2s.

The ***generator*** is the encoded elaborations, the **offset** is the encoded syncopations.

Using those ternary numbers as addresses into the fractal known as the Sierpinski gasket (as shown below), we now have map of all potential building blocks, laid out visually in terms of elaborations and syncopations. (Mathematically, this corresponds to patterns formed by binary carries in sums of binomial coefficients as established by Kummer’s theorem, which is related to Lucus’s theorem.)

Sierpinski gasket addresses mapped to building block rhythms.

Several characteristics and comparisons can be made using these integer encodings, without examining, let alone generating, the rhythm itself. The ternary digits tell you how closely two building blocks are in terms of how they evolved (that is, how many times they shared the same operation at the same metrical level)

Zooming into deeper metrical levels. Each location on each Sierpinski gaskets to the left corresponds to a (reversed) building block rhythm on the grids to the right.

Building block applications

Now that we have these encapsulations of rhythmic expectation at hand, how to we use them? It’s important to note that the building blocks are not exemplars of “good” rhythms; in fact the most compelling rhythms are often those that are the least parsimonious (for instance bossa nova, which, as detailed here, requires a separate building block for each note).

In practical terms, the first step is to parse the rhythm of a given melody or beat into these building blocks. This is computationally inexpensive because the integer mapping spares the need to perform the actual derivations; it’s just a matter of scanning out the known patterns.

The current set of apps is called Coord. Some details and demos are at coord.fm.

Varying rhythms

In the most simple case, one or more digits of the ternary number that encodes a particular building block can be altered, thereby switching the operation at the respective metrical levels. The building block will remain invariant with respect to the other metrical levels.

Geometry as musical instrument

A GUI can leverage the self-similarity of the overall set of building blocks by collectively shifting digits in the ternary encodings. This allows natural sounding variations that nevertheless might be very different on the surface, a parallel, hierarchical sort of editing that would be difficult to imagine otherwise.

Rhythms being manipulated hierarchically, rather than note-by-note.

An key aspect of this approach becomes visible in such interactions, the fact that each attack has its own potential to become a rest, and vice-versa. The attack potential equals the combined hierarchical weight of the building blocks at the given time point (again details are in the papers linked above). The notes act somewhat like rocks in a stream, simultaneously affecting the flow and submerged, or not, by that flow.

Potential note attacks above and below the expectancy threshold for actually being heard.

https://youtu.be/Ypb5kUMxb8g

Syncopation via genetic algorithm

This suggests a straightforward means of boosting syncopation without destroying recognizability: a GA with fitness function that rewards rhythms with low degree of parsimony with the building blocks but high degree of similarity to the original.

Morphing between rhythms

Selected, weighted, input melodies being morphed into a new melody.

The building blocks afford a powerful capability: morphing intuitively between rhythms. In short, the attack potentials from two or three weighted rhythms are combined, producing a rhythm that interpolates in nonlinear fashion between those inputs.

Attack potentials are calculated for three input rhythms

The attack potentials are summed to determine the new rhythm

Landscape of rhythmic variations

By positioning each source rhythm on a plane in a GUI, you can move the pointer between those locations in order to specify how much weight should be given to each source rhythm in the above scheme. As you slowly drag the pointer from one input to another you hear the rhythm morph in musically coherent fashion between the two.

Navigating the melodic morphing landscape.

Future directions

One offline possibility is to use the building blocks to help determine which rhythms are most like each other, where there is kinship that sounds natural but which might not be apparent on the surface.

Self-organized map

Supplying the same measure used in the morphing scheme above to the evaluation function of a self-organized map allows a meaningful proximity to be established among a set of rhythms, in terms of the similarities between the evolutionary pathways taken by those rhythms.

https://youtu.be/mLyglwG3SY8

Neural nets

Since the building blocks themselves can be expressed simply as integers, it might make sense to use those as the raw data to a neural net. This would allow the network to expend energy discovering relationships that already take into account what is already known. As noted above, I lack expertise, but nevertheless have been thinking through the following.

Perhaps I could present attacks to a recursive neural net (likely an LSTM), each encoded by a vector that indicates which building blocks include that attack (a binary vector with length 3ⁿ for n metrical levels). The aim is have the net learn the attacks patterns against the backdrop of the collective expectations associated with each pattern.

Alternatively I could simply encode each attack by its binary representation. This would be a more compact encoding (with vector length 2ⁿ instead of 3ⁿ) but it would relay on the net to learn the superposed patterns already factored into the building blocks.

(More realistically, the hope is to work on something like the above with more knowledgeable collaborators.)

Is source code available?

Not currently; at this point this is a music project that involves programming rather than vice-versa. My initial target is concrete collaboration focused on the theory, algorithms and their integration into some platform where the musical engagement is fully explored and tested. (My unfortunate experience has been that distributing code or binaries produces little engagement or feedback that leads in such a direction.)

Open source could be a real possibility if the project reaches a stable point, but that will first require something like a small team effort. Meanwhile the relevant math and algorithms are available via the papers linked above, and I’m happy to engage with anyone coding those.

Conclusion

I’ve described a style of music analysis and generation that could form a substrate of music creation in various settings. Although the original motivation here was real-time interaction, the low-level generative analysis discussed here might complement various approaches such as machine learning or higher-level grammars.

Ecosystems of rhythms and melodies

Augmenting the potential for variation could enable music that has a life of its own after it leaves the composer’s hands, where the line between listener and composer becomes more blurry, where ecosystems of musical elements might vary, hybridize, and evolve, and where adaptive music can be more intuitively woven into interactions ranging from to performance to location-based music.

I welcome feedback, suggestions, questions, and, in particular, contact from those interested in comparing notes.