Riding the Wave: How Music is Being Integrated in ICWave

Xbar Labs
9 min readFeb 16, 2022

--

From the very beginning, we have touted that each ICWave NFT will have its own unique, randomly generated musical accompaniment. The question then is, what exactly do we mean by that? In this article we intend to answer that question, and take an extended look at some of the complexities that come with randomly generated music as a part of NFTs.

Let’s start with a bit of background.

I (xbarjoe) have around 10 years of experience with music production & audio engineering, in both independent and professional environments. My primary genre of expertise is electronic music, but I have worked on various projects utilizing elements of jazz, rock, hip-hop / rap, orchestral, experimental, and everything in between. And as a matter of fact, this experience is what inspired the original idea for ICWave in the first place. The thought process that kicked off the whole idea in my head was: “If the Internet Computer can store files cheaply on-chain, then that opens up a lot more opportunities for more on-chain variety of NFTs other than just randomly generated images”. Don’t get me wrong, I love a good randomly generated image series as much as the next person, but my point is that the Internet Computer opens the door for projects to have more creative freedom for what can be stored on chain as an NFT.

Now that we’ve got that out of the way, let’s get back to the question of how to integrate audio into NFTs. Well… we have a couple options.

Option 1: Write unique tracks for each NFT by hand.

This is all well and good, and is by far the best option for creating the most high-quality and unique audio accompaniments. However, this option is plagued by some fairly unavoidable downsides (aside from the obvious one being that it uses no random generation component). Most NFT projects often have a supply in the thousands (10,000 seems to be a sweet spot and is the supply we’re going with for ICWave), so that would mean we would have to create 10,000 total tracks in order to fill out the series. In my entire time having been in the business of writing music, I’ve only completed maybe 300 or so tracks — nowhere near the quantity we would need for something like ICWave. Even when the tracks are less than 30 seconds (as they are in the case of ICWave), hand-making 10,000 unique songs would be a herculean undertaking that would likely drag out the development time of ICWave to more than a couple years. That’s more time than most artists would want to spend working on the same project, as well as more time than most NFT enthusiasts would want to spend waiting for a project to drop.

Option 2: Algorithmic Generation

This is your standard “randomly generated music” option, and it’s likely what you’re imagining if you’re just given the phrase “computer generated music”. The basic premise behind this method is using various algorithms to generate musical patterns. This is an incredibly complex task, and there are a lot of moving parts, so I’m not going to fully explain the process in detail; however, there are 2 main “avenues” we can take with this option. The first of these is doing everything code-side. This means that everything related to audio generation (tone selection, pattern generation, arranging, etc.) is handled in one fell swoop. While this does solve the previous option’s problem of not having any sort of randomness component (due to the nature of this option being entirely automated, it would be fairly simple to introduce randomness for things like note length, note pitch, etc.), it introduces its own slew of issues.

This version of option 2 is inherently limited, because it’s incredibly difficult to create anything other than simple chiptune-esque melodies with this method (yes, there are libraries for extending what we’ve described so far, but they’re outside the scope of what we’re talking about in this article). Another reason this is difficult is because all of the note timing details, key restrictions, and pattern placement have to be written in manually, which means the programmer would have to do a LOT of calculations, like musical-time to millisecond conversions. We can make things slightly easier for this option by having the program generate MIDI files, rather than audio files. For those who don’t know, MIDI is a file format that contains instrument and note data directly, rather than containing any actual audio data (this is an extreme oversimplification of the capabilities of MIDI files, but for the purposes of this article, those are the two most important features). We can modify the first “avenue” we mentioned, and instead have it write the randomly generated data to a unique MIDI file for each NFT, rather than fuss about with some internal sound generator. Once we have said MIDI files, we could then either individually map them to various sounds in some music creation software (probably not the best idea, because this would be almost as much work as option #1 above), or we can automatically generate the resulting audio file from the MIDI using a SoundFont (which is, in very simple terms, a bank of instruments that MIDI files can “talk” to, in order to generate audio). The downside to using SoundFonts is that the instruments they use are often very primitive and can often sound quite dated (which might actually be a plus depending on your desired outcome).

Check out Prismcorp Virtual Enterprise’s Home™ for an example of some excellent music composed using SoundFonts. That album is a significant source of inspiration for the Xbar Labs team when designing the assets for ICWave.

We can adapt the two methods in Option 2 by limiting the scope of what we’re randomly generating input data for. So, rather than generating entire melodies and sequences, we can limit the algorithm to only generate data for, say, drum patterns. This makes the generation step significantly easier (although, I still wouldn’t call it easy), but we sacrifice variety and musicality in the process. In this instance, the final NFTs would only have varying drum patterns as their musical component, which wouldn’t be very interesting. All in all, none of these ideas are desirable for our use case. This whole process of using MIDI files and SoundFonts makes certain aspects of the generation process easier, but it still doesn’t get us quite where we need to be in order to deliver quality musical components at the scale we plan to (as there would still be substantial work required in order to nail down the automatic generation of proper note timing and melodic cohesion).

Option 3: Use Machine Learning

This option is a little more abstract, and you can get wildly varying results depending on the desired outcome and training data used. I’m not going to dive too deep into the implementation of machine learning for an application like this, but the gist of it is: We can automate some of the work described in Option 2 by using machine learning. The two ways of doing this that seem immediately obvious to me are:

A). Training a neural network on MIDI files and having it generate MIDI files as a result.

B). Train a neural network on audio directly and have it generate audio directly.

Method A seems like it might have some promise, as there are huge repositories of MIDI files out there (The GeoCities repository is one that immediately comes to mind), so this could be something we potentially explore in the future (My academic background is in machine learning, so a project utilizing machine learning, music, and NFTs would be right up my alley)!

Option 4: The NFT Way

The final option (and the one we decided to go with for ICWave) is generating audio using the same approach used to generate images. Like most NFT series, ICWave uses the standard “Randomized Stacked Assets” approach for generating images. Basically, we take a base layer (which, in our case, is the statue layer), and pick a random selection of all of the different traits (background, face accessory, torso accessory, etc.) from the asset library and stack them all together to form the final image. Similarly for audio, we hand-write a library of musical stems, and choose a random selection of those stems to be the final audio component for a given NFT. This is effectively a “best of both worlds” combination of all of the previously described methods. We get the high quality music from having everything be hand-made, with the efficiencies of random generation, without having to actually go through and write 10,000 unique tracks. From what I can see, we’re the first NFT series to generate the audio component this way (at the very least, we’re the first on the Internet Computer, anyway).

We do this by splitting the audio generation into traits. In our case, we have:

  • A main drum layer, for things like kick drums, snare drums, and the like.
  • A drum top layer, for hi-hats, cymbals, risers, white noise, etc.
  • A lead layer, for the main instruments in the track.
  • Finally, a bass layer, for synth bass sounds in the lower frequencies.

While it may not be immediately apparent, the method as I’ve described it so far actually introduces some problems. Namely, it limits us to writing all of the music in only one key, and with only one BPM. Granted, it’s not like we’re forced to abide by this limitation, but having musical stems from varying BPMs and keys all mashed together would sound like musical nonsense, which isn’t quite what we’re going for. We get around this limitation by introducing musical key and BPM as traits themselves. By predetermining a set number of different keys and BPMs, and subsequently pre-writing and categorizing all of the audio stems into these aforementioned categories, we can ensure that the audio stems that get combined for each NFT all meld together in a cohesive and musical manner (another small bonus of this method is that it gives additional traits for those rare NFT hunters out there to keep an eye on). The generalized process of this is outlined in the image below:

To summarize what’s happening in this flowchart:

  • Step 0: Start with generated ICWave image.
  • Step 1: Select BPM for audio component (This gets added to the NFT’s trait metadata).
  • Step 2: Select Drums (both main and top parts) that match the BPM selected in Step 1.
  • Step 3: Select Musical Key (This gets added to the NFT’s trait metadata)
  • Step 4: Select lead and bass samples that match BPM and Key tags selected in step 1 and 3 respectively.
  • Step 5: Combine all four samples together into one audio file.
  • Step 6: Pair audio file to ICWave image in order to generate final ICWave.

As previously mentioned, this method really is a “best of both worlds” scenario, because it allows for the quality of having hand-written musical components, without the time required to actually go through and hand-write 10,000 individual tracks. On top of that, this method also has the benefit of being able to easily ensure that every final audio component of the NFT is unique. From a musical perspective, this is quite an interesting way of doing things, because writing music for something like this is an entirely different process from, say, writing a song. For something like this, there is a balance that needs to be struck between the stems being general enough so that they mesh well with other stems they might be paired with, all while making them sufficiently detailed to ensure that they’re musically interesting (and, of course, meet the quality standards we hold ourselves to). All-in-all, I think successfully striking this balance will lead to ICWave being unlike anything previously seen in the Internet Computer NFT space, and will set a quality standard for future NFT projects of a similar nature.

--

--