Measuring the subjective — improving the quality of AI-generated music as a PM
As a Product Manager, I used this process to measure and improve the quality of music at Jukedeck (a London-based startup building software that makes AI-generated music). There’s a little bit on how to use Google Scripts for this too.
For context, I used to be a Product Manager at Jukedeck, a London-based AI music startup. I probably knew the least about music there.
I previously worked as the Product Manager at Jukedeck, a London-based startup building an API to generate music using artificial intelligence.
In the past year we won Hottest Media/Entertainment Startup at the Europas, Startup of the Year at the BIMA Awards, featured in Amazon’s VR Platform Launch and warmed up for Boiler Room at the Slush Tech Conference.
At Jukedeck our biggest barrier to commercial success was always music quality. We were told ‘Hey, this is great, but it still sounds like it was made by a computer. Come back to us when the quality is better’. My role, spanning commercial and product strategy, was to improve the product and meet this demand.
At Jukedeck our biggest barrier to commercial success was always music quality.
I am by no means a music expert. I still couldn’t tell you what ‘gain staging’ or ‘bus routing’ are, but everyone I worked with really knew their stuff. My role as the Product Manager was to draw on this expertise and:
- Define what music quality means
- Identify initiatives to improve it
- Prioritise (time to implement + expected impact)
- Implement changes
- Measure the impact
This piece is a quick overview on how we thought and went about this at Jukedeck.
Music quality is subjective, so we broke it down into three groups of levers we could pull: composition, arrangement and production.
The first step was to understand what our prospective customers meant by ‘quality’. Our response to a piece of music is hard to quantify, largely emotional, and can differ enormously from person to person. Take Drake for example; some of us love his minimalist, existentialist R&B, others find his metaphors bland and annoying.
“I’m a staple in the game, all my paper’s together” — Drake
As a team we brainstormed a ton of ideas. We made sure we had a range of roles (from machine learning researcher to bizdev) and musical abilities (layman like me to diploma-level musicians) involved. We also drew on a backlog of feedback from 50+ sales meetings. Together we worked through three rough areas — composition, arrangement and production — and identified levers of quality in each, for example:
- Composition quality could include ‘hooky’, repetitive melodies
- Arrangement quality could include genre-specific structures (i.e. most pop songs follow the intro > verse > chorus > bridge > chorus > outro structure) or inserts like horn stabs and drum fills
- Production quality could include instrument selection, professional mix quality (balanced frequency distribution), dynamics (like the variation in ‘loudness’ between notes and sections), spatialisation (instruments sounding like they’re all played in the same space)
Now with a clear list of levers, we used further brainstorms to identify solutions that we could implement in our system. We ranked them by expected impact on perceived ‘quality’ and time to implement. Where possible, we created audio demos to help demonstrate the impact.
This exercise was the foundation of a six-month roadmap that we finished implementing in December 2017. We focused on a few longer-term system re-writes, coupled with a lot more smaller tweaks.
We brainstormed tests to measure incremental improvements in music quality, and considered a range of variables.
The aim this effort was to improve our music quality, in order to improve our commercial viability. The ultimate mark of success would be striking more commercial deals, but we also needed a way to measure incremental improvements in music quality, independent of our sales efforts.
Here are some of the considerations we had around running tests to measure improvement:
- Granularity — should we focus on assessing the track as a whole, or ask about specific elements (arrangement, timbre, dynamics)?
- Phrasing — should we ask about ‘quality’ specifically, use a proxy like ‘would you pay for this track?’ or should we just ask people to simply ‘rate’ a track (a catch-all)?
- Testers — should we ensure we poll a broad demographic of testers? Do we ask a low number of highly skilled musicians to give us qualitative data, or a higher volume of typical users to give us quantitative data?
- Benchmarking — should we get testers to compare between old and new Jukedeck tracks, Jukedeck tracks and the training data, or Jukedeck tracks and stock audio. What do we care about most?
- Normalisation — should we normalise our results with non-Jukedeck tracks, e.g. how does the same tester rate two non-Jukedeck tracks?
- Context — be aware that the mechanism for collecting this information may itself introduce bias (e.g. having an opt-in test will mean we poll people that are actively interested in AI music, likely with existing bias, or if we force users to rate tracks on the website before they can download them, they might not even listen to the tracks before rating them)
To avoid analysis paralysis we picked the simplest test: present two tracks (one before a change, one after) and ask the user to rate them both on an arbitrary scale.
In the spirit of being lean, we didn’t want to mull these considerations over too long. I came to the following conclusions:
- Incremental improvement would be best demonstrated by asking users to compare two Jukedeck tracks, one generated before we’ve made a change, and one after
- As a team, we were very capable of measuring improvements in more technical, granular detail (like dynamics or mix quality) ourselves, what we really needed to understand was the visceral reaction of our users to the track as a whole
- ‘Quality’ means so many things, and so we should simply ask people to ‘rate the track’ (combined with qualitative feedback this could actually tell us what aspect of the music people focused on when they made their rating)
Using Google Scripts we could quickly run a test at scale. We’d serve up two random tracks (one old, one new) for a user to compare and rate.
Google Forms was an obvious choice for the test mechanism — we could serve up two links from a spreadsheet (one old track, one new track) and ask people to rate them. I used some useful links like this one to help me decide what rating scale and wording to use, and got some great feedback from the Jukedeck team (they suggested we include the optional section for qualitative feedback, which has proved valuable). This is the form we used.
Google Scripts is a really powerful tool for Product Managers; I’ve used it to automate loads of stuff, and even built Jukedeck a self-serve employee database using it. In this case, I wrote a short script that:
- Presents links to two songs (one old track, one new track) in the Google Form
- Pulls these tracks randomly from two lists of 30 tracks each in this Google Spreadsheet
- Randomises the track order (i.e. the old track wouldn’t always be first in the form) to avoid unintended bias
- Allows users to rate each track between 1 and 7 and add qualitative comments, and stores the results in this spreadsheet
- Refreshes the tracks upon form submission, so there are two new tracks to compare when the form is next opened
We tracked two measures: 1) % of users that preferred the new tracks to the old, and 2) a measure of by how much the quality had improved.
The spreadsheet has an ‘analysis’ tab which shows two key metrics:
- What the % improvement (in average rating) is between new and old tracks
- What % of responses preferred the new to the old tracks
By sharing the form with our personal networks and tracking these metrics we were able to quantitatively demonstrate that our efforts to improve music quality were effective. We also gathered plenty of qualitative feedback that helped us focus our next improvement initiatives.
If you’d like to hear the improvement yourself you can check out our latest music that featured in The Times, or by comparing the two playlists here:
I’d love to hear your feedback on the music, the approach we took and how you’d have thought about running the tests. Finally, if you want any tips on using Google Scripts in this way let me know.
P.S. I’m trying to practice my writing and this is one of my first articles — any feedback is welcomed! You can get in touch at hi[at]rich.fyi