Identifying music genres form Spotify album covers using Clarifai’s deep learning API and Google Prediction

8 min readOct 8, 2015

With the recent news of image recognition start-up Clarifai raising $10M, I decided to experiment with their Web API. While Deep Learning (the core of their approach) has been used for music streaming, such as recommendations on Spotify, what about the image recognition side of it?

Here, I’ll describe an experiment which combines the Spotify Web API, Clarifai’s Image Recognition API and Google Prediction in order to identify an artist’s music genre based on their album covers.

Clarifai’s deep learning API: Machine-Learning-as-a-Service

Clarifai is one of those new services in the Machine-Learning-as-a-Service (MLaaS) area (Azure ML, Wise, etc.). If you think about how Web development evolved during the past years, it completely make sense, as an additional step towards the No-stack start-up.

Many start-ups need Machine-Learning approaches to scale their product: classifying customers, recommending music, identifying offensive user-generated content, etc. Yet, they better outsource this to platforms with a core ML expertise, rather than spending time and resource for this utility — unless that becomes their core business. The same way you don’t buy a fleet of servers and hire dev-ops but use Google App Engine, or don’t set-up a telephony system but rely on Twilio.

Clarifai’s API, which uses deep learning for image recognition, is very easy to use, with a simple API Key / token system and bindings for multiple languages such as as Python. For instance, the following tags Green Day’s Dookie album cover.

from clarifai.client import ClarifaiApi 
clarifai_api = ClarifaiApi() clarifai_api.tag_images('http://assets.rollingstone.com/assets/images/list/0e099b2214b1673fc76c6c60257b88aefe571def.jpg')

The API also lets you provide feedback to the extraction results, this feedback being used to feed their algorithm with additional data.

Check my #MusicTech e-book for more experiments with music and data!

SpotiTag: Tagging artists through album covers

The first step of my experiment was to tag artists using their album covers. To do so, I wrote a simple Python class which queries the Spotify Web API to get an artist’s top-10 albums, and pass those to the Clarifai API, filtering broad ones like “graphic”, “art”, etc.

This way, we can find the most relevant tags for an artist, as below.

>>> from spotitags import SpotiTags 
>>> SOCIAL_DISTORTION = '16nn7kCHPWIB6uK09GQCNI' 
>>> sp = Spotitags() 
>>> print sp.tag(SOCIAL_DISTORTION)[:10] [(u'text', 5.78628945350647), (u'silhouette', 4.906140387058258), (u'people', 4.833337247371674), (u'background', 4.743582487106323), (u'vintage', 3.9088551998138428), (u'banner', 3.8920465111732483), (u'men', 3.76175057888031), (u'card', 3.67703515291214), (u'party', 3.6343321204185486), (u'old', 2.952597975730896)]

The SpotiTags class is available on GitHub.

From artists to genres

To bridge the gap between artist-tags and genre-tags, I used “top x songs” playlists from Spotify, starting with two very-unrelated genres: Doom-metal and K-pop! Gathering a small dataset of 140 Doom-metal artist and 102 K-pop ones, and passing them through the previous tagging process, here are the top-tags for both genres.

This genre-tagging approach is, in a way, similar to what Spotify published about words associated with the different genres of music. Except that I’m not analyzing song titles, but album covers!

As you can see with the tags in italic, the overlap is quite large between genres — and I’ll come back to that later. But for now, let’s look at how I used this data to build an artist classifier.

Cloud-based classification with Google Prediction

As its name suggest, Google Prediction is Google’s (cloud-based) Machine-Learning API, predicting an output value for a set of features. It works whether this value is a class (in our case, a music genre), or a continuous value (e.g. the expected number of streams per month for an artist), i.e. classification or regression in ML-terms.

To predict if a set of tags belong to the Doom-metal or K-pop category, I’ve built simple training set as follows (note that Google prediction splits the string into its own internal model):

genre,"list of tags"

Separating the previous dataset between training and testing lead to 180 examples like the following ones.

doom,"nobody vintage christmas light people nature architecture girl woman paper west history celebration islam traditional astrology event round party background reflection antique dark old shadow conceptual death man postcard back festival east years music fireworks banner couple night female abstract women model adult travel greeting religion card gold street church fine art silhouette style map sepia venice rain broken government country lantern castle color love one nude arms sexy shiny texture pin velvet wall pride" kpop,"people adult one men north america vintage two vehicle motor vehicle transportation street cap group three police classic car humor hat cartoon outerwear xmas rapper fedora banner christmas jacket musician invitation man celebration background clothing decoration splash necktie text vest facial expression fast wedding automobile audio tape cassette stereo sound mono analogue obsolete record nobody tree radio broadcast compact reel fish eighties tuner unique nostalgia invertebrate conifer noise moon fine art black and white outfit sculpture winter singer season outdoors law nature monochrome greeting fashion menswear women serious dark change boy stroke merry travel garden female face war still life forest music teenager looking leader youth"

In addition, to try different models, I limited the list of tags (in the feature-set) to either all the tags for an artist, or their top-n. The results are as follows for the different approaches.

The fun part comes next, with a small script that combines the three APIs together to return the expected class (i.e. music genre) for any Spotify artist:

Using SpotiTags to query the Spotify API, get the artist top-10 albums and pass them to Clarifai — build the artist tag-set;
Passing this tag-set to Google Prediction to predict the class, using the former models.

And here we are!

(env)marvin-7:clarifai alex$ python predict.py -a7zDtfSB0AOZWhpuAHZIOw5 Candlemass: doom

More genres, more models

That’s all good and fun, but an ideal system would be able classify between more genres, e.g. 10 genres of popular music. I haven’t been so far but added Punk-rock to the equation, in order to try additional models.

Adding 75 Punk-rock artists, let’s have a look at the top-tags:

As earlier, there is also lots of overlap here. And, as new genres will be added, that overlap is expected to growth. Thus, using the former top-n tags approach, the results are unexpectedly worse than previously.

Finding the most distinctive tags

So instead of the top-tags, let’s focus on the ones that are specific to the genre. I.e. tags (again, extracted from album covers via Clarifai’s deep learning API), which appear in the top-100 of a genre, but not in the top-100 of others (still, limited to those three ones).

This definitely gives a better feeling of what the genre is about: Sexy and beautiful for k-pop, and scary and horrific for Doom-metal!

Using this distinct-tags approach, I’ve updated the previous models (and accordingly, the training and test sets) to take into account not the top-n tags, but the top-n distinct ones.

Here are the results of the new classifiers, deciding if an artist is playing Doom-metal, K-pop or Punk-rock based on their album cover’s tags.

Much better than the previous approach. Yet, as the number of genre growth, there will probably a need to find tune those models to accurately identify an artist genre. This means using more examples in the training sets — but also probably additional data, going further than images only, with album titles, and maybe some MIR techniques.

The MLaaS opportunity for developers, and for Clarifai

While being a limited experiment, this post showcases how different elements of a cloud-based machine-learning approach can help to identify what an artist is playing, based solely on what their album cover look like!

More globally, using such APIs definitely simplifies the life of developers and entrepreneurs. While you still need to grasp the underlying concepts, no need for an army of Ph.D.s and a fleet of GPU-based servers running scikit, Caffe or more to understand and make sense of your data!

As for Clarifai itself, I believe that classification and clustering could be two additional features that the former audience would enjoy:

For classification, instead of running a pipeline through another API (as I’ve done here with Google Prediction), managing everything through Clarifai would reduce the integration time and costs. Just think of an API that will, after you’ve send it a few examples, automatically decide if any picture is offensive or not;
For clustering, since the API already accepts images in bulk, returning clusters (with an additional parameter for the number of cluster) would also be helpful. As their premium offering already provides a “similar images” feature, that would follow a similar trend.

And, I’m not even thinking of features outside the image domain (music?), as they’ve already started with video analyzis. In any case, with $10M now in the bank, and a growing team of experts in the area, I have no doubt we will see new features in the API sooner than later, showcasing what deep learning can bring to Web engineering!

Originally published at apassant.net on May 14, 2015.