What Makes K-Pop, K-Pop? I Built a K-Pop Neural Network

@PatrickYoon
The Startup
Published in
10 min readMay 3, 2020

--

As a Korean American, I’ve grown up listening to Korean music my whole life. However, these past couple of years have seen an incredible upswing in Korean music. K-Pop groups such as BTS and Twice have been gaining international fame and breaking various musical records since their debut. Nowadays, even my non-Korean friends will come up to me, wanting to talk to me about the latest hit song or a new group that’s about to debut. Where did this hype about K-Pop come from and what makes K-Pop groups stand out so much? More importantly, what do top K-Pop artists such as BTS, Twice, and Blackpink do differently than the other K-Pop artists?

Some might say that Korean music is catchier or the visuals are more appealing, but all of those factors are completely subjective. Instead, I went over to the Spotify API and gathered thousands of tracks and various audio features provided from the API such as duration, acousticness, valence, and much more. You can read more about what each audio feature represents here.

I gathered every single track from the top 50 K-Pop artists from 2019, according to Tumblr, and compiled them into one data file. This included the name of the track, the group that performed that track, the gender of the group, all the audio features, and a popularity value (between 0–100) of the track that was provided by Spotify.

A short snippet of what my data file looked like

After watching a YouTube video about someone creating a face editing tool using machine learning and principal component analysis (PCA), I wanted to use PCA so that I can see which which features or components affect the track the most. However, after learning more about what PCA is and how it can be applied, I realized that PCA is a dimension reduction tool, and compared to images with thousands of dimensions, we only have thirteen (duration, key, mode, time signature, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence, and tempo). There isn’t any reason to reduce the number of dimensions, and unlike images, we are given each feature that directly impacts the track.

I originally wanted to create a neural network where given the thirteen features, it can classify which artist the track came from. However, after creating a multiclass perceptron, I quickly realized that there just wasn’t enough data per artist. Groups like BTS, EXO, and Twice have over hundreds of tracks but smaller artists like Weki Meki and Baby V.O.X. have less than a hundred. My neural network was barely getting a 2% accuracy, which means it can pretty much classify every single track in my data set as the same artist.

Instead, I opted into two separate neural networks: one that can classify between a male and female song and one that can predict the popularity value.

I built a multi-layer binary classifier for the neural network that classifies the gender and a multi-layer regressor for the neural network that calculates a popularity value. I’m still a relatively new student in the machine learning world, so there wasn’t much thought into the structure of the neural network; I was mainly just messing around until I was satisfied.

In my classifier, I had 13 neurons in the input layer, each one representing the audio features, and I had one hidden layer, consisting of 60 neurons. Since this is a binary classification, my output layer had one neuron. The hidden layer used the Rectified Linear Unit (ReLU) as my activation function and my output layer used a sigmoid function as the activation function. I used binary cross-entropy for my loss function and Adaptive Moment Optimization (Adam) algorithm for my optimizer.

In my regressor, I again had 13 neurons in the input layer and one hidden layer. However, the hidden layer now has 13 neurons instead of 60. My output layer has one neuron, representing the final popularity value it estimates. I again used ReLU for my activation function in the hidden layer, and no activation function in the output layer. I used Adam for my optimizer as well, but I used the mean squared error (MSE) as my loss function.

When I initially took a look at my binary classifier, I got an accuracy of ~70%, which is honestly pretty good.

However, when the neural network tries to predict using the same training data, I found out that it’s just predicting “Male” for every single track. When I took a look at the percentage of male tracks in the data set, I get an astounding 0.70378… So basically, the neural network is predicting 70% of the original training data set correct just by classifying every track as male. To remove this problem, I added over twenty more girl groups/artists into the data set. This evens out the ratio, getting a percentage of male tracks of about ~0.54. To clean up the data a bit more, I standardized the data set, and I got rid of any songs longer than 300,000 ms (5 minutes). With these actions, my neural network achieves a solid 71.62% accuracy, but this time it’s actually trying to classify based on the given features.

Before getting into the actual neural network, I wanted to see if I could differentiate between male and female songs without the use of any machine learning. I plotted all the audio features onto a histogram, with the color representing a male or female.

At first glance, we can see that all the audio features have roughly the same distribution. There isn’t one specific audio feature that allows us to discern between a song from a male or female group. It would be impossible for the naked eye to determine the gender of a track given the audio features. For my next look before checking the neural network, I was curious as to whether there was any correlation between any of my audio features and the popularity score.

As expected, there isn’t really any correlation between popularity and the audio features. The main reason why we aren’t able to recognize any patterns just by looking at these graphs and histograms is because my data set has over four thousands of tracks in all sorts of music genres. The audio features are all dependent on each other, and the joining of these audio features is what defines the track and its gender/popularity. Isolating a single audio feature won’t do us any good when its dependent on other features.

Using PyGame, I created a GUI that allows me to directly change the features using a slider. The GUI presents thirteen sliders, the values of each of the audio features, the prediction from the neural network, and a button that allows you to change between the gender and popularity prediction. I also added the track the current audio features most closely resembles by using a least squares approach and normalizing the data.

The first thing that I noticed when playing around with these sliders was that none of the sliders by themselves affected the gender except energy (low female, high male), instrumentalness (low female, high male), and liveness (low male, high female). I tried to create a mellower, ballad-type song by leaving the key at 9 and mode at 0 (which is A minor, the most common minor key), decreasing the danceability, energy, liveness, valence, tempo, and speechiness, and increasing the acousticness and loudness (which makes it softer). Doing this, my neural network predicts this song to be from a female and a popularity value of ~35. The closest song these features resemble is Flower by IU (female solo artist), which has a popularity value of 37. You could not believe how happy I was when my neural network had correctly predicted that!

I used these same settings, except only changing the duration to its max. The neural network predicts a female gender and ~42 for the popularity. And guess what! The song that these features most closely resembles is Farewell by TAEYEON (another female solo artist), which has a popularity of 44.

For my next test, I flipped all the values that are at one extreme of the slider. Theoretically, this should create a track that is more upbeat and electronic. My neural network predicts these features to be from a female and with a popularity of ~7. This song most closely resembles 이상한 일 (So Strange) by Brown Eyed Girls (female group), which has a popularity value of 9. Another success!

By playing around with the sliders a bit more, I noticed that the neural network tends to predict female when the sliders are at one extreme, but male when a majority of the sliders are towards the middle. When all the sliders are exactly in the middle (besides mode since mode can only be 0 or 1), the neural network predicts the song to be from a male with a popularity value of ~9. The most closely resembled song is Streets by A.C.E (male group), which has a popularity of 0. So our neural network isn’t perfect, but honestly, I feel more bad for A.C.E, having a track with a 0 popularity value.

To get the most popular song using the sliders, I had to increase the duration, acousticness, loudness, speechiness, and tempo while decreasing the danceability, mode, energy, instrumentalness, liveness, and valence. The key and time signature didn’t really affect the popularity. The neural network classifies this as female and estimates a ~73 popularity value. The song that these features most closely resembles is The Visitor by IU, which has a popularity value of 56. Remember, these features and The Visitor isn’t the true most popular song in my data set (that goes to ON by BTS). It’s just something I got by playing around and it’s really hard to get specific values with the sliders.

On the other hand, the least popular song I could get was by pretty much flipping all the extremes on the sliders. This predicted a female with a popularity value of around -9. My neural network was completely wrong since these features most closely resembled Tropical by INFINITE, which is a guy group and has a popularity value that is 26 more than what my neural network predicted.

It’s really hard to concretely conclude anything solely from the neural networks. However, I can point out a couple things I noticed:

  1. Songs from female groups/solo artists tend to be on the extremes while songs from male groups/solo artists tend to be in the middle. I believe this means that when a song from a female artist is a ballad, the song is likely going to be very acoustic, melancholic, and in a minor key. On the other hand, when a song from a female artist has a higher energy level, it’s likely also going to be upbeat, cheerful, more electronic, and in a major key. Male artists’ songs tend to be less one dimensional, such as songs having high energy, high danceability, and more electronic, but also being on the gloomy end and being in a minor key (BTS songs tend to have these features).
  2. In general, any songs that were performed at a live event are not that popular. This makes sense because personally, having a crowd in Spotify tracks are quite annoying.
  3. The duration, acousticness, and instrumentalness affect the popularity value the most. This might mean that people tend to prefer longer songs, songs that are more acoustic (less electronics), and songs that aren’t instrumentals.

I plan on working on this issue even more when I acquire the tools and knowledge. I want to compare these values to other genres in a significant manner (not by mere correlation) and analyze various K-Pop music videos as well. You can check out my code on my GitHub.

Resources

--

--