On the Vital Importance of Proper Classification While Analyzing Destiny’s Child

Recently I was given an assignment to analyze a dataset of the top singles of the year 2000 as I saw fit. For context, the year 2000 was peak Destiny’s Child; they had essentially doused the U.S. with gasoline and set it on fire when they produced singles like “Independent Women”, “Say My Name”, and “Jumpin’ Jumpin’ ”. So as I was running some analyses on this nostalgia-fest, I came across some interesting data. Here’s the list of genres on the list, along with how many songs they had:

Rock           137
Country 74
Rap 58
R&B 23
Pop 9
Latin 9
Electronica 4
Gospel 1
Jazz 1
Reggae 1

If you’re surprised to see “Rock” with about half of the songs on there, you’re onto something (you might also have enjoyed the sweet sounds of Destiny’s Child as I did). So, how did this happen? Every Destiny’s Child song was in the “Rock” genre. Okay, weird, but it’s only three songs, right? A few songs down the list was Christina Aguilera’s “Come On Over Baby”, also part of the “Rock” genre. Oh boy, this is starting to get ugly. N’Sync? Rock. Remember Sisqo? He’s the one who made the infamous “Thong Song”. Well apparently I missed his furious guitar solos because he was also counted under “Rock”. This goes on and on.

These types of mistakes in classification, while funny, can actually have a major impact on our ability to gain insight from the data, which is what data scientists do. Let’s put aside artistic merit for a moment and say that you just want to make the most popular song possible. Okay, the data says to go with a “Rock” song. But what is a “Rock” song now? Is it like something from Savage Garden? Or is it something by Destiny’s Child? What is the overlap of fans between those two artists? Intuition might say that it’s not that big, and that overlap is the reason for genres in the first place.