The Music Industry’s Dirty Data Secret

Published in

Creatrclub

9 min readApr 1, 2021

A few weeks ago I took part in a Palm Bay Music takeover of The Featured Artist Coalition Instagram page. In an offhand comment I mentioned how critical it was for songwriters to get their metadata sorted… cue a flurry of questions about what metadata is, why it’s important and how to keep on top of it.

Stories abound about the effect poor metadata has on the flow-through of creative’s income, universally felt across the spectrum from hobbyists to superstars. One tale I heard directly from a global megastars management team illustrates the data issues at hand. The management company’s in-house publishing consultant took time to manually work through foreign databases ensuring metadata had been correctly registered across this mega-artist’s catalogue. He found a small error in one registration (the order of a spelling mistake/mis-typed identifier), and was rebuffed after raising it with the artist’s publisher. It took quite some back and forth for the publisher to agree to find the error, admit it was there and subsequently correct it. The result? A six-figure payout of backlogged royalties landed in the artist’s pocket (and, of course, the publisher’s pocket too).

Metadata sounds like a boring and niche part of one of the world’s supposedly sexiest industries- yet it is at the very core of how music makes money. Arguably it is in fact the sole mechanism by which music ever makes money. It is the very lifeblood of the music industry- the arteries and veins along which royalties flow, the framework and scaffolding that facilitates transactions involving music. Metadata is a necessary requirement for royalties to be paid, it’s the reference point for calculating where royalties are distributed. If creatives want to be paid for their creativity they need to ensure their metadata is collated, accurate and up-to-date across the value chain- but this starts with understanding what it actually is and why it’s so critical.

So what is metadata?

Metadata is the information that describes a song or recording (i.e. a piece of IP), but is not itself part of the song/recording. Songs are intangible concepts- they need information to describe, isolate and identify them. The information that does this is metadata.

How do you describe to people where your house is? You’d need to include your Country, State/County/Region and Town/City- but these aren’t enough to give new visitors who don’t know you or the area the specific location of your property. They’ll also need the street name and your building number. Governments have gone as far as to develop systems of alphanumeric identifiers to narrow down, sub-divide and specifically identify geographical areas: the postal/zipcodes. This information is your house’s metadata. Addresses are property metadata. They describe where properties are, how to find them or post/deliver items to them. Without a universally accepted system of ‘addresses’ it would be incredibly difficult to differentiate one property from another, particularly in an intangible frame of reference (such as in a database). It would be impossible to send post and assign utilities. In a similar manner- how do we describe, isolate and identify intangible musical works?

Let’s take songs in isolation for now- ignoring recordings for the moment to keep a specific focus. Metadata is the information that identifies a song beyond the actual lyrics, melody and harmony that define the song itself. That is the song’s data, the information that defines what it is (or isn’t). The metadata is the information about that data, the information that describes, isolates and identifies that combination of melody/lyrics/harmony as that specific song¹:

Title
Alternative title(s)
Composers/writers
Their roles (composer, lyricist, arranger, translator, etc.)
Their % ownership control of the copyright (evenly split or otherwise)

This data is immutable- it doesn’t change as time passes on. It describes, isolates and identifies that song as that song, and this song as this song. Once finalised, the title of a song doesn’t change, the writers who originally wrote it don’t suddenly change, and neither do their roles.²

There is a wider set of data that can change over time- the song’s publishers and their administration framework(s). If a songwriter is signed to a publisher (and in a specific country), how does a foreign society know to pass revenue back to that specific local publisher? The writer may be signed to an independent publisher in their local territory, but that catalogue is likely to be represented by another publisher (administrator or sub-publisher) in other territories. Song metadata needs to include this information, and the sum-total needs to accurately pass down the administration chain so any usage globally can be correctly reconciled with the correct song- and revenue therefore correctly attributed and passed back to the original publishers and songwriters.

Why does it matter?

It sounds simple enough- to collate a specific set of data and ensure it’s passed accurately between partners. Beyond the standard ‘typo’ issues (which in themselves do significantly contribute to the problem), these details are too often lost in a generally outdated system flooded with an exponential explosion in the quantity of data. The multifaceted data issues are only exacerbated by the number of tracks released each day (40,000+), generating more granular detail from digital services (like Spotify) that subsequently needs sifting through.

Firstly if the metadata is not 100% present and 100% ‘conflict free’/verified then, all too often, no part of the data is paid out to. If a writer is missing, or worse the track is over-claimed (e.g. as often happens when 3 publishers each claim 33.34% of a song) then the work is not paid out to. It’s in the interests of each party involved, whether writer or publisher, to gather all the information for registrations. Everyone’s identifier numbers. Everyone’s full names. All the publisher details. This is the very basis for ensuring accurate data propagates through the pipework- laziness at the first step creates higher and higher hurdles to overcome as the data travels further from its original source.

Secondly there is little in the way of global standards for how data is to be entered. Are individuals noted as J.Smith, John Smith, Smith/John, John Paul Smith, J.P. Smith or just Smith? Databases differ in their requirements, and struggle to align data across borders due to a lack of globally agreed standards. This was less of an issue in days gone by with a smaller number of creatives and less creative works to connect them to. However the explosion of songs/recordings has necessitated the reduction of creatives and their work to their representative metadata. Song registrations need to be consistent and accurate to ensure connection and collation- automation is king in the digital age (… or at least should be).

Thirdly the separation of, and distinctions between, registration databases on a territory-by-territory basis creates other issues. There is a defragmentation of rights at both the songwriter level (as collaborations increase) and at the publishing administration level (due to sub-publishing networks built on local rights-specific societies and their databases). The result is the exact same set of metadata generated by one song is registered separately by 10+ different entities at the start of the value chain, let alone the number of times exponentially by foreign sub-publishers and administrators. Any errors in the initial data (e.g. typos or missing data) propagate down the value chain before being collated by a foreign society. This makes it difficult for societies to chase the ‘source of truth’ for the data, and foreign registrations can be left invalid because the conflict is difficult to resolve. The more entities in each chain (writer -> original publisher -> sub-publisher -> administrator -> society), the higher likelihood of poor pass-through of data and resulting conflicts.

Finally, correcting data issues after the fact can often create further problems instead of fixing the initial fault. Duplicate registrations lead to further data conflicts, which means more revenue into suspense accounts. This happens on the PRS system- any data edits create a duplicate entry containing the new edit, with the 2 registrations intended to be merged automatically at a later stage. Corrections also need to be shared and correctly passed down to sub-publishers, and across to collaborators and their administration networks. In the current industry framework the best strategy is to have data fully collated and de-conflicted before it’s registered or shared anywhere.

Of course all these issues again are made significantly worse by poor data ingestion from other sources (aggregators/DIY uploaders being a culprit historically) and the sheer quantity of tracks hitting the likes of Spotify and Apple Music without the corroborating data also connecting with societies.

What corroborating data? That of recordings. Songwriters aren’t paid for Spotify plays if their song registrations aren’t linked to the recordings that embody those songs. Songwriters, typically via their publishers, need to register recording ISRCs³ with the society song entries. Songs need to be linked with all the recordings of those songs out in the wild, so when societies receive recording usage data they can map it to song ownership in their databases. Platforms like Spotify typically don’t demand song information on upload- another task where manual data collation is required to ensure a smooth flow through of royalties. Other recording metadata is also helpful to make these connections more likely to occur¹:

ISRC
Record Label
Release title
Release date
Artist (and featured artists)
Length
Catalogue number
UPC/Barcode

The issues, and industry discussions about them, abound- but it’s the direct effect they have on creatives that is the greatest tragedy. By example- an imaginary hot new track I Love You is picked up and played on radio, where a royalty is subsequently due. The radio station checks who to pay (a simplification), but of course there are many MANY recordings entitled I Love You. Luckily the radio station knows the artist name and has an ISRC identifier from when the recording was delivered, so they are able to narrow down the entity to pay recording royalties to. Unfortunately there are tens of thousands of songs titled I Love You, and the station was given no information that might help them track down the correct I Love You song they’ve used, and hence they cannot pay the songwriters. Despite knowing who owns the recording, they don’t know who wrote the underlying song (whether that’s also the artist or not) because they have no idea which I Love You song was recorded in the version that they played. All the data collation actually occurs at the society under blanket licensing (rather than the station itself having to work out who to pay) and the principle extends to services such as Spotify and Apple Music too. Aggregators don’t demand song metadata as mandatory for uploading a recording to their service- 10,000s of recordings are uploaded daily and made available without anyone being told who wrote the songs that these recordings actually embody. You can see how this might be a problem for the songwriters involved.

In the interconnectivity of the digital age it seems beyond archaic that data management is still such a significant issue plaguing the industry. The ‘big little problem’ of data has been exacerbated by the digital revolution rather than helped. The volume of data has overwhelmed an ageing infrastructure, one not built for such scale and not created with a wider interconnected ecosystem in mind. Current systems struggle to talk to one another at scale, and understandably any available investment has been spent simply trying to ‘keep up’ rather than instigate any broader level of ground-up reset or integration.

The net result is creatives are losing out. By at least helping to collate data at it’s very inception they can set themselves up for a higher likelihood of unblocked and smooth flowing royalties when they are due. If creatives are made aware of the issues then they know the right questions to ask of the right people to ensure data transparency and accuracy. Data is everyone’s responsibility- but if creatives don’t take hold of the issue directly they’ll struggle to claim what’s rightfully theirs through the industry’s opaque plumbing.

[1] Neither of these metadata lists is exhaustive. Put simply- the more data, the better. The concept of a ‘minimum viable data set’ has been bandied around the industry for the last few years, but misses the mark on encouraging comprehensive data capture. The more data that is collated and ‘cleaned’, the higher likelihood matches can be made between recordings/songs and the entities that control those separate rights.

[2] Of course this isn’t necessarily true in practice- additional writers can be added to songs, or the splits changed. It’s worth noting there are international PRO rules prohibiting the same set of writers, with the same set of roles registering another song with the same title as one already in existence. This helps maintain that each song can then still be uniquely identified by its title, writers and their roles.

[3] ISRC- unique identifiers for master recordings (International Standard Recording Code)

The Music Industry’s Dirty Data Secret

So what is metadata?

Why does it matter?

Written by Henry Marsden