Building Decentralized Media Registries

The architecture behind the dotBC music metadata management system

Published in

Cardstack

17 min readJun 17, 2019

Chris Tse, Founding Director of the Cardstack Project, presents the music metadata management system built on the Cardstack Framework for dotBC and Warner Music Group.

Songs from Warner Music Group and Warner/Chappell Music, the vision of our partner Dot Blockchain Media (dotBC), and the Cardstack Framework — this combination has produced a powerful data registry for media rights. It is a decentralized data management system for the music industry, which takes advantage of the blockchain’s trustlessness, while solving problems around multi-party interactions, scalability, and the ability for people to own their own data (both private and public).

Registries are about lookups on the one hand, and about sharing information on the other hand — whether that is identity-related data, content, rights, or metadata.

Similar to the Identity Overlay Network (ION) announced by the Decentralized Identity Foundation and Microsoft — where a Sidetree base allows you to anchor certain information on a blockchain and then use a content-addressable storage system like IPFS to make it all work — we have developed the concept of Gitchain. Instead of using this Sidetree structure, we use a packfile for synchronization purposes. This packfile has a reference on the global ledger (meaning: only tiny bits of on-chain data), which allows someone else to unpack it, thereby making a copy of any type of content, data, and code the packfile contains. This infrastructure, which connects the ION implementation and Gitchain, shows that we are converging towards the idea of a Layer 2 architecture.

We have used the Cardstack Hub and infrastructure to build and deploy a real working application — to not only produce a proof-of-concept, but also validate the ideas about how Gitchain works:

Dot Blockchain Media’s vision was an open-source, globally accessible, decentralized and interoperable registry of music rights and rules, around a format for music called dotBC. So, we worked together to figure out the architecture around this nascent technology.

.bc was initially conceived of as a blockchain anchor replacement for .mp3. With .mp3 files, anyone can easily change the metadata about a particular piece of music (like the name, the songwriter, the composer, the year it was copyrighted, etc.). There is no global right. You could just claim that you wrote and own a Paul McCartney song, by updating the song details. Nobody would be able to change it, because the information you have entered is true on your iPod or iPhone. So, the concept of metadata being inside a file has always been around — that’s how you edit your music collection on your music player. In contrast, having a blockchain-anchored music format means that any change — say the ownership information of a particular song changes, or the cover art gets updated — can be cascaded to all players. The blockchain is the source of truth, while the file is still in a physical format on your computer.

Step 1: What is needed to build a decentralized music registry?

A lot of requirements had to be met, to be able to create the envisioned data management system:

No central authority. There shouldn’t be one website you go to — whether it’s dotBC or Spotify or Google — that determines the data and the correct master version of a song. It has to be a decentralized system.
Core data can be published and accessed publicly by anybody, anytime. This core data may not include details about royalty payments; but the basic information about who owns a song, who played in it, and who was a guest artist should be open to everybody. It’s part of the truth of the Web, almost like a Wikipedia page.
Sensitive data can be kept private and shared based on relationships. For instance, negotiations regarding percentages and disputes between songwriters constitute sensitive information that the public really doesn’t need to know.
Can reify metadata, media, and related items as a downloadable bundle. Any song can be a file, like an mp3, so you can download or stream it on your computer. All the information regarding that song gets downloaded together in one bundle. To have this duality — to go back and forth between the music industry’s online version of the registry and the musician’s or consumer’s view of the downloaded bundle — is important.
Scale to the entire commercial music catalog and all the fields. It’s difficult to have the entire music catalog of more than 50 million songs, which grows rapidly every day, including all the 100+ fields, on the blockchain. That would add up to too many transactions.
Record the history of amendments across the global network. Not only the current version of a song should be registered; and not just the history of spends and cryptocurrency exchange — but the entire history of changes made by multiple parties. If the rights for a song are transferred to someone else, that change has to show up in the registry, to ensure that the right person gets paid the royalties.
Changes can be proposed, then approved or denied. If someone claims to own a Beyoncé song, that person is probably a nobody trying to steal someone else’s rights. In a decentralized system, the community (or part of it) has to decide that this person is not a legitimate label or musician. It’s necessary to make sure they can deny such a fraudulent proposal.
Approved changes will become official; denied ones are discarded. Only changes that have been agreed upon by the song’s rights holders should get registered as definitive.
Calculate scores based on the quality of data to aid decision-making. Who is submitting the proposal? Is it coming from a random label, or one that has 15 years of history in the business? All this can be scored; we can look at the blockchain and provide some sort of trust score.
Filter and search by criteria across public and private catalogs. You can filter and search both your own songs, which you have added to the registry, and public information. Say you’re producing a cover song. You’ll want to make sure that you’re tagging or getting associated with the right original recording, so that the royalties go to the right artist.
Import and export catalogs or revisions in bulk with translations. Most things are not on the blockchain, but in enterprise systems. That’s why the registry needs to be able to import and export existing catalogs.
In-app workflow orchestration with third-party services. Streaming services, submitting information to supply chains, doing certain types of watermarking, fraud detection, or data cleanup — there are a lot of workflows the registry should communicate with; not only centrally, but on the way to the registry and on the way out of it too.
User-friendly interface for data entry and workflows. Especially an open-source registry should have a built-in interface that’s user-friendly and that can be customized as a white-label solution.
Participants can choose to run their own node in their cloud or data center. If you want to join the music industry registry, you can run your own node to be part of it, thus representing 1/15 of the network, for example. It’s like joining the Bitcoin network as a miner or the Ethereum network as a full node.

These requirements make a lot of sense, but they demand a lot of things!

No central authority means you probably need a blockchain.
To make core data accessible, you don’t want to send people to Etherscan, but to have a very fast-loading Web page, like a Web CDN.
To keep sensitive data private, you need a social network, so you can determine who a person is. Is the person part of this song? Is it the label? Is it the rep for the artist? That requires some understanding of context, identity, and relationships.
For downloadable bundles, you have to deal with zip file and folder structures, unwrapping, unzipping, etc.
So it can scale to the entire music catalog, you probably need to use proven database technologies, because the blockchain doesn’t scale to the great amount of transactions and amendments.
To record the history of amendments, you need an append-only ledger, but it may not necessarily have to be a blockchain.
Because changes can be approved or denied, you need a workflow system.
To ensure that denied changes are discarded, because some changes may be fraudulent or proposed by mistake, you need a version control system.
Calculating scores to aid decision-making requires an analytics system.
To filter and search across catalogs, a search engine is required.
In order to import and export catalogs, you need an ETL tool to extract, transform, and load bulk data.
For in-app workflow orchestration, an API layer is needed.
A user-friendly interface requires a Web app with a secure login, almost like a SaaS app.
If participants can choose to run their own node, you need a cloud platform.

Daunting or not — these are the prerequisites for building a decent media registry for recorded music, music videos, and other media assets.

Step 2: What should be avoided when building such a registry?

There are a lot of anti-patterns to consider as well:

Not a database. If there is no central authority, there shouldn’t be a database.
Not a SaaS app. If all the data is stored per user on a platform like Salesforce.com, how can the core data be accessible by the public? Besides, SaaS products are not designed for random people to click on a link and get the data. Therefore, it can’t be a commercial application; it has to be a public Web infrastructure.
Not a public site. A public site, however, cannot handle sensitive data.
Not just 1 source. The master file comes from this service, the publishing data comes from the publishing house, other information comes from the label, and statistics come from iTunes. There’s a lot of information coming from many sources. Downloadable bundles have to suck in all the information from the different music companies and music ecosystems.
Not a blockchain. If it has to scale to the entire music catalog, it can’t be encoded solely on a blockchain, given the limitations of the current generation of implementations.
Not a flat table. To record the history of amendments, the system has to be based on operations and deltas — not just on a list.
Not just 1 version. Since changes can be approved or denied, there are several versions of a song.
Not a rollback. You can’t just use simple rollbacks to ensure that denied changes are discarded.
Not at query time. Calculating scores to aid decision-making cannot use up much query time, because that would be pretty slow. The score has to be accumulated over time, using a complex technique called “dynamic programming”.
Not predefined. A predefined query is not enough to filter and search across catalogs. Lots of people want to look at lots of songs — including the songs they have submitted, the money they have made, at this or that time, with this or that unlicensed cover song.
Not just rest APIs. A bulk import and export of catalogs with 600,000 songs requires a bulk API.
Not off-the-shelf. Off-the-shelf software is not sufficient for in-app workflow orchestration with third parties. It needs to be an extensible UI framework.
Not just back-end. A UI is needed to provide a user-friendly interface.
Not just a site. Having a site like Spotify.com or Disco.com is not important. If participants can choose to run their own node, the infrastructure has to be shared across all of these websites.

These requirements and anti-patterns are tricky — especially because some of them are blatantly contradictory: It has to be a blockchain, but it can’t be a blockchain. It has to be a database, but it can’t be a database. That’s what happens when you start building useful applications with a blockchain infrastructure in the back-end.

Step 3: How can these requirements be fulfilled?

What kind of architecture can satisfy all (or at least a good part) of these requirements for building a decentralized media registry?

We found a way to satisfy the requirements and avoid the anti-patterns, using the following methods:

We built dotBC as…

…a blockchain (to have no central authority), using Hyperledger Sawtooth. We chose a permission ledger that works and use it judiciously, in a Gitchain kind-of-way.
…a Web CDN (to make core data accessible), using dotBC.info. We make the latest version of the data easily accessible — including watermarks and audio files. That’s all done through a regular CDN-backed Web page, but generated and published in a flat way, so that people can see it quickly.
…a social network (to keep sensitive data private), using “Request Access”-actions. This is a way to understand the identity and authentication of a person.
…a zip file/folder, using downloadable bundles.
…a database (so it can scale to the entire music catalog), using Postgres.
…a ledger (to record the history of amendments), using Gitchain.
…a workflow system (so changes can be approved or denied), using branching and merging — in the first version.
…a version control system (so denied changes are discarded), using Git/GitHub — in the next version. Multi-Git enables an oracle system. The workflow/version control system is based on Git, so that new versions can be created and discarded. GitHub is our way of demonstrating and debugging a lot of these things.
…an analytics engine (for calculating scores to aid decision-making), using AI/ML (Artificial Intelligence/Machine Learning) and computed properties in the Hub. This makes a formula view possible, to get a score.
…a search engine (to filter and search across catalogs), using PG Search in the Cardstack Hub. We use the Postgres database as basic JSON storage to review this information.
…an ETL tool (to import and export catalogs), using a custom JSON formatter that checks into Git.
…an API layer (for in-app workflow orchestration), using JSON:API via the Cardstack Hub. The Cardstack Hub is a schema-driven API server that generates JSON:APIs.
…a Web app (a user-friendly interface), using a dotBC-card-based UI. This app is a small part of the system, which we have demoed below.
…a cloud platform (so participants can choose to run their own node), using Amazon Web Services. Being currently hosted on AWS, this is part of the proof-of-concept (PoC).

This music registry for dotBC, including the infrastructure to support it, was built and shipped by Q1 2019.

Architecture & concepts of the media registry

The architecture of the dotBC media registry is built on two axes: Horizontally, the data comes through the ETL, goes to the folder, then to the database, before it gets exported as a zip file. Vertically, the platform is at the bottom and the API is at the top. The API layer captures the data and updates the database, leading to the version control system. It updates the ledger, which updates the blockchain, before the data gets anchored and runs on the cloud platform.

Here’s another way of looking at it:

The database sits in the middle. The ETL provides the feed of all the music information (instead of music data, this could be book data, video data, episodic series data, or something else in the future). The data then goes to a folder that describes this particular piece of content, data, or this person.

The search engine then pulls that information into a bundle and denormalizes it; so that the zip file, which can be downloaded, is not the official version, but a copy of the data. Meanwhile, the database has to figure out what’s the truth, and that’s version control.

The ledger is anchored on the blockchain.

The workflow system and the social network are not the main emphasis in this case; they exist to support the API layer. There’s an authenticated Web app and a Web CDN. Regular users who go to the CDN just see a Web page, where they can find out quickly who wrote a particular song. The Web app is where the framing can be modified and changed, before it gets published back to the CDN.

The Cardstack Hub is the database. It is a virtual database with a database engine underneath. The hub is multifaceted: It does scoring and searching. It deals with JSON models and goes back and forth between Git plugins. It can export CSV files, it has a schema, and it can hook into machine learning. Essentially, the Cardstack Hub is the scalable database that allows us to capture and manage this catalog.

Because it gets its data through Gitchain and the underlying blockchain, anybody can run a hub. This means: You can use the blockchain and the information from Gitchain to (re)create your own database, your own Cardstack Hub. This allows you to serve your own clients, using the API that you provide and an app that’s special to you. Alternatively, you can customize the standard app — using dotBC as a starting point to build your own music catalog app, or even consumer-facing music streaming services.

Music registry demo

The dotBC music registry we have built captures the sample catalog of music that was provided by Warner Music Group and Warner/Chappell Music. Let’s take a look at the registry demo that shows you how our Web app works:

Dmitry Naidionov, Cardstack’s Product Manager, gives a tour through the music metadata management system built on the Cardstack Framework for Dot Blockchain Media and Warner Music Group.

Once you are logged into the system, you can see and browse the music catalog, searching by group name/performer, by song name, or by internal codes. When the catalog returns your results, those songs will contain different kinds of data — master, release, and/or works data — which are simply different bundles of data managed by different organizations.

As a regular user, you can always see the publicly available data for the individual songs. To see the information that is private — the artwork or the ownership rights, for example — you need additional authorization, which you can request. All you need to do is click the “Request Access” button. Your request is delivered to the organization managing that particular part of the song. If the master and release data is managed by the label, only the label can approve or deny your request regarding that information. If the works data is managed by the publisher, only the publisher can approve or deny your request to see that private data. But as soon as they give you access to that additional information, you can see the respective private fields — and look at the artwork, play the audio files, or see the ownership splits, for example.

You can request editing rights too. Say you notice that a particular song title is incorrect. All you need to do is click a button to create a local branch. This means that you won’t be editing the master data, but a local copy. Once you save it, it is published to your local branch (not yet to the public catalog). But with one click, you send a merge request. The maintainer of this particular part of the metadata (such as the label) gets notified and can approve your request to merge the branches. Once the original branch and the branch with your amendments are merged, the song data — containing your edits — will be published for everyone to see.

You can even request matching rights, to edit or add matches between songs and artists that you think should be associated with each other.

Furthermore, you can export the bundle for a particular song. Of course, the actual data you are able to export depends on your permission rights. If you are a regular user without access to private data, you can only create a bundle that contains publicly available data. But if you are a privileged user, you can export a bundle containing all the available data, both public and private.

As for the data source, we use Git repositories, so all the data is saved in GitHub. But not every single commit gets synchronized to the blockchain, since that would be very expensive. Instead, only merge requests from users that have been approved are turned into bundles that get reported to the blockchain. This means we only synchronize milestones.

What’s next?

1. We are working on improving the UX of multi-party workflows — both design- and development-wise — using the Card SDK as the basic building blocks. This dovetails with the design for Card Flow as well. Ultimately, it will be a standard package that is part of Cardstack’s open-source codebase.

2. We are implementing a pluggable strategy for AI/ML-based scoring and response automation. The idea of artificial intelligence and machine learning is this: If you have a feed of things coming in — “Do you want to approve this license? Do you want to sell this to me? Do you want to trade this with me?” — you can either look at the workflow and read the email message, then say “yes” or “no” to make the decision, or you can train a bot to respond on your behalf. That is how machine learning can plug into blockchain; though not necessarily at the smart contract level (having a smart contract decide whether something is yay or nay seems a little scary). But someone running a node, running a client, or running a UI can say, “I’m not going to log in every day to say ‘yes’ to a million things. Most of the time, please make the decisions — whether to approve this metadata request in the music industry, or whether to approve the revisions in the content management system — for me.” Only if there is real controversy or uncertainty about whether or not something should be approved, does the machine notify the respective person to review it him- or herself.

For dotBC, this concept provides an industry-specific solution. It allows music companies to use the latest generation of machine learning algorithms to determine whether a song is accurate or whether requests and amendments from particular parties are trustworthy. This way, the poor person from the label doesn’t have to say “yes” every single time someone merely changes a misspelling or adds a bassist to a song. Those things can be automated, while risky things — like taking rights from someone, or transferring rights that change the royalty payments — require more authentication.

We plan to turn this into a general plugin strategy we can provide to other industries as well — to plug machine learning algorithms that operate at the edge of the network into decision-making. And this will likely tap into the Python ecosystem, as something that is not just written in JavaScript; because most machine learning is done in Python. So, we want to enable developers to use tools from the Python ecosystem to drive the decision-making engine.

3. We are considering the trade-offs of anchoring to a private blockchain (a permission ledger like Hyperledger Sawtooth) versus a public blockchain (like Ethereum). We’re working on IPFS and Ethereum for Gitchain, as a way to store anything that can be stored in Git. Some of that can be reimplemented in other types of registries. In contrast to blockchain solutions that require you to use their blockchain and their infrastructure before you can do anything, we want to support different approaches. But we do believe in a public chain as an anchor. Ultimately, you can choose which blockchain you find more trustworthy. And as long as you only put one transaction per 500,000 off-chain transactions on the Bitcoin or Ethereum blockchain, the cost remains reasonable. So, these trade-offs are decreasing, once we have this Layer 2 solution. And it’s great to have this innovation in lockstep with different people in the industry, who are working towards the same thing — like Microsoft and the Decentralized Identity Foundation with their ION project.

We are closing the gap between Web 3.0 (where we see the Web is going) on the one hand and Web 2.0 (with its cloud infrastructure, technology, and scalability) on the other hand. They are now converging, so that we can actually create something that is ready for end users — allowing them to do what they want to do on the network of their choice, while still having something as user-friendly as anything they’ve ever done on the centralized version of the Web.

Learn More

Join our Telegram group and announcement channel, and star Cardstack on GitHub to follow updates on the Cardstack framework, Gitchain, and a suite of standard cards.