The Speed of Sound: Ingesting Musical Content At Scale Using DDEX
This is the first of a series of posts describing our DDEX compliant music ingestion service. In this, and future posts, we’ll talk about the end-to-end architecture, the code, and the deployment behind this service.
What is DDEX?
First, a little background to get you up to speed:
Aaptiv provides on-demand audio fitness that combines the guidance of a personal trainer in sync with the music you already know and love.
That requires us to maintain a huge library of music for our Audio Engineers and Trainers to find the perfect song for the perfect motivation.
We receive the music used in our app from multiple major record labels. The music industry uses a standard called Data Definition EXchange (or DDEX) to distribute data about works, tracks, and products (including ownership and sales information) digitally. You can learn more about the DDEX standard here.
The juice that made it worth the squeeze
Creating an automated service to ingest this data from our partners offered us many advantages:
- Automated services can perform faster than their manual counterparts. This meant we could get new music in front of our Audio Engineers faster.
- By eliminating manual tasks, we eliminate manual mistakes. Using the DDEX standard as our measuring stick, we can identify errors, omissions or problems quicker. This is not only faster, but also has a higher level of accuracy, and eliminates manual labor on tasks with incomplete or incorrect data.
- The DDEX standard allows us to work with multiple partners using the same code base, reducing complexity and engineering overhead for redundant tasks. (Note: there is a fuzzy gray area here requiring each partner to adhere to the DDEX standard. You’ll see later how we tried to anticipate some flexibility in the standard without having to duplicate or fork code).
The quest for the Holy Grail: Design Considerations
Spoiler alert!: I mentioned in the first paragraph that we built this as a microservice.
FWIW, I also post GOT spoilers occasionally (but not in this post, it’s safe to keep reading!). We used AWS Lambda as our microservice provider. The workflow for this application is pretty simple, and lends itself well to Lambda. A provider drops an XML file (known as an ERN, or Electronic Resource Notification message) into an S3 bucket unique to the provider, provisioned by us. We process the file, identify the assets available to us, and the actions taken on them. We persist the ERN message to a PostgreSQL database. We use a message queue to store the actionable content defined in the ERN.
AWS Lambda offered multiple advantages (If you aren’t familiar with Lambda, you can read about it here.)
- Onboarding a new provider can produce bursts of thousands of files and terabytes of downloads. We needed something that can scale quickly and effortlessly.
- Beyond the initial data load, ERN messages are intermittent. New messages only arrive when new content is available for download or existing content is updated. We didn’t want servers sitting around doing nothing (except costing us money).
- It had to be fault tolerant. Lambda allowed us to spread our functions across multiple Availability Zones in AWS.
- Providers drop their ERN messages into S3 buckets. Lambda has an easy-to-use integration, allowing us to trigger the Lambda function when a new file is added to the S3 bucket.
- As you’ll see, parsing the XML files and the various errors resulting can be a tedious task. Amplify those errors at scale, and the problem becomes worse. AWS Lambda writes logs to Cloudwatch Logs by default. Cloudwatch offers a robust search functionality, allowing us to search logs and find errors quickly, easily, and without additional overhead.
If you write bad code, you’ll find the “I” in “team”
There were two factors that weighed heavily on my mind early on in this project:
First, we have a goal at Aaptiv of rotating team members to different teams every few months. That meant that in just a few months, this code would be in someone else’s hands. I didn’t want my name to become synonymous with curse words in their vocabulary.
Second, I’m sure you’ve seen those “lifer” projects. You know: it was written by someone who’s no longer around and everyone is scared to touch it. Or the flip side of that coin: the person who wrote it is still around and doesn’t get to work on any other projects because they are stuck maintaining this one.
I don’t want to be either person.
Fortunately, there is a way to solve both of those:
Good, solid tests.
I know, I know…
Writing tests can suck.
They can be hard.
They can be confusing.
Sometimes they’re both hard and confusing, turning the suck factor up to 11.
Here’s my thought process on tests:
Tests are for the future engineer who needs to make a change to your code after you.
By leaving him/her a good set of tests, you enable that person to confidently update the code without fear of breaking something.
And guess what?
Because life loves irony, sometime you are the engineer who has to update your code, long after you remember what it was supposed to do.
All of that translates to my fervent desire to write as many tests as I felt necessary to ensure that someone else could confidently update this code base, allowing me the freedom to work on different projects. Or, if I had to make an update, it was my insurance policy to ensure Past Will didn’t hoark over Future Will.
Paint By Numbers
It was important to track what was happening inside the Lambda functions.
When we receive a new ERN message, was it a happy little cloud function or a happy little accident?
What resources were made available?
New albums, hot new tracks, the next #1 hit?
Or was it a notice that content was no longer available?
These are all important metrics in measuring the value delivered.
While Lambda has some built-in metrics, most relate to operation and performance of the function itself. AWS offers Cloudwatch, but we already use Datadog in other parts of our environment. For that reason, I chose to integrate with Datadog to provide in-depth, application specific instrumentation and expose it via Datadog dashboards. In the future posts, you’ll see exactly how easy it was to implement and the amazing insight it produces.
In the next post, we’ll dissect the code base, identify major workflows and the tests that support them. See you then!