System Design: Top k songs played on music streaming applications

9 min readMay 11, 2024

This is a continuation of my previous blog on System design: Music Streaming Services. I would like to take this specific requirement as a separate system design question. Let’s dive deep in it.

1. Requirements

Functional Requirements:

Top k(1000) most played songs in the past period (say 1 hour, 1 day, 1 month, all time) with real-time update.
Record and count the songs as they’re being played by end users.

Non-functional requirements:

Scalable, high performance, high available.
We should return results within 10’s of milliseconds — min latency and need for pre computing.

2. Back of the envelope estimations

DAUs = 500M and each listens to 10 songs = 5B songs

the number of clicks per second (QPS) = 5B clicks/day / (100k seconds/day) = 50k QPS
Total songs= 50K/day, store for 10yrs = 200M

Storage estimated: 200 M * (8 bytes/ID + 8 bytes/count) = 3.2GB = ~ 4 GB (memory is only sufficient)

3. System APIs

We simply need an API to retrieve the top K songs.

GET /plays/top?window={WINDOW}&k={K} Response: { “videos”: [ { “videoId”: // … “views”: // … } ] }

4. DB Schema

songs, view and time window

song_list{
  song_id: 123
  user_id: 5678
  timestamp: 127800988978
}

Design Deep Dive

Approach 1:

Users play a song: When a user plays a song, the event is captured by the client application (e.g., a web browser or mobile app). The event data includes details such as the song ID, the time of play, and user information.
Load Balancer and API Gateway: This component acts as the entry point for all incoming requests from user clients. It distributes incoming traffic and requests across multiple servers or services to balance load and enhance performance and reliability.
Analytics collection service: Handles the ingestion and initial processing of analytics data from song play events.
Songs Plays Data Store (OLTP): Stores detailed records of each song play event. Manages a high volume of write operations as songs are played across the platform. Ensures data is consistently and reliably stored, albeit with eventual consistency, meaning it might take some time before all views of the data are consistent.
Ranking MapReduce + Top K Logic: Periodically runs MapReduce jobs on the data stored in the OLTP system to aggregate play counts by song. Applies Top K logic to determine the most popular songs based on the aggregated data.
OLAP DB Result Cache: Stores the results from the Ranking MapReduce process, specifically the top K song rankings. Serves pre-computed top K rankings quickly to the Song Listing Service to fulfill user requests.

Requesting Top K Songs:

A user requests the top K songs through their client application.
The request is routed via the API Gateway to the Song Listing Service.
The Song Listing Service fetches the latest top K rankings from the OLAP DB Result Cache and returns them to the user.
Song Listing Service: Retrieves top K song rankings from the OLAP DB Result Cache. Sends the top K song data back to the user’s client application via the API Gateway.

The major disadvantage of this approach is that map and reduce reduces data from the disk and writes to the disk which accounts for latency, it is made for batch processing of large data. It will only account for near real-time streaming. Hence we need a better solution.

Approach 2:

I have replaced the entire dependency on map and reduce to Apache Flink / Amazon Kinesis Data Firehose, which helped in real-time processing of data.

Why Apache Flink? It helps processing large-scale in real-time analytics or streaming data (used by Netflix, Apple, Uber etc).

Flink has flush intervals that can be configured, so we can aggregate on a minute boundary but flush the results every couple of seconds. This way, we get the best of both worlds, and the latest minute’s data will just be incomplete until the minute boundary is reached, which would be the expected behavior

Performance: It provides a very powerful and scalable runtime engine in distributed manner which means that you can process millions and billions of events in real-time and can do both stateless and stateful scalable processing.

Resilience: It is a very reliable system with failure recovery mechanism. Flink has mechanism like checkpoints so that you can make sure your business continues even if there is any problem.

APIs: It is available for multiple languages. Developer has the freedom of choice.

Unified stream and batch: Relates real-time with historical data to add most value out of that and this is exactly where Flink shines. In application like top k-songs, we just do not work on the real time data, but also draw insights from the old data.

How does Flink work?

For example, if we have a streaming source that continuously produces song play events, Flink would ingest these events, group them into min, hourly and day windows, calculate the play counts for each song within each window, identify the top k songs in each window, and output the results to an external sink. This process repeats continuously as new events arrive, providing real-time insights into the top songs being played over time.

Steps:

1. Analytics collection service

Function: Kafka captures and stores the incoming song play data, ensuring durability and fault tolerance. It’s partitioned by song ID, with partitions for popular songs to manage load effectively. Kinesis is used for its stream processing capabilities, particularly for partitioning and retaining data (with a 7-day retention period mentioned).

2. Apache Flink with Amazon Firehose

Purpose: To process streaming data at different time scales.
1-Minute, 1-Hour, and 1-Day Windows: Flink jobs are set up to aggregate song play data over these windows. Each job calculates play counts and possibly other metrics like unique listeners.
Amazon Firehose: Integrates with Flink to reliably load the streaming results into subsequent storage or databases for further analysis or immediate querying.

3. Spark MapReduce

Purpose: To perform heavy-duty batch processing, particularly for longer-term data aggregations (e.g., monthly).
Function: Runs nightly batch jobs to compute top K listings for periods like the past month or all-time. This is typically more resource-intensive and done less frequently.

4. Reconciliation Worker

Purpose: To check and correct any discrepancies or errors in the data processed by Flink or Spark.
Function: This component runs as a cron job, verifying the integrity and accuracy of the data in the Result Cache and fixing incorrect records, ensuring the data’s reliability.

5. Result Cache (OLAP Database)

Purpose: To store the results of the top K song computations in a way that’s optimized for quick retrieval.
Function: This database caches the top K results, storing data such as song ID, timestamps, play counts, and possibly other metrics. It supports fast, efficient queries from the Song Listing Service.

6. Song Listing Service

Purpose: To handle incoming queries for top K songs and serve the results to users.
Function: Retrieves the latest top K results from the OLAP DB Result Cache and delivers them to users in response to their queries.

1. How can we scale to support 50k clicks per second?

Let’s walk through each bottleneck the system could face from the moment a click is captured and how we can overcome it:

Click Processor Service: We can easily scale this service horizontally by adding more instances. Most modern cloud providers like AWS, Azure, and GCP provide managed services that automatically scale services based on CPU or memory usage. We’ll need a load balancer in front of the service to distribute the load across instances.
Stream: Both Kafka and Kinesis are distributed and can handle a large number of events per second but need to be properly configured. Kinesis, for example, has a limit of 1MB/s or 1000 records/s per shard, so we’ll need to add some sharding. Sharding by songID is a natural choice, this way, the stream processor can read from multiple shards in parallel since they will be independent of each other (all events for a given songID will be in the same shard).
Stream Processor: The stream processor, like Flink, can also be scaled horizontally by adding more tasks or jobs. We’ll have a separate Flink job reading from each shard doing the aggregation for the AdIds in that shard.
OLAP Database: The OLAP database can be scaled horizontally by adding more nodes. While we could shard by songID, we may also consider sharding by AdvertiserId instead. In doing so, all the data for a given advertiser will be on the same node, making queries for that song faster. This is in anticipation of advertisers querying for all of their active ads in a single view. Of course, it’s important to monitor the database and query performance to ensure that it’s meeting the SLAs and adapting the sharding strategy as needed.

2. There is just one remaining issue, hot shards.

Consider the case where Taylor swift just released a new song. This song is getting a lot of plays and all of them are going to the same shard. This shard is now overwhelmed, which increases latency and, in the worst case, could even cause data loss.

To solve the hot shard problem, we need a way of further partitioning the data. One popular approach is to update the partition key by appending a random number to the songID. We could do this only for the popular songs as determined by artist’s previous record volume. This way, the partition key becomes SongId:0-N where N is the number of additional partitions for that SongId.

2) How can we ensure that we don’t lose any click data?

By default, these streams are distributed, fault-tolerant, and highly available. They replicate data across multiple nodes and data centers, so even if a node goes down, the data is not lost. Importantly for our system, they also allow us to enable persistent storage, so even if the data is consumed by the stream processor, it is still stored in the stream for a certain period of time.

We can configure a retention period of 7 days, for example, so that if, for some reason, our stream processor goes down, it will come back up and can read the data that it lost from the stream again.

Stream processors like Flink also have a feature called checkpointing. This is where the processor periodically writes its state to a persistent storage like S3. If it goes down, it can read the last checkpoint and resume processing from where it left off. This is particularly useful when the aggregation windows are large, like a day or a week. You can imagine we have a weeks worth of data in memory being aggregated and if the processor goes down, we don’t want to lose all that work.

Transient processing errors in Flink, bad code pushes, out-of-order events in the stream?

We can lose if not add periodic reconciliation

At the end of the stream, alongside the stream processors, we can also dump the raw click events to a data lake like S3. Flink supports this through its FileSystem interface and various connectors, allowing for both batch and real-time data processing outputs to be stored directly in S3 buckets. Then, we can run a batch job that reads all the raw click events from the data lake and re-aggregates them. This way, we can compare the results of the batch job to the results of the stream processor and ensure that they match. If they don’t, we can investigate the discrepancies and fix the root cause while updating the data in the OLAP DB with the correct values.

This essentially combines our two solutions, real-time stream processing and periodic batch processing, to ensure that our data is not only fast but also accurate.

3. How can we prevent abuse from users clicking on play multiple times?

While modern systems have advanced fraud detection systems, which we have considered out of scope, we still want to come up with a way to enforce song play Idempotency. ie. if a user clicks on an ad multiple times, we only count it as one click.