System Design of Youtube

Santosh P.

10 min readJun 30, 2024

YouTube is a video sharing service where users can watch, like, share, comment and upload their own videos.

Functional Requirement

1) Upload video.
2) Watch video.
3) Search video.
4) Recommend Video.
5) Likes and comments.
6) Find preferences.

Non-functional requirement

1) Scale
2) Availability
3) Reliability
4) Lower latency
5) Cost efficient
6) Eventual consistency.
7) Failover.

Key Points
These are some of the keypoints that impact my youtube system design. 
I will discuss more about this later. 

-- Video can be accessed from different devices. 
-- Video loaded in multiple encoded formats. 
-- Video also stored in compressed format. 
-- Video shouldn't be corrupted. 
-- Smooth streaming experience. 
-- Region specific contents.

Back of the Envelope estimation


Total number of users: 2 Billion. So daily active user is 5 Million. 

Minimum no of videos watched per day per user: 5

Total video views per day: 5Million * 5 = 25Million views per day. 
10% of user upload video per day = 1 video. 
Average Video size = 300MB. 
Total daily storage space needed = 5M * 10% * 300MB = 150TB.
Per year storage = 150TB * 365 = 55PB

Youtube receives 100 requests per second.Total number of requests per day = 
       1000 * 60 * 60 * 24 = 90M requests. 

Per year 90M * 365 = 30Billion requests. 

As per SLA response time on an average should be 200ms. 
So  Number of threads 1000 Rps / (1/200ms) = 20000 threads. 
To calculated number of servers needed to support 20000 threads.

No of threads = 
   no of available cores * target cpu utilization * 
           (1 + wait time/service time). 

Wait time: Time waiting for I/O bound task to complete. 

service time: Time spent being busy. 

No of servers needed: ????

For network capacity planning need to know ratio of download and upload. 
Video size = 300MB, network bandwidth : 100MBps 
Each video should take 500MB / 100Mbps / 8 = 40s to upload. 

10 uploads simultaneousy. Bandwidth requirement = 400 * 100Mbps = 40000 Mbps. 
Total memory usage = 40000 Mbps / 8 = 5GB. 

server generally have 2GB RAM. 3 servers to upload. 400 concurrent uploading. 
1% uploaded video and 5M uploaded video.

API Endpoint

GetVideo - GET /videos: 

Retrieves video information for a given list of video IDs.

Example: GET `https://www.googleapis.com/youtube/v3/videos`
             ?part=snippet,contentDetails,statistics
             &id=Ks-_Mh1QhMc
             &key=YOUR_API_KEY

SearchVideo - GET /videos/{vid}
GET `https://www.googleapis.com/youtube/v3/search`
             ?part=snippet
             &q=OpenAI
             &type=video
             &key=YOUR_API_KEY

GetChannel - GET /channels: 

Retrieves channel information for a given list of channel IDs.

Example: `GET https://www.googleapis.com/youtube/v3/channels`


FetchPlaylist - GET /playlists: 

Retrieves playlists for a given list of playlist IDs.

Example: `GET https://www.googleapis.com/youtube/v3/playlists`


SearchVideos - GET /search: 

Searches for videos, channels, or playlists based on query parameters.

Example: `GET https://www.googleapis.com/youtube/v3/search`


GetSubscriptions - GET /subscriptions: 

Retrieves subscriptions for a given channel.

Example: GET `https://www.googleapis.com/youtube/v3/subscriptions`

Data Model:

Similarly the design should have the subscription object, channel object, like and comments object.

User data stored in mysql (relational databases), 
where as profile pictures stored in object store. 

Video data, like media files of different encoded formats are  being stored 
in object storage, like Microsoft blob store or Amazon S3. 
Whereas separate metadata storage of nosql database like MangoDB to store 
all video metadata, like video URL, Thumbnails, user information, likes and 
comments, etc.

Subscription info can be stored in mysql database, where as for search 
operations, the video data to be stored in elastic search and for 
recommendation for which large scale data storage is needed, can be stored 
in nosql databases like Cassandra or hadoop cluster. 
   
CDN is being used extensively for storing encoded video for fast streaming 
purpose. 

A distributed cache like redis or memcached can be used for caching user
metadata or video data or metadata. 

Separate metadata cache for caching video metadata.

High Level System design

With the Youtube, we can build a loosely coupled system to enable high parallelism. Loosely coupled means having a service model (SOA)/ event driven microservice based architecture model with distributed messaging queue for communication. So all tasks as services can perform defined responsibility in parallel.

Most of the videos available in nearby location of content delivery network. There are two different kinds of requests youtube deals with.

1. Video live streaming with UDP. It uses the websocket.

2. Video streaming with TCP. It’s http calls.

Since youtube can be accessed by different devices, the video stored should be stored and availed in the many different encoded forms as different devices need different form of videos based on network connectivity and device configuration. More information you will find it later.

Error handling: For highly fault tolerant system, handle errors gracefully and recover faster. For notification youtube should be having a separate notification service to notify user of video availablity or any kind of connection failures. Separate health monitoring to ensure a resillient architecture for handling any kind of service failures.

Low Level system design

The critical factor of a Youtube design is the process of transcoding. Transcoding is a crucial process in YouTube’s video processing workflow. It involves converting uploaded video files into different formats and resoluti-ons to ensure compatibility across various devices and network conditions. This process allows YouTube to deliver an optimal viewing experience to users, regardless of their device capabilities or internet speed. Transcoding can be done two ways.

Lossless Transcoding that converts video formats without any loss of data or quality. It’s mainly needed for archival purposes, professional video editing, scenarios where maintaining the original quality is crucial. But it results in very large file sizes, high storage and bandwidth requirements.

Lossy Transcoding converts video formats with some loss of data, which reduces file size. This is preferred for streaming services (like YouTube), where bandwidth and storage efficiency are essential.

The ingestion servers forward the raw live stream data to the transcoding servers. Transcoding server used for video transcoding and encoding purpose where video encoded into multiple formats (144p, 240p, 4k, etc) to provide best video streaming experience on different devices. Each video can be divided into data (video content) and metadata (video url, watermark, thumbnails, etc. )

Metadata usually have user and video information store in two separate table for user and video. We should have metadata replication with master-slave architecture, with write goes to master and query from slave replicas to perform in parallel and reduce load on master.
Metadata can also be sharded for faster lookup. but it increased system complexity. So we need to abstract to handle scalability and manageability challenges.

When the video gets uploaded it stored in original blob storage. Transcod-ed server fetch the video, convert the video into multiple formats and resolutions. To increase throughput we can parallelise transcoding process across several machines. Some video it can do another level of compressi-on to ensure quality with smaller size. Overall video processed by a batch job that runs several automated process for generating thumbnails, metadata, transcript, encoding, etc. Once the transcoding is complete following two steps executed in parallel.

Transcoded video sent to transcoded storage. It has been stored in the Sharded form. A video can be splitted into smaller chunks by Group of Pictures (GOP) alignment and then distribute across multiple storage devices. GOP is the chunk of frames in specific order, a playable unit of few secs in length. Then the processor generates DAG (directed acyclic graph) based on configuration files, client programmers writes. Preprocessor will do video splitting into smaller group of pictures (GoP) alignment and cache those segmented videos.

DAG scheduler split the DAG graph into stages of tasks and puts them into taskqueue in the resource manager. Resource manager allocate resources, 3 task queues and task scheduler.

Task queues are the priority based queue that contains tasks to execute. Worker queue is a priority queue that contain worker utilization info. Running queue contain info about running tasks and workers running the tasks. So task scheduler gets the highest priority task from task queue. Task scheduler gets the optimal task worker to run the task from the worker queue. The task worker run the task, puts the task/worker info into running queue. It removes the job from running queue once the job completed. Task worker runs the tasks defined in DAG. Tasks may be for generating watermark, encoding, generating thumbnails or merger. Now encoded video file store it in temporary storage need to be move to permanent storage. Like metadata which is frequently accessed abd smaller in size are stored in cache, where as video and audio file stored in object storage and temporary storage being freed.

Once transcoding completed, transcoding completion event queued to completion queue. Then transcoded video distributed to all CDNs. Completion handler pulls events from completion queue, update metadata database and cached transcoded videos. API server inform client about successful uploading of videos.

Some of the major components used for encoding features are:

container — A basket contains video file, audio file and metadata. File extensions .mov or .mp4.

Codecs — These are compression and decompression algorithm for reduce size but preserving video quality. Most video codecs are H.264, VP9 or HEVC. Now performing transcoding of video are computationally expensive and time consuming. So need to maintain video processing pipeline (automate) and high parallelism. This can be achieve by using Directed acyclic graph model.

Inspection ensure video of good quality. Video transcoding converted video to different resolutions, codec and bitrates. Once transcoding completes, generate thumbnails, water mark (image that identify video).

Vitesse database clustering running on top of mysql database. It has in-build sharding features for scaling. Automatic rewrite of bad queries to improve database performance. In build caching mechanisms with way to prevent duplicate queries (deduplication). Automatic handling of failover with backup functionality .
Use the different caching strategies. Distributed cache for storing metadata and adapting least recently used cache algorithm to perform caching services efficiently.

Video streaming flow:

Some key points in video streaming flow are:
1. To start streaming immediately, don’t wait for whole video to be downlo-aded. Device will continuously receive video streams from remote source. There is a standard (MPEG-DASH or Moving Pictures Expert group — Dynamic Adaptive Streaming over HTTP) that defines the way to control data flow from video streaming. For streaming service we need to decide that right protocol to support our usecase.

2. Video stream from CDN to nearest edge server with very little latency. Video which delivered will be encoded with compatible bitrate means more processing power and fast internet speed, better quality. Selection of bitrate based on device type and speed. Also raw volumes consume large storage space. An hour long high definition video recorded at 60 frames per second take upto few Hundred GB of space.

For live streaming Youtube opens a nonblocking websocket connections between both client and server end points and ensure that the socket not going to be closed if not explicitly closed by either party. Viewers access the live stream via the YouTube website or mobile app. The YouTube player requests the appropriate stream quality based on the viewer’s device and network conditions, using Adaptive Bitrate Streaming (ABR).

Finally youtube designed a rubust recommendation Engine that can help to Personalize video recommendations for billions of users. It Utilizes sophisticated machine learning models and big data analytics to analyze user behavior and preferences. More information about recommendation engine can be found separately.

Bottleneck

Some of the major bottleneck in case of youtube designs are:

Ensuring smooth playback without buffering, even for high-resolution videos. Adaptive Bitrate Streaming (ABR) dynamically adjusts the video quality based on the user’s network conditions.
Separate safety guidelines for protecting videos. Digital rights management system (DRM), AES for encryption and digital watermarking these are common task for ensure video’s exclusive right and it’s quality. Protect user data and prevent unauthorized access employ strong encryption, regular security audits, and compliance with data protection regulations.
To Provide relevant search results from a massive library of videos. Youtube used advanced indexing and search algorithms, leveraging machine learning and AI.

Optimisation

To improve upload speed create multiple upload centre across globe and sync them in the background.
Pre-signed upload URL. The client make http request to api server to fetch pre-signed URL before it can able to upload videos.
With live re-sharding we can change the shard key for your collection on demand as your application evolves.
Similarly CDN for video content cache, which constantly fetching media content directly from AWS S3. It can also use cloud front for content cache and elastic cache for metadata cache.
Disaster management strategies to avoid dataloss or service unavailability due to several reasons of failure. backup data on data centers located across geographies. User request can fetch from nearest location, and even from other data centres if data not available in nearest centre.
Cost saving Optimization as CDN is an expensive service if datasize is large. So popular videos can be serviced from CDN, where as other videos from high capacity storage video servers. No need to encode all videos. Short videos can be encoded on-demand. Some video localized to their regions. All optimization are based on content popularity, user access pattern and video size etc. It is important to analyze historical viewing pattern before doing any optimization.
So further optimize the error handling, categorize error with two types. 1) Recoverable error: retry the operation couple of times and in case it’s not recovered return proper error code to the client. 2) Nonrecoverable error: Stop running associated tasks with the video and return proper errorcodes to the client.

In this blog I have mostly covered the transcoding methods adopted by both youtube or Netflix or any kind of video sharing platform. I haven’t covered all the nitty grity of webapplication system design as almost all system has similar design. If you like the content or having any comments, feel free to reach out to me. A small gesture of giving a comment would encourage to further improve my blogging experience. I am looking forward to hear.