Blog Digest — How Netflix Works(Part 1)
It’s a digest version of this ByteByteGo post from ChatGPT.
Netflix’s seemingly straightforward user interface conceals a highly intricate and sophisticated infrastructure. At first glance, one might assume that Netflix, a video streaming platform, relies solely on AWS (Amazon Web Services) to deliver its content. However, the reality is far more complex and fascinating.
In 2022, Netflix boasted an impressive user base of over 221.64 million subscribers spread across more than 190 countries. With a quarterly revenue of nearly $7.87 billion, the platform has become a dominant force in global internet traffic. The scale of its operations necessitates impeccable reliability, especially since Netflix operates on a subscription-based model.
Netflix’s operations are divided into three fundamental components: the client (the user interface), the backend (which primarily runs on AWS), and the content delivery network (CDN), known as Open Connect. This vertical integration ensures Netflix has complete control over the entire viewing experience on a global scale.
Initially, Netflix operated its own data centers, but due to reliability issues, it gradually transitioned entirely to AWS over eight years. Operating across three AWS regions, each with multiple availability zones, Netflix has fine-tuned a global services model that allows seamless user experience even during regional failures.
Contrary to the assumption that utilizing AWS would inflate costs, Netflix found that the cloud’s elasticity actually made it a cost-effective solution. AWS handles various aspects of their operations, such as scalable computing, storage, databases (utilizing DynamoDB and Cassandra), big data analytics, recommendations, and transcoding.
Transcoding, a massive operation at Netflix, involves converting videos to suit a myriad of devices, creating thousands of files tailored to different device specifications, network speeds, languages, and user preferences. This personalized approach even extends to header images, ensuring a unique and appealing experience for each user.
The platform’s experimentation with different CDN strategies sheds light on its evolution. Initially, Netflix built a small in-house CDN. Subsequently, it shifted to using third-party CDNs (such as Akamai and Limelight). Finally, it developed Open Connect, its custom CDN. Open Connect’s advantages in terms of quality, scalability, and cost-efficiency were evident due to Netflix’s deep understanding of user preferences and content delivery requirements.
This intricate infrastructure underpins the seemingly simple “press play” experience that users worldwide enjoy. It underscores Netflix’s unwavering commitment to delivering a seamless, personalized, and top-tier viewing journey for its subscribers.
The process of transitioning to AWS was pivotal for Netflix. Initially established in 1998 as a DVD rental service through the US Postal Service, Netflix began envisioning the future in on-demand streaming video. By 2007, it introduced its streaming video-on-demand service, allowing subscribers to access television series and films through various platforms. However, making this transition wasn’t without its challenges.
In 2007, when Netflix’s streaming service began, AWS’s EC2 (Elastic Compute Cloud) was still in its nascent stages, making it impractical for Netflix to launch using this cloud service. Consequently, Netflix built its two data centers adjacent to each other. However, this approach encountered the classic problems of lead times for equipment, capacity issues, and difficulty in maintaining reliability during rapid growth.
Operating a data center involves extensive work, from equipment procurement and installation to scaling up to meet increasing demand. Netflix faced challenges in building a reliable monolithic system that could accommodate its rapid growth. A critical turning point arrived in August 2008 when Netflix experienced a service outage for three days due to database corruption. This event highlighted their inadequacies in managing data centers and their core competency in delivering video content.
The outage prompted a crucial realization within Netflix. While they excelled at delivering video content, their expertise didn’t lie in building and maintaining data centers. They recognized that their focus should be on enhancing their strengths rather than diverting resources to areas that weren’t their core competency. Consequently, in a bold move, Netflix decided to shift to AWS.
The decision to move to AWS was driven by several key factors. One primary motivation was the pursuit of a more reliable infrastructure. Netflix aimed to eliminate single points of failure from its system and sought the highly reliable databases, storage, and redundant data centers offered by AWS. Additionally, AWS provided cloud computing capabilities that allowed Netflix to scale globally without the need to build and maintain their own data centers in different regions.
AWS’s proposition aligned with Netflix’s vision, as they sought to create a global service without the burden of managing their own data centers. The concept of “undifferentiated heavy lifting” was critical in Netflix’s decision-making process. AWS’s ability to handle tasks that were essential but didn’t directly contribute to Netflix’s core business of delivering a quality video-watching experience, meant Netflix could focus on providing business value.
The migration process from Netflix’s data centers to AWS was not immediate but extended over eight years. During this period, Netflix’s streaming customer base grew exponentially. The platform currently operates on several hundred thousand EC2 instances within AWS, a testament to its scale and reliance on the cloud service provider.
Transitioning to AWS significantly improved Netflix’s service reliability. While occasional downtimes aren’t entirely eliminated, the platform’s service has become notably more reliable than it was before. Netflix has taken extensive measures to enhance the reliability of its service within AWS.
Operating out of three AWS regions — North Virginia, Portland, Oregon, and Dublin, Ireland — Netflix operates across multiple availability zones within each region. While many companies typically operate from a single region, Netflix’s presence in three regions minimizes the impact of regional failures. In the event of a region failure, Netflix follows an evacuation strategy, seamlessly redirecting users to other operational regions within minutes.
The monthly evacuation tests conducted by Netflix simulate region failures to ensure their system’s readiness to handle such events. This global services model enables any customer to be served from any region, ensuring uninterrupted service even during regional outages.
Beyond improved reliability, migrating to AWS also yielded cost benefits for Netflix. Contrary to common assumptions, the cloud proved to be cost-effective for the platform. The cloud’s elasticity allowed Netflix to add or remove servers based on demand, optimizing costs by paying only for the resources they utilized.
Before pressing play on Netflix, numerous operations occur within AWS. The platform utilizes various AWS services for different functions, including scalable computing (EC2), scalable storage (S3), distributed databases (DynamoDB and Cassandra), big data processing, analytics, recommendations, and more.
Scalable computing (EC2) and storage (S3) are fundamental AWS services utilized by Netflix. The client devices, whether an iPhone, TV, Xbox, or Android phone, communicate with Netflix services running on EC2. Every action on the Netflix app, from browsing potential videos to seeking details about a specific video, involves interaction with an EC2-based system.
The platform leverages distributed databases like DynamoDB and Cassandra, which offer high-quality database solutions. These databases store a myriad of data, including user profiles, billing information, viewing history, and other crucial information across multiple servers to ensure data safety.
Netflix’s extensive collection of user data undergoes big data processing and analytics. The platform collects vast amounts of data regarding user viewing habits, preferences, and behaviors. Processing this data involves standardizing it into a uniform format, while analytics involves analyzing this data to derive insights and make informed decisions.
One remarkable instance of utilizing data analytics is in Netflix’s selection of header images. These header images, displayed for each video, aim to entice users to select a video. Netflix employs a data-driven approach, tailoring these images based on users’ preferences and viewing habits. These subtle customizations contribute to enhancing the user experience and engagement.
Transcoding is a massive operation at Netflix, crucial for delivering videos across various devices and network speeds. It involves converting videos into different formats and bitrates to accommodate various devices and internet connection speeds. For instance, a video might be transcoded into lower resolutions or bitrates for users with slower internet connections.
The complexity of transcoding is further amplified by the diversity of devices and platforms used by Netflix subscribers. Different devices and apps require videos in different formats and resolutions. Netflix uses its proprietary transcoding system to process and create multiple renditions of a video file to match various device specifications.
Moreover, transcoding extends beyond just adapting videos to different resolutions or bitrates. It includes multiple audio tracks and subtitles in various languages. This ensures that users across different regions and language preferences have a seamless viewing experience.
Netflix’s recommendation system is a core feature that significantly contributes to user engagement and satisfaction. AWS plays a pivotal role in this aspect by powering the algorithms that suggest personalized content to users. Netflix’s recommendation engine is known for its accuracy in predicting users’ preferences based on their viewing history, ratings, and interactions with the platform.
Netflix’s experimentation with different CDN strategies offers insights into its evolutionary journey. Initially, the platform developed a small in-house CDN to deliver its content. However, as the user base expanded, they faced challenges in maintaining the quality of service.
Subsequently, Netflix shifted to utilizing third-party CDNs, including Akamai and Limelight. While these third-party CDNs offered global coverage, they didn’t fully align with Netflix’s objectives. Netflix sought to provide a consistently high-quality streaming experience across the globe, necessitating a custom CDN solution.
The introduction of Open Connect, Netflix’s custom CDN, marked a significant shift. This CDN was designed to meet Netflix’s specific requirements, providing better control over content delivery, scalability, and cost efficiency. Open Connect strategically places caching servers within internet service provider networks, allowing for faster content delivery directly to users.
The custom CDN not only improved the quality of streaming but also reduced Netflix’s reliance on third-party providers, leading to cost savings. By utilizing Open Connect, Netflix could better manage its massive volume of traffic, resulting in enhanced streaming quality for its subscribers.
In essence, Netflix’s transition to AWS and its evolution in CDN strategies highlight its relentless pursuit of innovation, reliability, and user-centric services. The platform’s intricate infrastructure, spanning AWS services and its proprietary CDN, underscores its commitment to delivering a seamless, personalized, and top-tier viewing experience to its global subscriber base.
This continuous innovation and dedication to improving the streaming experience have solidified Netflix’s position as a leader in the entertainment industry. The platform’s ability to leverage technology, data analytics, and a deep understanding of user behavior has set new benchmarks for delivering high-quality streaming content worldwide.