Netflix Billing Migration to AWS
On January 4, 2016, right before Netflix expanded itself into 130 new countries, Netflix Billing infrastructure became 100% AWS cloud-native. Migration of Billing infrastructure from Netflix Data Center(DC) to AWS Cloud was part of a broader initiative. This prior blog post is a great read that summarizes our strategic goals and direction towards AWS migration:
The journey to the cloud at Netflix began in 2008, and after seven years of diligent effort, we have finally completed…media.netflix.com
For a company, its billing solution is its financial lifeline, while at the same time, it is a visible representation of a company’s attitude towards its customers. A great customer experience is one of Netflix’s core values. Considering the sensitive nature of Billing for its direct impact on our monetary relationship with our members as well on financial reporting, this migration needed to be handled as delicately as possible. Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.
This blog entry discusses our approach to migration of a complex Billing ecosystem from Netflix Data Center(DC) into AWS Cloud.
Components of our Billing architecture
Billing infrastructure is responsible for managing the billing state of Netflix members. This includes keeping track of open/paid billing periods, the amount of credit on the member’s account, managing payment status of the member, initiating charge requests and what date the member has paid through. Other than these, billing data feeds into financial systems for revenue and tax reporting for Netflix accounting. To accomplish above, billing engineering encompasses:
- Batch jobs to create recurring renewal orders for a global subscriber base, aggregated data feeds into our General Ledger(GL) for daily revenue from all payment methods including gift cards, Tax Service that reads from and posts into Tax engine. Generation of messaging events and streaming/DVD hold events based on billing state of customers.
- Billing APIs provide billing and gift card details to the customer service platform and website. Other than these, Billing APIs are also part of workflows initiated to process user actions like member signup, change plan, cancellation, update address, chargebacks, and refund requests.
- Integrations with different services like member account service, payment processing, customer service, customer messaging, DVD website and shipping
Billing systems had integrations in DC as well as in cloud with the cloud-native systems. At a high level, our pre-migration architecture could be abstracted out as below:-
Considering how much code and data was interacting with Oracle, one of our objectives was to disintegrate our giant Oracle based solution into a services based architecture. Some of our APIs needed to be multi-region and highly available. So we decided to split our data into multiple data stores. Subscriber data was migrated to Cassandra data store. Our payment processing integration needed ACID transaction. Hence all relevant data was migrated to MYSQL. Following is a representation of our post migration architecture.
As we approached the mammoth task of migration, we were keenly aware of the many challenges in front of us…
- Our migration should ideally not take any downtime for user facing flows.
- Our new architecture in AWS would need to scale to rapidly growing the member base.
- We had billions of rows of data, constantly changing and composed of all the historical data since Netflix’s inception in 1997. It was growing every single minute in our large shared database on Oracle. To move all this data over to AWS, we needed to first transport and synchronize the data in real time, into a double digit Terabyte RDBMS in cloud.
- Being a SOX system added another layer of complexity, since all the migration and tooling needed to adhere to our SOX processes.
- Netflix was launching in many new countries and marching towards being global soon.
- Billing migration needed to happen without adversely impacting other teams that were busy with their own migration and global launch milestones.
Our approach to migration was guided by simple principles that helped us in defining the way forward. We will cover the most important ones below:
Challenge complexity and simplify: It is much easier to simply accept complexity inherent in legacy systems than challenge it, though when you are floating in a lot of data and code, simplification becomes the key. It seemed very intimidating until we spent a few days opening up everything and asking ourselves repeatedly about how else we could simplify.
- Cleaning up Code: We started chipping away existing code into smaller, efficient modules and first moved some critical dependencies to run from the Cloud. We moved our tax solution to the Cloud first. Next, we retired serving member billing history from giant tables that were part of many different code paths. We built a new application to capture billing events, migrated only necessary data into our new Cassandra data store and started serving billing history, globally, from the Cloud. We spent a good amount of time writing a data migration tool that would transform member billing attributes spread across many tables in Oracle into a much simpler Cassandra data structure. We worked with our DVD engineering counterparts to further simplify our integration and got rid of obsolete code.
- Purging Data: We took a hard look at every single table to ensure that we were migrating only what we needed and leaving everything else behind. Historical billing data is valuable to legal and customer service teams. Our goal was to migrate only necessary data into the Cloud. So, we worked with impacted teams to find out what parts of historical data they really needed. We identified alternative data stores that could serve old data for these teams. After that, we started purging data that was obsolete and was not needed for any function.
Build tooling to be resilient and compliant: Our goal was to migrate applications incrementally with zero downtime. To achieve this, we built proxies and redirectors to pipe data back into DC. This helped us in keeping our applications in DC , unimpacted by the change, till we were ready to migrate them.
- We had to build tooling in order to support our Billing Cloud infrastructure which needed to be SOX compliant. For SOX compliance we needed to ensure mitigation of unexpected developer actions and auditability of actions.
- Our Cloud deployment tool Spinnaker was enhanced to capture details of deployment and pipe events to Chronos and our Big Data Platform for auditability. We needed to enhance Cassandra client for authentication and auditable actions. We wrote new alerts using Atlas that would help us in monitoring our applications and data in the Cloud.
- With the help of our Data analytics team, we built a comparator to reconcile subscriber data in Cassandra datastore against data in Oracle by country and report mismatches. To achieve the above, we heavily used Netflix Big Data Platform to capture deployment events, used sqoop to transport data from our Oracle database and Cassandra clusters to Hive. We wrote Hive queries and MapReduce jobs for needed reports and dashboards.
Test with a clean and limited dataset first. How global expansion helped us: As Netflix was launching in new countries, it created a lot of challenges for us, though it also provided an opportunity to test our Cloud infrastructure with new, clean data, not weighted down by legacy. So, we created a new skinny billing infrastructure in Cloud, for all the user facing functionality and a skinny version of our renewal batch process, with integration into DC applications, to complete the billing workflow. Once the data for new countries could be successfully processed in the Cloud, it gave us the confidence to extend the Cloud footprint for existing large legacy countries, especially the US, where we support not only streaming but DVD billing as well.
Decouple user facing flows to shield customer experience from downtimes or other migration impacts: As we were getting ready to migrate existing members’ data into Cassandra, we needed downtime to halt processing while we migrated subscription data from Oracle to Cassandra for our APIs and batch renewal in Cloud. All our tooling was built around ability to migrate a country at time and tunnel traffic as needed.
- We worked with ecommerce and membership services to change integration in user workflows to an asynchronous model. We built retry capabilities to rerun failed processing and repeat as needed. We added optimistic customer state management to ensure our members were not penalized while our processing was halted.
- By doing all the above, we transformed and moved millions of rows from Oracle in DC to Cassandra in AWS without any obvious user impact.
Moving a database needs its own strategic planning: Database movement needs to be planned out while keeping the end goal in sight, or else it can go very wrong. There are many decisions to be made, from storage prediction to absorbing at least a year’s worth of growth in data that translates into number of instances needed, licensing costs for both production and test environments, using RDS services vs. managing larger EC2 instances, ensuring that database architecture can address scalability, availability and reliability of data. Creating disaster recovery plan, planning minimal migration downtime possible and the list goes on. As part of this migration, we decided to migrate from licenced Oracle to open source MYSQL database running on Netflix managed EC2 instances.
- While our subscription processing was using data in our Cassandra datastore, our payment processor needed ACID capabilities of an RDBMS to process charge transactions. We still had a multi-terabyte database that would not fit in AWS RDS with TB limitations. With the help of Netflix platform core and database engineering, we defined a multi-region, scalable architecture for our MYSQL master with DRBD copy and multiple read replicas available in different regions. We also moved all our ETL processing to replicas to avoid resource contention on the Master. Database Cloud Engineering built tooling and alerts for MYSQL instances to ensure monitoring and recovery as needed.
- Our other biggest challenge was migrating constantly changing data to MYSQL in AWS, without taking any downtime. After exploring many options, we proceeded with Oracle GoldenGate, which could replicate our tables across heterogeneous databases, along with ongoing incremental changes. Of course, this was a very large movement of data, that ran in parallel to our production operations and other migration for a couple of months. We conducted iterative testing and issue fixing cycles to run our applications against MYSQL. Eventually, many weeks before flipping the switch, we started running our test database on MYSQL and would fix and test all issues on MYSQL code branch before doing a final validation on Oracle and releasing in production. Running our test environment against MYSQL continuously created a great feedback loop for us.
- Finally, on January 4, with a flip of a switch, we were able to move our processing and data ETLs against MYSQL.
While our migration to the Cloud was relatively smooth, looking back, there are always a few things we could have done better. We underestimated testing automation needs. We did not have a good way to test end to end flows. Having spent enough effort on these aspects, upfront, would have given us better developer velocity.
Migrating something as critical as billing with scale and legacy that needed to be addressed was plenty of work, though the benefits from the migration and simplification are also numerous. Post migration, we are more efficient and lighter in our software footprint than before. We are able to fully utilize Cloud capabilities in tooling, alerting and monitoring provided by the Netflix platform services. Our applications are able to scale horizontally as needed, which has helped us in keeping up our processing with subscriber growth.
In conclusion, billing migration was a major cross functional engineering effort. Different engineering teams: core platform, security, database engineering, tooling, big data platform, business teams and other engineering teams supported us through this. We plan to cover focused topics on database migration and engineering perspectives as a continuing series of blog posts in the future.
Once in the Cloud, we now see numerous opportunities to further enhance our services by using innovations of AWS and the Netflix platform. Netflix being global is bringing many more interesting challenges to our path. We have started our next big effort to re-architect our billing platform to become even more efficient and distributed for a global subscriber scale. If you are interested in helping us solve these problems, we are hiring!
Originally published at techblog.netflix.com on June 21, 2016.