PayPal’s Teradata Data Warehouse Migration to Google BigQuery for Product Analytics
The first in a multi-step Google Cloud Platform journey
“Take the first step in faith. You don’t have to see the whole staircase. Just take the first step.” — Dr. Martin Luther King Jr.
PayPal has experienced record growth since the beginning of the global pandemic. To keep up with the demand from growth, we decided to migrate PayPal Analytics platforms to the public cloud. The first big migration of a warehouse workload to BigQuery in Google Cloud took less than a year. Along the way, the PayPal team built a platform which would support many other use cases as well.
This writeup captures a milestone migration experience. We migrated half of our data and processing from Teradata systems to Google Cloud Platform’s BigQuery.
As organizations and consumers ventured into new ways of doing business during the pandemic, PayPal experienced record-high transaction volumes. This put a lot of pressure on the offline analytics systems used for compliance, risk processing, product and financial analytics, marketing, customer success, and fraud protection. These analytics systems were all in on-premises data centers. The systems were powered by Teradata and Hadoop at the core, with additional software and workflows in place to manage resources across these systems.
Demands for processing data were far outstripping the existing capacity on-premises. Adding capacity quickly during the pandemic had its own share of challenges. Data platform teams managed the crisis with manual intervention to prioritize various workloads that demanded additional processing time. Given the business outlook of continuing growth, PayPal realized that the analytics ecosystem required a change.
Additionally, we recognized the opportunity to modernize our data strategy in accordance with the ideas of better agility, discoverability, shareability and ecosystem integration. BigQuery allows us to centralize our data platform without losing capabilities such as SQL access, Spark integration, and advanced ML training. Also, BigQuery has some advanced features such as ML and real-time analytics that can be leveraged without moving data out to another system.
There were multiple factors that PayPal took into consideration in choosing cloud over on-premises expansion. PayPal’s data team started a blueprint for migration to public cloud, to keep up with data demands for the next five years based on Google Cloud Platform’s capabilities.
PayPal’s analytics infrastructure is based on a constellation of technologies for various use cases. Data analysts and a portion of data scientists predominantly depended on a data warehouse for their data work. The data in the warehouse was semi-structured, making it easier for teams to do analysis and reporting.
A simplified view of the data flow is provided in the following figure. Data from site databases first goes into the data warehouse. A copy of some of the data from the warehouse is made into a data lake which is powered by open source technologies. Data is then embellished with other data sources like tracking, experimentation, and data from PayPal’s adjacencies to get transformed and loaded back into the analytics warehouse for consumption.
PayPal managed two vendor-based data warehouse clusters on-premises, with a total storage of 20+ petabytes serving over 3,000 users. The capacity requirement was constantly growing as data became critical for business decisions. The analytics warehouse was limited by storage and CPU, and the main warehouse was limited by IO and storage.
Warehouse use cases could broadly be classified into interactive and batch workloads. Interactive workloads include ad-hoc queries from users using Jupyter Notebooks, and reports and dashboards using BI tools like Tableau and Qlikview. Batch workloads are scheduled using Airflow and UC4. Workloads are mostly written in SQL and executed using shell or Python scripts.
Due to challenges that arose as traffic grew, many of the transformation jobs and batch loads were running behind schedule. PayPal analysts and data scientists were seeing data way behind their service-level agreements (SLAs), with a degraded experience, and all of that delayed decision-making. Of the two major warehouses, PayPal decided to first migrate the analytics warehouse to BigQuery to experience using the service as a replacement for Teradata, and in the process build a platform around Google Cloud Platform services for PayPal’s data users. The main warehouse would be migrated later based on the learnings and experience from the analytics warehouse.
Improving data users’ experience at PayPal involves addressing the following technological challenges:
- Security: As PayPal handles PII and PCI data, any data infrastructure needs full patching, hardened system configuration, encryption and decryption keys configured for sensitive data, access to data over TLS, and good role-based access control (RBAC) for data access.
- Elastic Scale on Demand: Timely access to capacity is very important to satisfy workload constraints. Hence, the analytics data infrastructure needs to scale up and down as demand grows and shrinks.
- Highly Performant SQL Access: Analyst and data scientist productivity would be improved if data types and access patterns were provided with a highly performant ANSI SQL interface.
- Access from BI tools: Since business intelligence is crucial to communicate insights, the analytics infrastructure should plug and play well with the present tools like Jupyter Notebooks, Tableau and Qlikview, and modern BI tools like Looker and ThoughtSpot. RBACs applied on the analytics infrastructure need to be honored by BI tools uniformly for easy and standardized data access management.
- Showback: Data users don’t have a clear view of their resource consumption. Any new infrastructure should have the capabilities for our teams to provide this valuable information back to users.
Changes to infrastructure need to overcome the following adoption challenges:
- Standardized: Data users had been subjected in the past to non-standard infrastructure which had either slowed them down or restricted the usage patterns. Users prefer something which is standardized so that they could use their existing talent pool, and their favorite tools.
- Path to Migration: Data users prefer a technology to which it is easy to migrate their existing artifacts in notebooks, dashboards, batch and scheduled jobs. Rewriting their workloads to a new target was seen as a massive investment and could be a failure from the start.
- Ease of Training: Users prefer a technology which is easy to learn by themselves online, as opposed to having special adaptations which may require specialized training and learning time.
- Access Flexibility: Tools for exploring data and modeling have been undergoing various changes. Users prefer an infrastructure that evolves as technologies evolve. Case in point: even though a majority of PayPal’s consumers use SQL, there are many users who use Python, Spark, PySpark and R for their analyses and machine learning use cases. In addition, users would like to see the infrastructure be constantly updated to leverage new features or to process data differently based on the industry trends.
- Disaster Recovery: Any infrastructure should have clear disaster recovery options which can be triggered within 30 minutes for users to get started with their work.
Given the list of challenges PayPal had to address, it was very clear that creating new on-premises solutions would be a problem. The building blocks for robust solutions were focused around the cloud with less support for on-premises infrastructure. Additionally, scaling requires buying hardware and the long lead times were a hindrance to business enablement. PayPal was already moving significant workloads to Google Cloud Platform which made the choice of moving the analytics platform to Google Cloud Platform easier. We evaluated various vendors that offered their services on Google Cloud Platform to see if they could solve some of the technological challenges mentioned earlier, and we narrowed down the choice to BigQuery. We ran a 12-week evaluation of BigQuery to cover different types of use cases. It performed well against the success criteria we targeted. A summary of the evaluation results is provided below.
We will be writing about the evaluation process, success criteria and the results in a separate post.
As part of our blueprint, we decided to tackle the “Analytics Warehouse” shown in Figure 1.
Once we chose which cloud and warehouse to explore, we identified the following tracks and started getting to the next phase.
We reached out to users of the warehouse based on their usage stats over the past 12 months, as well as the data providers in that cluster. We set up time, walked them over the decision and sought their buy-in for this migration. This stakeholder buy-in was important for us to have a successful migration. We explained the rationale and how we planned to approach the problem. Some users were excited and wanted to be closely involved with the migration effort. We identified one team within one business unit as an early adopter and focused our migration effort on their use cases and data requirements.
Secure Infrastructure Buildout
We built a secure infrastructure to move data to the cloud. We kept data within BigQuery as US, multi-region to be accessed from other regions in the US. We implemented a secure private interconnect between our data center and the region in Google Cloud Platform nearest the analytics warehouse. Since we expected to operate in a hybrid mode (other connected systems remain on-premises for the foreseeable future), a private interconnect with no egress costs was a better option.
We decided to secure our data using PayPal-provided private keys in BigQuery, within a service perimeter offered by Google Cloud Platform. This ensured that data was secure and within a perimeter which cannot be accessed from outside. We deployed automation to prevent accidental creations of data sets that lack encryption keys. This way, we had encryption enabled by default for all data stored in Google Cloud Platform in a way that was compliant with our internal policies and external regulations.
We have used this infrastructure to copy more than 15 petabytes for BigQuery and 80+ petabytes into Google Cloud Services for various use cases. We use the same network infrastructure for users to access BigQuery through Jupyter Notebooks, Tableau, or from their scheduled jobs.
Regulatory Compliance and Pen-Testing
As a fintech organization that deals with PCI and PII data elements in our datasets, we worked with various regulators to file our intent to move data to the cloud. PayPal’s InfoSec, regional business units and legal teams worked overtime with our teams to prepare paperwork for regulators. We followed up with rounds of pen-tests to ensure there were no gaps. This helped us validate what we had designed into the infrastructure to secure the data and access to the data.
DDL (Data Definition Language) and SQL Conversion
Given that we were taking our data users to the cloud with a new technology, we wanted to ease the pain of transitioning from Teradata to BigQuery. To achieve this, we evaluated various options and selected a tool from CompilerWorks. Its transpiler allowed us to create DDLs in BigQuery, and use that schema to convert DMLs and user SQLs from Teradata flavor to BigQuery. PayPal worked to harden the transpiler configurations to generate performant, clean BigQuery-compatible SQLs.
This automated code conversion was a very critical step for us to get right as we wanted to ease the migration for our users. In addition to the code conversion, we also extracted valuable lineage data from CompilerWorks’s tool. We created an automation framework as well as a portal for interactive use and self-service code conversion. The automation kept polling for changes in on-premises infrastructure and create the equivalent in BigQuery as new artifacts got created. We requested users to use the portal to convert their existing or known SQLs to BigQuery-compatible SQLs for their testing and validation. We leveraged the same framework to convert users’ jobs, Tableau dashboards, and Notebooks for testing and validation. This automation helped us convert more than 10K SQLs.
Workloads, Schemas and Tables Identification
To scope the workloads, the team went through all Notebooks in our repository, Tableau dashboards and UC4 logs. Based on the tables we identified, we created a lineage graph to come up with a list of tables and schemas used, active scheduled jobs, notebooks, and dashboards. We validated the scope of work with users and confirmed that it was a true representation of the workloads on the cluster. This helped the team greatly reduce the number of workloads that we needed to migrate. Here is a breakdown of what was deprecated from the overall inventory.
Spending time on automation helped us separate out the used ones from unused ones and gain validation from users at the last step. Having users manually find this out would be tedious and error prone.
Data Movement, Loading and Validation
As we worked through this project, it became clear that data movement is very contextual to our setup, and off-the-shelf tools had limitations in being able to seamlessly copy data to Google Cloud Platform. This was the hardest part of the whole project. It was not the volume but accidental complexities that made it harder. Here is a quick list of problems we encountered:
- Resource Availability and Usage: Since we were extracting data from an on-premises warehouse, our extraction speeds were bounded by the capacity available on the source. Our warehouse usage was seasonal, and running data extractions became extremely hard during those peak times. This became a manual process at various times of the day or weekends when we sped up data transfer by allocating more capacity to the extraction process.
- Data Manipulations at the Source: Since we were running on-premises systems while we were extracting data, we had to copy all incremental changes continuously to the target in BigQuery. For small tables, we could simply copy the whole table repeatedly. For larger tables that had rows added every day and did not have updates or deletes, we could track incremental changes and replicate it to target. For tables that updated rows at source or where rows were deleted and recreated, replicating actions became a bit harder.
- DDL Changes in Source: Alterations to source tables to support business use cases were inevitable. Since DDL changes were already limited to batch processes, we instrumented the batch processing platform to detect and synchronize data copy operations with changes.
- Data Types: While the mapping between Teradata and compatible BigQuery datatypes was straightforward, decisions had to be made about handling implicit behaviors. For example, we had timestamps with an implicit timezone in our source data that applications relied on and had to be converted to Datetime instead of Timestamp for loading into BigQuery. Similarly, string values in the source system had to be trimmed before copying into BigQuery so that queries that used the equal operator would return the same results as Teradata.
- Data Loading: Doing a one-time load into BigQuery is fairly straightforward. The process starts getting complicated when replicating changes at source on a periodic basis to BigQuery. This involves keeping track of the changes from the source, and replaying them in BigQuery. Handling a large backlog of automated data loads for these extreme cases was challenging. Since we were cutting over users progressively, we had to be aware that the tables in BigQuery needed to be in production quality.
- Data Validations: Multiple types of data validations were performed on the data before it was released to data users. This included row counts, partition counts, column aggregations and sampling checks.
- BigQuery Nuances: Limits on the number of partitions a single query can touch meant splitting up data load statements based on partitions and tuning the splits when we ran against the limit.
Dry Runs and Wet Runs
Dry runs, an execution with no data, ensured that there were no syntax errors with the queries that were transformed. If the dry run was successful, we loaded data into the tables and requested users to conduct a wet run. Wet runs were one-off executions to test whether the result sets were all correct. We created test data sets for users to do their wet run, before they qualified their workload for production. All of this were enabled for users using our application lifecycle management portal which our users were accustomed to for deploying applications. We laid lots of emphasis on fitting our testing into an ecosystem that our users were used to.
Visibility into Progress
A number of the above activities were happening in parallel. This required coordination, which is harder for humans or coordinated spreadsheets. We kept track of all the data in BigQuery which was automatically updated as executions happened. We created dashboards to track sequences of activities and to report consistently to our execs and stakeholders on progress. These dashboards tracked the data copy progress for multiple milestones, workload rationalization and progress on readiness of notebooks, scheduled jobs and BI dashboards for dry and wet runs. A sample report looked like this. Users were able to search by DB name and table names for checking status.
What Went Well
Team Work Makes the Dream Work
This was very true in our case since many teams across PayPal came together to get to this landmark. We believe the following makes our story unique, and that helps us succeed:
- Know Your Customer: This was very important throughout our journey. Our product team did an amazing job understanding how customers used and dealt with data. Product teams also had a very good understanding of the workflows that data users used to work on their data. This helped the engineering team frame the problem that needed to be solved.
- Engage Often and Early: We engaged our users right from day one of our journey, sharing with them what we have seen and how we planned to make progress. We shared our plans, created working groups, and brainstormed ideas with our users. Making our users part of the journey was critical for our success.
- Automation Rigor: This is important and is overlooked in large projects. Even if the end product is one-time use, automation helps improve performance when we have to redo things from scratch. Automation helped with repeatability and recoverability in a big way.
- Program Management: We had a very good program team split across the globe. Program team ensures that every track reported and tracked progress against common milestones. All progress was tracked in a common dashboard that everyone could view and validate.
- Stakeholder Reviews: We invited our users and stakeholders (including the Google Cloud Platform account team) for periodic reviews, to track risks, seek help, and be transparent on what we were doing. We had executive support since we managed and tracked the project in a transparent manner.
- Perfection is the Enemy of Good: Given the large-scale nature of the changes, we understood that we couldn’t be perfect. We laid down ground rules to adhere to. As we worked through the project, we found multiple places where a redesign or re-architecture was required. Instead of adding tracks, we focused on our main goal, and worked around these design challenges in the short-term.
- Relationship with Google Cloud Platform: This was critical. We built a great relationship with our Google Cloud Platform professional services, customer engineering, account and executive teams. We invited these teams into our design discussions, reviewed work items, reviewed backlogs, sought help and worked around when we ran into issues. This helped Google Cloud Platform enable features early, and respond to our bugs quickly for our use cases. We treated the GCP account and PSO teams as our partners and this paid off.
Conclusion and Next Steps
Our user community at PayPal has transitioned and taken to BigQuery really well. Users needed initial help with project conventions (a new concept for them compared to Teradata) and with some help, they got productive very quickly. Users absolutely love the query performance, faster data load times and full visibility through BigQuery logs.
Taking PayPal through this migration helped us evaluate and observe what BigQuery and Google Cloud Platform can bring to PayPal. Data users now use SQL, as well as Spark through Notebooks and Google’s Dataproc over BigQuery. This helps us maintain a single copy of data along with visibility that Google Data Catalog provides for our data.
Plans are afoot to consolidate multiple datasets from finance, HR, marketing and third party systems like Salesforce along with site activity into BigQuery to enable faster business modeling and decision making. Teams are looking at streaming capabilities to inject site data sets directly to BigQuery for near real time usage to our analysts. In addition to BigQuery, some of our teams are also leveraging Google DataProc and Google Cloud Storage to consolidate many pieces of our open-source based data lake shown in Figure 1.
There are many employees at PayPal who worked directly and indirectly on this effort. Many employees in our India offices spent time on this effort while dealing with the raging pandemic. Our thanks to all of them!