Scaling Database Access for 100s of Billions of Queries per Day @ PayPal : Introducing HERA
“What is the difference between an obstacle and an opportunity? Our attitude toward it. Every opportunity has a difficulty, and every difficulty has an opportunity.”
— J. Sidlow Baxter
HERA, High Efficiency Reliable Access (to data stores), is a data access gateway that PayPal uses to scale database access for hundreds of billions of SQL queries per day. Additionally, HERA improves database availability through sophisticated protection mechanisms and provides application resiliency through transparent traffic failover.
Through its many incarnations, HERA has been part of PayPal’s site architecture since the late 1990s. Today, tens of thousands of PayPal applications access thousands of database instances through hundreds of HERA pools. After decades of battle testing HERA at PayPal, a global and highly reliable/scalable/available payments company, we are proud to open source HERA on Github so others may benefit from our endeavors.
Setting the Stage
Like many successful enterprises with more than two decades of history, PayPal has relied on RDBMS as the data persistent storage of choice for OLTP workloads. Relational databases at PayPal were originally run on a single node — the only way to scale the database was to ‘scale it up’ (a.k.a vertical scaling). As our customer base, services offerings and geo-locations grew, so did our data storage and access needs. From an application architecture perspective, we started out with a single, monolithic application. Over time, we migrated to a microservices-based application architecture, one running thousands of distinct microservices (a.k.a. applications). The increased horizontal scale at the application tier resulted in a greater load on our OLTP database tier. Clearly vertical scaling of the DB would not be able to keep up with horizontal scaling at the application tier. We needed to also scale DB access horizontally — after many iterations, we settled on building a horizontally-scalable db gateway/multiplexer between the application and db tiers. With no commercially-available or open-source solutions to help make this a reality, we created HERA.
Challenges are an integral part of life — by overcoming challenges, we build confidence in the face of new obstacles. This belief has helped our team surmount multiple obstacles and build HERA to be one of PayPal’s most critical infrastructure pieces today. This blog chronicles HERA’s evolution and growing pains.
HERA Resolves Scalability Challenges
Challenge 1 — Latency associated with opening database connection at the time of request
DB connection establishment introduced application latency and customer-perceived response times.
Given PayPal’s high level of software sophistication today, it is hard to imagine a time when connection pooling did not exist. However, that was indeed the case in the 1990s when there were no connection pooling solutions readily available on the market. At this point, the team built its first version of HERA as a library providing DB connection pooling to applications accessing PayPal databases. Application response times immediately improved thanks to faster DB access times, which decreased from triple-digit to single-digit milliseconds latency. In this initial version, HERA established & cached DB connections. When an application thread needed to query a DB, the application would lease a connection from the HERA library.
Challenge 2 — Limitation of number of concurrent connections that database can support
Circa 2011, PayPal was growing rapidly in terms of both active customers and payment rate (e.g. 100s of payment transactions per second). To support continued growth, PayPal needed a modern, horizontally scalable application architecture backed by a reliable, available, and scalable data architecture. As mentioned earlier, we shifted to microservices architecture in the application tier. This migration resulted in a proliferation of application servers and an associated increase in pooled DB connections. A downside of microservices architecture is the increase in the number of persistent connections in-bound at the DB. Since RDBMS databases can only support a limited number of these inbound connections, we needed to scale out our database tier in concert with scaling out our application tier. However, since most of these pooled connections are idle, scaling out the DB layer in this fashion would result in lower DB utilization. To address this problem, a number of suboptimal solutions were proposed, discussed and tried:
- Lowering connection pool size in each microservice node
- Rewriting applications to use connections efficiently
- Scaling up database hardware
None of these solutions looked like a reliable long-term approach for our scaling problems.
The preferred approach and the one taken at PayPal in 2012 to address this problem was to instead build a connection multiplexer that could multiplex a high count of inbound (mostly idle) db connections from the application tier to a small set of active DB connections in the DB tier. This became the second incarnation of HERA.
As a further improvement, HERA provided the ability to multiplex multiple active inbound (from application tier) connections to a single active outbound (to DB tier) connection. It could only do this for non-transaction DB requests, so HERA needed a way to differentiate between transactional and non-transactional requests. It was at this time that HERA’s initial SQL parsing capabilities were implemented.
Challenge 3 — Performance limitations of Oracle RAC
Being one of the most popular online payments systems in the world, PayPal needs to continuously scale its data platforms. To scale Oracle databases horizontally, PayPal started using Oracle’s Real Application Cluster (RAC) platform. Massive PayPal DB traffic uncovered a chattiness problem in Oracle’s RAC implementation — if all the server nodes in a RAC cluster process DB write SQLs, the nodes need to share lock information with each other. This chattiness increases the latency of writes, thereby increasing customer-perceived response times.
The solution to this problem was implemented in 2013 in the form of the CQRS (Command Query Responsibility Segregation) data access pattern a.k.a. a Read/Write split.
In order to implement read/write splitting, we leveraged HERA’s ability to parse SQL and differentiate between transactional and non-transactional requests. Additionally, we developed a Routing component to direct read and write traffic to the appropriate dedicated read or write instances in the cluster. This solution reduced internal communication between RAC nodes and improved query response times. We enhanced the read/write query handling with isolation so that, for example, read spikes don’t affect write performance or vice-versa.
Challenge 4 — Scale limitation of RDBMS databases
As previously mentioned, our initial RDBMS architecture consisted of a single node — horizontal scaling was not initially supported. With the advent of our Oracle RAC solution, we introduced a shared everything architecture that supported read/write splitting. This architecture had a limited number of compute nodes over shared storage but was still not an extensible, horizontally-scalable solution. Circa 2015, we implemented sharding to address this problem. In a nutshell, several RAC clusters are employed to mimic a single logical database. Logical shards are distributed over these RACs and shard routing is handled by HERA. Applications see a single logical table.
This sharding approach required significant enhancements to HERA’s features:
- the Parsing component was updated to detect and extract the shard key from SQL Where clauses
- the Routing component was enhanced to route queries to a database based on the shard_id extracted by the Parsing component
In addition to WHERE clause-based routing, HERA also supports hint parsing and routing for explicit shard selection.
HERA’s Protective Shield
As HERA is deployed between applications and databases, it can protect one from the other thereby improving databases availability and applications resiliency.
Database Protection Features: Here are some features we support to prevent or minimize database resources saturation:
- An adaptive surge queue to backlog incoming requests during database resource saturation
- Surge queue management is implemented based on the CoDel algorithm
- A surge queue has configurable capacity — once full, new requests are rejected by bouncer.
- A Bouncer process rejects queries during database resource saturation/starvation. This occurs if
- there are no connections available and
- the surge queue, if configured, is full
- Eviction of any long running queries to prevent database resource saturation
- Queries that are candidates for eviction are identified based on predefined policies
- Eviction is throttled to minimize impact on applications and payments
- Throttling connections to databases after routine maintenance is complete to avoid connections storms
Application Resiliency Features: Here are a few features we introduced to minimize application impact in case of database availability issues:
- Supporting database maintenance by detecting maintenance mode and transparently rerouting requests to an available database
- HERA Transparent Applications Failover — detect and transparently reroute read-traffic due to Read Replica availability issues
A Very High-level Overview of HERA’s Architecture
HERA’s sophisticated features and battle-tested implementation provides PayPal with the solutions described in the previous sections. Without HERA’s Read/Write Split and Sharding features, every PayPal application would need to re-invent this wheel when accessing databases. Multiple technology stacks used in PayPal aggravate the problem by limiting code reusability between applications.
By providing centralized solutions and keeping application clients thin, HERA makes it easier for PayPal developers to concentrate on implementing business logic instead of tracking the ever-evolving complexities of PayPal data platforms. This significantly reduces time to market as well as resiliency of PayPal products.
HERA consists of three main components:
1. A language-specific HERA client library that resides on each application host and allows applications to communicate with HERA efficiently.
2. A centralized HERA proxy service that is fully aware of database configurations and application requirements (e.g. expected distribution of traffic between reads and writes). The HERA service receives application requests and routes them to a chosen database instance. The Router, Parser, & Query Rewriting modules are all part of the HERA proxy.
3. A set of workers that are responsible for maintaining connections to a specific database and facilitating communication between the proxy and this database.
Footnote: Till 2018 HERA was implemented in C++. It is now written in GoLang.
From Great Past To Bright Future via Open Source
After maturing for more than a decade in PayPal’s production environment, enhanced through multiple revisions and frequent rewrites, HERA is now available outside of PayPal as an Apache 2-licensed project. Refer to our instructions on how to contribute and please help us make it even better.
This post is the first in a series of articles covering HERA. This post details our motivations in building HERA, the business value it brings to PayPal, and its major features. Stay tuned for future articles that will dive into HERA’s design and implementation, migration from C++ to GoLang, etc…
A special thanks goes out to all of the folks who helped developed HERA over the years, namely the folks below:
Kamlakar Singh, Yaping Shi, Mukundan Narayanan, Qing Hu, Kenneth Kang, Petrica Voicu, Varun Sankar, Shuping Tien, Stephanie Vu, Shashank Singh, Gurkirat Singh.