Scaling Large Business-Critical Systems

Published in

The PayPal Technology Blog

6 min readAug 3, 2018

Mount_Everest_as_seen_from_Drukair2.jpg: shrimpo1967 CC BY-SA 2.0

Context

We read a lot about scaling applications and databases in systems design. Our team went through a similar journey in scaling a high traffic database at PayPal.

The primary purpose of this article is about sharing the learnings in building a system to scale with minimum downtime and manageable infrastructure expense.

To handle the growth in traffic, we initially did vertical scaling by adding more compute and memory resources. The volume of requests kept continuously growing over the years and made it harder to scale the infrastructure vertically without significant downtime. The cost of infrastructure for vertical scaling to keep up with traffic growth was also very high.

Later we introduced Read Only(RO) replicas as another attempt at scaling. This solved some of the immediate concerns but introduced other complexities. Many of our use cases have a need to read data immediately after it is written to Source of Record(SOR) aka “read your own writes”. As the reads are being served by the RO’s and there was latency in replicating the data from SOR to RO, we introduced caching to solve the “read your own writes” use case. Caching made the code extremely complex and hard to maintain.

We took on a journey to horizontally scale the database and this involved work at different layers in the technical stack. We will walk through some of the key constraints, design, execution and learnings.

Key Constraints

Any design to solve the data scaling had to satisfy some basic constraints:

All existing use-cases should be supported.
There should be minimal changes to application layer. The design should keep the application layer seamless from scaling logic.
There should be minimal rebalancing of data, an ability to fast-failover and set the foundation for geo-distribution.
The design should not impact current SLA’s including latency and other performance metrics.
The changes should be limited to server side since there are many clients with various technology stacks and programming languages.

The constraints can be summarized into some key principles the design had to adhere to: Availability, Scalability, Performance, Data Integrity, Simplicity, Flexibility, Manageability and Maintainability.

Design

To address traffic growth with linear growth in infrastructure, we have to use horizontal scaling. Horizontal scale is about adding more machines to the pool of resources. In the database world, horizontal scaling is often based on partitioning of the data i.e. each node contains only part of the data.

Overview

The approach we took is to do horizontal scaling by sharding the data across multiple physical databases. The sharding logic is implemented as a multi-level architecture. The shard key is hashed and mapped into K buckets. A bucket is mapped to one of M logical shards using a database table. A logical shard is mapped to one of N physical shards. The multi-level architecture with indirection enables future additions to physical infrastructure with minimal disruption to ongoing transactions.

Sharding Key

This is one the first steps in the design. Choosing the shard key involves understanding the current usage pattern and future usage of the system.

Global Secondary Index(GSI)

This is for handling use cases where clients do not have the shard key. GSI does the reverse look up where we can look up the shard key based on some Id.

Logical and Physical Sharding

The data is divided into K buckets. This gives us the capability to have K logical shards. Shard id is calculated as a hash. Shard id = (Hash on shard key) % K

Global ID(GID)

The sequence numbers in a non-sharded environment were obtained from the SOR databases. In a sharded environment, if the sequence number for the ID is obtained from the database sequence, it can result in ID collisions. We used a global ID Service to guarantee a unique identifier.

Design Learnings

We should keep application logic simple and transparent from sharding logic. We should always keep end-state in mind.
We need to account for mixed mode. Mixed mode means some of the machines have a current version(N) of code and some of the machines have newer version of code(N+1). This happens as new code is being rolled out to site.
We should not expose internal ids to clients. In our case, the IDs were being stored in multiple systems both upstream and downstream. Changing to 64-bit Global ID made it very hard to identify all the places where these IDs were being persisted.
We should not tightly couple domain data model with downstream enterprise data systems.
We should explore long-term approach rather than short-term workaround.
We should have a detailed and optimized wire-on strategy.
We should have a way to test infrastructure changes in production before wiring on features. It is important to understand the current topology and figure out a way to test access to new pool to figure out any DNS typo issues or network connection issues ahead.
We should make sure that we test for load and performance for both wire-on and wire-off path.

Execution

The foundation for execution is understanding the ecosystem including customers, complexities, dependencies, use-cases and potential impact due to failures. Many teams worked together to enable this effort.

We have to make the data partition ready. The work on the application includes de-normalizing tables, breaking transactions and joins, preparing the sql for sharding, creating the global secondary indexes and migration scripts for data.
The work on logical sharding includes auto-discovering shard key, SQL rewriting including shard key injection, routing based on shard key and providing visibility into violations. Application only sees a single target and shards are transparent to the application.
The work on database includes adding the shard key on all tables, data migration, data integrity and ensuring all the databases and infrastructure are in place. A significant portion of the work is to automate data movement across shards.
The work also includes ensuring data flows between upstream and downstream systems. This includes coordinating with downstream partners to ensure all the schema changes are completed with no impact to downstream processing.
There were incremental releases throughout the two years and a lot of partnership with various teams including Release engineering and Site operations.

Execution Learnings

This was an intensive project with a lot of complex moving parts and processes. This section is about the key execution learnings we got from delivering this critical successful project.

Sponsorship

Projects of this scale cannot happen without the leadership vision, sponsorship and support.

Core Team

It is important to have a small core team. Having architecture huddles is important for closure on design decisions. Decision making authority should be clear to avoid any ambiguity.

Partners and Collaboration

Strong collaboration with all dependent teams is critical for the success of a project of this scale.

Communication

Consistent communication is key for success. Having a communication rhythm is important. Having clarity on end state, timeline projection, constantly course correcting is important.

Build on Wins

Momentum and early first wins are important. For a large schema consisting of multiple entities, we sharded an entity to start with. It helped us build the confidence to take sharding forward for the rest of the entities.

Failures

Embrace failures and learn from them.

Celebrate Progress

Have a motivated team that believes and is passionate about the initiative. It is important to celebrate every win — big or small.

Diligence

Invest in Quality and Monitoring so you can release with confidence. Having a risk mitigation plan helps identify potential points of failure.

Customer First

We need to make sure that the customer is not impacted. Doing single server testing and slow ramp ensured that the customer was not impacted.

Scaling a large scale legacy system has been a very exciting and collaborative journey with a lot of learnings. This article deconstructs the design of the solution, the steps and the support required for successful execution of one such undertaking at PayPal.