Avoiding Double Payments in a Distributed Payments System

How we built a generic idempotency framework to achieve eventual consistency and correctness across our payments micro-service architecture.

Jon Chew
Jon Chew
Apr 16, 2019 · 14 min read

Authors: Jon Chew and Ninad Khisti

One of the conference rooms in our San Francisco office

Background

To ensure consistency among all services, protocols such as two-phase commit might be used. Without such a protocol, distributed transactions present challenges to maintaining data integrity, allowing graceful degradation, and achieving consistency. Requests also inevitably fail within distributed systems — connections will drop and timeout at some point, especially for transactions that consist of multiple network requests.

There are three different common techniques used in distributed systems to achieve eventual consistency: read repair, write repair, and asynchronous repair. There are benefits and trade-offs to each approach. Our payments system uses all three in various functions.

Asynchronous repair involves the server being responsible for running data consistency checks, such as table scans, lambda functions, and cron jobs. Additionally, asynchronous notifications from the server to the client are widely used in the payments industry to force consistency on the client side. Asynchronous repair, along with notifications, can be used in-conjunction with read and write repair techniques, offering a second line of defense with trade-offs in solution complexity.

Our solution in this particular post utilizes write repair, where every write call from the client to the server attempts to repair an inconsistent, broken state. Write repair requires clients to be smarter (we’ll expand on this later), and allows them to repeatedly fire the same request and never have to maintain state (aside from retries). Clients can thus request eventual consistency on-demand, giving them control over the user experience. Idempotency is an extremely important property when implementing write repair.

What is idempotency?

This technique is commonly used in billing and payment systems involving money movement — it is crucial a payments request gets processed completely exactly once (also known as “exactly-once delivery”). Importantly, if a single operation to move money is called multiple times, the underlying system should move money at most once. This is critical for Airbnb Payments APIs in order to avoid multiple payouts to the host, and even worse, multiple charges to the guest.

By design, idempotency safely allows multiple identical calls from clients using an auto-retry mechanism for an API to achieve eventual consistency. This technique is common among client-server relationships with idempotency, and something that we use in our distributed systems today.

At a high level, the below diagram illustrates some simple, example scenarios with duplicate requests and ideal idempotent behavior. No matter how many charge requests are made, the guest will always be charged at most once.

An idempotent request is a request that is made with identical parameters and the outcome will always be the same, consistently (the guest is charged at most once).

The Problem Statement

  • Instead of implementing a single, custom solution specific for a given use case, we needed a generic yet configurable idempotency solution to be used across Airbnb’s various Payments SOA services.
  • While SOA-based payment products were being iterated on, we couldn’t compromise on data consistency since this would impact our community directly.
  • We needed ultra low latency, so building a separate, stand-alone idempotency service wouldn’t be sufficient. Most importantly, the service would suffer from the same problems it was originally intended to solve.
  • As Airbnb is scaling its engineering organization using SOA, it would be highly inefficient to have every developer specialize on data integrity and eventual consistency challenges. We wanted to shield product developers from these nuisances to allow them to focus on product development and iterate faster.

Additionally, considerable trade-offs with code readability, testability and ability to troubleshoot were all considered non-starters.

Solution Explained

We implemented and utilized “Orpheus”, a general-purpose idempotency library, across multiple payments services. Orpheus is the legendary Greek mythological hero who was able to orchestrate and charm all living things.

We chose a library as a solution because it offers low latency while still providing clean separation between high-velocity product code and low-velocity system management code. At a high level, it consists of the following simple concepts:

  • An idempotency key is passed into the framework, representing a single idempotent request
  • Tables of idempotency information, always read and written from a sharded master database (for consistency)
  • Database transactions are combined in different parts of the codebase to ensure atomicity, using Java lambdas
  • Error responses are classified as “retryable” or “non-retryable

We’ll detail how a complex, distributed system with idempotency guarantees can become self-healing and eventually consistent. We’ll also walk through some of the trade-offs and additional complexities from our solution that one should be mindful of.

Keep database commits to a minimum

Orpheus is centered around the assumption that almost every standard API request can be separated into three distinct phases: Pre-RPC, RPC, and Post-RPC.

An “RPC”, or Remote Procedural Call(s), is when a client makes a request to a remote server and waits for that server to finish the requested procedure(s) before resuming its process. In the context of payments APIs, we refer to an RPC as a request to a downstream service over a network, which can include external payments processors and acquiring banks. In brief, here is what happens in each phase:

  1. Pre-RPC: Details of the payment request are recorded in the database.
  2. RPC: The request is made live to the external service over network and the response is received. This is a place to do one or more idempotent computations or RPCs (for example, query service for the status of a transaction first if it’s a retry-attempt).
  3. Post-RPC: Details of the response from the external service are recorded in the database, including its successfulness and whether a bad request is retryable or not.

To maintain data integrity, we adhere to two simple ground rules:

  1. No service interaction over networks in Pre and Post-RPC phases
  2. No database interactions in the RPC phases

We essentially want to avoid mixing network communication with database work. We’ve learned the hard way that network calls (RPCs) during the Pre and Post-RPC phases are vulnerable and can result in bad things like rapid connection pool exhaustion and performance degradation. Simply put, network calls are inherently unreliable. Because of this, we wrapped Pre and Post-RPC phases in enclosing database transactions initiated by the library itself.

We also want to call out that a single API request may consist of multiple RPCs. Orpheus does support multi-RPC requests, but in this post we wanted to illustrate our thought process with only the simple single-RPC case.

As shown in the example diagram below, every database commit in each of the Pre-RPC and Post-RPC phases is combined into a single database transaction. This ensures atomicity — entire units of work (here the Pre-RPC and Post-RPC phases) can fail or succeed as a unit consistently. The motive is that the system should fail in a manner it could recover from. For example, if several API requests failed in the middle of a long sequence of database commits, it would be extremely difficult to systematically keep track of where each failure occurred. Note that all network communication, the RPC, are explicitly separated from all database transactions.

Network communication is strictly kept separate from database transactions

A database commit here includes an idempotency library commit and application layer database commits, all combined in the same code block. Without being careful, this could actually begin to look really messy in real code (spaghetti, anyone?). We also felt that it shouldn’t be the responsibility of the product developer to call certain idempotency routines.

Java Lambdas to the Rescue

Here is an example, simplified usage of Orpheus, with Java lambdas in action:

At a deeper level, here is a simplified excerpt from the source code:

We did not implement nested database transactions, but instead combined database instructions from Orpheus and the application into a single database commit, strictly passing Java functional interfaces (lambdas).

The separation of these concerns does offer some trade-offs. Developers must use forethought to ensure code readability and maintainability as other new ones constantly contribute. They also need to consistently evaluate that proper dependencies and data get passed along. API calls are now required to be refactored into three smaller chunks, which could arguably be restrictive on the way developers write code. It might actually be really difficult for some complex API calls to effectively be broken down into a three-step approach. One of our services has implemented a Finite State Machine with every transition as an idempotent step using StatefulJ, where you could safely multiplex idempotent calls in an API call.

Handling Exceptions — To Retry or Not to Retry?

For example, suppose a downstream service was offline temporarily, but the exception raised was mistakenly labeled as “non-retryable” when it really should have been “retryable”. The request would be “failed” indefinitely, and subsequent retry requests would perpetually return the incorrect non-retryable error. Conversely, double payments could occur if an exception was labeled “retryable” when it actually should have been “non-retryable” and requires manual intervention.

In general, we believe unexpected runtime exceptions due to network and infrastructure issues (5XX HTTP statuses) are retryable. We expect these errors to be transient, and we expect that a later retry of the same request may eventually be successful.

We categorize validation errors, such as invalid input and states (for example, you can’t refund a refund), as non-retryable (4XX HTTP statuses) — we expect all subsequent retries of the same request to fail in the same manner. We created a custom, generic exception class that handled these cases, defaulting to “non-retryable”, and for certain other cases, categorized as “retryable”.

It is essential that request payloads for each request remain the same and are never mutated, otherwise it would break the definition of an idempotent request.

Categorizing “retryable” and “non-retryable” exceptions

There are of course more vague edge cases that need to be handled with care, such as handling a NullPointerException appropriately in different contexts. For example, a null value returned from the database due to a connectivity blip is different from an erroneous null field in a request from a client or from a third party response.

Clients Play a Vital Role

  • Pass in a unique idempotency key for every new request; reuse the same idempotency key for retries.
  • Persist these idempotency keys to the database before calling the service (to later use for retries).
  • Properly consume successful responses and subsequently unassign (or nullify) idempotency keys.
  • Ensure mutation of the request payload between retry attempts is not allowed.
  • Carefully devise and configure auto-retry strategies based on business needs (using exponential backoff or randomized wait times (“jitter”) to avoid the thundering herd problem).

How to Choose an Idempotency Key?

For request-level idempotency, a random and unique key should be chosen from the client in order to ensure idempotency for the entire entity collection level. For example, if we wanted to allow multiple, different payments for a reservation booking (such as Pay Less Upfront), we just need to make sure the idempotency keys are different. UUID is a good example format to use for this.

Entity-level idempotency is far more stringent and restrictive than request-level idempotency. Say we want to ensure that a given $10 payment with ID 1234 would only be refunded $5 once, since we can technically make $5 refund requests twice. We would then want to use a deterministic idempotency key based on the entity model to ensure entity-level idempotency. An example format would be “payment-1234-refund”. Every refund request for a unique payment would consequently be idempotent at the entity-level (Payment 1234).

Each API Request Has an Expiring Lease

A lease comes with an expiration to cover the scenario where there are timeouts on the server side. If there is no response, then an API request can be retryable only after the current lease has expired. The application can configure the lease expiration and RPC timeouts according to their needs. A good rule of thumb is to have a higher lease expiration than the RPC timeout.

Orpheus additionally offers a maximum retryable window for an idempotency key to provide a safety net in order to avoid rogue retries from unexpected system behavior.

Recording the Response

Persisting responses does have a performance trade-off — clients are able to receive quick responses on subsequent retries, but this table will have growth proportional to the growth of the application’s throughput. This table can quickly become bloated the table if we’re not careful. One potential solution is to periodically remove rows older than a certain timeframe, but removing an idempotent response too early has negative implications, too. Developers should also be wary not to make backwards-incompatible changes to the response entities and structure.

Avoid Replica Databases — Stick to Master

For example, suppose a payments service stored its idempotency information on a replica database. A client submits a payment request to the service, which ends up being successful downstream, but the client doesn’t receive a response due to a network issue. The response, currently stored on the service’s master database, will eventually be written to the replica. However, in the event of replica lag, the client could correctly fire an idempotent retry to the service and the response would not be recorded to the replica yet. Because the response “does not exist” (on the replica), the service could mistakenly execute the payment again, resulting in duplicate payments. The below example illustrates how just a few seconds of replica lag could cause significant financial impact to Airbnb’s community.

A duplicate payment created as a result from replica lag

Response 1 is not found on the replica in the retry attempt (Request 2)

Duplicate payment avoided by storing idempotency information only on master

Response 1 found immediately on master and returned in the retry attempt (Request 2)

When using a single master database for idempotency, it was pretty clear that scaling would undoubtedly and quickly becomes an issue. We alleviated this by sharding the database by idempotency key. The idempotency keys we use have high cardinality and even distribution, making them effective shard keys.

Final thoughts

However, achieving eventual consistency does not come without introducing some complexity. Clients need to store and handle idempotency keys and implement automated retry-mechanisms. Developers require additional context and must be surgically precise when implementing and troubleshooting Java lambdas. They must be deliberate when handling exceptions. Additionally, as the current version of Orpheus is battle-tested, we are continuously finding things to improve on: request-payload matching for retries, improved support for schema changes and nested migrations, actively restricting database access during RPC phases, and so on.

While these are considerations at top of mind, where has Orpheus gotten Airbnb Payments so far? Since the launch of the framework, we have achieved five nines in consistency for our payments, while our annual payment volume has simultaneously doubled (read this if you’d like to learn more on how we measure data integrity at scale).

If you’re interested in working on the intricacies of a distributed payments platform and helping travelers around the world belong anywhere, the Airbnb Payments team is hiring!

Shoutout to Michel Weksler and Derek Wang for their thought leadership and architectural philosophy on this project!

The Airbnb Tech Blog

Creative engineers and data scientists building a world…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store