Measuring Transactional Integrity in Airbnb’s Distributed Payment Ecosystem
In a distributed payment ecosystem, it is critical to accurately measure and track a transaction’s end to end state and contents to ensure consistency throughout the payment cycle.
By Ninad Khisti and William Betz
In a distributed payment ecosystem, it is critical to accurately measure and track a transaction’s end to end state and contents to ensure consistency throughout the payment cycle. Without robust tracking, data leakage and errors can occur, resulting in either lost revenue or increased costs for all parties in the payment cycle, including the consumers, merchants, gateways, and acquirers.
Transactional integrity is hard to measure precisely in a distributed payment system. With multiple systems and entities involved in any given transaction, tracking the state of a payment can be tedious and hard to obtain in a timely fashion.
With big data tools, Airbnb is tracing the contents of every transaction through various payment states to ensure every piece of the payments cycle lands in a consistent state. The reconciliation process not only produces data and insights that enable the team to track and mitigate unexpected transaction behavior, but also enables the system to “self-heal” certain aberrations when detected.
Background
With rapidly emerging payments technologies, merchants face a continually evolving landscape when it comes to processing payment transactions. The advent of new value-add entities, such as payment gateways, has offered increasingly large benefits to merchants, providing simplification by offering vaulting services and single API integration for payments processing. By integrating with payment gateways, Airbnb has been able to rapidly scale worldwide through various processing entities with minimal changes to our online transaction processing.
This being said, integration with any new value-add entity comes at a cost. Each new entity in the payments process adds an additional layer where a transaction’s integrity can be effected. A breakdown in a transaction’s integrity can create headaches for community members, increased workload for customer support, and an operational efficiency overhead.
What Is Transactional Integrity?
Generally, a transaction represents a unit of work performed in RDMS (Relational Database Management System) with atomic, consistent, isolated and durable properties. When it comes to payments, transactional integrity represents a no-surprise, accurate money movement. Accuracy of a financial transaction can be verified by various attributes, such as amount, currency, status, payment method, and even timeliness.
At Airbnb, transactional integrity encompasses both internal and external money movements. This means that we not only expect full consistency of payment attributes between our internal systems, but also across all external partners that the payment touches.
Problem Statement
Airbnb is an accommodation platform connecting guests and hosts to enable travel worldwide. Payments at Airbnb is a key factor to enable community trust, both on- and off-platform. To maintain this trust, it is crucial that we properly handle payments between our guest and host communities with the utmost accuracy.
Airbnb is a global brand operating in over 190 countries and 40 currencies around the world, including markets that are hard to reach and not commonly supported in the payments ecosystem. A singular processor relationship to move money globally simply does not exist in today’s world. As a result, Airbnb integrates with a handful of gateways and dozens of processors to achieve global coverage and payments redundancy. This system includes processors with varying degrees of maturity in their systems, different modes of integration (API call, batch and URL redirect), and widely differing transaction settlement periods. For example, Airbnb supports the processing of completely synchronous payment methods, such as major credit/debit card networks, as well as payment methods that can take up to several days to settle, such as Brazilian Boletos.
For a traditional online merchant, money flows into the system as a result of online purchases. More recently however, with the sharing economy on the rise, we have started to witness an increase in two-sided marketplaces. Airbnb is one such example — connecting travelers and hosts worldwide. As a result, Payments at Airbnb handles bi-directional money flow, not only handling payments into our platform, but also all payment outflows to hosts.
A large portion of the money flowing out to our hosts occurs with direct integration with banks via batch processing. Batch transaction processing involves two steps. First, we send a collection of transaction requests to a bank in a compliant format. We then process the response file(s) that bank sends us back containing responses to the transaction requests. While batch processes are suitable for larger transaction volume processing, the batching process, by nature, leaves our systems out-of-sync until batch response files are processed. Because this process can take up to several hours to complete, batch processing can prove to be a difficult barrier in maintaining transactional integrity.
Even in the most simple example of an Airbnb transaction, a reservation between guest and host, there are at least two financial transactions associated with it. The first financial transaction occurs when the guest pays for the reservation to Airbnb, and the second occurs when Airbnb pays the host within hours of rendering the service. However, travel plans change more often than we think. Additional payment features such as alterations, deposits/installments, group travel, tax withholding, VAT etc. all dramatically increase the number and complexity of financial transactions associated with a reservation. Additionally, the fact that many reservations on Airbnb platform are cross border, involving multiple currencies, further increases the complexity of our money movements.
Airbnb’s “New” Payment Gateway
Airbnb has seen an explosive growth in its marketplace in recent years, with payments being a critical underpinning of the expansion. Until recent years, most of Airbnb’s business and financial transaction logic was performed in a monolithic rails application.
To improve scalability, Airbnb is making a significant investment in Service Oriented Architecture. As part of this strategy, we set out to build an internal payment gateway to encapsulate all network communication to/from various processors and handle the “burden” of executing money movements for the application. Our new payment gateway is a Java Service with a dedicated datastore. This datastore hosts various payment methods and serves as a system of records for financial transactions.
This new service represents two distinct challenges with transactional integrity. First, an additional internal gateway increases the number of hops made during payment execution and if it behaves in an inconsistent manner or fails to process a gateway or processor response, it will create an “out-of-sync” transaction. Secondly, while we ramp up traffic on the new payment gateway there will be two transaction stores with Airbnb internal system — a legacy transaction store and a new payment gateway transaction store. We cannot afford any drift in consistency within our two data stores for any significant amount of time. These two challenges warranted additional consideration for transactional integrity.
What Is An “out-of-sync” Transaction?
If any system in the payments processing chain fails to respond and/or its subsequent system fails to properly consume the response of money movement it creates an “out-of-sync” transaction. Additionally, incorrect treatment of API responses by any entity in the chain can lead to “out-of-sync” transactions.
Introduction to Solution
Measuring transactional integrity across this maze of distributed systems in a timely fashion, and subsequently detecting and responding to any anomalies is a challenge. Using many systems of varying maturity makes it nearly impossible to instantaneously track a transaction through various states in different systems — platform, payment gateway, payment processor, etc. — at all times.
One potential solution that comes to mind is to use payment gateway APIs for transaction record comparison. The downside to this approach is that it’s harder to scale the comparison between internal transaction stores and the external transaction stores using APIs. In fact, some external entities do not even offer this information via API. Furthermore, a dedicated comparison system may be needed to execute API calls to the entity for transactional analysis to avoid any potential impact on live traffic, a scenario which most web-services cannot tolerate.
Airbnb’s Processor Transaction Reporting System
Almost all processors and gateways offers transaction reports and details to the merchant via secure file transfer protocol (SFTP). These transaction reports are offered as part of settlement to the merchant within an agreed upon SLA. Typically, merchants reconcile all the money movement on the platform via 3-way reconciliation between platform transaction records, processor settlement files, and bank statements. Airbnb has a dedicated service that imports, extracts and exports every processor file. In detail, the service:
- Imports daily transaction reports from our payment partners to S3,
- Extracts reports into report-specific staging tables,
- Exports these reports in a consistent format.
Additionally, it has logic to detect duplicate files and avoid repeated processing of the same file. Many processors/gateways offer these reports in CSV format and every entity uses their own vocabulary in the reports. These transaction details are then stored in a datastore and as part of the export they are sent for reconciliation.
Airbnb’s system uses a set of scheduled jobs to execute these activities on a dedicated server. While this system does not have access to Airbnb’s platform transactions, we are able to use additional tooling to combine these data sources to trace transactions end-to-end.
Snapshots of every database are available in HDFS at regular intervals. This makes it possible to trace a transaction throughout the ecosystem without leaving your network boundary. It additionally means this is an IO bound problem. Big data technologies are well suited for large scale data comparison problems via map-reduce.
Comprehensive Solution
As a result, we decided to approach this problem in a segmented format, dividing all payment activity into four broad categories — payments, refunds, payouts, and returns. Each category has a dedicated hive pipeline to generate transaction-level reports for all transactions, from all payment entities, both internal and external, within the category. This gives the ability to understand trends within and across the categories. By computing a moving average of these combined categories we’re able to produce a single number representing transactional integrity within Airbnb’s payment eco-system. This number allows us to measure improvements, detect issues, and set an overall goal.
With this approach, we are able to trace every transaction at every stage since day zero (the day when Airbnb’s Payment Gateway started receiving significant traffic) at regular intervals. Once a normalized view of all the transactions within a category is created, transactions with anomalies can be further grouped based on the type of aberration found. The anomaly attribution can then be directly tied to a business use case such as delayed payouts or mismatched transaction attributes. And because transactions with similar anomalies often require similar remediation of data and/or code fixes, these error groupings help us prioritize our efforts for fixes.
Payment Reconciliation Using Traditional Tools Is Not Sufficient
Transactional integrity initiatives differ from the payment reconciliation process in many aspects. Transactional integrity measurements require comparing each and every transaction from day zero (the hypothetically defined start date for a system), while reconciliation often leaves out already matched transactions. Integrity measurement techniques can also be expanded to any number of systems holding transaction records — this can include much more than one internal/external system.
Often payment reconciliation is done by matching platform and processor transactions alone. This precludes crucial links where transactional integrity could break down, since it does not offer comparison results for each stage of the process, but only provides an overall picture. Payment reconciliation is done by matching a set of identifying attributes as opposed to executing a deep comparison between two models. Many third party reconciliation technologies we analyzed focused on accounting aspects and provided process management toolkits, but these features don’t contribute to transactional integrity directly.
In addition, payment reconciliation is only done when money exchanges hands. However, there are a few transaction types — such as voiding an authorization — where money never moves, but that still may have an impact on our customers and/or system performance. To maintain a healthy payment system, it’s important to compare all types of transactions as opposed to a subset.
Big Data Toolkit — Hive, Hadoop, HDFS, Airflow And S3
Hive has the ability to compare different transaction models represented by different schemas directly with minimal interpretation and without needing any additional integration. This gives us the ability to quickly iterate on the solution.
Hive offers a SQL like interface to query the data, but its biggest strength is to effectively execute map-and-reduce jobs at scale. With this, it’s possible to compare each and every transaction throughout the ecosystem on various transaction attributes of interest — amount, currency, instrument used, transaction code, status, etc.
Hadoop also offers scalability, with its MapReduce mechanism that is suited well for comparison between large and growing datasets. HDFS gives us the ability to snapshot transactional integrity at regular intervals to produce a meaningful trend over time. Amazon S3 offers a cost effective datastore for archive purposes and works effectively with our big data toolkit.
Airflow, an Airbnb developed service, offers a scheduling tool to orchestrate data operations, allowing us to execute various steps of intermediate computation and produce a transaction level report.
Monitoring Transactional Integrity (Druid, Superset, And Automated Reporting)
Dashboards and automated reporting are critical to improve organizational awareness of system issues and to provide tools to measure performance over time. Without clear reporting, it is often difficult and time consuming to identify customer and business impacting events in a timely fashion.
At Airbnb, we are able to take advantage of an OLAP system built on top of Druid, a low latency, distributed data store, to ingest our transaction pipelines and interactively explore big data in a scalable fashion. Utilizing Superset as a dashboard tool, we are able to display constantly up-to-date measures of health for the entire payment system. With robust payment categorizations built into the our reporting tools, we are able to see the progress of various system issues over time, and in certain cases, these trends help us manage a long tail of historical issues by dealing with them in a scalable way.
Anomaly detection is another critical outcome of our transactional integrity projects. Our OLAP system allows us to easily configure anomaly detection algorithms across various dimensional cuts, enabling automatic email/slack notification of any anomalies shortly after our data pipelines land.
Key Wins Of Transactional Integrity Analysis
Transactional integrity analysis has helped us identify and size issues ranging from simple integration bugs to more nuanced edge-cases around eventual consistency. It has also helped us fine tune various system parameters such as socket timeouts, error handling, and retry mechanisms.
With the help of transactional integrity data and auxiliary tools, it is easier to proactively synchronize out-of-sync transactions before they impact our community. This has not only dramatically cut down high-volume, low-complexity support tickets, but it has led to large improvements in payment reconciliation, leading to more accurate financial reporting and streamlined operations.
Deep data analysis has also helped us easily monitor and understand processor issues and misbehavior. And building robust alerting on top of our analytical frameworks has enabled our system to proactively notify us about processor outages in online transaction processing or missing/out-of-SLA transaction reports.
Future Looking
While we have made much progress monitoring and improving our transactional integrity figures to date, there is still much work to be completed in the future to achieve unblemished transactional integrity.
Today at Airbnb, our system is designed to achieve “eventual consistency” through automatic retries of failed transactions. Retries address cases where our system does not hear back from downstream processing entities within allowed timeframe due to timeouts, transient system issues, loss of network connectivity, outages, etc. This gives our system the ability to get back “in-sync” with our processing partners.
However, safely retrying a payment request requires a strong idempotence guarantee to execute one and only one money movement, which is hard to achieve for every possible use case. With transactional integrity analysis, we have gained insight on how to deal with various edge-case scenarios that break our idempotency guarantee, and we are in the middle of redesigning our framework to achieve near perfect idempotency.
Furthermore, to trace transactions effectively throughout our distributed system, a normalized transaction taxonomy across processors is required. Computation of transactional integrity requires all data to be available in Hive. An event-driven approach can help us address both these problems elegantly, by designing a normalized settlement event schema that each processor record system can use to share its activity. These settlement events can optionally be decorated with gateway information, and can be captured in our data warehouse to allow transactional integrity measurement and analysis. They can also be consumed by our online transaction system, or be streamed into tools that create tickets, offer real-time analysis, and automatically repair underlying transactions to achieve consistency.
If you’re interested in working on the intricacies of a distributed payment system, or adding an additional “9” to our transactional integrity numbers, which can even require rebuilding certain parts of our payments system, Airbnb is hiring!
Big shout out to Lou Kosak for his thought leadership and prototype. Many thanks to Sam Wyman, Alice Liang, Khaled Hussein, Brian Wey, and Cynthia Adams for their generous contributions.