Adyen Tech

Insights from the team building the world’s payments infrastructure.

Featured

The AI behind Uplift

Adyen
14 min readJan 17, 2025

--

Adyen Uplift — visualization of one rich data set.

Following the release of Adyen Uplift to General Availability after a successful pilot, we thought it would be interesting to go in depth into the engineering and science behind Uplift. This note follows the blog post “How Adyen does AI” where we explained how Adyen understands AI and the overall progress and direction we are taking. This post expands on the role that AI plays in optimizing every transaction flowing through the platform.

AI is the only way forward

Similarly to Adyen’s concept of providing end-to-end capabilities under one platform, the same is true for a single transaction. All the steps in the payments flow are interconnected. Each step in the payment process is conditional on the previous one, creating a complex web of dependencies. For instance, the payment experience offered to a customer depends on the predicted risk, which in turn is influenced by the outcome of authentication, and the choice of payment rails, which in turn affects costs. It also then determines the retry strategy if the payment fails.

To navigate this intricate process, we invested in developing a system that makes globally optimal decisions across the entire payment journey.

At the same time, humans like control and understanding, even if they can lead to a lot of mistakes. To cover this need, systems such as (legacy) RevenueProtect exposed an endless menu of conditions that would form rules (“if this then that”). While the feeling of being in apparent control is welcomed, what really matters is the data. Indeed, the quality of any system is not determined by how it is designed but by how it performs. Bringing these two premises together in harmony is the only answer to scale both in complexity and volume while providing the maximum global, not local, performance is to delegate the decision to the machine. Inherently, given the complexity of the patterns, the only way forward is to allow the machine to learn them from the data (note: it does not see all the data, teaser for below).

As in any AI application, the role of humans is not to compete with the machine for its decisions or content but to supervise its behavior.

This post details some of the decisions, engineering and science behind Uplift’s AI.

Part 1: Concept

In the past we have deployed ML models to locally optimize every step of a transaction in isolation. For example, a fraud model would predict the probability of a chargeback and would block a transaction if the predicted likelihood was above a threshold. A completely unaware Authentication Model would decide which route is best for a certain user simultaneously (e.g. SCA exemption, or 3DS1 authentication).

The AI behind Uplift changes that paradigm. It’s designed around the context of a decision-making process and consists of a collection of machine learning models of various natures that share awareness and knowledge. These models are optimized globally through Reinforcement Learning and share the same objective: to balance fraud, cost, and conversion in harmony.

Part 2: Engineering

Let’s revisit some of the Adyen Platform numbers from 2023:

  • 1 trillion USD processed volume through the platform
  • Growth of 26% year-on-year
  • SLA for transaction processing: 1 second including acquirer communications which consumes around 600 ms.

Now, let’s consider Adyen’s platform numbers during Black Friday/Cyber Monday in 2024 (4 days)

  • Number of transactions: 670 million
  • Peak transaction rate: 163K per minute (2.7K transactions per second)
  • Peak API requests: 25K per second
  • API uptime: 99.9999%

While the numbers are impressive, it becomes even more mind-blowing when we consider that all these transactions were driven by AI in their journey through the Adyen platform. Every single transaction touched between 2 and 5 different AI endpoints where a machine-learning model was making a decision and had an allocated latency of 20 ms (median).

This is possible because of a number of finely designed and engineered components.

Note: all critical flows of Adyen run on-premise. This means that all the components here mentioned, including the compute hardware are designed, engineered, deployed, tested and operated by Adyen. We rely heavily on open source, as can be concluded from the tech stack mentioned. Once the infrastructure is there, end-to-end ownership and control of a full vertical allows us to deploy changes fast without third-party dependencies for reliability.

Feature platform

The AI is connected to a Feature Platform that provides low-latency, high-cardinality, high-volume, multi-geography input vectors to both the training and inference services. For slow, complex features we use distributed compute through Spark and for fast features we compute them using Apache Flink, storing and serving them in a Cassandra backend deployed in different datacenters in different parts of the globe to resolve locality and latency requirements.

The Feature Platform allows to serve features with a cardinality in the tens of billions and a serving latency in single digit milliseconds. We cover the design and decisions in this talk at Codemotion 2023 and this blog post.

Inference service

After a model is trained, its artifact is registered, stored, and served. The inference service (that we call “Alfred”) allows scientists to create an experiment, select a model, define a baseline, and deploy the model in a traffic split defined by the scientist herself.

Alfred creates an internal API endpoint that connects to the payment flow (recall this is a highly critical flow) and takes the responsibility to serve every request with a p50 of 20 ms and p99 of 100 ms. The artifact size and the serving architecture are fundamental for such an outcome. As a side note, we have become quite versed in removing any indexing, meta-data or ancillary path for numpy or pandas pickles. Straight to the bones!

The inference service also takes responsibility for model management. We follow a principal-challenger model for every deployment (a deployment is an experiment, see below). Alfred will label one and only one model as “principal” which is the one we deploy in the vast majority of traffic and that provides the strongest performance across all versions. New models rolled-out under experiments are deployed under different stages: (1) “ghost”, where we only log telemetries and statistics but we do not influence outcomes; (2) “challenger”, where we do influence outcomes for a given traffic split — and log telemetries. When a challenger proves stronger performance than the principal model with statistical confidence, we sunset the principal (to “retired”) and we promote the challenger to be the new “principal”.

Experimentation service

All model deployments run on top of an experimentation service that is used to quantify the performance of each model (either principal, ghost, challenger) with scientific accuracy, framed as an experiment (i.e. a hypothesis to test). We do that through A/B/n testing and comparing the statistics of each model with a unified control group and ensuring that we serve the experiment with gradual rollouts to harness the experience.

The unified control group is a split of traffic that is consistent across all elements of a transaction and where models do not act. It is indeed used to trace transactions and thus as a comparison baseline for all models. Note that we still act on the control group as we define baseline as a “market baseline” — i.e. a payment experience quality that any proficient payment provider would offer. For us, this baseline is blocking cards that we know are stolen, SCA (Strong Customer Authentication) exemption, forcing a network token, and retrying with a challenge if the authentication fails. This results in a lower, but more honest uplift calculation for our system.

Part 3: Scale

Entity resolution

One of the biggest datastores (or databases) in Adyen is a graph that links together transaction attributes in order to recognize entities. This graph represents a powerful source of information that we can model to extract features as well as to train on.

Under strong directives of PCI, GDPR, and also guided by our high-stance on ethical standards, we can use this information for risk-based and due-diligence decisions. We estimate that we have seen more than 1 billion individuals on earth transacting in the Adyen platform. As the graph contains all transaction attributes, today it consists of more than 100B nodes and 300B edges.

Historically we have been storing this data in Postgres databases that provide a great solution for serving time, but fall short in the complexity of the compute (linking logic) as well as data volume. For this, we are migrating the system to a lambda-architecture with a hybrid approach: an on-line flow based on Cassandra that does the linking more efficiently and providing another level of depth in its complexity and an offline flow that does complex calculations and corrects every hour the online datastore either with new links or undoing wrong links. We call this system CELL (Customer Event Linking Logic).

Compute and storage size

Deploying AI at our scale necessarily means harnessing the data behind it. For this we have invested in the infrastructure and framework to handle such a process. Here are some top-level numbers of Adyen’s data platform today (January 2025):

  • 1500+ nodes
  • 600TB RAM
  • 60,000 CPU cores
  • 70PB storage
  • 96x NVidia A100 Tensor Core GPUs with NVLink Bridge
  • 1000+ DAGs running daily or ad-hoc on the platform (we are heavily reliant on Airflow and Spark).

We continuously invest in modernising and ensuring capacity for the future. Continuously means, we have the processes, talent and culture to run our own cloud and ensure it is always available and up-to-date.

Regulation, Authentication, and token vaults

Scale is not only about the load of data or latency or uptime. It’s also about the responsibilities taken when more and more customers trust Adyen to process their payments and financial services. This requires us to provide technical solutions that abide by regulations inherently and at the same time provide an answer to the scaling needs of our business.

To this end we have been pushing through Strong Customer Authentication directives under PSDx and pushing the boundary on how to be compliant with regulations, while bringing products to market that help our merchants boost their conversion rates. Uplift uses decisioning on which is the best authentication “rail” to choose to balance fraud, conversion, and cost, and can choose among several available actions such as an exemption, a version of 3DS or a passkey.

Similarly we have scaled our token vault and we are hosting north of 2B tokens that safeguard highly sensitive data under PCI-compliant regulation. For instance, the AI in Uplift can choose to tokenize, use a token, or swap a token for a PAN based on global optimization targets.

An AI will only be as effective as the action space it can choose from. Within Uplift we have created all those actions bound by regulations that, at scale, to unleash the performance of AI.

Part 4: Science

Connected holistic decisioning

One of the most head-scratching decisions is connecting a collection of unaware machine learning models to a common goal. The current version of Uplift uses message passing, which provides a simple but efficient approach to create awareness that models can condition their estimates on and thus converge closer to a global optimum.

We have tried with bigger artifacts and complex deep learning models that can combine multiple decisions, and we have found that they would often compromise the engineering requirements (latency, uptime) of online deployments in a critical flow. We do however keep on investigating this line of thinking as well as pushing what is possible in engineering terms and we are expecting to move the whole pipeline to deep learning architectures in the short future (see below).

We are doing active research in this area: we have funded a full PhD position with UVA’s AMLAB to help us work out this problem from a Reinforcement Learning perspective. We have already shared some work with the community in this direction in this conference talk.

Off-policy Evaluation

Running AB Tests is expensive in terms of time and money. If you field a bad variant it costs money and the amount of traffic it takes to reach significance can take a long time. Also you have an upper-bound on the number of experiments you can run in a year potentially delaying the discovery of a winning variant and realizing lost revenue. Additionally, it takes operational and cognitive load as well as simply “sacrificing” the traffic split to test a hypothesis. Also in mature product orgs, AB tests often come back flat or insignificant resulting in wasted time.

For that purpose, we have done research on Off-Policy Evaluation that allows us to essentially run offline AB Tests. New variants can be tested instantly w/ high correlation (+80%) to On-policy Estimates (actual AB Tests). This has saved us an estimated 20 weeks/year of time wasted on flat AB tests and an incremental 9–54 million transactions over a six-month period.

This research has been submitted to RecSys ’25, and a preprint is available on arXiv.

Counter-factuals and Causal Inference

At our scale, understanding the causes that drive the results of an experiment is more of a need than an interest, as it would allow us to detect systemic sources of entropy that we could avoid in the future.

On the same wavelength, once we apply a decision on a transaction (e.g. block it for fraud likelihood) we do not have access to the outcome if we wouldn’t have acted on it. In statistical terms, this is called a counterfactual. To learn the full distribution of the traffic then you would then need not to act on the traffic, beating the point of the system.

Control groups, randomization and exploration traffic help with that but diving in disciplines like Causal Inference allows us to better understand the underlying reasons of the experiments’ outcomes as well as bypassing counterfactuals.

We’re also investing in research in the field by funding another PhD position with UvA’s AMLAB to work on causal inference for datasets like ours at our scale. Researchers working on their PhD have the purpose of pushing research boundaries and publishing papers while they can team up with Adyen’s engineering team to implement and test hypotheses. As an example, we are working on a transaction simulator with Generative AI that would generalize over distributions, counterfactuals and PII information and would allow us to benchmark algorithms and techniques before running experiments with real traffic splits.

Weak Supervision

Labels are hard for two reasons: quantity and quality. Given our data size, we have enough labels to train models and thus we could use balancing techniques such as downsampling and still get enough data to train our models.

However, we are leaving some predictive power behind as labels normally come later, don’t come or come in incomplete. To solve this we have been engaging in research around Weak Supervision. The premise of Weak Supervision is that the modeling process benefits more from a larger quantity of data, even if it is noisy, than from a smaller amount of high-quality data. More precisely, for a fixed quantity of high-quality data, adding noisy data can be more beneficial. Weak Supervision feeds into the “Adyen’s Data Flywheel” that combines efforts on increasing label quality and quantity together with Active Learning (future work, see below). Using Weak Supervision in production we have increased recall by +22%, reduced auth rate loss by -46%, and achieved a +13% issuer refusal rate gain by improving fraud detection efficiency.

Non-uniform Random Exploration for Contextual Bandits

When running an experiment in real time with real traffic over a RL system one faces the dilemma of executing the action that provides the best reward, based on the knowledge up to that point (exploitation), and also ensuring that this knowledge is still valid and you are not blindly executing in old truth (exploration).

The simplest technique that one can adopt is called epsilon-greedy, where a random traffic split of epsilon percent is allocated to explore (typically by choosing a suboptimal action from the available action space). There are several research avenues on choosing the best next action that would allow you to keep exploring while maintaining a suboptimal but competitive baseline in the exploration. We have been researching and deploying platform-wide experiments with several techniques and we have found significance in techniques around Regression Oracles.

This research has been submitted to WWW ’25, and a preprint is available on arxiv: [2412.00569] Contextual Bandits in Payment Processing: Non-uniform Exploration and Supervised Learning at Adyen.

Deep Learning & Ensembling

It is one of our cultural tenets to strive for simplicity and to create solutions to problems,not problems for the solutions we want to use. As such, classical ML algorithms such as boosted trees still provide a strong baseline for the majority of classification and regression problems whose input is structured data and have constituted, and sometimes still do, our principal baselines. We have run, and are running, experiments (see experimentation engine) where we deploy complex Heterogeneous Ensembles of Neural Networks for online scoring in payment flow and have not only achieved performance parity with boosting baselines but also surpassed them, with performance improvements that justify the delta in operational load and complexity.

Note: strictly speaking, using Generative AI (i.e. an LLM) is not a technology that would help solve this problem. However, we have taken inspiration in our work from companies like Hyperplane (great product, great team) who have created offline models through bigger networks based on the Transformer architecture removing the need for explicit feature design (next point).

Transformer architectures

We are experimenting with leveraging Unsupervised Pretraining and Transformers to unlock the full scale of our data and incorporate the correct inductive biases in our modeling process.

On top of our efforts to utilize more data such as weak supervision or active learning, we are also thinking of approaches to change the supervised paradigm completely and allow us to unlock the full potential of our data. Inspired by recent breakthroughs in Self-Supervision applied to language modeling, we are applying the same patterns of Unsupervised Pretraining to payments data. By inferring labels from the structure of the data itself, we can achieve human-free supervision and unlock the full potential of our datasets.

Just like sentences are sequences of words, shoppers are sequences of transactions — this is the core data structure of Adyen: shopper transactions sequences. Traditional approaches to modeling often ignore this assumption and model sequences independently or attempt to model the assumption with workarounds like point-in-time shopper aggregations, however the Transformer architecture allows us to offer an alternative approach to extract the predictive power of these structures.

Therefore, Transformers and Self-supervision are enabling us to build a foundational payments model trained on billions of transactions, allowing us to bootstrap any modeling process with unparalleled scale. Through downstream fine-tuning and entity embeddings, we are enhancing shopper insights, improving fraud detection, and opening up new research opportunities like synthetic data generation.

Observability

Once the models are deployed and quantified through an experiment we constantly run diagnostics to ensure the performance stays in place. We run classic drift detection (classically under the umbrella of MLOps) as well as more complex algorithms to detect business performance drifts and biases such as combinations of MIST (Multiple Irregular Seasonalities and Trend decomposition) and DTW (Dynamic Time Warping) algorithms. We covered these aspects in PyData 2024.

Fairness and explainability

Inherently to delegating decisions to the machine there come challenges around ensuring that these decisions do not come with biases that can segment demographics or be considered unfair from a human perspective.

For this we have established an internal working group of technical and legal experts to be on top of regulation, including GDPR and the AI act, and we have made procedural changes to ensure that Adyen stays close to its cultural ethos: a highly ethical company. All products and models need to be evaluated for biases and screened for ways where they could violate handbooks and regulations. Once the workgroup has approved we monitor through observability tooling that certain sensitive features do not overfit.

All decisions taken by AI are meant to be explained. For this every inference call is recorded and algorithms that reason around the underlying reasons are scored (e.g. SHAP values). The UI of Adyen Uplift offers an explanation on every decision per transaction.

Future lines of work

While this gives a snapshot of the internal workings, challenges, and lessons learnt while building and deploying Adyen Uplift, it’s clear that we have only achieved a part of where we want to go.

We are actively researching in the fields mentioned above plus other avenues that show promising early results such as Agentic Flows, Alignment, Weak Supervision + Active Learning Flywheel, and Identity Representation Learning with Differential Privacy.

In this piece we provide a transparent, factual snapshot of the technology behind Adyen Uplift. We are always looking for exceptional talent to join our exceptional team. If the contents of this post resonate with you, please take a look at our careers page!

--

--

Adyen Tech
Adyen Tech

Published in Adyen Tech

Insights from the team building the world’s payments infrastructure.

Adyen
Adyen

Written by Adyen

Development and design stories from the company building the world’s payments infrastructure. https://www.adyen.com/careers/

No responses yet