AI-Powered Decision Making: Automating ML Recommendations in a Critical System

Published in

Riskified Tech

7 min readAug 14, 2023

Automation and machine learning have become buzzwords in the tech industry, but they offer more than just hype. These technologies have the potential to revolutionize the way we work, making tasks faster, more efficient, and more accurate.

Automation can take over repetitive tasks, leaving time for more complex and creative work. Machine learning algorithms can analyze vast amounts of data and make predictions or decisions that would be impossible for humans to make on their own.

As companies grow and have more clients, they cannot scale without those automation processes.

So automation is great! But what about the challenges of automation? When everything happens so fast and without human interaction, critical mistakes might occur. Automation in critical systems can cause great damage in a split second.

In this article, I will discuss our journey to apply ML recommendations automatically in a critical system. I will focus on the challenges we faced and how we approached them.

Our journey to a self-optimizing system

It all started two years ago. Back then, we had a manual system for optimization configuration. Our operations users used our back office application to apply those configurations. To understand what those configurations are and how they influence our revenue, you first need to understand how Riskified’s chargeback guarantee solution works.

With Riskified, merchants control which orders are submitted for review. Every submitted order is quickly analyzed by our system, using machine learning models, elastic linking, behavioral analysis, and other fraud detection methods, resulting in an ‘approve’ or ‘decline’ decision. We back our approvals with a guarantee. In other words, in case you incur a fraud-related chargeback for an order we approved, you will be reimbursed — automatically. (*taken from Riskified blog)

Even though Riskified’s decision engine is based on machine learning, there is still a lot of configuration to do in order to optimize the models’ decisions. Since those configurations might have a big influence on the “chargeback” rates and since Riskified offers a chargeback guarantee solution, those configurations might have a big influence on our revenue. In simple words, changes in those configurations might lead to more chargebacks, and more chargebacks equal more expenses.

Since those configuration changes are very risky and have a direct effect on company revenue, each change has to be made very carefully. We used a draft mechanism, which included a simulation to try to predict the influence of such a change on the system, and the changes could be applied only after they were reviewed.

The problem was that this solution was not scalable.

As Riskified was growing, more configuration changes were required, and we needed to hire and train more people to manage those configurations. It was obvious that in order to scale, we needed to automate those processes.

At that point, we decided to start our journey to automation. We knew it was not going to be easy since we were dealing with risky changes, and every mistake could have a great impact on the company’s revenue, so we decided to divide this journey into phases.

Phase one: half automation

The goal of this phase was to create a half-automated system. The system recommends configuration changes, and Riskified’s operation users can review those changes and decide whether to apply those changes or not.

To calculate those recommendations, we collaborated with the data science team. DS developed an algorithm to produce recommendations. We used Apache Airflow as a workflow orchestrator to trigger their process.

Airflow is an open-source management platform for data engineering pipelines, and it uses direct acyclic graphs (DAGs) to manage workflow orchestration. Airflow has many built-in operators, such as HTTP operators. This way, we could easily trigger their job, and they could trigger us back on success or failure. With Airflow, we could also add retries if a task failed, and set a timeout for each task.

This worked well for a while, but as Riskified continued to grow, this solution was not scalable enough, so we decided to take the next step to a fully-automated system.

Phase two: full automation

The challenging requirements of a self-optimizing system

When it comes to system design, one can divide the requirements into two groups:

Functional requirements: requirements that the end user specifically demanded
Non-functional requirements: requirements that describe how the system works (for example, speed, availability, reliability, etc.)

Usually, the non-functional requirements are the most challenging ones. In our case, the functional requirements were:

The system had to be able to run periodically based on some configuration.
The system had to calculate configuration optimizations, and apply those recommendations.
If something went wrong, we needed to be able to revert quickly and easily.

The non-functional requirements were:

We needed to design a system that would not only be used for this configuration optimization. This system would need to lay the foundations for any future automation processes.
We needed to create a scalable system that would handle a large number of optimizations.
The system had to be resilient and fault-tolerant
- Fault tolerance means the ability of a system to continue operating properly in the event of failure with some of its components
- Resilience means how many faults the system can tolerate.
Monitoring and alerts — we needed to be able to detect wrong optimization recommendations. The challenge here was how to detect if a specific recommendation was wrong since it might have taken time until we saw the impact of the wrong configuration on our revenue. We also wanted to be able to rollback in case of an alert.

Designing a full-automation system

We decided to use an event-driven architecture, using Apache Kafka, between all of our services to achieve the following:

Persistence — with events, if one service is down, the broker can persist the event until the service is back online to receive the event. This avoids a single point of failure and improves durability.
Scalability — we could parallel the optimization processes of different customers, using partitions.
Decoupling — services no longer needed to know other services existed in order to work together. If we wanted to switch a service, the other services wouldn’t be affected.

We created 2 new services — optimization-orchestrator service and optimization-consumer:

The optimization-orchestrator would be responsible for triggering the process by scheduling jobs based on some configurations. In addition, it would also be responsible for monitoring the process, checking whether there are failed or stuck processes, and retrying to run the process in case of failure.
The optimization-consumer would be responsible for the optimization business logic. It would be in charge of receiving the optimization recommendations from DS and applying them in the system.

We used a monorepo for those services. This way, we could reuse the shared code, but still have different deployments for each service. This would always allow us to scale in case we wanted to add more consumers.

We decided to save the history of each run and all the changes in the system, for two purposes:

In case of a failure, we wanted to be able to understand at what step we failed and what was the data for that step. In addition, the optimization-orchestrator would use this data to recover from a failed or stuck process.
In case of an alert, the optimization service would use this data to auto-apply revert the system to the state it was before the automation process.

This led us to use event sourcing.

What is event sourcing? It is an alternative way to persist data. In contrast to state-oriented persistence that only keeps the latest version of the entity state, event sourcing stores each state mutation as a separate record called an “event.”

To do this, we published to Kafka an event on each step with all the relevant data. The optimization-orchestrator consumes all events and stores them in its DB.

Wrapping up

When you think about the automation of critical systems, you must do it carefully and not just think about the happy flow. Think about the consequences of failures and how to design a fault-tolerant system. Monitoring, reverting, and observability are critical in those kinds of systems.

Rome wasn’t built in a day …
Don’t try to jump directly to the end solution. It is easier to make complex changes gradually, step by step.