Causal Inference at Onfido

Published in

Onfido Product and Tech

10 min readFeb 9, 2023

At Onfido, we use AI to automate digital identity verification for 800+ businesses worldwide. Our proprietary AI has been built over ten years by a dedicated team of hundreds of researchers and engineers to make our analysis fair, fast and accurate. In most cases, the verification is done using a photo of a government-issued document and matching it with facial biometrics. Processing a diverse set of identity documents poses a significant challenge for building a reliable AI engine that accurately catches fraud. Onfido strives to build a better risk engine while not compromising on automation rates.

Onfido’s unique AI engine combines machine learning models to automate as many verifications as possible and enable our clients to onboard as many customers as possible without letting fraudulent ones through. Understanding the incremental value of different product initiatives is key to improving our AI.

Automation Structural Causal Model

We needed to create a model that would provide a high-level overview of our document verification product and would allow us to understand what effect on automation we would get by influencing different parts of our business.

We turned to Causal Inference because it aims at answering causal questions and estimating causal effects. Structural Causal Models have two key elements: a graph (DAG) representing the causal structure and a set of equations (Structural Equation Model).

We started with creating a graph of the document verification product. From the point a document is submitted for verification, Onfido’s AI needs to first identify the document, extract appropriate fields and features for fraud detection and generate a result. In some cases, the automation may not be able to create a conclusive result, and a human agent may need to finalise the verification process.

The graph has two main flows: automated (Onfido’s AI) in green and manual (human agents) in amber. The product consists of different components, grouped logically by the way they impact the auto-completion rate.

Formally, a causal graph specifies a factorisation of the joint probability distribution of data:

The next step was to formalise the relationship between variables. Let N be the number of checks submitted for document verification. Enter a Structural Equation Model (SEM):

We used simple regression models without additive error terms for the modelling. Structural equations provide an alternative representation of the factorisation of probability distributions mentioned earlier. SEM can be written in terms of expectations:

For this article, we’ll generate a synthetic dataset and use the automation structural causal model to evaluate the impact of a product initiative. We’ll use the pycountry library that provides an ISO dataset of countries. For simplicity, we’ll only use three document types in the example:

list_country_names = [country.name for country in pycountry.countries]
document_types = ['ID', 'Passport', 'Driving Licence']
documents = list(itertools.product(list_country_names, document_types))
documents = [' '.join(doc) for doc in documents]

We’ll generate all the variables randomly and obtain a dataset, in which one line corresponds to one document (document type and issuing country), with its classification rate, extraction rate, non-fraud rate and whether it’s enabled for automation or not. We’ll also generate several checks corresponding to every document.

np.random.seed(0)
df = pd.DataFrame(data=documents, columns=['Document'])
df['Classification rate'] = np.random.beta(3, 1, df.shape[0])
df['Enabled automation'] = np.random.choice(2, df.shape[0], p=[0.1, 0.9])
df['Extraction rate'] = np.random.uniform(0.6, 1, df.shape[0])
df['Non-Fraud rate'] = np.random.uniform(0.7, 1, df.shape[0])
df['Number of checks'] = (np.random.beta(1, 3, df.shape[0] * np.random.normal(15000, 2000, df.shape[0])).astype(int)
df['Share of checks'] = df['Number of checks'] / df['Number of checks'].sum()

Classification

After an applicant submits a picture of their document, Onfido’s AI will try to classify it: determine the issuing country, type (such as ID, passport, visa, etc.) and other characteristics. Our AI classifies documents in milliseconds and supports over 2,500 documents in 195 countries.

If the classification score is higher than a threshold, this document will continue in the automated flow; otherwise, it will be classified by an agent. We’ll take a score of 0.2 as a threshold and N equal to 100 in the example. We obtain that 99% of checks pass the classification step:

N = 100
df_c = df[df['Classification rate']>=0.2]
c = df_c['Share of checks'].sum()
classified = c * N

Enabled automation

To decrease the number of fraudulent applicants who pass our verification process, Onfido doesn’t allow some document types to autocomplete. Non-identity documents, documents without a photo or older documents with fewer security features are potentially high risk, so they are always reviewed by an analyst.

e * c * N checks continue the automatic flow. In the code, we’ll take the checks that passed classification and filter the ones enabled for automation. 90% of checks pass:

df_e = df_c[df_c['Enabled automation']==1]
e = df_e['Share of checks'].sum()
enabled_auto = e * c * N

Engine

The engine is simplified into two components that are executed simultaneously. Extraction refers to an ensemble of Onfido’s AI algorithms that extract the document’s visual information, data and metadata.

Fraud assessment refers to determining whether the document shows signs of fraud. Our unique micro-model architecture combines over 10,000 machine-learning models trained to detect specific fraud markers. Our models analyse pixel-level variations in document colour, shape, and texture to assess authenticity accurately.

Let a be the share of checks that pass both two components.

Auto-complete

Document verifications that passed both auto extraction and auto fraud assessment are auto-complete. For simplicity, we’ll assume that documents that have an extraction rate higher than 0.7 and a non-fraud rate higher than 0.8 pass both components:

df_a = df_x[(df_x['Extraction rate']>=0.7)&(df_x['Non-Fraud rate']>=0.8)]
a = df_a['Share of checks'].sum()
auto_complete = a * e * c * N

39% of all checks are auto-complete.

A Simple Intervention

One of the benefits of causal inference is that it allows us to simulate experiments and evaluate the causal effect knowing the causal connections. It allows us to intervene in the process we’re modelling and control the outcome.

The do-operator is used in causal inference to denote an intervention. Given random variables A and B

P(A | B) is the probability of A given B (the distribution under a natural state of B),
P(A | do(B = b)) is the probability of A given an intervention that sets B to b.

The do-operator simulates interventions by deleting certain functions from the model, replacing them with a constant and keeping the rest of the model unchanged. Recall the pre-intervention distribution:

We can simulate an intervention in Extraction by removing an incoming arrow from E to X, and manually setting X to a constant value. The post-intervention distribution resulting from the action do(X = x₀) is given by the equation:

The new auto-completion rate is:

Where a₀ is the percentage of checks that pass both Extraction and Fraud Assessment. The uplift from this intervention can be estimated as

Example

One of the Product initiatives is improving the extraction model’s performance up to 99% on the top three documents from the Seven Stans (countries with a name ending in -stan). We’ll filter out the documents with a high fraud rate since improving extraction on them won’t result in increased automation. The top three documents are:

stans = ['Turkmenistan', 'Uzbekistan', 'Tajikistan', 'Kyrgyzstan', 
'Kazakhstan', 'Afghanistan', 'Pakistan']
df_xn = df_e[df_e['Document'].str.contains('|'.join(stans))].copy()
df_xn = df_xn.sort_values(by=['Number of checks'], ascending=False)
df_a0 = df_xn[(df_xn['Extraction rate']>=0.7)&(df_xn['Non-Fraud rate']>=0.8)].head(3)

We’ll set the Extraction rate to x₀ = 0.99 and calculate the uplift:

df_a0.loc[:,'Extraction rate'] = 0.99
a0 = df_a0['Share of checks'].sum()
uplift = c * e * a0 * N

The uplift from optimising extraction on the top 5 documents is 0.6 percentage points.

Extraction Optimisation

Going even further, instead of taking a set of top three documents, we can estimate the uplift of improving extraction to 99% on top one, top two, up to top n documents, where n is the total number of documents from the Stans.

The first document in the subset has the highest volume in checks and the n-th has the lowest. According to the law of diminishing returns, there would be a point where the effort associated with improving extraction on a new document type is no longer worth the corresponding auto-completion rate increase. This point is called a knee point (or an elbow point in the case of convex optimisation).

Kneedle

Kneedle algorithm is an algorithm developed to detect knees in discrete datasets. We’ve used it to detect the optimal tradeoff point for this example and for other simulations.

This algorithm uses the mathematical definition of curvature for a continuous function as the basis for the knee definition. For any continuous function f, there exists a standard closed form that defines the curvature of f at any point as a function of its first and second derivative:

The algorithm follows these steps:

a) We use a smoothing spline to preserve the shape of the original data set as much as possible. The points of the smooth curve are normalised to the unit square. We calculate the Euclidean distances between the curve and the diagonal line (from first to last data point)

b) The distances are rotated 45 degrees clockwise. The goal is to determine when the difference curve changes from horizontal to sharply decreasing since this indicates the presence of a knee in the original data set.

c) We find the local maxima of the difference curve. For each local maximum, we define a unique threshold value based on the average difference between consecutive x values and a sensitivity parameter, S. The sensitivity parameter allows us to adjust how aggressive we want Kneedle to be when detecting knees. Smaller values for S detect knees quicker, while larger values are more conservative. Put simply, S measures how many flat points we expect to see in the unmodified data curve before declaring a knee. If any difference value drops below the threshold before the next local maximum in the difference curve is reached, Kneedle declares a knee at the x-value of the corresponding local maximum. If the difference values reach a local minimum and start to increase before reaching the threshold, we reset the threshold value to 0 and wait for another local maximum to be reached.

Optimising Extraction rate with Kneedle

We’ll use the kneed python package’s implementation of the algorithm and the automation model’s formula to calculate the uplift from onboarding top i documents:

Plotting the uplift for every document type i, we obtain this graph with the number of document types to optimise as X and automation uplift in percentage points as Y:

uplift = []
df_xn = df_xn[(df_xn['Extraction rate']>=0.7)&(df_xn['Non-Fraud rate']>=0.8)]
for i in range(1, len(df_xn)+1):
    df_i = df_xn.head(i)
    a_i = df_i['Share of checks'].sum()
    uplift_i = c * e * a_i * N
    uplift.append(uplift_i)

x = np.linspace(1, len(uplift), len(uplift))
y = uplift

kneedle = kneed.KneeLocator(x, y, curve="concave", direction="increasing")

The knee point is at top six document types, corresponding to an uplift of about 1.2 percentage points. Improving extraction on all document types from the Stans would result in a 1.4 percentage point increase in automation.

Conclusion

Using causal inference we’ve created a high-level overview of Onfido’s document verification product. It allowed us to discover the relationships between different product parts and understand the dependencies and limitations.

The model we created allowed us to simulate the potential impact of varying product initiatives on automation without performing many different A/B tests on the components. It was used to estimate the automation uplift of implementing additional changes to our models and prioritise accordingly in 2022.