Unveiling the Mechanics of Zero-Shot Predictions: Harnessing Machine Learning Models to Understand LLM’s Decision-making Process

8 min readMay 30, 2023

Introduction

In the contemporary landscape of machine learning, Large Language Models (LLMs) such as ChatGPT are increasingly gaining momentum. They present immense potential for various applications including knowledge bases, intelligent search mechanisms, chatbots, and content creation.

In the machine learning space, a common challenge is the availability of high-quality, representative data for validating hypotheses and assumptions. Collecting such data can be arduous due to several roadblocks such as high costs, limited availability, issues with data cleanliness, potential biases, and concerns about security or privacy.

In my discussions with numerous clients, I have noticed the fear around the utilization of ChatGPT and other LLMs. Most of these fears stem from security and privacy considerations, leading to many organizations choosing to abstain from using these models.

I’ve talked with many of my customers about what is their company’s take on ChatGPT and LLM, most of the companies banned ChatGPT due to security and privacy concerns

Since it’s difficult to connect data with LLM, why don’t we leverage the content creation capabilities to generate synthetic data and use the data as input for machine learning projects?

Before we delve deeper into this proposal, it’s crucial to understand the implications of using LLMs for machine learning projects, particularly in supervised learning scenarios. The predictions made by LLMs usually operate on a ‘zero-shot’ or ‘few-shot’ basis. In simpler terms, the model leverages the knowledge it has been trained on, making assumptions with little or no context.

So in this blog post, you’ll see:

Crafting the right prompt to query a binary decision (prediction) from ChatGPT (GPT 3.5 turbo).
Creating synthetic data and implementing the prompt strategy to garner responses on a large number of observations.
Utilizing machine learning explanation insights (in this guide, we’ll use DataRobot) to understand the rationale behind ChatGPT’s ‘zero-shot’ decisions.

This blog post is the first in a series of multiple experiments using the similar business case for demonstration purposes. Future directions will include how to provide guidance and examples to ChatGPT to correct misclassifications, and align it more closely with business logic and existing processes. Stay tuned as we embark on this exciting journey to harness the potential of LLMs.

Use Case

Before joining DataRobot, I served as the Manager/VP of Global Risk Analytics at HSBC. My main responsibility was to leverage machine learning for optimizing transaction monitoring systems and bolstering our anti-money laundering (AML) framework. To shed light on my work, I will use an example from AML.

Money Laundering?

Money laundering is the illegal process of concealing the origins of money obtained illegally by passing it through a complex sequence of banking transfers or commercial transactions. The overall scheme of this process returns the “clean” money to the launderer in an obscure and indirect way

Why is it a problem?

The United Nation research states that the estimated amount of money laundered globally in one year is 2–5% of global GDP
With the cost of transaction monitoring capabilities increasing and increased regulatory pressure to increase personal liability on failed compliance program failure, many organizations are now looking to disruptive technologies to solve these challenges

Understanding Transaction Monitoring Framework and its Challenges:

In basic terms, the transaction monitoring process begins with data, which includes transaction details and client profiles. Banks then apply business rules in their transaction monitoring system (TMS). The system monitors transaction behavior and generates a batch of potentially suspicious alerts.
These alerts are then manually investigated by a team of experts to determine whether they are genuinely suspicious or merely false alarms. This process can be challenging and time-consuming, and this is where innovative technologies can play a pivotal role in enhancing efficiency and effectiveness.

How can machine learning help?

Machine learning can greatly assist in this process. It can utilize historical transaction data, client information, and past investigation results (whether they were true alerts or false alarms) to predict the risk level of each new alert. This information can then be used to prioritize alerts before they go on to manual investigation. This not only streamlines the process but also ensures that the most potentially damaging issues are addressed promptly.

How can LLM assist?

When data is challenging to collect, or when the Anti-Money Laundering (AML) team is working on creating a new monitoring rule, LLMs can prove to be invaluable tools. They can be used to a) generate synthetic data, and b) predict if an alert is suspicious. This contributes to more effective monitoring rules and assists in efficient handling of alerts, aiding in overall optimization of the AML framework.

Can we trust the assistance of LLM?

We can place trust in LLMs if we utilize machine learning to emulate the decisions made by these models. By examining the insights and understanding the reasoning behind LLM predictions, we can validate and enhance the reliability of their contributions to our transaction monitoring systems.

Experimentation

I. Crafting the right prompt to query a binary decision (prediction) from ChatGPT (GPT 3.5 turbo).

First, I developed a straightforward prompt asking for a 30-day transaction summary for a customer. ChatGPT provided some advice on crafting those features but did not give me any specific numerical data.

I then tweaked the question to be more precise and detailed. The answer I received was excellent and well-structured. This type of narratives is very common in the alert investigation process, usually the investigation experts will leverage this information to make their conclusions.

‘Surprisingly’, ChatGPT was even capable of creating a script for generating a pandas dataframe, incorporating the synthetic information it had produced. Amazing!

When I posed another question, the response was accurate but overly detailed and lacked directness.

To address this, I had to refine my question to be more specific about my requirements.

So far, we’ve successfully employed a sound prompt strategy to generate synthetic data and predictions from LLMs, even without any historical context. This shows the potential of these models in generating useful, actionable information from prompts.

II. Creating synthetic data and implementing the prompt strategy to garner responses on a large number of observations.

In this part, I simplified the problem by considering only 8 dimension — (ACH, Wire) x (Transaction Amount, Count) x (Actual Activity , Expected Activity)

I’ll call this new rule ‘Wire and ACH Deviation from Expected Behavior’

With some text manipulation, I was able to generate prompts like this:

‘Just say yes or no in the answer, nothing else. Here is the question: I’m investigating an case for potential money laundry, and the customer’s 30 day transaction behavior looks like this: Wire amount: $7148.0, Wire count: 5.0 transactions; ACH amount: $15318.0, Wire count: 16.0 transactions; The expected behavior of this customer is : Wire amount: $7486.0, Wire count: 8.0 transactions; ACH amount: $4767.0, Wire count: 4.0 transactions; Based on the actual and expected behavior, is this case suspicious or not?’

Here are two examples of ‘Yes’ and ‘No’ responses

Based on the experiment above, I built a dataset comprising 300 customers. From each row, I generated a prompt and obtained a ‘Yes’ or ‘No’ response from ChatGPT.

Additionally, I calculated the ratio between actual and expected activities, as it is a significant factor I highlighted in the question.

Here is what the dataset looks like:

III. Utilizing machine learning explanation insights (in this guide, we’ll use DataRobot) to understand the rationale behind ChatGPT’s ‘zero-shot’ decisions.

Out of the 300 records analyzed, ChatGPT classified 180 of them as “Yes” (suspicious) and 120 as “No” (false alarm).

Let’s quickly examine the distribution of wire transaction amounts.

There is a clear correlation between the transaction amount and the count for each transaction type, which is expected and logical.

DataRobot developed numerous models and ranked them based on the preferred metric, which in this case is the AUC (Area Under the Curve).

The ROC curve from the top model indicates its ability to effectively differentiate between the “Yes” and “No” responses from ChatGPT.

By utilizing SHAP-based feature impact analysis, it was discovered that while the prompt mainly focuses on comparing actual and expected values (the ‘Ratio’ variables), the actual transaction amount and count (e.g., Txn Cnt ACH) also significantly contribute to the prediction.

Let’s delve into the effects of the top features:

When the count of ACH transactions exceeds 9, higher values significantly increase the probability of a “Yes” prediction.

For the ratio between actual and expected wire transaction amounts, a deviance above 1 indicates higher risk, and there is a substantial increase in risk when this ratio exceeds 2.

In contrast to the previous scenario, when ChatGPT evaluates the deviance between the actual and expected ACH transaction counts, values lower than expected (<0.8) also raise suspicion.

A sudden spike in risk level occurs when the actual wire transaction amount exceeds $8000.

To provide further insight, here are a few SHAP-based prediction explanations that shed light on the factors influencing the predictions made by ChatGPT.

By leveraging machine learning and exploring the inner workings of LLM, we unlock a world of opportunities. We can refine the models, improve their accuracy, and harness their power to drive intelligent decision-making across various industries and domains.

Conclusion

I have additional ideas to explore, but I will save them for a future blog post. In conclusion, the key takeaways from this analysis are:

LLM can make predictions in a zero-shot manner with effective prompting strategies.
Machine learning and prediction insights are crucial for understanding how LLM makes predictions and inferences.
Leveraging LLM to generate synthetic data can greatly expedite the development and evaluation of new ideas, eliminating data availability restrictions and security/privacy concerns.

These insights highlight the potential of LLM in various applications and emphasize the importance of leveraging machine learning techniques for predictive analysis.

Unveiling the Mechanics of Zero-Shot Predictions: Harnessing Machine Learning Models to Understand LLM’s Decision-making Process

Introduction

Use Case

Experimentation

Conclusion

Written by Ray Mi