A Cloud-Based Solution for Real-Time Analytics

pierluigidibari

Published in

Data Reply IT | DataTech

13 min readApr 17, 2023

Exploiting streaming data to accelerate Business Intelligence

Real-Time Analytics & Fraud Detection

Organizations are collecting more data, faster than ever before, with no sign of slowing: many of them are struggling to turn these piles of data into something they can use to grow their business. Real-Time Analytics (RTA) refers to the process of collecting, analyzing, and interpreting data as it is generated. It enables companies to make more informed decisions by providing them with up-to-the-minute insights into their operations. Just think of tasks such as monitoring and control systems, fraud detection or network security, where the timeliness of any intervention is crucial. According to “The Speed of Business Value” industry report, published by KX and the Centre for Economics and Business Research (CEBR) in 2022, 80% of the companies surveyed have seen their revenues increase after adopting RTA approaches, improving efficiency, productivity and customer satisfaction. A wide range of different industries can grasp the advantages of RTA: usually, whenever it is necessary to adapt to frequent changes or to analyze continuous data streams, RTA is the answer.

One of the hottest topics where RTA is at its best is exactly fraud detection. There are many forms of it, but in general fraud detection refers to the process of identifying and preventing fraudulent activities or unauthorized use of resources, to protect individuals and organizations. Of course, the most affected industries are insurance and finance. In this kind of activity, timely interventions are crucial, as they can make difference in defending against financial losses or any other negative consequences of fraudulent activities.

In this article, we propose a cloud-based solution to detect fraudulent credit card transactions on the fly. In particular, we adopt an approach based on user behavior: in other words, each incoming transaction is not classified as fraudulent or not just by looking at itself, but is compared with all the last transactions of the same user. In this way, we can define the behavior patterns of the user, and the transaction is labeled as fraudulent if it is particularly far from them. If you want to know more, take a look at our article Fraud Detection Modeling User Behavior.

Use Case & Architecture

In the imagined scenario, we have a set of customers and merchants. Each customer has a bank account uniquely identified by a credit card number and is defined by personal information such as name, surname, date of birth, residence and so on. On the other hand, for each merchant we know a unique identifier, category (e.g. technology, food, home, entertainment…) and location. Of course, we also know transactions: a single transaction is described by the customer and the merchant involved, the date and time of purchase and the amount spent. A transaction is labeled as fraudulent when it is addressed to a customer different from the real one.

The system we want to build should mainly be able to detect fraudulent transactions in real-time. The transactions detected should be reported as soon as possible, and all the information available should be easily accessible. The idea is to automatically validate or cancel a suspect payment before it is performed. To do this, we will rely on Google Cloud Platform (GCP) and we will follow Google’s best practices to realize a simple and scalable solution.

Our proposed GCP architecture for fraud detection in real-time

The architecture we thought can be synthesized as follow. Raw data about customers, merchants and transactions are initially stored on Cloud Storage. We move data to BigQuery so that it is more easily accessible and processable. In BigQuery, we perform feature engineering and we split the available data into training and test set: the test set consists of the last 30 days’ transactions, while the training set consists of all the others. We exploit BigQuery ML to train a model able to recognize fraudulent transactions. The trained model is uploaded to Vertex AI and exposed through a proper endpoint. The test set is initially uploaded to Firestore, which will contain the most recent purchase data, useful to enrich transactions at inference time. Pub/Sub is the entry point for the incoming transactions to be classified. After that, each transaction is processed through a Dataflow pipeline and given as input to the model. If the transaction is recognized as fraudulent, it is published on a proper topic on Pub/Sub, otherwise, it is written to Firestore. Regardless of the prediction outcome, the enriched transaction and the related label are written to BigQuery. Finally, all information produced is readable in a Looker Studio dashboard.

Let’s examine in more detail the services mentioned and how they are used in our system:

Cloud Storage. As the name suggests, Cloud Storage is the GCP fully managed storage service for unstructured data. It guarantees scalability, consistency and durability while keeping its content safe. We adopted Cloud Storage as a container for our raw batch data about customers, merchants and transactions, but also to store all the various system artifacts.
BigQuery. BigQuery is the GCP data warehouse solution, specifically designed for Big Data. It is a serverless service that stores data in columnar format, allowing easy interactions in SQL. Once we moved our data to BigQuery, we defined a view for enriched transactions leveraging exactly SQL queries. In this way, each transaction is no more described only by the customer and merchant involved, the time of purchase and the amount spent, but contains a lot of additional information useful for modeling customer behavior. For instance, from the last transactions of each customer, we obtain the number of operations made, the average spend, the distance from the merchants, the favorite purchase category, the most frequent shopping weekday and time and so on. This information is extracted from the transactions of the last 1, 7 and 30 days respectively. In addition, to emphasize it, we compute the deviation from the current transaction values. At this point, the so built view constitutes the base to define our training and test set. As soon as the whole system is up and running, BigQuery will be also used to store all the incoming transaction predictions.
BigQuery ML. BigQuery ML is a feature of BigQuery that allows you to build and run ML models directly in BigQuery. You can easily start training a model through a simple SQL-based syntax, exploiting the data already stored in BigQuery. In our case, we opted for an XGBoost model as it is a fairly scalable, fast and flexible solution for classification problems like ours. Moreover, tree models such as XGBoost offer the advantage of providing feature importance, useful for a better understanding of training data, and explainable predictions. We trained our model by adopting a fine-tuning approach, to obtain the best hyperparameters combination. Since our data is highly skewed towards non-fraudulent transactions, once the training is over, we consulted the ROC curve to select the best decision threshold. After that, we evaluated the so trained model on the test set as usual and we exported it to Cloud Storage.
Vertex AI. Vertex AI is the GCP unified platform for ML. It provides all the services and tools required for an ML project, from data preparation to model deployment, in a single place. Since our model had already been trained and evaluated in BigQuery ML, we exploited Vertex AI mainly to expose the model to all the other services, making it more easily accessible. In particular, we first imported the model into Vertex AI Model Registry, the central repository for ML models. Next, we created an endpoint to host the model and serve low-latency online predictions. As we will see later, Vertex AI can also be exploited for model versioning and monitoring, useful for automatic skew and drift detection.
Firestore. Firestore is a serverless document database provided by GCP, specifically designed to store and synchronize large amounts of data in real-time. In our use case, Firestore is used to store all the last 30 days’ transactions known, which will be necessary to enrich incoming transactions to be classified. The reason why we preferred Firestore to BigQuery lies in the NoSQL nature of Firestore. As a NoSQL database, Firestore guarantees better scalability and performance in read operations than a traditional relational database. In particular, the way data is organized ensures very low latency accesses, when compared to BigQuery. Firestore arranges data in documents and collections. Each document is identified by a unique ID and describes a particular entity through a set of key-value pairs. A collection is a set of related, conceptually similar documents that do not necessarily have to share the same structure. In our case, we have a first collection of customers. Each customer is defined by a document, identified by the credit card number and containing all the other customer details, and a subcollection of transactions. When an unknown transaction comes in to be classified, given the customer’s credit card number, this kind of setup enables very simple and quick access to all the necessary data. For testing purposes, we initially populated Firestore with the transactions coming from the test set. Another cool feature of Firestore is the TTL (time-to-live) rules. TTL rules allow deleting documents dynamically after a certain amount of time has elapsed. This frees up storage space when documents are no longer useful. In our case, incoming transactions are enriched with data coming from the transactions of the last 30 days: it means that we do not need to store older transactions in Firestore. For this reason, for each transaction document, we set an expiration date equal to the purchase date plus 30 days. Finally, we compared Firestore with BigTable, the GCP wide-column database, but given the operations needed, Firestore proved to be better suited to our use case and performed much better.
Pub/Sub. Pub/Sub is a fully managed messaging service available on GCP, useful for asynchronous communication between different services or applications. It is one of the main GCP services to be used for streaming data. Pub/Sub works by allowing publishers to send messages to a specific topic and then distribute them among subscribers of the same topic. Our system uses Pub/Sub at both the beginning and the end of the data processing pipeline. In particular, the unknown transactions to be classified come into a specific Pub/Sub topic as messages in JSON format. The following Dataflow pipeline is listening on the same topic. After being processed, when a transaction is classified as fraudulent with a score greater than a certain threshold, it is published to a second topic related to fraudulent operations. The idea is to allow any subscriber of this topic to take action as soon as fraud is detected (e.g. an e-mail could be sent to the customer concerned, through a Cloud Function triggered by new messages published to the topic).

Dataflow. Dataflow is the fully managed GCP solution based on Apache Beam, for building and executing data processing pipelines at scale. It represents the core of our architecture as it defines the flow of operations to which incoming transactions are subjected, from reading to classification.
Let’s look in more detail at the individual steps of the pipeline:
1. Incoming transactions from Pub/Sub are read and loaded as JSON content, to be accessed correctly.
2. Given the credit card number of the customer who performed the transaction, additional personal information and transaction data from the last 30 days are recovered from Firestore. The so retrieved data is then aggregated to properly enrich the original transaction record. In other terms, we replicate the same operations initially performed in BigQuery to enrich raw data. If the customer is unknown or has no past transactions, all the additional values are set to null.
3. The enriched transaction record is transformed so that it can be correctly given as input to the model (e.g. categorical values are subjected to label encoding, whereby they are replaced with the respective numerical value).
4. The transformed record is finally given as input to the model. The output will be the transaction plus the prediction score.
5. If the transaction is classified as fraudulent with a score above a certain threshold (so we are quite sure that the transaction is fraudulent), it is published to the related topic on Pub/Sub.
6. If the transaction is classified as not fraudulent with a score above a certain threshold (so we are quite sure that the transaction is not fraudulent), it is added to the customer transactions on Firestore. In this way, it helps defining customer behavior and can be useful for future predictions. If the customer is unknown, his personal details are also written to Firestore.
7. Regardless of the classification outcome, the enriched transaction (output of the 2nd point) plus the prediction score are finally written to BigQuery. This data will be accessible through the Looker Studio dashboard, but could also potentially be useful for future retraining of the model.

Looker Studio. Looker Studio, previously known as Data Studio, is the Google data visualization solution that makes it easy to create rich reports and interactive dashboards, directly from data on GCP. We exploited Looker Studio to create a dashboard useful for exploring the prediction data on BigQuery. The dashboard is thought for business stakeholders and provides helpful insights such as the fraud rate, its trend over time or the number of fraudulent operations by country and merchant. The dashboard is updated every minute to show the latest data within a reasonable time frame.

Possible developments

The core architecture just defined consists of the essential services to perform Real-Time Analytics on GCP. Besides this, we identified several options to further enrich our fraud detection solution:

Model monitoring. When we create a new endpoint on Vertex AI, it is possible to activate automatic monitoring of hosted models. Model monitoring makes it possible to detect deviation of input data from training data and prevent potential performance degradation. Two types of deviation can be reported: training-serving skew occurs when the input data distribution diverges from the training data distribution; prediction drift occurs when input data distribution changes considerably over time. In both cases, thresholds can be set to trigger alerts to notify those responsible and take appropriate action.
Model explainability. Just as with model monitoring, when creating a new endpoint on Vertex AI, model explainability option can also be activated (of course, if the model allows it). In our case, given an input record, besides the prediction result, XGBoost is able to provide a description of the features that most affected the classification. This information can be useful in explaining why a transaction has been classified as fraudulent and could be published to the related Pub/Sub topic, together with the transaction itself.
MLOps. MLOps refers to a set of techniques whose goal is to deploy and maintain ML systems in production in a reliable and effective way. They implement and automate continuous integration (CI), continuous delivery (CD) and continuous training (CT). In GCP, it is possible to reach these goals by combining Kubeflow and Cloud Build. Kubeflow is an open-source toolkit for building ML workflows of any complexity, integrated with many GCP services. With Kubeflow you can define, orchestrate and run ML pipelines, covering all steps from data pre-processing to model deploying and monitoring. On the other hand, Cloud Build is the service that executes builds on GCP: it can import source code from different repositories, build it and produce artifacts. It is the main service responsible for CI/CD. As a result, it is possible to define an ML pipeline with Kubeflow and load the related code to a repository. If changes in the repository are detected, Cloud Build can rebuild the whole code and re-run the ML pipeline to have an updated model. The ML pipeline could be triggered even when new training data is available. For instance, the input transactions collected can be used to continuously train the model. However, in our use case, a human-in-the-loop approach is suggested, as it is necessary to validate fraudulent or not fraudulent transactions before training and deploying a new version of the model.
Preprocessing pipeline. As we have seen, training and input data are processed in the same way to aggregate information from past transactions: training data is obtained in BigQuery with SQL queries, while input data is enriched with information from Firestore. A possible solution is to define a single Dataflow pipeline for both training and input data, in order to avoid potential misalignments. Consequently, the same pipeline can be run in the ML Kubeflow workflow and the main Dataflow pipeline for incoming transactions.
Data anonymization. Considering the nature of the use case, our data contains sensitive data, mainly credit card numbers. Data Loss Prevention (DLP) is a GCP service for finding and de-identifying confidential information. DLP can automatically recognize a wide range of different data and supports several types of transformation, such as masking or replacement. In our case, the most suitable transformation for hiding credit card numbers is crypto-based tokenization. In particular, deterministic encryption encodes original data using a cryptographic key; after that, data can be recovered by decoding the generated tokens with the same key. Therefore, after generating a suitable key, in our architecture DLP could be used at first to locate and de-identify credit card numbers in raw data. In this way, the encoded data is propagated consistently through the entire system. Similarly, credit card numbers in incoming transactions can be encoded, so that they can be used to retrieve the needed information as usual. Of course, original credit card numbers can be recovered if necessary.

Conclusions

Real-Time Analytics is increasingly becoming a must for all those companies that deal with continuous streams of data and want to gain valuable insights from them. In this article, we have seen a possible cloud-based solution for Real-Time Analytics, leveraging GCP. Despite we explored the specific use case of fraud detection, the architecture proposed consists of the main services and resources for streaming analytics, so it can be easily adapted to many other case studies (e.g. monitoring market trends, inventory management for large retailers, product tracking in logistics, live maintenance of machinery, and so on). Combining the benefits of Real-Time Analytics with those of the cloud is now a smart investment for any organization that wishes to embrace a more data-driven approach and exploit its full potential.

References

Fraud Detection Modeling User Behavior
How to build a serverless real-time credit card fraud detection solution