A Day in the Life of Modern Financial Surveillance Data Science Team

John Thuma
DataSeries
Published in
8 min readMar 22, 2019

Banking surveillance is often fragmented in which distinct surveillance teams monitor customer interactions, electronic communications, market activity, voice recordings, building sensors, video feeds, and social media within silos. This limits the ability to spot sophisticated risk activities that cross multiple locations, lines of business, and functional areas.

Our earlier posts describe how cross-functional trade surveillance will help you move from a fragmented surveillance program to a holistic one that enables analysts to visually correlate data across all surveillance platforms and spot risk patterns that would not have been identified in isolation. This concept is represented in the diagram below.

A holistic surveillance program will optimize your response to regulatory inquiries while demonstrating that you have a robust and defensible process. Real-time streaming data is a critical piece of that and this post describes how it would be applied along with some of the technologies that organizations would use to analyze and perform forensics against data that resides across many systems.

The Use Case: Circular Trading

Circular trading is a fraudulent trading scheme in which a closed group of entities trade among themselves for the purpose of falsely creating liquidity or volume while there is no genuine change in ownership of the security.

The adjacent correlation flow diagram illustrates a closed group of eight individual parties (each is a vertical block). Although a security is bought and sold between five individual parties, the pattern is circular because common ownership can be traced back to one related party.

The diagram below provides examples of events that could tip off a circular trading scheme. This scenario is by no means sophisticated. We did that on purpose so that we could align an event to the respective surveillance platform in a simple way.

Orchestration of Trade Surveillance in a Big Data Streaming World

As illustrated above, trade surveillance is not just about trade data. Many types of data are represented: structured, unstructured, historical and real time. The challenge is connecting these disparate types so that a story emerges that helps to describe the intent behind the events. Below describes how this would be orchestrated.

The first thing you want to do is move all of this data into a data lake and do fingerprint analysis and anomaly detection. Fingerprint analysis is a bulk data technique to find patterns in data. This is a simple binary technique that leverages supervised machine learning or a trained predictive model. We can use this fingerprint to classify a series of questionable transactions. In Event #1 above, a high volume of trades was identified that are similar in size and are for the same security. Wait, that doesn’t sound illegal at all. This is where further forensics come into play (patterns to tip off an investigation).

In the example above, all we know is that there is liquidity, which is a good thing, not a bad one. However, tying those trades to entities/clients that are related (Event #2) gives some more information. The two together is the flag, not the individual facts themselves. These activities are surveillance leads that humans would not be able to do themselves due to the large volumes of data, hence the need for the advanced analytics described above

Once we identify the transactions we then need to understand the intention behind those trades. Are the parties intending to be fraudulent or is it just coincidence? We can look at emails, texts, chat logs, call transcripts, social media, and other unstructured data sources to help infer intent (Events #3, 4, 5). There are a variety of very powerful text analytics and enterprise search capabilities that not only identify the inherent meaning of what has been written or said but also establishes the peer to peer network so that we can understand who is communicating with whom and what they are communicating about.

Regulators Implicitly Require Streaming Analytics

Much of the attention paid to real-time surveillance in financial services is about catching creative individuals who exploit high-frequency trading and the complexities of a global trading network for illicit purposes. However, the use case described above doesn’t include high-speed trading but streaming analytics is still necessary.

Regulators are now demanding that firms take a proactive and holistic approach to preventing financial crime, not just uncovering it after the fact so you want to be able to initiate an investigation as soon as possible and not months later. The events we described above don’t happen on a scheduled basis. They can happen at any time, so streaming capabilities are required to capture and store an event whenever it happens.

Relevant Technologies

The use case above requires a solution that enables analysts to correlate data across five different surveillance platforms and spot patterns that can’t be flagged in isolation. Below we describe some of the technologies that would support a holistic solution.

Data Lake

Many organizations move the type of data and analytics described above into the data lake where it can be looked at in bulk very rapidly. Prior to the data lake, it was extremely difficult to capture data that did not fit nicely into rows and columns or a relational database. Due to this fact, surveillance was focused on transactional data and as we see from the use case above, transaction data is just one piece of a larger story. Organizations were not able to capitalize on human behavioral data (i.e., intent) that was locked in images, emails, audio, and video. Roughly ten years ago the concept of a Data Lake enabled organizations to affordably store data and use it later. With Apache Hadoop and cloud storage technologies from AWS and Azure, the data lake can help surveillance teams exploit transactions and interactions.

Modern big data environments such as data lakes process more varieties and higher volumes of data at a significantly lower cost than traditional data platforms but other capabilities are needed to allow organizations to exploit data in real-time. They also need to wrangle petabytes of documents, contracts, images, and various types of unstructured data for indexing and high-speed search. In the example above, multiple channels are being used to communicate, each of which create huge amounts of living data (it’s always growing).

Apache Solr

Apache Solr is an enterprise search platform that is optimized for interactive results. Anomaly detection and social network analysis are common functions and relate to our use case above. In Event #4, the trader made an international phone call to an executive. For that trader, speaking with executives is normal, the anomaly was that the call was international.

In Event #5 a relationship was spotted between the trader and the executive, presenting itself as a photo on social media. That information would have been uncovered by deep learning algorithms that are able to process images from social media to see where people are, when they were there, and who they are with. The output from these algorithms are large and complex, so the ability to search and link relationships is key.

Apache Kafka/Real-Time Streaming

Apache Kafka is an open source stream-processing software platform that is capable of handling trillions of events a day and also store that data. The events we described above don’t happen on a scheduled basis. They can happen at any time, so streaming capabilities are required to capture and store an event whenever it happens. The value of real-time is that we do not have to wait and we can take advantage and triage behaviors as they happen. Most organizations wait for batches of transactions or behaviors that could be hours or days old. With real-time, we can evaluate and alert to potential activities almost immediately.

Apache Kudu

Kudu is an open-source storage engine for structured data that supports low latency random access with efficient analytical performance. It is well suited for time series data use cases such as market data streaming, fraud detection/prevention, and risk monitoring.

In the use case above, data is constantly changing and being updated. As an investigation proceeds, you want to be aware of any late arriving data. Kudu appends the latest data point to a historic set of points continuously. Because of this, you see trends emerge in the data and can run analytics as they are happening. In conjunction with Apache Solr and Kafka, you get something even more powerful than real-time visualizations. You get the ability to visualize and understand the historical relevance of current events as they are happening in one application.

See this video for an introduction to visual Apache Kudu.

Real-time Streaming Visual Analytics

Real-time streaming visualizations with granular, time-based filters are essential to putting together a surveillance story that would prove out the use case above. You can pull the most recent data from any time window (in hours, minutes, and even seconds) from streaming sources like Apache Kafka topics and refresh the visuals without manual polling. You can control the data visualization with play and pause so you can analyze a specific point in time, and by leveraging Kudu, you can understand its historical relevance. The play and pause capability is critical to the story aspect of surveillance because it can be used to visually synchronize multiple events as part of the investigatory process to help infer intent.

The Big Picture with Native Visual Analytics

Technologies such as Apache Kafka for streaming, Apache Solr for enterprise search, and Apache Kudu for low latency data storage are now mainstream and provide the necessary capabilities that bridge the technology gaps required for today’s surveillance requirements. However, there will be no resolution to our use case as each technology and underlying best of breed surveillance platform is viewed in isolation.

Native visual analytics refers to a BI platform that has its visual analytics engines installed directly on each node within the data lake. In conjunction with an intuitive and self-service web interface, analysts and their teams are able to collaborate on the same big picture. Because data has not been translated (i.e., summarized/aggregated and then moved to a separate visualization server) the team will have confidence in the integrity of the analysis. They will also have direct access to the output of the technologies described above in granular detail, not abstract summarizations. Any other approach reduces the agility and speed required to seek insights quickly and effectively in a highly fluid environment.

Be Proactive, Not Reactive

Organizations can no longer afford to be reactive to these matters. Instead, organizations that proactively build holistic programs around these solutions will demonstrate to regulators that they have a robust and defensible process in place.

They can also demonstrate and educate their employees that forensic intelligence through a holistic surveillance program is in force. It is kind of like having a police car parked in the neighborhood; the criminals will be less likely to come around.

--

--

John Thuma
DataSeries

Experienced Data and Analytics guru. 30 years of hands-on keyboard experience. Love hiking, writing, reading, and constant learning. All content is my opinion.