Connecting the dots: how to see the shape of fraud
At Feedzai, we’re in a constant and ever-evolving fight against financial crime. To be efficient fraud fighters, we leverage large amounts of data.
Data scientists use our platform to build machine learning models from historical data, which are then deployed to stop worldwide fraudsters in real time. Fraud analysts investigate the most complex cases and take action. Investigators look into what’s trending among fraudsters, creating rules to complement the model and stop future attacks.
All these personas share a common need: using data to successfully reach their goals. A data scientist handles terabytes of transactions while an analyst investigates many customers’ histories. However, despite the differences in the size and scope of the data they handle, both need to make sense of data.
And if you ask me, I’ll tell you that visualization is one of the best tools for that — and research continuously proves it.
Fraud-fighting charts
A data visualization engineer’s goal is to make complex data interpretable through visual representations. We take data, abstract information, and encode its properties through visual channels to create a visual representation.
Our human perception is really fast at decoding shapes, color, position, and movement as they make use of our preattentive visual processing. As Colin Ware put it, “the features that pop out are hardwired in the brain, not learned.” So we can think of our vision as a sort of high-bandwidth channel to the brain.
A common expectation for data visualization that follows from this is to be quickly understood and to present what’s hidden in the data. This fast component of data visualization is usually the most explored one.
This paradigm works really well in Feedzai’s reporting tool called Insights. In Insights, data is made visible in multiple dashboards, from real-time fraud metrics and most triggered rules to an overview of day-to-day analyst operations. In these dashboards, you’re designing charts that can be read at a glance, usually alongside gauges and metrics.
In this world of fast data visualization, there’s a tendency for simplicity and to stick to the most conventional charts, but there should still be some room for creativity. As Amanda Cox said,
“There’s a strand of the data viz world that argues that everything could be a bar chart. That’s possibly true but also possibly a world without joy.”
However, not all visualizations fall under this category. Our data scientists often create and use elaborate charts that can be used to explore different perspectives on the data, such as how the shape of a distribution of a categorical variable changes between legitimate and fraudulent transactions, and how that relates to each in-category fraud rate. These are not the sort of plots you can just glance over, no matter how good your preattentive visual processing is. These charts are intended to be studied as they encode many aspects of the data. It’s slow data visualization: you’re investing more time and getting a richer and deeper analysis.
If you want to explore more about this separation between fast and slow data visualization, Elijah Meeks’s article Data Visualization, Fast and Slow is the way to go.
So, as data visualization engineers at Feedzai, it’s important to always understand where the visualization we’re creating fits in this fast-slow spectrum, what the reader’s data literacy is and how much time they will spend looking at the chart. Not everything is fast-paced and “show me the data,” but not every reader is open to an elaborate interactive chart. Like most things, there’s a time and a place for each.
It’s all connected
In this blog post, we’ll go back to a specific time and place and tell the story of how one type of visualization was key in uncovering complex fraud patterns — and then, how it kick-started Feedzai’s product, Genome.
First, let’s start by giving you a glimpse of the payment world and how fraud happens. We’re usually told to be very aware of our passwords and credit card details. We’re told to protect them from malicious lonely hackers hiding behind computer screens who are trying to steal our personal data and money. For some reason, images depict these fraudsters to be wearing a hood at all times.
While there are hackers who operate solo, (we’re still unsure about the hood part, though), fraudsters rarely work alone. They are often part of a fraudulent organization and take part in multiple activities, including setting up cloning card devices in ATMs and deploying phishing attacks online.
When fraudsters try to convert stolen cards to re-sellable goods and attempt to shop online, they tend to connect through private networks, switch devices and spoof their location around the world. However, being part of these large networks tends to leave a trace. All it takes is a sloppy fraudster (someone who stayed up last night, binging through too many episodes on Netflix and who hasn’t had their morning coffee yet) to forget a step in the switching procedure, and you can get your hint to uncover the organization and stop them. It does sound a bit clichè, but everything is, in fact, connected.
As we kept on piecing together these connected fraud stories — a shared device between two users, someone using over ten different cards, each of which was shared with other distinct users — we found that tables are a really bad medium to find these connections.
We were starting to look a bit like this (see image below), so we decided to give the idea of node-link diagrams and graph-based features a go.
On February 2018, Feedzai hosted its first Zaickathon. For two days, there was pizza, lots of coffee, t-shirts, teams and a whole bunch of ideas. It was also the perfect opportunity to test out a new visualization method.
We picked a dataset of digital wallet transactions with known fraud schemes and visualized it with a node-link diagram. Each entity in the dataset (each distinct customer, credit card, email or device) was represented as its own node. Then, two nodes were linked (connected by an edge) if their entities participated in the same transaction. Here’s an example:
- Imagine we are an online sneakers shop, and we get an order from Zoe. From the transaction data, we know that Zoe used her smartphone, a Samsung S9, to buy herself a brand new pair of sneakers. We also know that she paid using her debit card. From this event, we can build the following graph,
- A few weeks pass, and we get a couple more orders from Zoe. She’s still paying with the same card but now using a new Huawei device. We update our graph and connect the new device to Zoe and her debit card. The edge thickness also changes because the thickness is proportional to the number of transactions in which both entities (nodes) participate in
- Now let’s say we have relevant historical data, previous transactions that the new device also participated in and were confirmed to be fraud. Then we want to visually encode that information in the diagram as well. Let’s update the graph once again
The red borders and edges give us a way to clearly see where there’s a history of fraudulent activity
To recap, here are the visual encodings we’re using:
The result? We could really visually find fraud! In the middle of a sea of small connected components, these really large and intertwined subgraphs would surface.
For the first time, we were discovering the shape of fraud. We were mapping the genome sequence of different fraud patterns.
The hackathon project was a success, and soon there was a cross-functional team working on kick-starting a product out of it. Genome would become a dynamic visualization engine that leverages Feedzai’s powerful AI technology to provide an intuitive way for investigators and data analysts to quickly identify emerging financial crime patterns.
Larger and larger networks
The team was assembled and the two-day prototype gave way to a scalable graph visualization built from scratch. The team was still in the early days of building a product. There were post-its with ideas scattered around walls, daily brainstorms, and quick experimentations.
We wanted complete freedom to render the graph any way we wanted — custom interactions, complex node and edge design, and different layouts (from the standard d3-force to alternative layouts with WebCoLa). This meant one thing: we had to build our own graph renderer.
We played around with several front-end technologies. We even started with SVG but quickly realized it didn’t scale for the large graphs we were getting from real data. We moved from SVG to Canvas and dwelled a bit on WebGL. In the end, we made it work on Canvas. Victor Fernandes was the magician who was able to push its performance to our needs by using creative and very ingenious strategies (that’s a topic for a whole other blog post). After plotting larger and larger graphs, we concluded that the bottleneck was no longer in the browser performance — it was in the graph legibility for the user. You could potentially plot 30 000 nodes and 100 000 edges, but why on earth would you want to do it?
As most data visualization practitioners are aware, node-link diagrams are a tricky domain. While seemingly perfect for understanding relationships, they can quickly get out of control if we’re dealing with very large, highly-connected graphs — something commonly referred to as “the hairball problem.” Doesn’t it sound nice?
It was never realistic to think we’d see the full network in the browser. The fraud analyst usually sees only a subgraph generated from current alerted event data and relevant historical context, and then they can expand nodes to find additional connections. An investigator starts with a more generic query (e.g.,“all missed fraud from last week”), sees the graph generated from all those events, and tries to identify interesting new patterns. In both cases, we’re using a “search and expand-on-demand” approach instead of the classical “overview first, zoom and filter, then details-on-demand” data visualization paradigm.
This means we can usually avoid those very large graphs, but sometimes we do get them. Despite the fact that they are not really interpretable, they can be quite good looking, so we’re calling them “Genome Data Art.” They make good desktop wallpapers and t-shirts, and maybe they could even be a part of a data art exhibit one day:
Storytelling fraud
As we learned more about connected fraud, it became clear that there was a very important dimension to some of these patterns that we were not encoding visually: time. The temporal aspect of a fraud attack is of the utmost importance: the frequency and periodicity of the events are big indicators of fraudulent activity.
Because of this, we went on to develop a time histogram for Genome, which shows the distribution over time of the events generating the graph. This seemed simple enough to do but came with a few challenges, as developing data visualizations for product usually does. Luís Cardoso, who did the phenomenal work of open sourcing the Brushable Histogram, has written a blog post on it (which you should really check out!).
The Brushable Histogram supports adjustable binning, pan, zoom and even has an overview strip plot. All this solved many of our problems. However, we still wanted to relate the node-link diagram to the histogram, and for this, we used another visual channel: animation. By pressing play next to the time histogram, we can see the story of the graph unfold as new nodes and connections pop on the screen and time passes by.
What the future holds
Genome has already provided analysts with insight into how fraud looks. The question now is how do we make it even better? How do we make the graph even easier to read and interpret? To take Genome up another notch, our next goal is to create the perfect conditioner, so we can untangle the messy “hairballs,” the large, highly-connected graphs. We’re now mixing the ingredients (alternative layouts, edge bundling, node grouping or graph summarization) to be able to achieve a scalable overview mode to complement the investigation view.
We’re also working on figuring out how to add a geospatial dimension to the graph. Similar to time, this is also a very important tell of some fraud modi operandi. We’re looking forward to exploring that side of geoviz as well.
Finally, AI is always at the core of Feedzai, and our data scientists are working very hard on making Genome even smarter to assist the fraud analysts in their investigations. The team has already made great strides in this front, including scoring subgraphs to recommend interesting areas of the graph, and clustering subgraphs that are all instances of the same fraud pattern (known as Genometries). It’s all about machine learning that is interpretable and actionable to improve the analyst’s experience with Genome. However, there’s still much to do, as data scientists at Feedzai are booming with innovative ideas they want to explore and state of the art experiments they want to run — it’s an exciting road ahead!
What started as a two-day prototype in a hackathon has grown and evolved. We’ve found our footing and built a solid foundation to keep building on this next-generation graph visualization platform. Now the fun continues, as a world of possibilities for network visualization is at our fingertips.