Customer Risk Profiling : Demystifying User Risk Pattern

Yusuf Azis Henny Tri Yudhantoro
Tokopedia Data
Published in
8 min readSep 6, 2018

Financial fraud is an ancient domain that has been circulating around the finance world since long ago (read this article about the origin of fraud). A business will eventually talk about potential of income and loss, and one of the biggest source of loss is fraud. That’s why it is a mandatory concern.

Since technology development has enabled wider business options, Tokopedia arrives as one of the leading player on e-commerce landscape. As one form of business, it is inevitable that we are dealing with fraudsters. And that technological advancement itself makes fraudulent activities evolve with: various approaches, higher volume, and attacking wide range of e-commerce product. Risk management Tribe is therefore established to handles these issues in Tokopedia.

Risk Management team has major concern in fraudulent conducts and ways to anticipate it. Along with the Software Engineers and Analysts from other teams, we managed to save hundreds of millions of rupiahs from myriad transactions in daily basis. But we can’t just settle by playing defensive in fraudulent landscape by blocking fraudulent attempts since fraudsters are actively trying to exploit our system everyday with new approaches. If we only focus in that particular way, we will end up patching the holes forever and there is a chance we will be outsmarted.

There is another view that makes fraudulent conducts more preventable. Instead of focusing on the transaction’s point of view, we could use user’s point of view. If we could understand which users have fraudulent conduct potential, we will be able to stop them from recurrent trials and anticipate them in the future.

The Usual Suspects : An intriguing movie about analyzing who’s the real suspect from a crime. My favourite one, you should watch it tho :) (image from film-grab.com)

Before we dive into the details, let’s understand the bigger picture of the e-commerce fraudulent landscape. We will use football game as an analogy to have a better understanding. In order to not lose in a football game, the team needs a great goalkeeper to read the ball and block shots. Using this analogy, the shots are fraudulent transactions that aiming monetary benefit in e-commerce. The goalkeeper is our Analysts and Software Engineers who check transaction behaviors. As long as we have the-wall-like-goalkeeper, we will not lose. However, since there are numerous spots to be guarded and too many shots were made, it is impractical to rely solely on goalkeeper alone.

Even having “the wall“ doesn’t guarantee you go through the group stage :) (gif from giphy.com)

Similar with the evolving nature of the football game, in term of creating new attack patterns, fraudsters tend to be innovative in developing schemes to exploit the system. But like most of the football attack pattern, they share a common approach. They shot around the six yards box. That concept aligns with fraudsters activity pattern. No matter how unpredictable the scheme is, it is always played by our own users with a certain pattern of interactions. We need to read what kind of interactions are those and who is playing these moves. Realizing these circumstances, we decided to initiate Customer Profiling Project.

Customer Profiling Project is an initiative to have better picture of our users profile and it is a common approach that used by many e-commerce. In this article, we could see how Alibaba increases their revenue with better understanding on their user’s data. Alibaba also did a complex customer profiling on their risk management side in this journal. Amazon does customer profiling in wide aspects in this article. It is inevitable for e-commerce company to do customer profiling. Better understanding on user level will bring many advantages for the users and the company itself.

Risk Management team of Tokopedia also has concern on customer profile to leverage our defense against the fraudsters. We start this approach with Risk Scoring. In short, this scoring tries to quantify our user’s behavior into an understandable measurement, by identifying users with bad activities. All the next passages will focus on demystifying Risk Scoring.

  1. What is Risk Scoring?
Body Mass Index (BMI) : one of most common and cheapest test for indicating obesity or general health (image from bjisg.com)

To understand Risk Scoring, we will use BMI as analogy. By using body height and weight, we can estimate our obesity degree and potential implications. Higher range index implies harsher health complication than the lower one. Among all ranges, there is one range that indicates a normal level. With a similar concept, we use many parameters from several sources to assess each of our user then determine the right degree of risk.

The desired output of user risk score. x is the score, and m is the score range

The assessment will yield a score for each user according to the diagram above. This means our new users with no past behavioral data will be classified in normal risk range. However, as time goes by, they might do activities in Tokopedia. Once their activities indicate fraudulent conducts, their risk score will increase. On the contrary, if their activities indicated no fraudulent conduct, their risk score will decrease.

2. Bucketize the Users

In order to bucketize users with this metric, we need to define the y variable that reflects user’s fraudulent degree. We decided to calculate user’s fraud probability that depends on the percentage of fraud transactions as the y variable.

One tip to calculate this probability : “Avoid miss guided results with continuous function”.

Suppose we calculate y variable by dividing fraud transactions with total transactions. Without applying continuous function, series of [(1/10),(10/100),(100/1000)] will yield the same output of 0.1. In term of risk probability we don’t say (1/10) is equal to (10/100) since the greater transaction bring more weight in single fraud conduct. Greater catch of fraud transactions in greater amount of total transactions need to be weighted more.

Therefore, a more reasonable scheme would look like this [(1/10)<(10/100)<(100/1000)]. We choose log function to transform the numerator and denominator with a little tweak changing 0 values to 1 and 1 to 1.5, so it won’t yields -∞. And since number of transactions is positive integers ranging from 0 to ∞, this little tweak might do the trick.

This methodology is still arguable on whether it accurately describes user’s fraudulent degree. However, since we have validated this record with the Fraud Analysts and Software Engineers, for first run, it should be sufficient for version 1.0.

3. Scoring Approach

We selectively chose several parameters (x variables) as a consideration to Risk Score. We use pearson correlation to determine the connection between the x variables and y variables. It yields values ranging from -1 to +1.

Pearson correlation
  • +1 means there is a positive linear correlation between x and y. The higher x given, the higher y become.
  • -1 means there is a negative linear correlation between x and y. The higher x given, the lower y become.
  • 0 means there is no linear correlation between x and y

With this information we can infer how likely a parameter (x) affects user fraud probability (y), and doing parameter selection. Next step is determining on what to do with these selected features.

The first time we got this project, we thought about logistic regression. This approach was common, simple, and instinctive enough (since we don’t threat them as a blackbox) to be implemented. Later on, we were quite happy with our choice since we needed to rewrite all of our python codes to SQL so we could calculate millions of Tokopedia users daily in production.

Correlation information plays a big role here. With that information we could measure the weight of every parameter we have. According to that information we can do proper transformation to each parameter value. We treated parameter values under assumption of normal distribution so we apply standard score normalization. And since this value acted like a dimmer, it has to lie between 0 and 1, so we apply min-max feature scaling.

With all of this information collected we could estimate y, the risk score itself.

4. The Result

We collected several months of data, and did simulations to several users. The graph below describes a sample of users and how their risk score changes over period of time.

Three months user risk score simmulation

As we see on the graph above, we can see some fluctuation over time in some user’s risk score. This simulation implies that our scoring has aligned to our initial goal. When users doing fraudulent activity it raises their risk score, and vice versa.

Averaging on user risk score for past x days to obtain the trend

Another point of view is trend on user’s risk score depicted on the graph above. It was obtained by averaging their risk score in several previous days. It give insight on what risk degree a user mostly lies in.

These graphs could leverage our understanding about particular user risk profile. It strengthens our defense against fraudster since we are not only focus on measuring transactions, but also on measuring the users behind the transactions.

5. Retrospective

So far with the emergence of this scoring mechanism, we added one layer of defender in front of the keeper. By doing so, we could say that now we have a stronger defense against the fraudsters which leads to less probable of losing.

However, if we think about football as the whole analogy, we do not settle by just defenses. It still the half of the game. We need to move forward and strike back to eventually win the game. We need to counter fraudsters even before they start the attack, we need to anticipate them.

And yes that is a long process, and we are moving forward to that.

So that is the big picture of what we are doing with this Risk Score. Actually, there are many technical details on how we enable this gigantic calculation daily, pros and cons, and lessons learned from this project but we could cover that in different post :).

Thank you very much for Fandy Soejanto, Natanael Taufik, Nico Winata, Abdullah Malik, Jufery Chen, Maria Tjahjadi, Caroline Lianto, Theodorus Widjaja, Julius Leo, Kent Stanley for the insights and co-working efforts in this project. And Kevin Filmawan for the iterative editing process.

Hopefully it’s useful for you and if you like it, don’t forget to hit clap button, build discussion on comment section, and share it. And yes, we are hiring! Open the details on tokopedia.com/careers

--

--