Fraud Detection
With the rise of the internet and smartphones, user data collection has become even easier for companies, many companies sell this data but sometimes this data gets stolen, now there are multiple ways using which data can be stolen.
One popular way to do this is to deploy crawlers, they go to a platform and register themselves as regular users then they start extracting the data.
for this blog, we are using data from an internet company that has information related to the mobile number and location of a user, stored in its database, this company allow one user to view the information of another user only if he has the number or any other information about the same.
Now we are provided with four sets of data, they are as follows
Signup Data
in this table, the timestamp of the user and the country from which they have signed up has been captured.
Call Data
this table contains the timestamp of the calls received or done from a user
Message Data
this table contains the timestamp of the message received or sent from that particular user
Search Data
This is the most important table as this contains the timestamp of the searches performed by a user
Now we have basic data and their respective definition associated with it now for going forward we will need to have a rough understanding of the definition of fraud in this case.
now before doing that let us explore the dataset provided to us and do certain quality checks on the same.
these are the certain checks I think we must do before analyzing the data:-
Converting the timestamp into DateTime format
check if all of the signup ids are unique
check if the signup data is the master set of all other ids
Now let’s check the country-wise distribution of the people signing up on the platform.
we can see here that the maximum signups are coming from the country code IT i.e Itlay
ok now we have three different data frames let’s combine them and create an aggerate data frame that has the data points across all of the data frames
this will result in the following database
Now that we have aggregated all of the data let us start with the analysis
Here I have plotted a candlestick for the count of messages, call, and searched now I have already pruned the outliers while plotting it but there were outliers present so we will be handling them as well
One important thing to notice here now is that there are users who have zero calls and zero messages but have some record for searches.
in the context of the current problem, I feel any user that has searches aggregating more than 10 can be called fraudulent users as they as it seems are just bots deployed to scrape the data of the users from the platform
Master2 = Master2[~((Master2.message_count==0)&(Master2.call_count==0))]
now we have removed all of the users who have zero calls and messages and for further analysis, we will be using this data frame only
Now we will engineer one feature that will help us in segmenting the users into different buckets
First I will normalize the message, call and search data then I will divide the search data with the sum of message and call data
Master2['call_normalized'] = normalized_data[:,1:2]
Master2['message_normalized'] = normalized_data[:,0:1]
Master2['search_normalized'] = normalized_data[:,2:3]
Now let’s create the ratio
Master2['searc2callmessRatio'] = Master2['search_normalized']/(Master2['message_normalized']+Master2['call_normalized'])
now using this ratio after certain experiments I have chosen 0.98 as the cut-off value and any outliers are suspects as these users have a very high number of searches as compared to messages and calls.
suspectes2 = Master2[Master2.searc2callmessRatio>=Master2.searc2callmessRatio.quantile(q=.98)]
Now we have another set of users who come under direct suspicion
so, we will be doing more analysis on the same to narrow down the findings
Now let us examine the data concerning the countries of origin
I will be plotting country-specific candlesticks to have a better understanding of the distribution of calls, messages, and searches.
From this, we can see the search count is high as compared to call and message count
also if we look at the numbers we can see that Spain has the highest percentage of fraudulent users
Ok now we are done with the country-wise analysis, the next import KPI can be the age of the user on the platform, more importantly, we should look at the time user takes to do the first search on the platform.
if our assumption is correct then most of the suspects will immediately start searching the database as soon as they signup but some crawlers can also time this to make it seem like a normal user, I won’t be going into that much depth in this article but let us focus on the already suspected users and not human-like mimicking bots because that will be a quite lengthy process.
using the above function we will find the delay in the data
Now we will be creating a data frame that will store the information about the time user has taken to do the first search, receive the first message, and the first call
once we have the delayed data frame we will merge it with our main data frame and then do the rest of the analysis
newSuspect = suspectes2.merge(delay_frame)
now it is quite important to look at the users who have not immediately started searching because they can be legitimate users and we will not want them to be counted as suspects also we will ignore all the users who have a search count of less than 20, this will help us in narrowing our search.
after all this, we have the final count of the users, and also after this business case, these are the conclusions we can draw from the same.
CONCLUSION
- There are a total of 1228 users who are found to be fraudulent in their behavior, using the analysis we can also create red flags for certain users.
- 500 have higher search history but very less call and SMS.
- 18.31% of all users from Spain have fraudulent nature in this database.
FINAL FLOW FOR FRAUD DETECTION
RESOURCES
The Jupyter notebook of the above analysis can be found here