Telco Fraud Detection on Snowflake

End-to-End demo: Dynamic Tables, Streamlit, Data Sharing

According to a survey conducted by CFCA (Communications Control Fraud Associations) in 2021, 2% of the total global telecom industry revenue is lost due to fraud activities. To put this into perspective, that is close to almost $40 billion vanishing every year.

What’s even more concerning is that this is not a stagnant figure. It’s on the rise, with a 28% increase compared to previous years. That’s quite a considerable jump and a threat that cannot be ignored.

In the last few years, we’ve seen an explosion in the number of fraud scenarios, each with its unique modus operandi and impact.

In this article, we will explore Snowflake’s features to construct a comprehensive data pipeline, spanning from the initial data ingestion stage all the way through to data visualization. You’ll find relevant materials at the bottom to help you run your hands-on demo on this subject.

Do you want to know more and see this demo live? The webinar “Fraud Detection 101” is available on-demand

Fraud Detection 101 — On-Demand Webinar

Fraud Scenarios

Let’s review a few of these well-known threats.

  1. Wangiri: known as the “one ring and drop” scam. This scheme involves fraudsters deliberately placing a missed call, often from premium or international phone numbers (MSISDN), with the expectation that the recipient will return the call.
    Regrettably, some individuals do indeed call back, only to discover exorbitant charges for international and premium calls on their subsequent phone bills. In response, customers may file complaints against their telecom providers for failing to safeguard their interests. They might also refuse to pay the inflated bills and decide to switch to a different telecom service provider.
  2. SIM Swap Scam: SIM swapping is a routine procedure that people often rely on when they lose their mobile phones. In such cases, individuals typically visit a dealer to request a new SIM card with the same number.
    However, fraudsters have identified an exploitable vulnerability in this process. They trick the carrier into activating a SIM card that they themselves control, but with a phone number they do not legitimately possess.
    Once these fraudsters gain control of the victim’s mobile number, they can intercept the subscriber’s calls and messages. This includes sensitive communications such as one-time passwords from banks, which are vital for accessing bank accounts and authorizing transactions like bank transfers.
  3. Spamming and Phishing: Phishing attacks occur almost every single day in the lives of nearly everyone. Fraudsters send a large number of SMS messages with attached phishing links, all meticulously designed to deceive subscribers into disclosing sensitive information, such as passwords, credit card numbers, or bank account details.
  4. Mobile Money: Fraudsters exploit Mobile Money services for money laundering, a situation that telecom companies must monitor closely to ensure compliance with regulatory requirements.
    Simultaneously, scammers may attempt to target subscribers through PIN phishing attacks, with the intention of gaining unauthorized access to their mobile wallets and initiating transfers or withdrawals of funds.

Getting Started

Transform your data using Dynamic Tables

Frequently, telecom data is distributed across various systems.

There are charging and billing systems processing CDRs (Call Detail Records), Customer Relationship Management (CRM) and network systems handling subscriber data, records related to dealer activations, SIM swaps, service plan changes, and many more. In a telco infrastructure, one can easily count hundreds of data sources.

Creating a comprehensive view of all this data can be quite challenging.

In this brief demonstration, batch jobs load data into a CDR_STAGING table (dataset generated using FrostyGen).
Once the CDRs are situated in the staging table, dynamic tables can be used to normalize the data and enhance it with additional contextual information, including details about subscriber activation and a lookup table for country codes.

Dynamic Tables — CDRs ingestion, normalization & enrichment

Below is the Data Definition Language (DDL) for the CDRS_ENRICHED dynamic table, which combines data from the CDRS_NORMALISED, SIM_ACTIVATION, and COUNTRY_CODE tables.
Dynamic tables have configurable parameters to set:

  • the TARGET_LAG, which determines the expected refresh time of the table, with a minimum allowable value of 1 minute.
  • the WAREHOUSE used to automatically refresh this table by background jobs.
CREATE OR REPLACE DYNAMIC TABLE CDRS_ENRICHED
TARGET_LAG = '1 day'
WAREHOUSE = FRAUD_ANALYSIS
AS
SELECT cn.*, COUNTRY_NAME, RISK_SCORE as COUNTRY_RISK_SCORE, ACTIVATION_DATE as A_NUM_ACTIVATION_DATE
FROM CDRS_NORMALISED cn
left join SIM_ACTIVATION sa on cn.anum = sa.MSISDN
left join COUNTRY_CODE cc on cn.b_country_code = cc.country_code ;
FRAUD_DATA.ANALYSIS.TESTFRAUD_DATA.ANALYSIS.TEST
ALTER DYNAMIC TABLE CDRS_ENRICHED REFRESH;

In the GitHub repository shared at the bottom of this article, the scriptsetup.sql and files mentioned below are available to download.

Here are the steps to execute:

  1. Upload the cdr_dataset.csv.gz to an S3 bucket.
  2. Update S3 bucket parameters in the setup.sql script.
  3. Run the setup.sql (HINT: When populating the CDRS_STAGING table, you can either load the dataset into an internal Snowflake stage instead of the S3 bucket or generate a fresh, larger dataset using FrostyGen and push it directly to the table).
  4. During the execution of setup.sql, in order to showcase Snowsight features, one of the steps required is to upload the country_code.csv file via UI to the COUNTRY_CODE table.
-- Storage integration with S3 bucket where we placed our files (adjust parameters below accordingly)
CREATE OR REPLACE STORAGE INTEGRATION s3_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<ADD_ROLE_HERE>'
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('s3://<ADD_STORAGE_HERE>');

-- Create External Staging in Snowflake linked to the reference s3 bucket.
CREATE OR REPLACE STAGE CDRS_STAGING
STORAGE_INTEGRATION = s3_int
URL = 's3://<ADD_STORAGE_HERE>'
FILE_FORMAT = cdr_csv_format;

[...]

-- This step must be done using the UI - DO NOT SKIP IT Showcasing data loading feature from UI)
-- Country Code (Static Table): From Snowsight you can upload CSV directly into a table.
-- We'll use this feature to load data in the HOTLIST_NUMBER table.
-- 1) Use the file country_code.csv and load it via Snowsight UI.
-- 2) Create a new table from the UI based on the csv mentioned above. Table should be named "COUNTRY_CODE" otherwise the next command will fail.
-- More details: https://docs.snowflake.com/en/user-guide/data-load-web-ui

Once these steps are completed, you can query records using Snowsight.

Relying purely on SQL for analysis may not be the most convenient option for fraud analysts. This is where “Streamlit in Snowflake” becomes valuable.

NOTE: this data pipeline is not intended to be a ready-to-use artifact but provides a foundation on available features that can be used to streamline your data transformation jobs.

Visualize your data using Streamlit

Streamlit Dashboard — Fraud Detection Analytics

Developing a dashboard in Streamlit to visualise data is easy. You can deploy this dashboard within your Snowflake account and customize it directly in your environment by following these steps:

  1. Download the fraud_detection_sis.py and logo.png files from the GitHub repository.
  2. Create a new Streamlit app within your Snowflake account.
  3. Paste the code from the fraud_detection_sis.py file into your new Streamlit app.
  4. Upload the logo.png image to the Streamlit application stage in Snowflake.

By following these steps, you can run the Streamlit dashboard directly in Snowflake and tailor the code and graphs to your specific needs.

NOTE: this dashboard is not intended to be a ready-to-use artifact but provides a foundation on available features that can be used.

HINT: Data Sharing for a joint effort to tackle fraudsters

A fraud analyst, across different activities, might start filling a list of suspicious numbers due to their involvement in fraudulent activities on the network. Thanks to Snowflake’s data sharing features, this list can be leveraged in at least two ways:

  1. Share the list with other relevant teams within the company to automate network actions (barring / unbarring / deactivation)
  2. Privately or publicly share the list with external organizations (or within the same global telco group) to jointly tackle fraudsters.

While executing the setup script, we created the table HOTLIST_NUMBER containing numbers calling our subscribers from risky countries.

-- Use Case: Based on the country codes with the highest risk score, create a list of top-risk hotlist numbers.  
INSERT INTO HOTLIST_NUMBER
SELECT DISTINCT BNUM, COUNTRY_CODE FROM CDRS_STAGING cs join COUNTRY_CODE cc ON cs.CCODE = cc.COUNTRY_CODE
WHERE RISK_SCORE > 4
ORDER BY RISK_SCORE DESC LIMIT 100;

You can use this table to explore private sharing capabilities in Snowflake.

Provider Studio — Share Privately Hotlist numbers with partners and other teams.

Conclusion

Fraud detection is an intricate scenario that requires substantial system integration efforts and in-depth business expertise. While this demo significantly simplifies the complexity of the situation, Snowflake can be an helpful data platform to fulfill three key technical objectives:

  1. Data normalization and enrichment within a unified platform.
  2. Data exploration and visualization through Streamlit in Snowflake.
  3. Data sharing to facilitate coordinated actions and collaborative efforts in addressing fraud scenarios.

If you have any questions or suggestions, please feel free to reach out to me on LinkedIn.

Resources:

--

--

Matteo Consoli
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Sales Engineer @ Snowflake ❄️ | Data Engineering | Data Analytics | Data Science | Python | SQL | Book | Music Composer 🎼