100 Scripts in 30 Days challenge: Script 29, 30,31,32 & 33 — Using pyspark for Data & SWIFT BIC Masking

In my previous roles I have come across many scenarios where we will have to do data masking on production data dump that we get and load it to a QA database. Now this is a very common scenario in any enterprise setup. The common formats that we have used are wither SQL queries, stored procedures or shell scripts. And for Financial institutions BIC masking is one of the general masking activities they do where ever SWIFT Data is involved. In this experiment I am trying to do Data Masking using PySpark which can showcase it ETL capabilities and also how it can be used in masking data in various environments.

What is SWIFT BIC?

The SWIFT Code is a standard format for Business Identifier Codes (BIC) and it’s used to uniquely identify banks and financial institutions globally — it says who and where they are.These codes are used when transferring money between banks, in particular for international wire transfers or SEPA payments. Banks also use these codes to exchange messages.

For the purpose of this experiment I did web scraping to extract BICs from the below site:

Web Scraper for Bic Extraction from Website

This is a Web Scraper code for extracting BIC from the website.

parse_spages() function parses the main page and find the alphabet pages for the country.

parse_country() function extracts the country pages url and passes to parse_countr_pgs() for the country page to be parsed and extract the BIC.

parse_countr_pgs() function helps in extracting the HTML Data that contains the BIC data save the HTML bic data information into .pickle files to be parsed later.

Clean BIC Data and save it back to pickle files.

The following script reads Bic data from Pickle file and filter it and saved the refined data into a csv file. In this we use html2text to read the html files from the pickle DB and then convert to markdown text and then search for pipe separated text where we have Bic text.

The below pattern helps used regular expression to find the lines that contains a BIC.

pattern = “^.*([a-zA-Z]){4}([a-zA-Z]){2}([0–9a-zA-Z]){2}([0–9a-zA-Z]{3})?.*$”

Generating Fake Data for the Actual Experiment

The below code uses Faker library which has many functions to get random values for address, people, telephone number and various other types of data. I used the Faker library and randomized form for BIC data add to the test data.

The test data will have the following columns.

ACCOUNT_ID,STREET_ADDRESS,SECONDARY_ADDRESS,POSTAL_CODE,CITY,COUNTRY,COUNTRY_CODE,ZIP_CODE,SWIFT_ADDR,TEL_NUM,EMAIL_ADDR,CNTCT_PRSN,CMPNY_NAME,FAX_NUM

Following Sample Data

AO0000027338,39184 Haden Place Suite 992,Apt. 558,39580,North Ova,Ghana,GH,41324
,DELBDE33XXX,1–091–342–5557x52643,honora56@walter.org,LeeSchultz,Senger-Berge,(
201)864–3352

Loading the CSV to a Postgres Database using PySpark

The below script helps to load the data to a database using Pyspark. I used the following command to load the below data and it created a new table with appropriate data types in Postgres. This a very good feature I liked of PySpark data frames.

spark-submit — driver-class-path postgresql-42.1.1.jre6.jar script_32_load_data.py

Finally the data is loaded into test_bics1 table.

Masking Data using pyspark

For the purposes of masking the data, I have created the below script, I only worked on 100 records because of the limitations on my system allocating only 1GB driver memory at the end of which there is not enough Heap Size for the data to processed for multiple data frames.Hence one major issues that I faced is that you not only need lot of memory but also have to do an optimized tuning of the Spark cluster or environment for it to run optimally which was out of the scope of my current experiment. Finally the script worked and it gave the desirable output.

I used the udf functions to apply masking and used spark sql to apply these udf functions in a single statement. And finally write to another table. This is another issue I found with “overwirte” option for writing Dataframe to a JDBC connection that it fails to overwrite an existing table.

Finally it was a great experience doing this code finally I was able to get the masked data in the following format where based on the account ID many fields were masked except country and country code and in Swift address I changed the last 3 digits as XXXX. It took me long but also a perfect end to my month long exploration of Python and a great learning opportunity for different aspects of Python.