Apache Beam . Deep Dive series Episode 2

Prag Tyagi
towardsdataanalytics
6 min readMay 13, 2021

Welcome to Episode 2 of the Apache Beam series . In this episode we will further dive deep into Apache Beam applications in fraud detection as well as how it can be used in streaming solutions .

If you want to watch episode 1 please refer this link :

OK . So let’s get started . In fraud detection we will cover three scenarios .

i)Credit card default

ii)Personal loan default

iii)Medical loan default

Credit Card defaulter

i)Assign 1 point to customer for short payment, where a short payment means when customer fails to clear atleast 70% of its monthly spends.

ii)Assign 1 point to customer where he has spent 100% of his max_limit but did not clear the full amount.

iii)If for any month customer is meeting both the above conditions,assign 1 additional point.

iv)Sum up all the points for a customer and output in file.

So let’s get our hands dirty .

Data Set

Now what if you want to showcase this data as a tuple in key value pair . Well, in that case wrap the output in a tuple and convert the data into K,V pairs . Let’s write a function for this .

Loan Defaulter

i)For Personal loan category, Bank does not accept short or late payments. If a person has not paid monthly installment then that month’s entry won’t be present in the file.

ii)If customer has missed a total of 4 or more installments OR missed 2 consecutive installments then that person will be classified as loan defaulter .

Sample loan data

Apache Beam integration with Google Pub Sub

Google Pub Sub : It is a messaging system designed to implement data streaming . In simple words we can think of it as Kafka on top of Google . Now why i have not demonstrated Kafka integration with Beam and why used Google Pub Sub with Beam . Reason is that Beam is a google project donated to Apache . Beam is mostly used with Google Pub/Sub . So that’s the reason behind this integration . Ok so enough of talk . Time to get hands dirty .

So first we need to set up environment where in we can create topic , publisher and subscriber in google Pub/Sub . Follow below steps :

1.) Register yourself for Google cloud. Google provides you $300 for the first time in your account.

https://cloud.google.com/free/

2.) Install Anaconda 2.7 version (Beam version below 2.14.0 as of now supports Python 2 only and i am using the same) from below link on your system. It is simple only. Download it and you have to click on few ‘Next’ buttons.

https://www.anaconda.com/distribution/

3.) Install visual c++ 9.0 from below link.

http://aka.ms/vcpython27

Download it and you have to click on few ‘Next’ buttons.

4.) Open Anaconda Navigator

5.) Follow below steps to create an environment (Screenshot attached below)

  • Go to Environments
  • Click on Create
  • Name → beam
  • Python → 2.7
  • Click Create (green button)

Note: It may take some time to create the environment.

6.) Open Anaconda Prompt as administrator.

7.) Run the below commands (Screenshot attached below)

Command : activate beam (activate <name of environment>)

Command : pip install google-cloud-pubsub

Command : pip install apache-beam

Post this you need to following steps in Google cloud

i)Go to cloud.google.com and create a new project first.

ii)Click on three line icon of catalogue in the top left and go to service account under IAM section and create a service account .

iii)Now go to the service account and generate key file . Make sure you generate a Json file and not P2 .

Not just like we keep ppk file very securely in AWS similarly in this case keep this file in a secure location as this will be required to establish connection with your GCP pub sub account .

Now create topic in a Pub Sub environment which will be holding or accepting your messages(data).

Also you need to create a subscription account .

Here you can configure the retention period , message ordering and other properties . That’s it we now done with the Pub Sub configuration .

Now that we are done with the installation . Let’s start with the code of pub and sub .

So you need 3 services here :-

i)Publisher :- Just like Kafka producer, this is a service which can publish data into an intermediate storage, DB or a topic .

ii)Processor :- This is an intermediate optional service . It takes data from an input topic processes it and writes back into an output topic .

iii)Subscriber : Just like the consumer in Kafka, this is a service which can subscribe to any topic and write data into any sink .

Now here i am not sending any Jazzy message to the topic . But just some numbers . Now i will show how the data is traveling in real time .

Go to your anaconda prompt . Then navigate to the folder where you have installed beam . Now start your publisher .

Now start your processor

Now start subscriber

So data has started flowing in real time . By default data will flow in Json format . Although if you have scenarios of schema evolution Avro should be the one to be used .

With this let’s wrap this episode . More details to follow in next episode .

Post credit : In episode 3 we will go though how to implement watermarking , windows, triggers and how to handle late incoming records . So stay tuned .

--

--

Prag Tyagi
towardsdataanalytics

Senior leader and a technologist having 14+ years of experience in Data Analytics. Passionate to share new concepts and learning in Data Analytics domain.