Apache Beam . Deep Dive series Episode 2

Published in

towardsdataanalytics

6 min readMay 13, 2021

Welcome to Episode 2 of the Apache Beam series . In this episode we will further dive deep into Apache Beam applications in fraud detection as well as how it can be used in streaming solutions .

If you want to watch episode 1 please refer this link :

Apache Beam . Deep Dive series Episode 1

Apache beam the latest open source project of Apache is a unified programming model for expressing efficient and…

medium.com

OK . So let’s get started . In fraud detection we will cover three scenarios .

i)Credit card default

ii)Personal loan default

iii)Medical loan default

Credit Card defaulter

i)Assign 1 point to customer for short payment, where a short payment means when customer fails to clear atleast 70% of its monthly spends.

ii)Assign 1 point to customer where he has spent 100% of his max_limit but did not clear the full amount.

iii)If for any month customer is meeting both the above conditions,assign 1 additional point.

iv)Sum up all the points for a customer and output in file.

So let’s get our hands dirty .

Now what if you want to showcase this data as a tuple in key value pair . Well, in that case wrap the output in a tuple and convert the data into K,V pairs . Let’s write a function for this .

Loan Defaulter

i)For Personal loan category, Bank does not accept short or late payments. If a person has not paid monthly installment then that month’s entry won’t be present in the file.

ii)If customer has missed a total of 4 or more installments OR missed 2 consecutive installments then that person will be classified as loan defaulter .

Apache Beam integration with Google Pub Sub

Google Pub Sub : It is a messaging system designed to implement data streaming . In simple words we can think of it as Kafka on top of Google . Now why i have not demonstrated Kafka integration with Beam and why used Google Pub Sub with Beam . Reason is that Beam is a google project donated to Apache . Beam is mostly used with Google Pub/Sub . So that’s the reason behind this integration . Ok so enough of talk . Time to get hands dirty .

So first we need to set up environment where in we can create topic , publisher and subscriber in google Pub/Sub . Follow below steps :

1.) Register yourself for Google cloud. Google provides you $300 for the first time in your account.

https://cloud.google.com/free/

2.) Install Anaconda 2.7 version (Beam version below 2.14.0 as of now supports Python 2 only and i am using the same) from below link on your system. It is simple only. Download it and you have to click on few ‘Next’ buttons.

https://www.anaconda.com/distribution/

3.) Install visual c++ 9.0 from below link.

http://aka.ms/vcpython27

Download it and you have to click on few ‘Next’ buttons.

4.) Open Anaconda Navigator

5.) Follow below steps to create an environment (Screenshot attached below)

Go to Environments
Click on Create
Name → beam
Python → 2.7
Click Create (green button)

Note: It may take some time to create the environment.

6.) Open Anaconda Prompt as administrator.

7.) Run the below commands (Screenshot attached below)

Command : activate beam (activate <name of environment>)

Command : pip install google-cloud-pubsub

Command : pip install apache-beam

Post this you need to following steps in Google cloud

i)Go to cloud.google.com and create a new project first.

ii)Click on three line icon of catalogue in the top left and go to service account under IAM section and create a service account .

iii)Now go to the service account and generate key file . Make sure you generate a Json file and not P2 .

Not just like we keep ppk file very securely in AWS similarly in this case keep this file in a secure location as this will be required to establish connection with your GCP pub sub account .

Now create topic in a Pub Sub environment which will be holding or accepting your messages(data).

Also you need to create a subscription account .

Here you can configure the retention period , message ordering and other properties . That’s it we now done with the Pub Sub configuration .

Now that we are done with the installation . Let’s start with the code of pub and sub .

So you need 3 services here :-

i)Publisher :- Just like Kafka producer, this is a service which can publish data into an intermediate storage, DB or a topic .

ii)Processor :- This is an intermediate optional service . It takes data from an input topic processes it and writes back into an output topic .

iii)Subscriber : Just like the consumer in Kafka, this is a service which can subscribe to any topic and write data into any sink .

Now here i am not sending any Jazzy message to the topic . But just some numbers . Now i will show how the data is traveling in real time .

Go to your anaconda prompt . Then navigate to the folder where you have installed beam . Now start your publisher .

Now start your processor

Now start subscriber

So data has started flowing in real time . By default data will flow in Json format . Although if you have scenarios of schema evolution Avro should be the one to be used .

With this let’s wrap this episode . More details to follow in next episode .

Post credit : In episode 3 we will go though how to implement watermarking , windows, triggers and how to handle late incoming records . So stay tuned .

Apache Beam . Deep Dive series Episode 2

Apache Beam . Deep Dive series Episode 1

Apache beam the latest open source project of Apache is a unified programming model for expressing efficient and…

Written by Prag Tyagi