Use Functions for Analytics workflows

Mrudula Madiraju
Analytics Vidhya
Published in
5 min readApr 26, 2020

--

TL;DR

This story is based on a customer use case — how you can combine Serverless in conjunction with Managed Services for constructing analytic workflows. We’ll see the need for such a requirement, and the steps to go about it.

Why serverless when there is a service? Ben Kehoe’s excellent article Serverless is a State of Mind talks about how…

“You should use functions as the glue, containing your business logic, between managed services that are providing the heavy lifting that forms the majority of your application.

Overview

  1. Data comes in batches and gets stored on the Cloud Object Storage. The frequency of data drop is once every day that can come in at any time depending on the load, number of transactions etc from some upstream system.
  2. Whenever there is a data drop an action is triggered using IBM Cloud Functions
  3. The actual execution of action is to invoke a job on the compute engine, in this case, IBM Analytics Engine. Here we invoke a spark application using the Livy interface
  4. Finally, execution of the spark application is done against the data that just got dropped.

IBM Cloud Functions

Based on Apache OpenWhisk, IBM Cloud™ Functions is a polyglot functions-as-a-service (FaaS) programming platform for developing lightweight code that scalably executes on demand. This event-driven compute platform, also referred to as Serverless computing, or as Function as a Service (FaaS), runs code in response to events or direct invocations.

  • Action: An action is a piece of code that performs one specific task. An action can be written in the language of your choice, such as small snippets of JavaScript or Swift code or custom binary code embedded in a Docker container. You provide your action to Cloud Functions either as source code or a Docker image.
  • Sequence: A set of actions can be chained together into a sequence without having to write any code. A sequence is a chain of actions, invoked in order, where the output of one action is passed as input to the next action.
  • Event: Examples of events include changes to database records, IoT sensor readings that exceed a certain temperature, new code commits to a GitHub repository, or simple HTTP requests from web or mobile apps.
  • Trigger: Triggers are a named channel for a class of events. A trigger is a declaration that you want to react to a certain type of event, whether from a user or by an event source.
  • Rule: A rule associates a trigger with an action. Every time the trigger fires, the rule uses the trigger event as input and invokes the associated action

Read more on the concepts and terminologies here.

Step 1 : Create Trigger

Without going into too much minute details, this step is about creating a trigger on a specific bucket called the pmobucket that gets activated when objects with name prefix dailyreport* gets written on to the bucket.

It appears in my list of triggers thus:

Step 2: Creating Action

In this step — we create an action that is associated with the trigger created in the previous step. You can choose any runtime such as Node.js, Go of your choice — in this case I have chosen Python3.

For the sake of demo — I have used a simple Livy invocation using the python requests library — to invoke a predefined application called dailyreportanalysis.py

import sys
import requests
def main(dict):
url = 'https://chs-mmm-007-mn001.us-south.ae.appdomain.cloud:8443/gateway/default/livy/v1/batches'
data = '{ "file":"/user/clsadmin/dailyreportanalysis.py"}'
resp = requests.post(url, data, auth=('clsadmin','password'),headers={"Content-Type": "application/json","X-Requested-By": "livy"})
print(resp)
return { 'message': resp.status_code }

The action I created gets listed as shown below.

Step 3: Upload data to Trigger the Action

Again for the purpose of the demo — I use the console interface of the Cloud Object Storage to upload the data that needs to be analyzed

This results in the application getting automatically invoked as you can see from the Ambari UI of the Analytics Engine cluster.

Conclusion

IBM Cloud functions is a powerful tool that can be used to stitch together the flow of processing across several services. We have only scratched the surface for purpose of demo here.

If not for Functions? The same thing could have been achieved with external and internal workflow tools (like Oozie for example). However it would have required a lot more set up, code and configuration. It took me under an hour to created the steps for this particular demo.

Happy exploring around IBM Cloud!

This story has been co-authored with Daniel Lopez Sainz, services consultant for IBM Cloud and Cognitive Software.

--

--

Mrudula Madiraju
Analytics Vidhya

Dealing with Data, Cloud, Compliance and sharing tit bits of epiphanies along the way.