Image for post
Image for post

This post gives you a quick walkthrough on AWS Lambda Functions and running Apache Spark in the EMR cluster through the Lambda function. It also explains how to trigger the function using other Amazon Services like S3.

What is AWS Lambda?

AWS Lambda is one of the ingredients in Amazon’s overall serverless computing paradigm and it allows you to run code without thinking about the servers. Serverless computing is a hot trend in the Software architecture world. It enables developers to build applications faster by eliminating the need to manage infrastructures. …

Image for post
Image for post

Apache Beam

Apache Beam(Batch + Stream) is a unified programming model that defines and executes both batch and streaming data processing jobs. It provides SDKs for running data pipelines and runners to execute them.

Apache Beam can provide value in use cases that involve data movement from different storage layers, data transformations, and real-time data processing jobs.

There are three fundamental concepts in Apache Beam, namely:

  • Pipeline — encapsulates the entire data processing tasks and represents a directed acyclic graph(DAG) of PCollection and PTransform. It is analogous to Spark Context.
  • PCollection — represents a data set which can be a fixed batch or a stream of data. …


Ankita Kundra

Big Data Specialist, Software Developer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store