Running Serverless Spark Applications with AWS Lambda.

Ramon Marrero
Geek Culture
Published in
8 min readJun 28, 2021

--

Leverage Amazon SageMaker Processing to run serverless Spark applications from AWS Lambda.

Data illustrations by Storyset

A widely known big data processing framework such as Apache Spark needs no introduction. If you are reading this post, you most likely know what you are getting into, and just like me, you are curious to know if it is possible to run serverless Spark jobs from an AWS Lambda function.

That also means you are familiar with AWS and serverless services such as AWS Lambda.

That being said, we all know that a little bit of context “never hurt nobody”. So let’s start with Spark!

Spark as an analytics engine for large-scale data processing relies on infrastructure and other software dependencies. Lucky for us, we can use cloud services such as AWS to remove the heavy lifting of installing, upgrading, and maintaining Apache Spark and its dependencies. At the same time, avoiding configuring and maintaining underlying infrastructure or operating systems altogether by using managed services such as Amazon EMR and Amazon SageMaker.

The idea for what you are about to learn came while working with Amazon SageMaker Studio. During the development of a recent ML project, I noticed that the SageMaker Processing feature provided to run Spark applications, simply put, was a…

--

--

Ramon Marrero
Geek Culture

Head of Data Engineering | AWS Community Builder | AWS Certified Solutions Architect | Google Cloud Certified Professional