Serverless Data Processing on AWS

Published in

awsblackbelt

7 min readFeb 22, 2023

Within this Post, we will explain different ways on building serverless data processing workflows on AWS using AWS Lambda, AWS Step Functions, and AWS Glue. Especially serverless computing has revolutionized the way developers build and deploy modern applications. With serverless computing, developers can focus on writing code rather than managing infrastructure.

Reference Architecture

Most of the serverless data pipelines consist of different stages for processing. From ingestion to visualization.

Data ingestion

Data is ingested into an Amazon S3 bucket. This could be data from a variety of sources, such as IoT devices, web applications, or data warehouses.

Trigger

When data is added to the S3 bucket, an event is triggered that starts a Step Function state machine.

Distributed map

The state machine uses a distributed map to process the data in parallel. Each item in the data set is processed independently, with the results aggregated at the end.

Data transformation

Each item in the data set is processed by a Lambda function that transforms the data into the desired format.

Data storage

The transformed data is stored in an Amazon S3 bucket, Amazon DynamoDB table, or other data store, depending on the requirements of the application.

Analytics

The transformed data can be analyzed using services such as Amazon Athena, Amazon Redshift, or Amazon EMR, depending on the complexity and volume of the data.

Visualization

The results of the analysis can be visualized using services such as Amazon QuickSight, allowing users to explore the data and gain insights.

Serverless AWS data processing services

AWS Lambda

AWS Lambda AWS Lambda is a serverless computing service that lets you run code without provisioning or managing servers. With Lambda, you can execute code in response to events such as changes to data in an Amazon S3 bucket, or messages in an Amazon SNS topic. You pay only for the compute time you consume, with no minimum fees or upfront commitments.

To build a Lambda function, you need to write the code that executes your data processing logic. The following code snippet is an example of a Lambda function that reads data from an Amazon S3 bucket and processes it using the pandas library:

import pandas as pd
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_key = event['Records'][0]['s3']['object']['key']
    file = s3.get_object(Bucket=bucket_name, Key=object_key)
    data = pd.read_csv(file['Body'])
    processed_data = data.apply(lambda x: x * 2)
    return processed_data.to_csv(index=False)

In this example, the Lambda function reads a CSV file from an Amazon S3 bucket, processes it using the pandas library, and returns the processed data as a CSV file.

AWS Step Functions

AWS Step Functions is a serverless workflow service that lets you coordinate distributed applications and microservices using visual workflows. With Step Functions, you can build and run highly parallel workflows that can scale to process millions of events per second.

To build a Step Function, you need to define a state machine that describes the workflow you want to execute. The following code snippet is an example of a Step Function state machine that processes data using the Lambda function we defined earlier:

{
  "StartAt": "ProcessData",
  "States": {
    "ProcessData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessDataFunction",
      "End": true
    }
  }
}

AWS Glue

AWS Glue is a fully managed ETL (extract, transform, load) service that makes it easy to move data between data stores. With Glue, you can create and run ETL jobs that can process petabytes of data per hour. Glue supports a variety of data sources including Amazon S3, Amazon RDS, and Amazon Redshift.

To build an ETL job using Glue, you need to define a script that describes the data processing logic. The following code snippet is an example of a Glue script that reads data from an Amazon S3 bucket, processes it using the pandas library, and writes the processed data to an Amazon S3 bucket:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql.functions import *
import pandas as pd

# Create a Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Get the job parameters
args = getResolvedOptions(sys.argv, 
['JOB_NAME', 'input_bucket', 'output_bucket'])
# Read data from S3 bucket
input_bucket = args['input_bucket']
df = spark.read.format("csv").option("header", "true")
.load("s3://{}/svens_input.csv".format(input_bucket))
# Perform some data processing using pandas library
pdf = df.toPandas()
pdf = pdf.drop(columns=['unnecessary_column'])
pdf['new_column'] = pdf['old_column'] + 1
# Convert the processed data back to a Spark DataFrame
df = spark.createDataFrame(pdf)
# Write the processed data to another S3 bucket
output_bucket = args['output_bucket']
df.write.format('csv').option('header', 'true')
.mode('overwrite').save('s3://{}/processed_data.csv'.format(output_bucket))

AWS Step Functions distributed maps

AWS Step Functions provides a powerful way to coordinate the execution of serverless workflows. One of the key features of Step Functions is the ability to create parallel workflows using the in December 2022 intruduced distributed maps [1]. The distributed map allows you to parallelize the execution of a task across a large dataset, making it an ideal choice for processing large amounts of data in a serverless environment.

Comparison Old vs New Maps

To get a better overview the main differences between the original map state flows and the new distributed maps as descriped within the AWS Blogpost [2].

Original map state flow

Sub-workflow for each item in an array with data from the previous task
Each iteration produces a state machine’s execution history entry
Parallel execution with an effective maximum concurrency of around 40
Input is limited to JSON arrays
Payload is limited to 256 KB
Event history is limited to 25,000 events

Distributed map

Sub workflows work with standard- and express workflows
Runs totally separated sub-workflow for each item in an array or Amazon S3 dataset with own history
Multiple child executions with an concurrency up to 10,000 executions at a time
Enhanced input to Amazon S3 object list, JSON arrays or files, csv files, or Amazon S3 inventory
No payload limitation due to S3 file references (file processing capability is limited by Lambda storage and memory)
Event history for each iteration (child execution) with up to 25,000 events for standard workflows
No event history limitation for express workflows

Distributed map example

To use a distributed map, you define a Map state in your Step Function state machine. The Map state takes an array of input values and applies a specific task to each input value in parallel. Once all the tasks are complete, the output is returned as an array. The following is an example of how you can use the distributed map feature in Step Functions to process a large dataset of images stored in Amazon S3:

{
  "Comment": "Process images stored in S3 using a distributed map",
  "StartAt": "ProcessImages",
  "States": {
    "ProcessImages": {
      "Type": "Map",
      "ItemsPath": "$.images",
      "MaxConcurrency": 10,
      "Iterator": {
        "StartAt": "ProcessImage",
        "States": {
          "ProcessImage": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:process-image",
            "InputPath": "$.image",
            "End": true
          }
        }
      },
      "End": true
    }
  }
}

In this example, we define a Step Function state machine that takes an array of image objects as input, where each object contains the S3 bucket name and key for an image. We use the Map state to process each image in parallel by calling a Lambda function named “process-image” for each image. The MaxConcurrency parameter specifies the maximum number of concurrent invocations of the Lambda function. Once all the images are processed, the output of each invocation is returned as an array.

The Lambda function used in this example takes an image object as input and performs some image processing operation, such as resizing or cropping the image. The following is an example of a Lambda function that processes an image using the Python Pillow library:

import boto3
from PIL import Image

s3 = boto3.client('s3')
def process_image(event, context):
    bucket_name = event['bucket']
    object_key = event['key']
    image_file = s3.get_object(Bucket=bucket_name, Key=object_key)['Body']
    image = Image.open(image_file)
    resized_image = image.resize((256, 256))
    output_key = f"processed/{object_key}"
    s3.put_object(Bucket=bucket_name, Key=output_key, Body=resized_image)
    return {
        "bucket": bucket_name,
        "key": output_key
    }

The Lambda function takes an image object as input, reads the image from an S3 bucket, resizes it to 256x256 pixels, and writes the processed image to a new S3 object. The output of the Lambda function is an object that contains the S3 bucket name and key for the processed image.

Conclusion

AWS Lambda, AWS Step Functions including distributed maps and AWS Glue are powerful tools that enable developers to build highly parallel serverless data processing workflows on AWS. By combining these services and parallelizing the execution of tasks across a large dataset, you can build complex data processing pipelines that can scale to process petabytes of data per hour.

The code snippets provided in this article, can help you to get started building your own highly parallel serverless data processing workflows on AWS.

About the Author:

My name is Sven Leiss and I am an 5x certified AWS enthusiast and AWS Migration Blackbelt. I have been working in the AWS space for the past 7 years and have extensive knowledge of the AWS platform and its various services. I am passionate about helping customers get the most out of the cloud and have a great track record of successful implementations.

I have extensive experience in designing and implementing cloud architectures using AWS services such as EC2, S3, Lambda and more. I am also well versed in DevOps and AWS cloud migration journeys.

If you are looking for an experienced AWS expert, I would be more than happy to help. Feel free to contact me to discuss your cloud needs and see how I can help you get the most out of the cloud.

Serverless Data Processing on AWS

Reference Architecture

Data ingestion

Trigger

Distributed map

Data transformation

Data storage

Analytics

Visualization

Serverless AWS data processing services

AWS Lambda

AWS Step Functions

AWS Glue

AWS Step Functions distributed maps

Comparison Old vs New Maps

Original map state flow

Distributed map

Distributed map example

Conclusion

About the Author:

Published in awsblackbelt

Written by Sven Leiß

No responses yet