Big Data Going Serverless

Published in

Engineered @ Publicis Sapient

10 min readJan 28, 2020

Puneet Babbar, Senior Manager, Technology

The first thing that will come to your mind, when we talk about Big Data application, is a Hadoop cluster. What is the size of the cluster, how many nodes do we have in the cluster? What is the configuration of the nodes? Which distributor provider do we have — is it Cloudera, Hortonworks*, MapR or vanilla Hadoop distribution (a brave option to go with)? What is the version of Hadoop and the various components within the distribution?

In case your cluster is managed by a Support team or you use cloud infrastructure, you don’t have to worry about the former. Still, as a Data engineer, you need to think about the volume of data your application needs to process, how much processing power you need, and what will be the rate at which this data will increase, and also consider the seasonal spikes.

What if I told you, you could forget about all this, and let a serverless platform handle this for you? Right, a serverless platform helps the developer to focus on the application, and not the infrastructure.

In this blog, I will talk about why Serverless is so well suited for Big Data processing.

3 key takeaways from this blog:

- Serverless offerings by various cloud providers to set up a Big Data application

- Design pattern to build real-time and batch pipeline with cloud-native application

- Log Aggregator application build on AWS serverless toolset with code and learnings.

What is Serverless?

Serverless is a computational model where the cloud provider is responsible for executing the code (function) by dynamically allocating the resources. This model simplifies the building, deployment and management of cloud scale applications. Instead of worrying about data infrastructure like server procurement, configuration and management, a Data engineer can focus on the tasks it takes to ensure an end-to-end and highly functioning data pipeline. And it is worth mentioning that even though the cloud provider abstracts the infrastructure, servers are still involved in executing the code.

The Cloud provider runs the code on their containers using Functions as a Service (FaaS), and they further communicate with the Backend as a Service (BaaS) for Data Storage. A function can be triggered from various sources. Sources include http requests, database events, queuing services, monitoring alerts, file uploads, scheduled events (cron jobs), etc.

Also, a good blog to cover this in further detail.

Offerings by Cloud Service Providers

All the major Cloud providers, i.e. AWS, GCP and Azure have their offerings in this genre. AWS Lambda, Google Cloud functions and Azure functions are their FaaS offerings respectively. In terms of data pipelines, the functions are the core of the pipeline as they provide data processing capabilities like data transformation, data cleansing or data validation. Generally, all or some combination of these are carried out in this layer.

To build a data pipeline, in addition to the FaaS, we also need other services like data collection, data streams and data storage. Each Cloud provider has offerings in all these areas. We are going to deep-dive in all these services and the offerings.

* Recently Cloudera and Hortonworks got merged.

Data Collection

Data collection is the most crucial step in the process of integrating data from an organization’s ecosystem to Cloud applications. Data migration can be segregated into 2 steps. The first is the one-time historical load, and the second is the ongoing data collection. Here, we emphasize more on the ongoing data collection which could be logs, event-driven (API), Change Data Capture (CDC) or IoT devices.

The data collection stage is where the source is generated for all data pipeline. We have AWS data pipeline, AWS Glue, AWS IoT, Google Cloud DataFlow by GCP, Azure DataFactory and Azure IoT hub.

Data Streams

Data Streams is the real-time data ingestion service that enables us to process and analyze real-time streaming data. It also provides a real-time persistence layer for the streaming data being ingested via the Data collection layer. For a serverless application where data producers are not finite, and the rate at which they produce data can also vary, we need real-time storage which can scale up in case of a massive increase of incoming data, and also scales down if the incoming data rate is slow.

We have Amazon Kinesis by AWS, and in addition to this, AWS also provides Amazon Managed Streaming for Apache Kafka (Amazon MSK). This is a fairly new offering by AWS and supports only Kafka 1.1.1 and 2.1.0. As of now, PUB/SUB by GCP and Event Hub by Azure are offerings in this domain.

Along with Streams for real-time data ingestion, there is also a class of service for real-time data processing. Here we have Amazon Kinesis Streams, Google Cloud DataFlow and Azure Stream Analytics.

Data Storage

The Serverless concept has become popular in the storage layer too, and various offerings with different use cases are now available. Being relational or no-sql databases, these BaaS are fully managed databases, they automatically scale tables up and down to adjust for capacity and maintain performance. Availability and fault tolerance are built in, eliminating the need to architect your applications for these capabilities. With Serverless data storage, it also helps to decouple the compute and storage nodes.

AWS DynamoDB, AWS Aurora, AWS Athena, Azure Cosmos DB, Azure SQL Database Serverless, Google BigQuery, Google BigTable and Google Datastore are offerings in this segment from various Cloud providers.

Design Pattern Across Providers

Let’s now focus on how all these services work in tandem and build up a serverless data pipeline. As we know by now, serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless. Each service we discussed above helps build these pieces, and the arrangement of these pieces depends on the use case.

The use case we are going to discuss will be event-driven, and will cover data pipeline for ingestion, transformation or data aggregation. Additionally, it can support both batch or real-time processing. Let us cover batch and real-time architecture.

Batch Processing

In case of Batch Processing, the pattern below shows how we can leverage the serverless framework:

In Batch Processing, the source itself are the files uploaded to the Data Lake (like S3, Google Cloud Storage or Azure Storage) at regular intervals or nightly. All Cloud providers have event trigger over object creation/modification in the Data Lakes. So once the data source arrives in the data lake via SFTP or data dump from OLTP database, the function will be triggered. These functions will perform ETL by pulling data from the Data Lake, run transformations and move the data to Data Warehouse.

In case of Batch Processing, the toughest part is to estimate the load, i.e. how much data needs to be processed in the next increment. So, Serverless in this scenario provides a cost-efficient way to scale automatically according to the load. With Serverless architect, we don’t need to revisit the ETL performance and update the cluster (Hadoop/Spark) like executors, driver, etc.

Real-time processing

The real-time processing engine is required for sources producing data continuously, like Click stream, Social media, IoT devices or microservices. The architect below provides an overview of how Real-time Analytical Platforms can be built on Serverless:

The Serverless architecture befits real-time data processing, as the data source won’t produce data at a constant velocity, and it provides a data processing platform which can process any amount of data with consistent throughput, and writes data to Data Serving Layer.

Log Aggregator Application — Serverless

In this section, I will provide you an overview of the Big Data application I have implemented on the serverless rail on AWS. This application is a log aggregator application and works in the Lambda architecture. So both real-time and batch pipeline processing are being done simultaneously on the same data source. The code base and implementation are present on this GitHub repository.

With the Lambda architecture, it enables both real-time analytics like fraud detection or site/node failure, etc., and it also provides insights from the historic load like log analysis, etc.

For the ease of implementation, I have created a Lambda function (on Python 3.2) to mock the Apache logs and produce the stream of logs, as this Lambda function is triggered every 1 minute. Web Logs Generator function directly delivers the output to the Kinesis Data Firehose delivery system, which in turn collects and stores the logs in configured S3 buckets.

Batch Processing

In the S3 bucket, the logs are stored as csv file in YEAR/MONTH/DAY/HOUR folder based on the timestamp when the logs are generated. Now AWS Glue, a fully-managed ETL service, is used to create Data Catalog, convert these csv logs to extract metadata, cleanse and store in a more optimized format like parquet in another S3 bucket. Once these logs are in S3, Athena, a serverless interactive query service, can be used to analyze the data using standard SQL.

Real-time Processing

For the Real-time processing pipeline, I have created Kinesis Analysis aplication on top of Kinesis data firehose creating the stream of the Apache logs. A pre-processor Lambda function is triggered every time each stream is delivered from the firehose. This pre-processor created the json output in a set schema as accepted by the Kinesis Analysis application, referred to as Kinesis Data Firehose Request Data Model. Details can be found at this AWS link. Now, in the Kinesis Data Analytics application, I have created streams using the SQL editor to create multiple scripts. One of them is aggregating the web logs and the other stream to detect the anomaly in the stream using the random_cut_forest function provided by out-of-the-box by AWS. This stream can then be fed into a Lambda, which in turn can trigger Amazon Simple Notification Service (SNS). In this implementation, I have persisted the aggregated stream via a delivery firehose stream to a redshift table.

Key learning

Along the way of implementing this application, there are important learnings which are worth sharing and are important to read if you plan to go this route. To simplify this section, I have categorized them.

Scalability

The Serverless shines in this field, but there are a few considerations one should keep in mind. Firstly, consider going completely with a Cloud-native design using a Serverless tool set. For example, for ETL processing, go with Glue instead of an EMR cluster. Secondly, decouple your architecture to a more asynchronous model wherever possible. And the last one, be aware of the concurrency around the Lambda function and function scaling lever per region in AWS. Check this document for more details.

Language to use

AWS Lambda supports multiple programming model support. While I was writing this blog, it supported 7 languages: Node.js, Python, Ruby, Java, Go, C# and Powershell. For the current set of supported languages, check this link.

The above example is in Python. It is an interpreted scripting language, and easier for the data processing application. While choosing the language, it is critical to understand how you are going to resolve the dependency. In AWS Lambda, it is done via Layer. These blogs around Lambda layer are worth a read: For python and for java this one. The GitHub repo link provided has the layer zip required to run this Python application.

Fault Tolerance

In the above pipeline, we have made the application fault-tolerant by managing the invocation of the pre-processor Lambda function, and how we handle the error and do re-processing. When Lambda is invoked via Kinesis, it reads records in batches via shards. If an error is returned, Lambda retries the batch until processing succeeds or the data expires. Lambda provides new customizable controls. Users are able to control how function errors and retries impact stream processing function invocations with properties like MaximumRetryAttempts, MaximumRecordAgeInSeconds, BisectBatchOnFunctionError and On-failure Destination. Make sure you set the destination for the On-failure records. This helps in auditing and finding any data issue. This particular property was really helpful for us, and really helped debug issue in data mapping.

Volume Metrics

With this framework, we were able to process the volume of 5000 (record/sec) with an estimate size of each record around 1kb and on an average processing around 100 (record/sec). The AWS Lambda invocation was calculated to vary from around 1M — 600k for a month. With the peak rate, the Real-time pipeline data was ingested into the redshift with 300–500 milli secs. For the above calculation, I have configured Kinesis Firehose buffer size to 100MB, Kinesis Firehose buffer interval was 5 minutes, and memory setting for pre-processor Lambda was 1,792 MB.

Deployment

AWS CloudFormation and CodePipeline are the best combination to get the CI/CD through the AWS. I suggest you spend more time in getting the CloudFormation script, use the existing template present on AWS documentation site and modify them. Also, make sure to utilize the Parameter Override Functions in the CodePipeline to override values in a template configuration file of CloudFormation.

Drawbacks of a Serverless approach

I have already mentioned the benefits of Serverless technology, i.e. simplicity in management, no upfront commitment, reduced operational cost, to name a few. Read this detailed blog to find more about advantages and disadvantages.

Conclusion

Serverless architecture is a new approach to writing and deploying application that allows developers to focus on code. Such an approach can reduce time to market, operational costs and system complexity, and especially in Big Data application, where we have the problems of 4 Vs: volume, variety, velocity and veracity. So, to handle these, a Serverless Architecture is an impeccable solution.