Lambda Architecture in AWS
Lambda architecture is a design to keep in mind while designing big data platforms. To understand what lambda architecture provides, it is important to understand what is expected of a big data application platform. You can find a multitude of articles online that will give you different lists of needs based off of different use cases.
To get started lets first outline a few high level needs:
- Support for quick ad hoc data retrieval of both batch and real time data
- Support for large scale updates and data processing
- Ability to recover from mistakes, popularly known as fault tolerance
Keep in mind that a lot of this overview can be found in other more in depth articles about the architecture. This book by Nathan Marz could safely be considered a comprehensive guide to understanding best practices while building data applications. The purpose of this article is understanding how we can build one of these architectures in the cloud, using AWS.
Now let’s understand the different layers that make up a lambda architecture. I’d love to also talk about all the other open source tools that can be used to build this out, but we still need to cover the AWS version, so Google it for now.
- The gold standard. Here resides the true source of data, the master behemoth i.e. every bit of information ever generated.
- This layer can only be appended to, with new incoming data. Ain’t nobody going to be messing with the gold standard.
- This layer depends on a high throughput services, capable of processing large amounts of data. Additionally the fault tolerance of the system is also maintained in this layer. No record left behind.
- Predefined queries are also run here. Aggregations, computations, and derived metrics are all performed here for downstream use.
- This layer right here is the tricky one in my opinion. This is where the batch layer gets cleverly indexed so that low latency querying against the batch layer is possible.
- Right here is what is supposed to provide the ability to query the data in the speed layer along with the data in the batch layer. This is the hardest bit to achieve in a lambda architecture. We’ll cover how to achieve this in AWS.
Speed Layer [My favorite]:
- My favorite, and by far the sexiest of layers, is the speed layer. This is where only tiny recent snapshots of data are available. The beauty of this layer is that it supports processing and analysis on real time streams of data. This means you know what’s up, exactly when its up.
- Without getting into implementation details think of this layer achieving close to the same thing the processing of derived sets in the batch layer does. Only difference; here we do it to a small cut of the latest data.
The point of the Lambda architecture should be evident now. The batch layer maintains a historical store of processed data, the serving layer provides key-value access to derived metrics and data ready for analysis, Machine Learning and visualization, and finally the speed layer allows you to do everything the previous layers do, only to the latest data. Finally you can get a view of the now and the everything else all at the same time.
Now let’s jump into AWS:
Source: Feed all your data into Kinesis Streams. The KCL [Kinesis Client Library] will allow you to load data in from all of your users. The multi-sharded stream can handle incoming data in massive amounts and the data is persistent for 24 hours. Which means you have 24 hours to move your data into the batch layer.
Batch Layer: Straight out of the Kinesis stream you can use a large scale consumer to dump data into S3 buckets. Do not do anything to this data, other than maybe a little bit of cleanup and formatting. This clean up can be achieved very fast using an EMR Cluster that fires of Hive jobs that pull the data from the stream and dump it into S3. You can also directly dump the raw data to S3 with no cleanup using AWS Lambda jobs.
The processing of views can then be achieved by hitting the S3 buckets with heavy duty jobs running on EMR clusters. Tools like Spark Streaming, Cascading, Pig can all run on AWS EMR clusters and the data can be directly loaded post formatting in a key-value format into an AWS Redshift cluster. This entire process can be automated using AWS Lambda [Different from the Lamda in Lambda Architecture] and AWS Data pipelines. Remember the batch layer needs provide historic aggregates. Use the EMR clusters to effectively perform batch recomputes to keep your historic Redshift data as upto date as possible, and do iterative appends for data that doesn’t need historic computations. Additionally keep in mind that a lot of aggregates can be done extremely fast in Redshift it self, know what computation to do where. Redshift queries, copies and EMR batch recomputes can all be triggered in the same pipeline. This can take time, your data will be hours out of date while you wait for the recomputes. Use your speed layer to fill in the gap.
Serving Layer: Redshift is a petabyte scale, column based data warehouse, that is queried using PostgreSQL. In simpler words, Redshift becomes your serving layer. Low latency queries should be Redshifts middle name. Your cluster will allow you to barrel through massive amounts of data, all retrieved over a JDBC connection.
Speed Layer: The fun part. This is where Apache Storm comes in. We’re going to be hitting on some storm lingo here, so read up on the storm link before proceeding. An Apache Storm cluster can be very easily deployed on AWS EC2. In addition to this, Storm now comes fully equipped with an AWS Kinesis Spout. This allows us to write Bolts, that receive data directly from the same Kinesis stream that is feeding our batch layer.
The scalability of Storm allows us to have large amounts of data available in real time in our speed layer. The data coming out of the bolt can be stuck back in the same Redshift Cluster [Different table please!], and then queried beautifully right along side the batch data. This will introduce a little lag and delay in your data to accommodate the data loads. Once again, this can all be automated using pipelines and lambda.
An additional perk of Storm is that ElastiCache can be used to build a sliding window application to visualize real time data. This is an extra piece that allows you to look at data without the Redshift load lag, but the downside is, this cannot be queried alongside the batch data without additional engineering.
I understand that this article doesn’t really tell you how to do anything. It just outlines the different technologies that you can use. Each component of this architecture in AWS has its own set of use cases. I’ve tried my best to link out to any and all articles that have explanations and tutorials. Try and implement bits and pieces at a time, and hit me up if you have any questions. I’ve implemented each one of the components at some point or another. Lastly this isn’t a quick deploy tutorial, this entire system will take time and maybe a team to both build and maintain. It is however, worth the effort.