Hands-on AWS E-Commerce Project: More Than Just Theory

Learn and put your Data Engineering knowledge into practice

Andreas Kretz
Plumbers Of Data Science
5 min readApr 27, 2022

--

I’m sure you know it. You take courses, work your way through a lot of theoretical material, but you lack the practice to really feel secure in your job as a data engineer. For all those of you I have put together several hands-on example projects in my Data Engineering Academy. Here, you learn how to use the most important Data Engineering tools, such as MongoDB and Apache Kafka, how to create your own data pipelines, how to work on cloud platforms, like AWS or Microsoft Azure, and how to process data efficiently.

One of these projects is “Data Engineering on AWS”. This AWS project not only gives you an understanding of how data engineering is done in the real world. It is also perfect for everyone who wants to start with Cloud platforms. Currently, AWS is the most used platform for data processing. It is really great to use, especially for those people who are new in their Data Engineering job or looking for one.

Working through AWS, you learn how to set up a complete end-to-end e-commerce project with a streaming and analytics pipeline. Based on the project, you learn how to model data and which AWS tools are important, such as Lambda, API Gateway, Glue, Redshift Kinesis and DynamoDB.

The Dataset

For the AWS project, you can use data from a transactional e-commerce dataset. It contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

You can use it for both the streaming pipeline and the batch pipeline. The only difference is that you send in single invoices and their items in the streaming part, and in the batch part you use the complete csv file.

Building the Streaming Pipeline

While building your streaming pipeline, there are five major parts to consider: Ingest, Buffer, Process, Store, and Visualize. In the following I show you how you could work through each step of your project:

For the ingestion part, you can use a Lambda function as well as an API to enable a client to send in data. This data could be transactions with invoice numbers and items, for example.

For buffering, I recommend using AWS Kinesis, as it is a message queue that is very easy to use and has well-manageable features, which makes it perfect for beginners.

In the streaming process, you have two different Lambda functions. One pulls data from Kinesis and writes it into your SQL database. The other one stores the data as files in AWS3. Those two functions are easy to create and can be easily integrated into Kinesis, which I definitely recommend using here instead of an open source platform like Apache Kafka.

DynamoDB is a great document store NoSQL database and perfect for your usage in the storage part. Here you can swiftly save invoices and their items on the USS. DynamoDB is affordable and scales out very well horizontally. Thus, it is not only ideal for beginners but also great to use when you have multiple servers side by side.

For the visualization you could build a dashboard, but this is not mandatory for this project. As the data is sent in via an API, it is called up again via an API after it is processed and stored.

The advantage here is that anyone can build a client, which can access the API, and the developer of the pipeline does not need to program or specify the visualization. Using an API for visualization is a nice and useful method to give access to data, which also includes the safety layer of the API.

Getting Into the Batch Pipeline Part

Just as with the streaming pipeline, you also have the steps Ingest, Process, Store and Visualize to work through the batch pipeline, except for the Buffer part as this is not relevant here. This is what I recommend for how to work through each part and what tools to use best:

As you are not using an API for ingestion, the data is stored as a file within S3 as clients usually send in their data as files. This is very typical for an ETL job, where the data is extracted from S3 and then transformed and stored.

A typical tool to use for processing is AWS Glue. With this tool, you can write Apache Spark jobs, which pull the data from S3 and then transform and store them into the destination, for which I recommend Redshift.

Beside the Glue jobs, another part you should make use of is the data catalog. It catalogs the csv files within S3 as well as the data in Redshift. Thus, it becomes very easy to automatically configure and generate a Spark job in AWS Glue, which sends the data to Redshift.

As said before, the recommended storage for the analytical part is Redshift, from which the data analysts can evaluate and visualize the data. Best to use here is a single staging table in which the data is stored.

The typical method for visualization would be to use Redshift as your analytics database, to which you ideally connect a BI tool such as Power BI, for example.

So much from me now. I wish you a lot of fun with your very own AWS e-commerce project!

In case you need any help with it, check out our Data Engineering Academy. Here you find the complete AWS project explained step by step, with all necessary source codes, further links and learning material available.

Are you interested, but still have a few burning questions on your mind? Feel free to contact me via hello@learndataengineering.com or mention me on Twitter.

For more information and content on Data Engineering, also check out my other blog posts, videos and more on Medium, YouTube and LinkedIn!

--

--

Andreas Kretz
Plumbers Of Data Science

Data Engineer and Plumber of Data Science. I write about platform architecture, tools and techniques that are used to build modern data science platforms