Fundamental Guide to Data Engineering on AWS

DP6 Team
DP6 US
Published in
5 min readJun 6, 2024

Data Engineering is already a vital sphere in today’s Digital Marketing landscape, and Amazon Web Services (AWS) stands out as one of the main platforms for working with data in the cloud. Even for professionals whose main stack resides in another cloud, the popularity and breadth of what it offers make learning its tools and services highly recommended. The provider offers a wide range of services, from storage solutions to advanced analytical tools that allow you to process and analyze large volumes of data efficiently. Understanding AWS can therefore open doors to innovative career opportunities and significantly improve your ability to deliver robust and scalable data solutions.

This article takes a look at some of the main tools and gives an overview of what data engineering looks like on AWS.

Storage

Amazon S3 (Simple Storage Service) is the main object storage service offered by Amazon Web Services (AWS). With it, users can store any type of file, such as a parquet, CSV, or even an image. From a data engineering point of view, the main use cases for S3 are to build data lakes, where different types of structured and unstructured data are stored, providing the scalability, cost-effectiveness, and flexibility required. We can also use tools such as Athena, as we’ll see below, to query the data.

Functions‍

AWS Lambda is a serverless computing service and functions as a service. It allows you to run code to events, such as API calls, file uploads, database changes, or other events you define. The main advantage of Lambda is that you don’t have to worry about managing servers or clusters. You can simply write the code focusing on the logic and leave the infrastructure to Lambda. You can use it to process logs, transform data, perform ETL (Extract, Transform, and Load), and much more.

Workflow Orchestration‍

AWS Step Functions is a workflow service, i.e. an automated sequence of simplified data tasks that can be used in the orchestration of distributed applications, process automation, and in data and machine learning pipelines. In the context of engineering, it is common to have to use more than one process in a pipeline. For example, a common use of this tool is to run two Lambdas in parallel and use their output for a third function. The step function offers similar functionality to Airflow, but is more simplified.

source: https://aws.amazon.com/pt/step-functions/use-cases/

‍Cloud Data Warehouse

The AWS warehouse service, the structure where data is integrated and centralized, is called Amazon Redshift. It is designed and optimized to process and analyze large volumes of data efficiently. Internally, Redshift uses a columnar architecture, optimized for analytical queries. From an infrastructure point of view, it is also possible to increase or decrease the size of the Redshift cluster according to the company’s data storage and processing needs, without the need for repairs or reboots.

Queries‍

When there is data stored in S3, data warehouses, ERP, in an RDS (AWS relational database service) or elsewhere, Athena is a service that can be used to perform queries directly and using SQL. In addition to the SQL serverless query engine it provides, it is also possible to use an Apache Spark engine if you need more speed when processing large volumes of data. All of this is flexible and costs vary according to use, making it very useful as a tool for querying data directly from one or more sources without necessarily moving it to a database system.

Hadoop/Spark

source: https://spark.apache.org/

Elastic MapReduce or Amazon EMR is the service that allows you to use Spark on AWS without having to manage the cluster infrastructure. Even with other specialized platforms, such as Databricks, it’s still an excellent solution for big data workloads.

Streaming ‍

Kinesis is AWS’s fully managed data streaming tool for situations where real-time data ingestion, processing and analysis is required, and is an alternative to Apache Kafka, for example.

ETL service (not only)

AWS Glue is a fully AWS-managed, serverless and scalable ETL service that can be used for data integration.

The service allows you to extract data from various sources, discover metadata and reliably catalog, cleanse, enrich, move and organize data between different types of storage (such as data lakes, data warehouses and databases). It combines the speed and power of Apache Spark with several other tools that make the job easier.

One of its important components is the Glue Data Catalog, which stores metadata such as information about databases, tables and crawlers.

The tables in the Data Catalog refer to metadata discovered by crawlers, which scan files in an S3 bucket and identify columns and their types. With this same service it is possible to run a crawler on your data and use this information in ETL processes, organizing data, facilitating search, improving governance, enabling quick discoveries and optimizing ETLs.

source: https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html

‍IAM and CloudWatch

In addition to the services mentioned above, it is also worth noting the need to use IAM (Identity and Access Management) to control access to the resources used in the pipelines and to the data. Another service that will probably be used a lot when developing solutions on AWS is CloudWatch, where you can monitor and consult pipeline logs to improve performance and debug errors.

Finally, I’d like to stress that it’s important to get to know AWS services and their applications, as well as putting projects on the platform into practice to consolidate your learning. The benefits of having at least a basic knowledge are great and much of the knowledge acquired in other clouds, such as Google Cloud and Microsoft Azure, can be reused. Various study materials made by AWS itself are available on sites such as https://docs.aws.amazon.com/ and https://aws.amazon.com/training/, making them good sources for consultation, study and preparation for certifications.

Profile of the Author: Emanuel Betcel | Mechatronics Technician, Systems Analyst and BA in Information Technology from UFRN. Data Engineer for 3 years at DP6, working with data collection and structuring, cloud development and data integration.

Originally published at https://www.dp6.com.br.

--

--