AWS Data Engineering: AWS Basics

Talha Şahin
8 min readOct 2, 2023

--

In this article, we will delve into the foundational aspects of AWS that every data engineer should grasp. We’ll explore the essentials of Identity and Access Management (IAM), tools used to reach AWS like Cloud shell, Client Shell Interface (CLI), Python Source Development Kit (Python SDK) and the versatility of Elastic Compute Cloud (EC2) instances.

With these building blocks in place, it will be easier to understand other AWS services. Let’s begin!

Amazon Identity and Access Management (IAM)

IAM Users and Groups

IAM, or Identity and Access Management, is a core service provided by Amazon Web Services (AWS) that enables you to securely control access to AWS resources. It allows you to manage users, groups, and permissions, ensuring that only authorized individuals or systems can interact with your AWS infrastructure and services.

IAM users, representing real individuals, are created to interact with AWS services. These users can be organized into IAM groups, each holding specific permissions, to streamline access management. Importantly, a user can belong to one or more groups or have individual permissions like the example down below,

The blue, red and green clusters are groups and avatars are users. As you can see an user could be in multiple groups as well as being in none.

IAM Policies

You manage access in AWS by creating policies and attaching them to IAM identities (users, groups of users, or roles). AWS evaluates these policies when an IAM principal (user or role) makes a request. Therefore, we can say that policies are small information that decides what you can do and you can’t do. Additionally, policies are stored in AWS as JSON documents.

A permission must include:

  • Version — Recommend that you use the latest 2012–10–17 version.
  • Statement — Use this main policy element as a container for the following elements. You can include more than one statement in a policy.
  • Sid (Optional) — Include an optional statement ID to differentiate between your statements.
  • Effect — Use Allow or Deny to indicate whether the policy allows or denies access.
  • Principal (Required in only some circumstances) — If you create a resource-based policy, you must indicate the account, user, role, or federated user to which you would like to allow or deny access. If you are creating an IAM permissions policy to attach to a user or role, you cannot include this element. The principal is implied as that user or role.
  • Action — Include a list of actions that the policy allows or denies.
  • Resource (Required in only some circumstances) — If you create an IAM permissions policy, you must specify a list of resources to which the actions apply. If you create a resource-based policy, this element is optional. If you do not include this element, then the resource to which the action applies is the resource to which the policy is attached.
  • Condition (Optional) — Specify the circumstances under which the policy grants permission.

IAM Roles

IAM roles in AWS differ from IAM users in that they are not tied to specific individuals. Instead, roles are intended to be assumed by anyone who needs the permissions and access associated with that role. Unlike IAM users, roles do not have permanent credentials like passwords or access keys. When a role is assumed, it provides temporary security credentials for the duration of the role session. These credentials are used to access AWS resources securely. Additionally, AWS service roles are a specific type of IAM role used by AWS services to perform actions on your behalf, and IAM administrators have the ability to create, modify, and delete these roles within the IAM service.

You can find IAM examples at my GitHub repo from here!

Amazon CloudShell, Client Shell Interface (CLI) & Python Source Development Kit (Python SDK)

Amazon CloudShell

Amazon CloudShell is a web-based, interactive shell environment provided by Amazon Web Services (AWS). It allows AWS users to access a fully managed Linux terminal directly from the AWS Management Console. It’s a convenient and ephemeral environment for managing AWS resources, scripting, and automation tasks, ensuring that you always have a consistent and up-to-date environment whenever you need it.

Amazon CLI

The AWS Command Line Interface (AWS CLI) is a versatile and unified tool that enables you to manage your AWS services by interacting with their APIs (Application Programming Interfaces) through text-based commands. To configure and use the AWS CLI, you typically require access keys, which consist of an Access Key ID and a Secret Access Key, providing the necessary authentication for your AWS account. The AWS CLI offers several advantages, including speed and automation capabilities that surpass the AWS web console. It allows you to perform AWS operations more quickly and efficiently, making it an excellent choice for developers and administrators who need to automate tasks or manage AWS resources in a programmatic and scriptable manner.

You can learn how to install and configure Amazon CLI for your local computer you can check out my repo from here!

Amazon Python SDK

The Amazon Python SDK, often referred to as the AWS SDK for Python (Boto3), is a set of libraries and tools that allow Python developers to interact with Amazon Web Services (AWS) programmatically. It provides a convenient way to access and manage AWS resources, services, and APIs using Python code. Boto3 is the most commonly used AWS SDK for Python and offers a wide range of features, including authentication, resource provisioning, data storage, and more. It simplifies the process of integrating AWS services into Python applications, enabling developers to build, deploy, and manage AWS-powered solutions using Python scripts or applications.

Before giving instructions on how to use Amazon Python SDK we need to look at EC2.

Amazon Elastic Compute Cloud (EC2)

Amazon Elastic Compute Cloud (Amazon EC2) is a foundational service offered by Amazon Web Services (AWS). It is renowned for its popularity and serves as a fundamental building block for various cloud-based applications and workloads.

At its core, EC2 provides users with the ability to rent virtual machines, known as instances, on the AWS cloud. These instances offer secure and resizable compute capacity, which means users can tailor the computing resources to match their specific needs, whether it’s running a small web application or powering a large-scale data processing task.

One of the compelling aspects of Amazon EC2, especially for newcomers to AWS, is the AWS Free Tier. It allows users to access EC2 instances for up to 750 hours per month at no cost during their first 12 months on the platform, making it an accessible and cost-effective option for experimenting with cloud computing.

In terms of service categorization, EC2 falls under the Infrastructure as a Service (IaaS) model. This means that instead of owning physical servers, users can essentially “rent” compute capacity on a pay-as-you-go basis, eliminating the need for upfront hardware investments and allowing for flexible scalability as demands change.

EC2 Options

When creating an Amazon EC2 instance, you have several key configuration options. First, you can select the desired operating system (e.g., RedHat, Amazon Linux) to run on the instance. Next, you can specify the machine size, determining the amount of compute power with options like the number of CPU cores and RAM. You also have the choice to create key pairs for secure terminal access, select the network card and attach it to the desired virtual network, define firewall rules using security groups for network access control, configure disk options for storage needs, and you can include a bootstrap script or user data to execute commands during the instance’s initialization, allowing for customized setup and configuration. These options provide flexibility and customization when provisioning EC2 instances to meet specific application requirements.

EC2 Security Groups

Amazon EC2 Security Groups serve as a critical component for managing the network traffic to and from your EC2 instances within the AWS cloud. They function as virtual firewalls, enabling you to control and secure inbound and outbound network traffic.

One of the key advantages of security groups is their flexibility. You have the freedom to modify the rules of a security group at any time, allowing you to adapt to changing security requirements and application needs. By default, security groups are designed to be cautious: they permit all outbound traffic but block all incoming traffic, providing an initial layer of protection for your instances.

Amazon EC2 Instance Roles, often referred to as IAM (Identity and Access Management) roles for EC2 instances, are a vital component of AWS security and access management. They are designed to allow your applications running on EC2 instances to securely interact with other AWS services and resources, such as S3 buckets or databases, without the need to manage access keys or security credentials within your code or configuration files.

Amazon Elastic Block Store (EBS)

Amazon Elastic Block Store (EBS) volumes are essential storage components for Amazon EC2 instances, and they are billed independently of the instance. By default, when you terminate an EC2 instance, the root volume (where the operating system resides) is deleted automatically, while other attached volumes persist. It’s crucial to note that even after terminating an instance, you continue to incur charges for the associated EBS volumes, so it’s important to manage your volumes carefully to avoid unnecessary costs. Additionally, when attaching additional EBS volumes to an instance, ensure they are in the same Availability Zone (AZ) as the instance to guarantee seamless connectivity and data transfer.

EC2 Purchasing Options

  • On-Demand Instances — Pay, by the second, for the instances that you launch.
  • Savings Plans — Reduce your Amazon EC2 costs by making a commitment to a consistent amount of usage, in USD per hour, for a term of 1 or 3 years.
  • Reserved Instances — Reduce your Amazon EC2 costs by making a commitment to a consistent instance configuration, including instance type and Region, for a term of 1 or 3 years.
  • Spot Instances — Request unused EC2 instances, which can reduce your Amazon EC2 costs significantly.
  • Dedicated Hosts — Pay for a physical host that is fully dedicated to running your instances, and bring your existing per-socket, per-core, or per-VM software licenses to reduce costs.
  • Dedicated Instances — Pay, by the hour, for instances that run on single-tenant hardware.
  • Capacity Reservations — Reserve capacity for your EC2 instances in a specific Availability Zone for any duration.

You can find out how to setup EC2 from my GitHub repository by clicking here!

--

--