The Data Scientists’ Guide to AWS EC2: Key Concepts and Best Practices

Understand what AWS EC2 is, key concepts to know as a data scientist, and best practices on how to use AWS EC2

Karun Thankachan
CodeX
16 min readMay 7, 2023

--

Photo by Ian Battaglia on Unsplash

Are you tired of waiting for your local machine to crunch those big data sets? Do you find yourself dreaming of a magical computing beast that can scale to your every need? Look no further, because AWS EC2 is here to save the day!

AWS EC2 is a powerful service that provides scalable computing capacity in the cloud. It allows you to create virtual machines, or instances, that can run a wide variety of applications and workloads. As a data scientist, this means you can spin up instances with just the right amount of processing power and memory to handle your data exploration/machine learning needs. And if your needs change over time, no problem! Simply spin up more instances, or scale down if you don’t need as much power.

The beauty of EC2 is that it’s entirely flexible and customizable. You can choose from a wide variety of instance types, ranging from small instances with just a few gigabytes of memory to massive instances with hundreds of gigabytes of memory and multiple CPUs. And if you need to run specialized workloads, such as machine learning or high-performance computing, there are instances optimized for those use cases as well. So what are you waiting for? Let’s dive in and learn how to use EC2 with Python Boto3!

Creating Instances

To create an instance using Python Boto3, you’ll need to follow a few key steps:

  1. Create an Amazon Machine Image (AMI): An AMI is a pre-configured virtual machine that you can use to launch instances. Typically you would use the public AMIs available in the AWS Marketplace. To create an AMI using Python Boto3, you can use the `create_image()` method of the EC2 client.
  2. Choose an instance type: AWS EC2 offers a wide variety of instance types, each optimized for different workloads. Instance types vary in terms of CPU, memory, storage, and networking capacity. To choose an instance type, you’ll need to consider the requirements of your workload and choose the instance type that best matches those requirements (more on how to choose in the last section). To specify an instance type using Python Boto3, you can use the `InstanceType` parameter when creating an instance.
  3. Specify security groups: Security groups act as a virtual firewall, controlling inbound and outbound traffic to your instance. You can create one or more security groups and specify them when launching your instance. To create a security group using Python Boto3, you can use the `create_security_group()` method of the EC2 client, and then use the `authorize_security_group_ingress()` method to specify inbound rules.
  4. Launch an instance: Once you’ve created an AMI, chosen an instance type, and specified security groups, you’re ready to launch your instance! To launch an instance using Python Boto3, you can use the `run_instances()` method of the EC2 client. This method takes a number of parameters, including the ID of the AMI, the instance type, the number of instances to launch, and the IDs of the security groups to assign to the instances.

Here’s an example Python code snippet that shows how to create an instance using Boto3:

import boto3

# create an EC2 client
ec2 = boto3.client('ec2')

# create an AMI
response = ec2.create_image(InstanceId='i-0123456789abcdef0', Name='my-ami')

# choose an instance type
instance_type = 't2.micro'

# create a security group
security_group = ec2.create_security_group(GroupName='my-security-group',
Description='My security group')
ec2.authorize_security_group_ingress(GroupId=security_group['GroupId'],
IpProtocol='tcp', FromPort=80, ToPort=80, CidrIp='0.0.0.0/0')

# launch an instance
response = ec2.run_instances(ImageId=response['ImageId'],
InstanceType=instance_type,
MinCount=1,
MaxCount=1,
SecurityGroupIds=[security_group['GroupId']])

This code creates an AMI from an existing instance, chooses a `t2.micro` instance type, creates a security group with an inbound rule allowing HTTP traffic, and launches a single instance with the specified parameters.

Connecting to an Instance

To connect to an instance using Boto3, you’ll need to follow a few key steps:

  1. Create an SSH key pair: An SSH key pair consists of a public key and a private key. You’ll need to create an SSH key pair and add the public key to your instance so that you can authenticate when you connect. To create an SSH key pair using Python Boto3, you can use the `create_key_pair()` method of the EC2 client.
  2. Get the public IP address of your instance: To connect to your instance, you’ll need to know its public IP address. You can retrieve the public IP address using the `describe_instances()` method of the EC2 client.
  3. Connect to your instance: Once you have your SSH key pair and the public IP address of your instance, you’re ready to connect! You can use the `EC2.Instance.connect()` method to connect to your instance using the SSH protocol.

Here’s an example Python code snippet that shows how to connect to an instance using Boto3:

import boto3

# create an EC2 client
ec2 = boto3.client('ec2')

# create an SSH key pair
key_pair_name = 'my-key-pair'
response = ec2.create_key_pair(KeyName=key_pair_name)
private_key = response['KeyMaterial']

# get the public IP address of the instance
instance_id = 'i-0123456789abcdef0'
response = ec2.describe_instances(InstanceIds=[instance_id])
public_ip_address = response['Reservations'][0]['Instances'][0]['PublicIpAddress']

# connect to the instance
from paramiko import SSHClient, AutoAddPolicy
ssh = SSHClient()
ssh.set_missing_host_key_policy(AutoAddPolicy())
ssh.connect(hostname=public_ip_address, username='ec2-user', key_filename=key_pair_name + '.pem')

This code creates an SSH key pair with the name `my-key-pair`, retrieves the public IP address of an instance with the ID `i-0123456789abcdef0`, and connects to the instance using the SSH protocol with the `paramiko` library.

Note that in this example, we assume that the instance is running a Linux-based operating system and that the default username for connecting via SSH is `ec2-user`. If you’re connecting to a Windows instance, or if you’ve customized the SSH settings on your instance, you’ll need to adjust your connection parameters accordingly.

Managing Instances

To manage instances using Boto3, you’ll need to follow a few key steps:

  1. Start an instance: To start an instance, you can use the `start_instances()` method of the EC2 client.
  2. Stop an instance: To stop an instance, you can use the `stop_instances()` method of the EC2 client.
  3. Terminate an instance: To terminate an instance, you can use the `terminate_instances()` method of the EC2 client.
  4. Monitor the status of instances: To monitor the status of instances, you can use the `describe_instances()` method of the EC2 client.

Here’s an example Python code snippet that shows how to manage instances using Boto3:

import boto3

# create an EC2 client
ec2 = boto3.client('ec2')

# start an instance
instance_id = 'i-0123456789abcdef0'
response = ec2.start_instances(InstanceIds=[instance_id])
print(response)

# stop an instance
instance_id = 'i-0123456789abcdef0'
response = ec2.stop_instances(InstanceIds=[instance_id])
print(response)

# terminate an instance
instance_id = 'i-0123456789abcdef0'
response = ec2.terminate_instances(InstanceIds=[instance_id])
print(response)

# monitor the status of instances
instance_id = 'i-0123456789abcdef0'
response = ec2.describe_instances(InstanceIds=[instance_id])
print(response)

In this code, we start an instance with the ID `i-0123456789abcdef0`, stop it, terminate it, and then monitor its status using the `describe_instances()` method.

Note that when you stop an instance, you’re simply pausing its running state. You can start the instance again later and it will resume its previous state. When you terminate an instance, you’re permanently ending its running state and deleting the instance. Be careful when terminating instances, as this action cannot be undone!

Using EBS with EC2 for Storage

Using Elastic Block Store (EBS) with EC2 instances can be done through the following steps:

  1. Create an EBS volume: To create an EBS volume, you can use the `create_volume()` method of the EC2 client. You’ll need to specify the size of the volume in gigabytes and the Availability Zone in which you want the volume to be located.
  2. Attach the EBS volume to an instance: To attach the EBS volume to an instance, you can use the `attach_volume()` method of the EC2 client. You’ll need to specify the instance ID and the device name (e.g., `/dev/xvdf`) to which the volume should be attached.
  3. Detach the EBS volume from an instance: To detach the EBS volume from an instance, you can use the `detach_volume()` method of the EC2 client. You’ll need to specify the volume ID and optionally the instance ID from which the volume should be detached.

Here’s an example Python code snippet that shows how to use EBS with EC2 instances using Boto3:

import boto3

# create an EC2 client
ec2 = boto3.client('ec2')

# create an EBS volume
volume_size = 10
availability_zone = 'us-west-2a'
response = ec2.create_volume(Size=volume_size,
AvailabilityZone=availability_zone)
volume_id = response['VolumeId']
print(response)

# attach the EBS volume to an instance
instance_id = 'i-0123456789abcdef0'
device_name = '/dev/xvdf'
response = ec2.attach_volume(VolumeId=volume_id,
InstanceId=instance_id, Device=device_name)
print(response)

# detach the EBS volume from an instance
response = ec2.detach_volume(VolumeId=volume_id, InstanceId=instance_id)
print(response)

In this code, we create an EBS volume with a size of 10 gigabytes and located in the `us-west-2a` Availability Zone. We then attach the volume to an instance with the ID `i-0123456789abcdef0` and the device name `/dev/xvdf`, and finally detach the volume from the instance.

Note that when you attach an EBS volume to an instance, you’ll need to make sure that the instance has the necessary drivers and filesystem support to access the volume. Additionally, be sure to detach any EBS volumes from an instance before terminating the instance, or you may lose any data stored on the volume.

Securing AWS EC2

Security is an important consideration when using AWS EC2 instances. Here’s an overview of how to manage security in AWS EC2 using Python Boto3:

  1. Creating and managing security groups: A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. To create a security group, you can use the `create_security_group()` method of the EC2 client. You’ll need to specify the group name, a description, and the VPC ID for the security group. Once the security group is created, you can use the `authorize_security_group_ingress()` method to add rules to allow incoming traffic.
  2. Using Identity and Access Management (IAM): IAM is a service that helps you control access to AWS resources. To use IAM with EC2 instances, you can create IAM roles and policies that allow specific actions on EC2 instances. You can use the `create_role()`, `create_policy()`, and `attach_role_policy()` methods of the IAM client to create roles and policies and attach them to EC2 instances.
    Note: By default, EC2 instances have no IAM role or policy associated with them, which means that anyone with access to the instance can perform any action on it.

Here’s an example Python code snippet that shows how to create a security group and use IAM to control access to an EC2 instance:

import boto3

# create an EC2 client
ec2 = boto3.client('ec2')

# create a security group
group_name = 'my-security-group'
description = 'my security group'
vpc_id = 'vpc-0123456789abcdef0'
response = ec2.create_security_group(GroupName=group_name, Description=description, VpcId=vpc_id)
security_group_id = response['GroupId']
print(response)

# add rules to the security group to allow incoming traffic
response = ec2.authorize_security_group_ingress(
GroupId=security_group_id,
IpPermissions=[
{
'IpProtocol': 'tcp',
'FromPort': 22,
'ToPort': 22,
'IpRanges': [{'CidrIp': '0.0.0.0/0'}]
}
]
)

# create an IAM role and policy to control access to EC2 instances
iam = boto3.client('iam')
role_name = 'my-ec2-role'
policy_name = 'my-ec2-policy'
policy_document = {
'Version': '2012–10–17',
'Statement': [{
'Effect': 'Allow',
'Action': ['ec2:*'],
'Resource': '*'
}]
}
response = iam.create_role(RoleName=role_name, AssumeRolePolicyDocument=policy_document)
role_arn = response['Role']['Arn']
response = iam.create_policy(PolicyName=policy_name, PolicyDocument=policy_document)
policy_arn = response['Policy']['Arn']
response = iam.attach_role_policy(RoleName=role_name, PolicyArn=policy_arn)

In this code, we create a security group with the name `my-security-group`, description `my security group`, and attached it to a VPC with ID `vpc-0123456789abcdef0`. We then add a rule to allow incoming SSH traffic on port 22 to the security group. By specifying the source IP address range as ‘0.0.0.0/0’, this allows SSH connections from any IP address.

Next, we create an IAM role and policy that allows all EC2 actions and attach the policy to the role. This role can then be used to control/perform EC2 related actions.

In other words, the security group decides what kind of traffic can hit the instance (only SSH), and the IAM role decides what action that traffic can take (only EC2 related)

EC2 Load Balancing

Elastic Load Balancing (ELB) is a service provided by AWS that automatically distributes incoming traffic across multiple instances. Here’s an overview of how to use ELB with EC2 instances using Python Boto3:

  1. Creating and managing an ELB load balancer: To create an ELB load balancer, you can use the `create_load_balancer()` method of the ELB client. You’ll need to specify a name for the load balancer, a list of availability zones, a listener configuration, and a list of security groups. Once the load balancer is created, you can use the `describe_load_balancers()` method to get information about the load balancer.
  2. Registering and deregistering instances with the load balancer: To register instances with the load balancer, you can use the `register_instances_with_load_balancer()` method of the ELB client. You’ll need to specify the load balancer name and a list of instance IDs to register. To deregister instances from the load balancer, you can use the `deregister_instances_from_load_balancer()` method and pass in the load balancer name and instance IDs.

Here’s an example Python code snippet that shows how to create an ELB load balancer and register instances with it:

import boto3

# create a client for Elastic Load Balancing
elb_client = boto3.client('elbv2')

# create a new ELB
elb_name = 'my-elb'
subnet_ids = ['subnet-0123456789abcdef0', 'subnet-0123456789abcdef1']
response = elb_client.create_load_balancer(
Name=elb_name,
Subnets=subnet_ids,
SecurityGroups=['sg-0123456789abcdef'],
Scheme='internet-facing',
Type='application'
)

# create a target group for the ELB
target_group_name = 'my-target-group'
response = elb_client.create_target_group(
Name=target_group_name,
Protocol='HTTP',
Port=80,
TargetType='instance',
VpcId='vpc-0123456789abcdef0'
)
target_group_arn = response['TargetGroups'][0]['TargetGroupArn']

# register instances with the target group
instance_ids = ['i-0123456789abcdef0', 'i-0123456789abcdef1']
response = elb_client.register_targets(
TargetGroupArn=target_group_arn,
Targets=[{'Id': instance_id} for instance_id in instance_ids]
)

# create a listener for the ELB
response = elb_client.create_listener(
LoadBalancerArn=elb_arn,
Protocol='HTTP',
Port=80,
DefaultActions=[{
'Type': 'forward',
'TargetGroupArn': target_group_arn
}]
)

This code creates a new Elastic Load Balancer named my-elb, which is associated with subnets subnet-0123456789abcdef0 and subnet-0123456789abcdef1, and security group sg-0123456789abcdef. It also creates a target group named my-target-group, which is associated with the VPC vpc-0123456789abcdef0, and registers instances with the target group. Finally, it creates a listener for the ELB that forwards traffic to the target group.

EC2 Autoscaling

Auto Scaling is a feature of AWS EC2 that automatically adjusts the number of EC2 instances in a group based on the demand of the applications. It helps to ensure that the application is running at optimal capacity without under or over-provisioning. Here’s how to use Auto Scaling in AWS EC2 with Python Boto3:

  1. Creating an Auto Scaling Group: To create an Auto Scaling Group, you need to use the `create_auto_scaling_group()` method of the Auto Scaling client. You’ll need to specify the desired number of instances, minimum and maximum number of instances, the launch configuration, the name of the Auto Scaling Group, and other parameters.
  2. Setting up Scaling Policies: To set up scaling policies, you can use the `put_scaling_policy()` method of the Auto Scaling client. You’ll need to specify the name of the policy, the Auto Scaling Group name, the scaling adjustment, and other parameters.
  3. Monitoring the Scaling Activities: You can use the `describe_auto_scaling_groups()` method to monitor the scaling activities. This method provides information on the current size of the Auto Scaling Group and the instances that are currently running.

Here’s an example Python code snippet that shows how to create an Auto Scaling Group, set up scaling policies, and monitor the scaling activities:

import boto3

# create an Auto Scaling client
autoscaling = boto3.client('autoscaling')

# create an Auto Scaling Group
auto_scaling_group_name = 'my-auto-scaling-group'
launch_configuration_name = 'my-launch-configuration'
min_size = 2
max_size = 5
desired_capacity = 2
response = autoscaling.create_auto_scaling_group(
AutoScalingGroupName=auto_scaling_group_name,
LaunchConfigurationName=launch_configuration_name,
MinSize=min_size,
MaxSize=max_size,
DesiredCapacity=desired_capacity,
AvailabilityZones=['us-west-2a', 'us-west-2b']
)


# set up scaling policies
scale_up_policy_name = 'scale-up-policy'
scale_down_policy_name = 'scale-down-policy'
response = autoscaling.put_scaling_policy(
AutoScalingGroupName=auto_scaling_group_name,
PolicyName=scale_up_policy_name,
ScalingAdjustment=1,
AdjustmentType='ChangeInCapacity'
)
response = autoscaling.put_scaling_policy(
AutoScalingGroupName=auto_scaling_group_name,
PolicyName=scale_down_policy_name,
ScalingAdjustment=-1,
AdjustmentType='ChangeInCapacity'
)

# monitor scaling activities
response = autoscaling.describe_auto_scaling_groups(
AutoScalingGroupNames=[auto_scaling_group_name]
)

In this code, we create an Auto Scaling Group with the name `my-auto-scaling-group`, a launch configuration named `my-launch-configuration`, minimum size of 2, maximum size of 5, and desired capacity of 2. We also set up two scaling policies, one to scale up the Auto Scaling Group by 1 instance when triggered and another to scale down by 1 instance when triggered. Finally, we monitor the scaling activities by describing the Auto Scaling Group.

So this scaling group when created will immediately kick of two instances since minimum size is 2. The traffic to an auto-scaling group is directed using Elastic Load Balancer as shown below.

import boto3

# create an ELB client
elb = boto3.client('elb')

# create an ELB
elb_name = 'my-elb'
response = elb.create_load_balancer(
LoadBalancerName=elb_name,
Listeners=[
{
'Protocol': 'HTTP',
'LoadBalancerPort': 80,
'InstanceProtocol': 'HTTP',
'InstancePort': 80
}
],
AvailabilityZones=['us-west-2a', 'us-west-2b']
)

# register instances with the ELB
response = elb.register_instances_with_load_balancer(
LoadBalancerName=elb_name,
# these are the two instances launched by the auto scaling group.
Instances=[
{

'InstanceId': 'instance-id-1'
},
{
'InstanceId': 'instance-id-2'
}
]
)

# attach the ELB to the auto scaling group
response = autoscaling.attach_load_balancers(
AutoScalingGroupName=auto_scaling_group_name,
LoadBalancerNames=[elb_name]
)

The ELB will distribute traffic among the instances in the group, and as traffic increases or decreases, the Auto Scaling Group will automatically adjust the number of instances to meet demand. In this way, the ELB acts as a single point of contact for incoming traffic, and the Auto Scaling Group handles the scaling of instances to meet demand.

Monitoring EC2 Usage

AWS CloudWatch is a monitoring and logging service that allows you to monitor and collect metrics, collect and monitor log files, and set alarms on metrics and logs. CloudWatch provides a unified view of your resources, applications, and services that run on AWS. Here’s how to use AWS CloudWatch to monitor and log EC2 instances using Python Boto3:

  1. Collecting Metrics: To collect metrics for an EC2 instance, you can use the `put_metric_data()` method of the CloudWatch client. You can collect metrics such as CPU utilization, memory usage, and network I/O, among others.
  2. Creating Alarms: You can use CloudWatch alarms to monitor metrics and take automated actions based on defined thresholds. You can create an alarm to notify you when an instance is running low on memory, for example. To create an alarm, you can use the `put_metric_alarm()` method of the CloudWatch client.

Here’s an example Python code snippet that shows how to collect metrics and set up CloudWatch alarms for an EC2 instance:

import boto3

# create a CloudWatch client
cloudwatch = boto3.client('cloudwatch')

# put metric data for an EC2 instance
instance_id = 'i-0123456789abcdef0'
response = cloudwatch.put_metric_data(
Namespace='AWS/EC2',
MetricData=[
{
'MetricName': 'CPUUtilization',
'Dimensions': [
{
'Name': 'InstanceId',
'Value': instance_id
},
],
'Unit': 'Percent',
'Value': 60.0
},
]
)

# set up a CloudWatch alarm
alarm_name = 'Low Memory Alarm'
response = cloudwatch.put_metric_alarm(
AlarmName=alarm_name,
ComparisonOperator='LessThanThreshold',
EvaluationPeriods=1,
MetricName='MemoryUtilization',
Namespace='AWS/EC2',
Period=60,
Statistic='Average',
Threshold=20.0,
ActionsEnabled=False,
AlarmDescription='Alarm when server memory is running low',
Dimensions=[
{
'Name': 'InstanceId',
'Value': instance_id
},
],
)

In this code, we use the `put_metric_data()` method to put metric data for an EC2 instance. We specify the instance ID and the `CPUUtilization` metric with a value of 60%.

We then set up a CloudWatch alarm using the `put_metric_alarm()` method to trigger when the `MemoryUtilization` metric falls below 20% for the specified instance. We set `ActionsEnabled` to `False` in this example to avoid triggering any actions, but you can set it to `True` to trigger an action, such as sending an email or SMS notification.

Optimizing EC2 Usage

Lets finally also look at some tips and best practices for using AWS EC2 with Python Boto3:

Optimize instance performance
Choose the right instance type for your workload, and regularly monitor your instances to identify any performance bottlenecks. You can use CloudWatch to monitor your instances and track metrics such as CPU utilization and network traffic.

  • Understand your workload: Before choosing an instance type, you need to understand the requirements of your workload. Consider factors such as CPU, memory, storage, network bandwidth, and I/O performance.
  • Use AWS tools to help you choose: AWS provides tools such as the AWS Compute Optimizer and the AWS Instance Type Matrix that can help you choose the right instance type for your workload based on your performance and cost requirements.
  • Test your workload on different instance types: It’s a good idea to test your workload on different instance types to see which one performs the best. You can use tools such as AWS Load Testing to simulate different workloads and measure performance on different instance types.
  • Consider the availability requirements: If your workload requires high availability, you may want to choose an instance type that is part of an AWS Availability Zone or Region. You can also use AWS Auto Scaling to automatically add or remove instances based on demand.

Manage costs
EC2 can be cost-effective if used wisely. To keep costs under control, consider using spot instances for non-critical workloads, use reserved instances for steady-state workloads, and turn off instances when they’re not in use. You can also use cost optimization tools like AWS Cost Explorer to analyze and optimize your costs.

Ensure security and compliance
Follow security best practices such as using strong passwords, enabling multi-factor authentication (MFA), and regularly updating your software. Use AWS Identity and Access Management (IAM) to control access to your EC2 resources, and use security groups to control inbound and outbound traffic to your instances.

Automate tasks
Use automation tools like AWS CloudFormation and AWS OpsWorks to automate the creation and management of your EC2 resources. You can also use AWS Lambda to automate tasks such as backups and resource cleanup.

Use version control
Use a version control system like Git to manage your infrastructure code, including your Boto3 scripts. This will help you track changes over time and collaborate with your team.

Regularly backup data
Always back up your data to prevent data loss. You can use EBS snapshots or Amazon S3 to backup your data, and use AWS Backup to automate the backup process.

Keep up-to-date with EC2 updates
AWS EC2 is constantly evolving with new features and updates. Keep up-to-date with these changes and take advantage of new features that can improve your workload performance, security, and reliability.

By following these tips and best practices, you can optimize your use of AWS EC2 and Python Boto3, reduce costs, improve security and compliance, and automate tasks to increase efficiency.

Conclusion

AWS EC2 is a powerful tool that provides data scientists with scalable computing capacity in the cloud, making it easier to run applications and workloads on virtual machines.

In this blog post, we covered the basics of instance creation, connecting to instances, managing instances, using Elastic Block Store (EBS), security in AWS EC2, Elastic Load Balancing (ELB), auto scaling, and monitoring instances using CloudWatch. We also provided some tips and best practices for using AWS EC2 with Python Boto3, such as selecting the right instance types, optimizing instance performance, managing costs, and ensuring security and compliance.

With this knowledge, data scientists can make the most of AWS EC2 to run their workloads efficiently and cost-effectively.

Credits

This post was written with help from ChatGPT. Some of the promopts used are

Discuss the basics of instance creation using Python Boto3. This should include the creation of an Amazon Machine Image (AMI), choosing an instance type, specifying security groups, and creating an instance.

Explain how to connect to an instance using Python Boto3. This should cover SSH key pairs, public IP addresses, and using the EC2 client to connect to the instance.

Discuss how to manage instances using Python Boto3. This should include starting, stopping, and terminating instances, and also monitoring the status of instances using the EC2 client.

--

--

Karun Thankachan
CodeX

Simplifying data science concepts and domains. Get free 1-on-1 coaching @ https://topmate.io/karun