Automating Data Management on AWS: A Step-by-Step Tutorial

precious ogundipe
4 min readNov 30, 2023

--

By incorporating automation, the handling of data can become more efficient, precise, and adaptable to the dynamic needs of contemporary businesses. Automation is the process of repeating a set of steps multiple times with the use of technology to avoid the process of manually doing them over and over again. Through the elimination of mundane activities, organizations can optimize their resources, minimize errors, and concentrate on vital aspects of data utilization. This article talks about how AWS Cloud offers various services, like Amazon CloudWatch and Amazon CloudFormation, to automate tasks related to data management. The emphasis will be on leveraging Amazon S3 and Amazon Elastic Block Storage (EBS) technologies for streamlined and efficient data management.

Case Study:

A company regularly uses Amazon EC2 instances to support its dynamic and scalable infrastructure. Managing data is critical for the company’s operations, and they have decided to implement an automated solution to create and maintain snapshots for EC2 instances, copy important files to an S3 bucket, and leverage S3 versioning for enhanced data resilience.

Architecture

Create an S3 bucket:

pick a unique name for your s3 bucket and create the bucket

Attach instance profile to Processor (EC2 instance):

Taking snapshots of your instance

snapshot of the instance taken and instance is set to start again
# To display the EBS volume-id
aws ec2 describe-instances --filter 'Name=tag:Name,Values=Processor' --query 'Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.{VolumeId:VolumeId}'
# To display the instance ID
aws ec2 describe-instances --filters 'Name=tag:Name,Values=Processor' --query 'Reservations[0].Instances[0].InstanceId'
# To shut down the "Processor" instance
aws ec2 stop-instances --instance-ids INSTANCE-ID
# To verify that the "Processor" instance stopped
aws ec2 wait instance-stopped --instance-id INSTANCE-ID
# To create your first snapshot of the volume of your "Processor" instance
aws ec2 create-snapshot --volume-id VOLUME-ID
# To check the status of your snapshot
aws ec2 wait snapshot-completed --snapshot-id SNAPSHOT-ID
# To restart the "Processor" instance
aws ec2 start-instances --instance-ids INSTANCE-ID

Automating the creation of subsequent snapshots:

Using the Linux scheduling system (cron), you can set up a recurring snapshot process so that new snapshots of your data are taken automatically.

cron job scheduled to take snapshots every minute
# To create and schedule a cron entry that runs a job every minute
echo "* * * * * aws ec2 create-snapshot --volume-id VOLUME-ID 2>&1 >> /tmp/cronlog" > cronjob
crontab cronjob
# To verify that subsequent snapshots are being created
aws ec2 describe-snapshots --filters "Name=volume-id,Values=VOLUME-ID"

Retaining the last two snapshots:

Write a Python script that maintains only the last two snapshots for any given EBS volume.

The script deletes all the snapshot and retains the last 2 snapshots.
#!/usr/bin/env python

import boto3

MAX_SNAPSHOTS = 2 # Number of snapshots to keep

# Create the EC2 resource
ec2 = boto3.resource('ec2')

# Get a list of all volumes
volume_iterator = ec2.volumes.all()

# Create a snapshot of each volume
for v in volume_iterator:
v.create_snapshot()

# Too many snapshots?
snapshots = list(v.snapshots.all())
if len(snapshots) > MAX_SNAPSHOTS:

# Delete oldest snapshots, but keep MAX_SNAPSHOTS available
snap_sorted = sorted([(s.id, s.start_time, s) for s in snapshots], key=lambda k: k[1])
for s in snap_sorted[:-MAX_SNAPSHOTS]:
print("Deleting snapshot", s[0])
s[2].delete()

Synchronize files with Amazon S3:

Activate versioning for the bucket, then sync all your local files to your s3 bucket using some aws cli commands.

congratulations you successfully uploaded your files to s3
# To activate versioning for the bucket
aws s3api put-bucket-versioning --bucket datamanagementsbucket --versioning-configuration Status=Enabled
# To sync the contents of your snapshots with your Amazon S3 bucket.
aws s3api get-object --bucket datamanagementsbucket --key files/file1.txt --version-id VERSION-ID files/file1.txt

References

Enable bucket versioning

Sync files to s3

ssh into a linux instance

--

--