AWS Basics Training for Data Analysts and Data Scientists

Data analysts/scientists at Ro need to have a general grasp of AWS infrastructure. They also need to know how to do things safely and what questions to ask an administrator.

Sami Yabroudi
Dec 2 · 4 min read

Estimated time for completion: 1.5 days

Before starting, make an account on linuxacademy.com .

AWS Concepts

Start by running through AWS basic concepts:

You can skim the “Conclusion” section

Understand AWS Permissioning

The worst but most important part of AWS is permissions (called IAM — Identity Access Management). Do the “Identity and Access Management” section here, including the lab:

The takeaway from this section will not be that you are a master AWS permissions admin but that you remember the general ideas about how policies and roles work. Pay special attention to the idea of an IAM user vs. an IAM role, and the idea that roles can be assumed.

EC2

An EC2 instance is basically a big computer/server in the cloud. EC2 is the most fundamental piece of infrastructure that Amazon offers, and applications running on variations of EC2 power much of Ro and much of Ro Data’s custom infrastructure. Go through the EC2 section here, including the linux version of the lab:

S3

S3 is the main file storage tool for AWS. When S3 famously went down in 2017, the internet “broke”. S3 is integral to any usage of AWS. Go through the S3 section here, though only do the first lab:

RDS (And DynamoDB)

RDS is Amazon’s main transactional database, and it powers Ro’s main applications (Data team refers to it as prod db). DynamoDB is AWS’ competitor to MongoDB — it’s a document storage database (so no tables or columns, but rather a faster way to store and look up large blobs of text). Do “Database Services”->“Summary of AWS Database Services” and “Database Services”->“RDS and DynamoDB Basics” here:

Monitoring and Alerting

Ro’s favored tool for monitoring is DataDog. The purpose of monitoring is to make sure that all systems and infrastructure are running smoothly, and to alert when something goes wrong. Monitoring can be done at the level of infrastructure (ex: EC2 memory usage, RDS transaction rate) but also within applications.

Please make sure to get a Datadog read-level account from IT. You may be able to access through Okta.

Watch this quick overview video of Datadog: https://www.datadoghq.com/product/ (video “Get full visibility into modern applications”)

Within Datadog:

Other information:

Boto 3

Boto 3 is AWS’s SDK for Python (use it in Python simply via “import boto3”). You can use Boto 3 to do lots of what you can do via AWS’s graphical user interface (exs: access files in S3, manipulate EC2 instances, assume IAM roles, etc etc etc).

Lambda

Let’s say you have a python function that takes inputs from an http request and does something with those inputs (exs: performs calculations and returns result, performs calculation and saves results to RDS). One way to host this function is to put it within an app and then put the app on an EC2 instance — that’s a large amount of work to host a rather trivial function. A far simpler approach is to turn this function into a lambda function; a lambda function is “serverless”, meaning you just set up the function using AWS’s UI and then the function can receive calls and execute without your ever having to provision an application, EC2, etc. Lambda functions can accomplish all sorts of stuff.

Do the “Serverless Compute” section here:

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Sami Yabroudi

Written by

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade