AWS: The Data Scientist’s Best Friend

One of the best and most popular cloud-based platforms for data scientists

Ritti Bhogal
NYU Data Science Review
5 min readMar 22, 2022

--

Image taken from Unsplash

Cloud computing is heavily intertwined with the work of data scientists. Advanced data analytics have become possible thanks to the storing and computing of data in the cloud. While there are several cloud computing technologies (such as Microsoft Azure and Google Cloud Platform) that have reduced the gap between data science and cloud computing, no other platform does it better than the industry’s lead: Amazon Web Services.

Amazon Web Services (AWS) is an evolving cloud computing platform developed by Amazon that offers reliable, intuitive and cost-effective cloud based products. The services available on the platform are comprised of models such as infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS). IaaS models are highly scalable and automated cloud computing resources. SaaS is the use of the internet by a third party vendor to host applications. PaaS provides a platform on which software can be hosted in the cloud. These are three of the most popular cloud service offerings and have replaced infrastructure that is installed in the client’s building. AWS uses a combination of all three computing methods to give its users the most hands-off experience when it comes to application development. Unlike its competitors, AWS emphasizes scalability and operates under a pay-as-you-go (PAYG) cloud model, charging its customers solely for the time they spend using any one of the offered services¹. Some of the fundamental services provided by AWS are:

Elastic Compute Cloud (EC2)

  • AWS’s virtual machine service with environmental control that reaches operating system level
  • Run cloud-native and enterprise applications in a single EC2 instance capable of on-demand resizing²

Amazon Simple Storage Service (Amazon S3)

  • A secure, low-cost and scalable storage service
  • Storage capabilities ranging from pdf documents to mobile applications³

Amazon Virtual Private Cloud (Amazon VPC)

  • provides a customizable virtual network in which the user can launch AWS resources
  • Enhance web application security through enforcing network outbound and inbound traffic rules⁴

And that’s just to name a few out of the over 200 that are currently available. Due to its flexibility, AWS attracts clients that range from solo software developers itching to get a taste of AWS’s developer tools, to large-scale enterprises, including Netflix, Autodesk, and Coursera⁵.

How does AWS gear itself towards data scientists, you may ask? In their Cloud Computing Concepts Hub, AWS features several articles detailing broad topics, one of which is “What is Data Science?” AWS has also created a data science learning path which serves as a beginner’s guide for data scientists using AWS for the first time. It involves a short description of Amazon SageMaker (one of the more data science-oriented services that I’ll get into later) and a tutorial on how to use it to build, train and deploy machine learning models⁶.

So it appears that AWS can be a paradise for data scientists. But exactly which of these services are meant to ease the lives of our favorite analytical experts? I’ve sifted through AWS’s analytics-based services and listed what I believe are the top five below.

1. Amazon SageMaker

Like I mentioned earlier, SageMaker is a tool that allows users to design machine learning models having abilities that range from image classification to natural language processing. Accessing data sources can easily be done through a Jupyter Notebook instance, eliminating the headache of managing servers (AWS takes care of that for users). The output and training data from the models can be stored in an s3 bucket given its URL. One popular implementation of an ML model that utilizes SageMaker is fraud detection in online monetary transactions⁷.

2. Amazon Elastic MapReduce (EMR)

This service gives data scientists the ability to process large amounts of data and analyze data using open source frameworks such as Apache Spark, Hive, and Tensorflow. Combined with Amazon SageMaker, EMR can conduct large-scale ML model training. From an enterprise perspective, Amazon EMR can run “what-if’’ algorithms that uncover patterns in customer preferences and market trends⁸.

3. AWS Glue

AWS Glue is known as an ETL (extract, transport and load) service that provides all the capabilities required for data integration. The skills of this tool range from extracting data to organizing data into data lakes and warehouses. Users can create and run serverless ETL jobs through the AWS Glue console, avoiding the hassle of coding them. AWS Glue can be integrated with other data storage services offered by AWS such as Amazon S3, Amazon Redshift, AWS Lake Formation, Amazon Aurora, and Amazon Relational Database Service⁹.

4. AWS Kinesis Video Streams

AWS makes it possible to interact with visual data using the AWS Kinesis Video Streams service. The console allows for live stream monitoring and can be configured to store or encrypt data in the cloud for a preset period of time. Not only can this service tackle visual data, but it also allows for the processing of audio and thermal data. AWS Kinesis is generally used to process visual feed from video chat and peer to peer media streaming, but it can also absorb data from RADARs, LIDARs, drones, satellites, dash cams, and depth-sensors¹⁰.

5. AWS Relational Database Service (RDS)

AWS RDS brings relational database creation to a whole new level, allowing for scaling and cost-efficiency while automating administrative tasks including database setup, patching, and backing up. Users can choose to manage their database using a familiar database engine such as MySQL or MariaDB, or the Amazon version called Amazon Aurora, which can be fully managed by AWS RDS¹¹.

While the services mentioned above are geared towards data processing, all of the AWS services should be on the radar of data scientists. However, it is also important to note that AWS isn’t the only platform out there that can benefit data analysis experts. Other platforms include Microsoft Azure, which is the second most popular cloud computing platform to AWS. Resonate Ignite Platform, an AI-driven consumer analytics and data platform, is also an acceptable option. Below are hyperlinks to certifications and educational resources offered by these platforms’ companies specifically for data scientists.

Microsoft Data Scientist Path — The data science learning path offered by Microsoft utilizing Microsoft Azure

Learning Journey for Machine Learning on Azure — Interactive learning document for machine learning on Azure to prep for Azure Data Scientist Associate certification

Data Science Virtual Machines (DSVMs) — Download for Microsoft’s Azure-based virtual machine service

Resonate Ignite Platform Demo — demonstration request form for Resonate Ignite Platform

Do you know of other platforms comparable to AWS that data scientists need to try out? Leave them in the comments below. Wishing all data scientists the best of luck on their analytical journeys!

References

[1] Alexander S. Gillis, Amazon Web Services (AWS), TechTarget,

[2] Amazon Web Services, Amazon EC2, Amazon.

[3] Amazon Web Services, Amazon S3, Amazon.

[4] Amazon Web Services, Amazon VPC, Amazon.

[5] John Cave, Who’s Using Amazon Web Services? (January 2020), Contino.

[6] Amazon Web Services, Data Scientist — Learning Path, Amazon.

[7] Amazon Web Services, Amazon SageMaker, Amazon.

[8] Amazon Web Services, Amazon EMR — Big Data Platform, Amazon.

[9] Amazon Web Services, AWS Glue, Amazon.

[10] Gilad David Maayan, 5 AWS Services Every Data Scientist Should Use (July 2021), Medium.

[11] Amazon Web Services, Amazon RDS, Amazon.

--

--

Ritti Bhogal
NYU Data Science Review

Computer Science at NYU Tandon | NYU Data Science Club | NYU RoboMaster team UltraViolet | water is wet