Data Extraction on AWS using Python boto3, AWS SDK for Pandas (awswrangler), Redshift_connector and Pyathena — Part 1
Everything you need to know to access data from AWS data services, including S3, Redshift, and Athena
The Importance of Data Extraction
When data scientists start a new project, they always need to first identify where the data is, and how they can access the data programmatically. As more and more organizations have either migrated their data infrastructure from an on-premise environment to the public cloud, or built their data infrastructure on the cloud from scratch, data scientists also need to upgrade their skills to learn how to efficiently extract and ingest data from the data related services of the public cloud.
The top 3 cloud providers are AWS, Azure and GCP, and they all have quite a few data specific services. The focus on today’s blog will be on AWS. I will have another two similar blogs respectively for Azure and GCP. Please feel free to follow me if you’d like to be notified when similar blogs for Azure and GCP are published.
The data services that data scientists and Machine Learning (ML) engineers need to heavily interact with in order to extract data from AWS are as follows:
- AWS S3 — AWS S3 is an object storage service. Cloud tenants can store basically any volume of data of any format in S3. This is why many organizations build their data lake solutions on top of AWS S3, due to its cheap storage and flexibility to store data of any format, like semi-structured and unstructured data.
- Amazon Redshift — Redshift is AWS’s cloud data warehouse service. End users can use SQL to query data in Redshift, which is generally used to store structured tabular data and is built for analytical workloads. It is quite common that data scientists need to extract data from Redshift if their organization's data infrastructure is built on AWS.
- Amazon Athena — Athena is an interactive query engine service that allows data users like data scientists and data analysts to query the data stored in S3 interactively, to undertake data exploratory analysis using standard SQL. When the data set (stored in S3) is very large, data scientists generally need to explore the data before they decide which part will be extracted into their data science environment for further machine learning work. This is when Athena will play a role in the data scientists’ toolbox.
In this blog, I will cover the key data extraction patterns and share the implementations of each pattern using Python by leveraging the following four open source python libraries. (At the conclusion of this blog, we will also do a comparison among these 4 libraries and explain when you should use what).
Data Extraction on AWS using boto3 — Programming Model
We will start with boto3 as it is the most generic approach to interact with any AWS service. boto3 is an AWS SDK for Python. Users can use it to create, configure, and manage any AWS services. Therefore it has broader use than just a data service on AWS. It provides the most flexibility out of the 4 libraries. Let’s see how boto3 works:
Ingestion data from AWS S3 using boto3. I will summarize the generic programming model that you can follow when working with boto3, regardless which AWS service you interact with:
- Step 1— Make sure the credentials used to connect to AWS are available, including aws_access_key_id, aws_secret_access_key, and aws_session_token. These credentials are used to authenticate and authorize the API requests made to the underlying AWS resources.
- Step 2 — Create a service client by name, such as S3, redshift, and Athena.
- Step 3— Understand the corresponding resources and sub-resources that you can work with for each AWS service. For S3, the fundamental resources and sub-resources are bucket and object. For Redshift, the fundamental resources and sub-resources are clusters, schema, databases, tables and queries. For Athena, the fundamental resources and sub-resources are data catalog and queries;
- Step 4— Explore the methods and actions that can be taken on the abovementioned resources and sub-resources. For S3 buckets, you can perform actions such as copy, create, delete buckets and put objects into an S3 bucket, and for S3 objects, you can perform actions such as copy, get, delete, upload and download objects. For Redshift clusters, you can perform actions such as create, tag, describe and delete cluster, and for Redshift data, you can perform actions such as list databases, schemas, tables and execute query statements. For Athena, you can perform actions such as create, get and list data catalog, execute queries and get queries results.
Data Extraction on S3 — boto3 Implementation Guidance
Interacting with data in S3 using boto3 — The code snippet shown below, covers the most commonly used method that data scientists will use to interact with data in S3,including list buckets, list objects, get objects (read csv files and parquet files), and put objects (saving files back to S3).
Very important when calling boto3 API, is that the returned result is always in JSON formats because it is calling the underlying S3 APIs. Data scientists have to leverage other tools to parse the JOSN results into a data frame, which is what they are familiar with. In the example shown below, you will notice in order to read in a csv file, we will have to leverage pandas to convert the [‘body’] part of the returned JSON response into a data frame.
This is also the limitation of boto3 to most data scientists as they need to always think about how to parse the JSON results. This is why boto3 is not always the preferred approach for data scientists when it comes to data extraction. Instead, AWS SDK for Pandas, also called awswrangler, is a better option specifically for data extraction. When we dive into AWS SDK for Pandas in the next article, I will demonstrate how AWS SDK for Pandas makes it easier for data scientists. For now let’s see how you can extract data from S3 using boto3. I have given examples of reading data in the formats of CSV and Parquet.
In the next part, I will explain the implementation guide on extracting data from redshift using boto3.
Data Extraction on Redshift — boto3 Implementation Guidance
Interacting with data in redshift with boto3 — boto3 has three sets of API for interacting with redshift. The first is redshift, which is used for managing redshift clusters; the second is RedshiftDataAPIService, which is used for running queries on Amazon Redshift tables, such as SQL commands; and the third is RedshiftServerless, which is used to manage resources with Amazon Redshift Serverless. The focus for us will be RedshiftDataAPIService as getting access to data in redshift is the purpose of today’s blog.
RedshiftDataAPIService has the batch_execute_statement method that allows users to run one or more SQL statements, which can be data manipulation language (DML) or data definition language (DDL). Same as S3, when you use boto3 to query data from redshift databases and tables, the returned result is also in JSON format and again data scientists have to parse the result. In the below code snippet (action 3), you can see that I have to define a whole function in order to convert the JSON response into a Pandas dataframe.
This is why boto3 is not particularly user-friendly for data scientists when using it to extract data. When we do the deep dive into AWS SDK for Pandas in the next session, we will be able to demonstrate how AWS SDK for Pandas makes it easier for data scientists. But for now, let’s see how you can extract data from redshift using boto3.
In the next part, I will explain the implementation guide on extracting data from Athena using boto3.
Data Extraction on Athena — boto3 Implementation Guidance
Interacting with data in Athena using boto3 — Amazon Athena is an interactive query service that lets you use standard SQL to analyze data directly in Amazon S3. You can point Athena at your data in Amazon S3 and run ad-hoc queries and get results in seconds. This is why Athena is serverless.
The API call to access data through Athena is start_query_execution, which runs SQL query statements. With the API call of start_query_execution, you need to define the query itself as well as a S3 bucket to store the query results. Then you need to run another API call — get_query_results with the query ID to access the query results.
The same as S3 and Redshift, the result will be returned as in JSON formats. Data scientists need to convert the JSON results into a data frame in order to continue their analysis on the data. As mentioned earlier, in the session on AWS SDK for Pandas, also called awswrangler, I will demo how AWS SDK for Pandas makes it easier.
For now, let’s see how data scientists can use boto3 to execute Athena queries and get access to the query results.
Summary on Boto3
So far we have covered how to use Boto3 to extract data from AWS services including S3, Redshift and Athena. It is quite clear that Boto3 can do the data extraction work. However, because it is calling the underlying AWS APIs, the returned result is in JSON format, which is not friendly for users like data scientists, data engineers, and data analysts.
In the next part of this series of blogs on AWS data extractions, I will dive into AWS SDK for Pandas, as it extends the power of the Pandas library to AWS connecting Pandas Data Frames and AWS data related services to see how it makes the data extraction easier. Stay tuned!
I hope you have enjoyed reading this blog. Please feel free to follow me on Medium if you want to be notified when there are new blogs published.
If you want to see more guides, deep dives, and insights around modern and efficient data+AI stack, please subscribe to my free newsletter — Efficient Data+AI Stack, thanks!
Note: Just in case you haven’t become a Medium member yet and want to get unlimited access to Medium, you can sign up using my referral link! I’ll get a small commission at no cost to you. Thanks so much for your support!