Setup Python Project for AWS Analytics

Published in

Data Engineering on Cloud

9 min readMar 27, 2022

Let us understand how to setup Python Project to develop Data Engineering Pipelines using Services under AWS Analtyics. It includes setting up Python Virtual Environment and Install required dependencies.

This article is part of Data Engineering on Cloud Medium Publication co-managed by ITVersity Inc (Training and Staffing) and Analytiqs Inc (Data Services Boutique)

We will take care of installing IDE (Pycharm), setup project, install required dependencies as well as validate the integration with AWS Services. Here is how it will look like by end of this article.

Understand Pre-requisites to setup Python Project to build solutions using AWS Analytics Services.
Get an overview of IDEs and setup Pycharm (a popular Python IDE).
Setup Project using Pycharm and understand Python Virtual Environment.
Identify and Install required dependencies taking one hypothetical project.
Validate the project by developing couple of programs which integrates with AWS.
Optionally Setup Jupyterlab as part of the same project to streamline the learning process using Jupyter Notebooks.

Pre-requisites

Let us go through the pre-requisites for setting up Python Project for AWS Analytics.

A Mac or Windows or Linux based laptop or desktop. If you are using Windows, you can consider Ubuntu 20.04 setup using wsl.
Pyhon 3.6 or later. The article is created using Python 3.8 version.
A valid AWS Account
A well configured AWS CLI is highly desired. If you are new to AWS and using CLI, then follow instructions to setup and configure AWS CLI.

Install IDEs such as Pycharm

Let us go ahead and install IDEs such as Pycharm to take care of application development using Python as programming language.

IDE stands for Integrated Development Environment. IDEs are typically used to boost productivity for the application development.
There are quite a few IDEs for Python based application development, such as Pycharm, Visual Studio Code, Spyder, etc. Pycharm is the most popular choice.
Pycharm comes with community edition as well as premium edition.
It will be suffice to have community edition for now. However, the premium edition have quite a few good features.

Here are the instructions to install Pycharm.

Go to Downloads page for Pycharm. Make sure to choose right platform. As I am using Mac, it have chosen dmg.

In case if the above link does not work, just google around by saying install jetbrains pycharm and follow the most appropriate link (typically from jetbrains)

Click on Download. Typically the software will be downloaded to Downloads folder.
Double Click and follow the instructions to take care of installation of Pycharm.

Before going ahead let us understand some of the advantages of using IDEs.

Easy management of the project.
Ability to navigate between folders and files easily.
Auto fill of classes, variables or objects, functions or methods, modules, etc.
Ability to navigate to the documentation of the Classes or APIs.
Integration with code versioning tools.
Easy to refactor the code when we decide to change the names of classes, variables or objects, functions or methods, modules, etc.

Setup Python Project using IDEs such as Pycharm

As Pycharm is installed let us understand how to setup Python Project. I will be using this for Data Engineering using AWS Analytics and hence I will be naming the project as aws-analytics.

Launch Pycharm
Create new project by name aws-analytics
Make sure to choose the location for the virtual environment and right interpreter.

Here is how the Python Project looks like once it is created using Pycharm. Make sure to not to add any source code to this folder.

Overview of Python Virtual Environments

Let us understand the relevance of Python Virtual Environments with respect to development of Python based applications.

Most of the engineers might have to deal with multiple projects at a time.
Each project might use different Python version and the dependencies related to the project.
For example a web based application might be dependent on libraries such as Django, SQL Alchemy, etc while data engineering application might be dependent on libraries such as Pandas, SQL Alchemy, Pyspark, etc. Also the Python version might have to be different for each of these applications.
Python Virtual Environments will facilitate us to manage dependencies related to different projects in an isolated fashion.

As long as we use IDEs such as Pycharm, it is relatively straight forward to setup virtual environments for the projects.

Identify Required Dependencies

Let us understand the required dependencies for this project. We will be splitting json files into smaller ones, compress them and then copy the smaller and compressed files to AWS s3.

We will read the data from json files into Pandas Dataframe in chunks and then create smaller and compressed (or zipped) JSON files. Hence, we need to install pandas.
As we will be using Python based code to upload files to s3, we need boto3.
Also, we might go with multiple threads to copy the data into s3 to speed up the copy process. We can leverage multiprocessing for the same. It is available as part of core Python and hence we don’t need to install any additional dependencies.

Install Required Dependencies

Now let us go ahead and install required dependencies. Once Python Virtual Environment is setup as part of Pycharm, there are multiple ways to install dependencies.

Use Interpreter under Pycharm to search and install required dependencies. It uses pip under the hood to install the required dependencies.
Launch Terminal or Powershell, activate Python virtual environment associated with the project and then install using pip.
Add requirements (requirements.txt) file. Pycharm will automatically prompt you when ever there are changes to requirements file to install dependencies.

We will take care of installing the dependencies using the requirements file. Here is the content of requirements.txt file, we are going to use for one of the project related to building Data Engineering Pipelines using AWS Analytics services.

You can use this to copy and paste later into requirements.txt file.

pandas==1.3.4
boto3==1.21.19
awscli==1.22.74
fsspec==2022.2.0
s3fs==2022.2.0

Make sure to right click on the project name, then New, then File.

Give the name as requirements.txt

Paste the list of dependencies related to the project. Click on Install requirements. It will take care of installing all the requirements specified in the requirements.txt file.

The requirements will be installed in the lib/python3.8/site-packages folder created under the virtual environment associated with the project.

Some of the dependencies might need additional libraries and those will be automatically installed even though we do not explicitly specify those as part of requirements.txt.

Getting Started with AWS Services

As we have setup the project successfully, let us go ahead and validate by interacting with AWS Services using boto3. We will try to list the s3 buckets or ec2 instances.

Getting s3 bucket names

Here is the code snippet to get s3 bucket names. The authentication and authorization is taken care of using Access Key and Secret Key of default AWS profile created as part of AWS CLI configuration.

import boto3


def main():
    s3_client = boto3.client('s3')

    # Returns dict type object using JSON formatted results
    buckets = s3_client.list_buckets()

    # Extracting bucket names from the dict that is returned using list comprehensions
    bucket_names = [bucket['Name'] for bucket in buckets['Buckets']]
    
    for bucket in bucket_names:
        print(bucket)
    print(f'Total Number of buckets under my account is {len(bucket_names)}')


if __name__ == '__main__':
    main()

List of all ec2 Instances with state

Here is the code which prints the ec2 instance id along with state such as running, stopped, etc.

import boto3


def main():
    ec2_client = boto3.client('ec2') # type:

    # Returns dict type object using JSON formatted results
    ec2_instances = ec2_client.describe_instances()

    instances = []
    for reservation in ec2_instances['Reservations']:
        for instance in reservation['Instances']:
            instance_state = {
                'instance_id': instance['InstanceId'],
                'state': instance['State']['Name']
            }
            instances.append(instance_state)

    for instance in instances:
        print(instance)


if __name__ == '__main__':
    main()

Overview of Jupyterlab

Let us get an overview of Jupyterlab. Jupyterlab is wrapper on top of Jupyter Notebook servers which is primarily used to learn Python Programming using interactive environment.

When we setup Python, we get some thing called as Python CLI (Python Command Line Interface). We can use Python CLI to learn Python, but it is not learner’s friendly.
Python Community introduced some thing called as iPython which is better Python CLI. Using iPython, a web based notebook environment is eventually developed. It is called as Jupyter Notebook Server which is very effective to practice and learn Python.
Jupyterlab is a wrapper around Jupyter Notebook Server which empowers Python learners by providing additional features.

Here are some of the features related to Jupyterlab (a wrapper on top of Jupyter Notebook Server).

An interactive web based environment to practice and learn Python Hands-On.
It contains cells where we can either write code or document using Markdown.
We can run OS commands via this web based environment and also launch Termial to interact with OS directly.
Jupyterlab also provides a sidebar which simplifies the navigation of the notebooks.
Also we can review active kernels and manage them using sidebar.

All the articles related to interacting with AWS as part of our publication will be demonstrated using Jupyter Notebooks. Hence, even though it is optional we would highly recommend to set up Jupyterlab.

Setup and Validate Jupyterlab

Here are the instructions to setup and validate jupyterlab as part of aws-analytics project. As Jupyterlab will not be included as part of the project, we will not be updating requirements.txt.

Setup Jupyterlab

Go to the aws-analytics project and launch Terminal. Run the following command to install jupyterlab.

pip install jupyterlab

There will be quite a few additional dependencies installed because of the above command. Here are some of them with jupyter in them.

jupyter-client       7.2.0
jupyter-core         4.9.2
jupyter-server       1.16.0
jupyterlab           3.3.2
jupyterlab-pygments  0.1.2
jupyterlab-server    2.12.0

Run Jupyterlab

Once installed you can run Jupyterlab, you can launch Jupyterlab by running following command:

jupyter lab

By default the Jupyterlab Web Server will be launched using port number 8888.

Many times Jupyter based environment will be launched automatically as part of the default browser. If not you can click on the links which are highlighted above or copy paste the links into your favorite browser.

Conclusion

As part of this article we have taken care of the following in the process of setting up Python Project to develop applications using AWS Analytics Services.

Understand relevance of IDEs
Setup Pycharm and create project along with virtual environment
Identify and install required dependencies such as Pandas, Boto3, etc.
Validate the integration with AWS by listing buckets in s3 as well as listing instance ids along with state.

As we have successfully setup the project for AWS Analytics and validated we can leverage this to add any functionality that involves Pandas and Boto3. If you need additional dependencies for your project, you can update requirements.txt and start using them.