How to implement an end-to-end selenium project on AWS?
Learn to host your selenium code on EC2 and use S3 for the storage of output
In this article, we do the following things:
- Write a sample selenium code that aggregates the Machine Learning introduction topics and links from the homepage of the GeeksforGeeks website and displays it using pandas.
- Setup the environment on the AWS EC2 instance and use that to host your code
- After processing, use S3 bucket to store your output (pandas dataframe)
Let’s begin by writing a sample code that extracts the Introduction Topics & links of the Machine Learning course from the GeeksforGeeks homepage
The output should look like this:
Check if the code works on your local. Proceed only when the code works on your local.
Now let’s move on to setting up the environment on AWS EC2. First, make sure you have IAM roles access to you. You should have S3FullAccess (or at least S3 GetObject & PutObject) policies, and EC2FullAccess policy attached to you.
On your AWS console, you must now:
- Configure a ubuntu instance and have it running on your EC2
- Have your AWS console credentials configured on your instance
- Install google chrome and chromedriver on your instance
- Install python (should be already installed on your instance)
- Install pip package
- Install all the libraries using the pip
- Create an S3 bucket
- Finally, run your code
To create a ubuntu instance,
- Go to your AWS console, Under Services search for EC2, and choose it
- On your left navigation pane, choose Instances
- Choose Launch Instance
- Under Name and Tag, write your Instance name. I have written mine Ubuntu-Instance
- Under Application and OS images (Amazon Image Machine), choose Ubuntu Server.
- Keep the default instance type, t2.micro. This instance type has 1 virtual CPU and 1 GiB of memory.
- Now under Key Pair (login), choose to Create New Key Pair.
Note: Key Pair is a secure way of logging into your EC2 instance. If you are using EC2 for the first time, you should compulsorily generate the key pair. Give the Key Pair a name, (eg: SeleniumEC2-Credentials) and click save. The .pem is downloaded automatically into your local and you should then move it to a new folder. I have saved mine into a folder ‘Medium-Selenium’. If you already have the Key Pair .pem file then you can use that to login into this instance too.
- Keep other settings default and choose Launch Instance. You should see your instance launched.
- Choose View all instances. The instance appears in a Pending state, which means that it is being launched. It then changes to Running, which indicates that the instance has started booting. There will be a short time before you can access the instance.
- Wait for your instance to display the following:
Note: Refresh if needed. - Instance State: Running
- Status Checks: 2/2 checks passed
Congrats, you now have a virtual computer installed. Now let’s connect this to this machine:
- Click on the checkbox next to your instance name. And select Connect option
Now you should see ways to connect to your instance. Choose the SSH tab and follow the steps provided :
On your terminal execute the commands and should be connected to your instance:
- First, on your terminal, go to the folder where you saved the .pem file and then execute the commands as shown below.
The green IP shows you have successfully logged in to your instance.
Now, first, configure your AWS credentials into the instance. Aws credentials are details containing your console user name, password, access key ID, and secret access key. It should be given to you when you create a new user in AWS.
In your terminal first, install awscli and then configure your details:
sudo apt install awscli #hit enter when asked
aws configure
Type in the AWS access key and the secret key. For your Default region name, you can see it from the URL as well.
Now let’s install google chrome and chrome driver on your Linux machine:
For installing googlechrome, on your terminal hit:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb#run this command to install Chrome using the offline installer:
sudo apt install ./google-chrome-stable_current_amd64.deb
sudo apt -f install
Now check your google chrome version using:
google-chrome --version
Now let’s download the chromedriver matching the same version with the google chrome:
Go to the website: https://chromedriver.storage.googleapis.com/ and find the chromedriver matching the same version as the google chrome and for your Linux instance:
mine is: 103.0.5060.53/chromedriver_linux64.zip
so for my google chrome, I need to download this zip: https://chromedriver.storage.googleapis.com/103.0.5060.53/chromedriver_linux64.zip
On your terminal hit:
wget https://chromedriver.storage.googleapis.com/103.0.5060.53/chromedriver_linux64.zip#Now install unzip command
sudo apt install unzip#unzip the chromedriver now
unzip chromedriver_linux64.zip#move the chromedriver to /usr/bin/chromedriver
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
Before anything, let's create a s3 bucket which we will be using in our later stages:
- At the upper left of the AWS Management Console, on the Services menu, choose S3.
- Choose Create bucket
- In the General configuration section, enter the Bucket name. I have named it bucket-selenium.
- Keep all other settings default
- Choose Create bucket
Now let’s install all the packages and run the code on EC2.
- At the upper left of the AWS Management Console, on the Services menu, choose EC2.
- Connect to your instance like before (By connecting using the SSH client)
On your Linux machine, python3 must be already installed. You can check by hitting ‘python3' on your terminal
python3
We can see that python3 is already installed. Quit the python by quit() command.
Install the pip command using:
sudo apt-get -y install python3-pip#Now after installing pip, make sure you hit
source ~/.profile#check verison using:
pip3 --version
Now install the pandas, selenium and boto3 package:
boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts.
pip install pandaspip install seleniumpip install boto3
Just like this:
Now let's create a python file and implement our code:
Create a directory name selenium-code
mkdir selenium-code
cd selenium-code
vim topics-ML.py
Now finally copy the code below and put it in the python script and you are done:
Save the vim editor by typing ‘:wq’. Finally, run your python script:
python3 topics-ML.py
Tadaa, the code worked. Now on Services search S3 and on the left navigation pane, choose Buckets. Click on the bucket ‘bucket-selenium’. You can see the csv file created. You can download the file and see the output as well.
I hope it helps anyone who is looking to implement selenium on AWS.
Cheers!