Data Engineering: From Web Scraping, APIs and Local Databases to Automated Data Pipelines in the Cloud

10 min readFeb 17, 2023

Natural Gas Pipeline — Source: bridgemastersinc.com, 2018

This blog post follows the process of Data Engineering, explaining the ways in which data can be collected, transformed and stored appropriately.

The Project: A Data Engineer at Gans

At the Data Science Bootcamp I’m attending one of the projects is focused on developing Data Engineering skills. For this project I am a Data Engineer at a company called ‘Gans’, a startup e-scooter-sharing company aspiring to operate worldwide.

Through their marketing, focusing on the eco-friendly narrative, Gans has managed to acquire new users and gained good press. However, they have seen that its operational success simply depends on having the scooters parked in the right places.

e-Scooters for Hire — Source: newcivilengineer.com, 2022

As a Data Engineer, I have been asked to collect data from external sources to help predict e-scooter movement. Customer surveys found weather and flights are key influencers on this. Firstly, whenever it starts raining e-scooter usage decreases. Secondly, usage of e-scooters increases when flights arrive as increasing numbers of tourists are flying with backpacks, therefore having e-scooters near airports and connecting train stations at the right times is important.

Data Collection: Web Scraping

In order to collect the data on cities, the company suggested using Wikipedia. Luckily there is a programmatic way of doing this. Through using a browser like Chrome you can access the HTML code behind all the information displayed in the page and find where these numbers are. Using the library Beautiful Soup I was able to scrape the HTML pulling the data into python, from which data frames can then be created.

Example: Showing HTML Code behind Displayed Information

The first step of web scraping is to identify a website and download the HTML code from it. This can be done simply by using the below code:

The prettify method can be used to turn the Beautiful Soup parse tree into a nicely formatted unicode string, separated by line for each tag and string.

You can then access the data by reading the HTML using tags, names and attributes and combine this with methods like find_all() and find_next().

After scraping the HTML to access the data I was able to take advantage of wikipedia urls being all the same besides the city name and create a function to run through multiple city pages. The function adds the data to a dictionary which can be easily converted to a data frame. There are a few other hurdles with Wikipedia due to its unstructured HTML. For example, population is not at the same point on each wikipedia page, an if clause was used to ensure the code didn’t stop at this hurdle.

Example: Function Web Scraping Wikipedia

Data Collection: APIs

Gans also asked me to collect weather and flight arrivals data. Luckily there is an even easier way of doing this than through web scraping and this is through the use of APIs.

An API is a Application Programming Interface. APIs are an accessible way to extract and share data. They are created specifically to provide a communication between different programs. The data that comes from APIs typically contains a data infrastructure, such as a JSON file. JSON stands for JavaScript Object Notation, it is commonly used for transmitting data and is structured exactly like python dictionaries.

In order to access the data from the APIs you can use the request library which makes HTTP requests in Python. Through the request.get() method you can retrieve data from a specified resource, in this case the APIs. Just to note here to access the APIs you will require an API Key which you will get when you open an account. Once you make a request you will then get a response. The response contains the server’s status code to the HTTP request stating whether you received the information you wanted or not. If all went smoothly you will see the response 200: meaning success!

Example: Request, Response, Status Code.

After you have your successful response the .json method can be used to format the data into JSON which looks like nested python dictionaries. You can then access the data the same way you access python dictionaries.

Weather Forecast — Source: GooglePlay, 2022 & Plane on Runway — Source: transportenvironment.org, 2022

API: Weather Forecast

OpenWeather is an API that provides free weather forecasts and this is what I used to access the weather data I needed. OpenWeather has a few different APIs but for this project I used the 5 Day Forecast. After requesting the data and formatting it into JSON. I was able to use python dictionary methods, such as .keys() to access the data. From reviewing the data it was clear the dictionary for each day and time followed the same format. Due to this I was able to create a function to access the data needed from different cities and create a data frame.

API: Flight Arrival Data

So far I’ve used web scraping and requesting APIs to collect my data. However, some websites make APIs even easier. Rapid API is a marketplace for APIs, some of them cost money but a lot have “Freemium” pricing plans; allowing you to access certain information. Rapid API takes care of writing the code to request APIs for you. I used the AeroDataBox API to access the flight arrivals data I needed. Example below demonstrates how I could copy the coded required. Once you have filled out the few parameters required and selected the correct language the code is ready to read.

The process for then accessing the data in python is the same as the above approach using python dictionary methods. I again created a function to access the data in a data frame format. However, the function uses today and tomorrow variables to automatically access the data for tomorrow because Gans wants the flight arrival times for the next day so they can plan ahead.

Data Storage: Locally with MySQL

After collecting the data and creating three data frames for cities, weather and flights it was now time to store the data in a database where it can be easily analysed. I used My SQL Workbench to do this locally on my device.

Firstly, when thinking about creating a database you need to think about how your tables (data frames) are going look and the relationship between them, for example a one to many relationship or a one to one relationship. See below my database schema. Please note I added another table with the ICAO conversion to city so I would be able to connect my flights table to my cities table.

Example: ERR Diagram of my Database Schema

In MySQL Workbench you can then run a SQL script to create a database and the tables needed. Within the tables you will define the columns and the datatypes. Example below:

Example: Creating a Database and Table in My SQL Workbench

After creating my database and tables in MySQL I then need to connect to my python code. To store the data directly from my code to the database I used the python library SQLAlchemy, which is the simplest way to connect python and SQL.

To use SQLAlchemy you need to establish a connection to your local database from python. This can be done with the below code. Breaking this code down, the “schema” refers to the database you have already created in MySQL. The “host” is the location of the database server, in this example it is my local host. “User” and “password” are the detail you use to login to the database server. The “port” refers to the optional database port. In the string “con”, ‘mysql+pymysql’ refers to the type of the database (MySQL) and the DBAPI you are using (pymysql). DBAPI is an acronym for DataBase Application Programming Interface and it is a library that lets Python connect to the database server.

Once the connection is established the to_sql function can be used to take data stored in a DataFrame to a SQL database.

Example: Creating a Local Connection and Using to_sql() Function

Finished?! Not Yet.

I have now collected all the data I need about cities, weather and flights and put this into functions which can be run to provide up-to-date data. I then created a database schema in My SQL and connected all the tables. Next, I could automate the process locally but that would require my computer running all the time. Luckily there is a smarter way to automate this and that involves uploading the whole process to the cloud. The two top cloud computing services are Amazon Web Services (AWS) and Microsoft Azure. For this project I used AWS.

AWS: Pipelines on the Cloud

To use AWS you need to create an account which come with some free services for the first 12 months. AWS offers over 200 fully-featured services; for this project I used AWS RDS, AWS Lambda, AWS IAM, AWS EventBridge and AWS CloudWatch.

AWS RDS

Amazon Relational Database Service is a service that makes setting up relational databases, such as MySQL, in the cloud really easy. AWS provide lots of helpful documentation to help you get started. By following instructions on here I was able to create an Amazon RDS instance.

After your Amazon RDS instance is set up you need to connect it to your MySQL Workbench. This can be done using the database endpoint. Once the connection is set up between MySQL and Amazon RDS you can create a database with all of the tables to store the information for your project. In this case I was able to use the same code as for my local connection in MySQL and in python, I just need to update the “host” to the endpoint of the Amazon RDS instance, shown below.

AWS Lambda / AWS IAM

AWS Lambda is an event-driven, server-less computing platform that runs code in response to events and automatically manages the computing resources for you. Without worrying whether your computer has the power to run code. AWS Lambda uses functions and recognises many languages including python! This means I will be able to copy my code across.

To give the AWS Lambda access to the RDS service I first need to create a new Role in the AWS IAM which stands for Identity and Access Management.

Now, I can copy across my python code functions to my AWS Lambda. I can then create a lambda_handler function which will access all my python functions so they are all inside one code, as only the code inside lambda_handler() gets triggered. Within the lambda _handler function you also need to ensure you are including the code that connects to the RDS instance.

In some cases you need to add in layers to the AWS Lambda so they can access certain libraries that might be used within your python code. I used AWSDataWrangler-Python39 layer and a custom layer to import SQLAlchemy.

You can then test your function. If you do not specify any output for my AWS Lambda function the data just gets sent to the database in MySQL Workbench. In MySQL Workbench I can then query the database to get the results.

This process so far follows the same as what was done locally. The next step is to automate this using AWS EventBridge and AWS CloudWatch.

AWS: Pipeline Automation

AWS EventBridge

Amazon EventBridge is a amazon service that can be used to schedule events to trigger automatically at certain times, meaning you don’t have to push the button or create any kind of trigger to run your Lambda function. This useful AWS documentation can help you set up an event.

I was able to schedule an event to run at 8am every day. Therefore Gans would have the information about the upcoming weather forecast and flight arrivals ready to access.

AWS CloudWatch

AWS CloudWatch can be used to verify that the Lambda function was invoked with the EventBridge rule. If you see the Lambda event in the CloudWatch logs, you’ve successfully scheduled the event.

Remember, if you do not specify any output for my AWS Lambda function the data gets sent to MySQL Workbench, where you can query the database to get the results.

Project Summary

At the end of this project I now have my code running in the cloud every day at 08:00am collecting data from the Internet and saving it to a database in the cloud. Through implementing this, I have gained a strong understanding of the Data Engineering role from web scraping, APIs, local databases and automated data pipelines in the cloud. Below is a summary of what was covered in this project.

Data Collection: Collecting required data from the internet using Web Scraping method with the library Beautiful Soup and APIs.
Data Storage: Creating a Database in MySQL Workbench to store the data collected.
AWS: Pipelines on the Cloud: Using Amazon Web Services (AWS) to move the pipeline to the cloud.
AWS: Pipeline Automation: Automating the whole data collection and storage process to run at a set time daily.