Automated data engineering with Python, SQL, and AWS

Hristo Goumnerov
3 min readMay 7, 2023

--

Imagine you would like to know the weather and flight information for one or more airports. Now imagine wanting to have that information every day. It sounds like a tedious task if you want to do it by hand. But this is where computer science, and in particular a new emerging field can help, namely data engineering.

Data Engineering is the practice of data processing, data cleaning and preparing ready-to-use data for analytics, data science, and artificial intelligence implementation. [1]

In this article we will show one possible approach on how to tackle the task of automatically obtaining, processing, and storing weather and flight data for further analysis.

MySQL

The first step is to set up a database. As it makes sense to have a simple, straightforward way to create and then maintain and update the database, we choose the well known relational database management system — SQL. The popular MySQL implementation has an easy-to-use GUI and all required functionality for the task. We set up some tables with the respective columns and primary / foreign keys. The image below shows an example with tables for the names of the airports and cities that we are interested in, as well as weather and flight data.

Data collection through web scraping and API usage

Once the database has been set up and works, we proceed with the data collection part. Python is a great choice to accomplish this, as it offers a very simple syntax and has all necessary features to perform the data collection through its comprehensive set of libraries and open-source packages. It is recommended to write the logic for the different data aspects (city/airport, weather, and flight) in separate files or modules so as to keep the code modular and maintainable. The code section below shows an example of obtaining and processing data from openweathermap via its very nice API functionality.

import requests
import pandas as pd


def get_response(city, API_key, header):
''' Obtain html / json response '''

url = 'https://api.openweathermap.org/data/2.5/weather'
params = {'q': city, 'appid': API_key, 'units': 'metric'}
response = requests.get(url, params=params, headers=header)

return response.text, response.json()


def get_weather_details(data):
''' Extract data from the obtained json response '''

output = dict()

output['forecast_time'] = data['dt']
output['outlook'] = data['weather'][0]['description']
output['temperature'] = data['main']['temp']
output['feels_like'] = data['main']['feels_like']
output['wind_speed'] = data['wind']['speed']

return output


def get_weather_data(city, API_key, header, city_sql):
''' Main function to obtain, extract, and return data '''

# get response
_, response_json = get_response(city, API_key, header)

# get city data
city_dict = get_weather_details(response_json)
city_dict['city'] = city
city_dict['city_id'] = city_sql[city]

return pd.json_normalize(city_dict)

Automation with AWS

The last part of the data engineering task is the automation of the data collection and storing. This can be done locally or through a cloud service, depending on the resources available and the project requirements. Amazon’s AWS offers the ability to upload a database (RDS) and run code (Lambda) according to a given schedule through EventBrigde. The image below shows an overview of the setup for a function called wbs_aws which is to be executed every day for ten days at 2 pm.

In conclusion, we have seen that it is possible to get, process, and store data using publicly available tools. Gaining the experience and skills to perform data engineering work makes a very strong addition to one’s portfolio, and will be of increasingly large interest for many branches of industry in the future.

--

--

Hristo Goumnerov

Bulgarian mechanical engineer with data science and programming experience