Data Engineering: Building a local pipeline and migrating it to the cloud

John Igbokwe
Data 100
Published in
7 min readAug 17, 2024

In today’s world, data is often seen as gold. However, like unrefined gold, raw data must be processed and refined before it can be used. This project is for that purpose. Building an automated data pipeline with the strategy of ETL (Extract, Transform, Load) to refine and store data from multiple sources.

The Local Pipeline

The first step in the project was to build a functioning data pipeline on my local machine. This was where I got my hands dirty with web scraping, API calls, and data processing using Python. The goal was to collect data on flights, weather, and city information from different sources and store it in a structured format for analysis.

Web Scraping City Data

Before getting the flight and weather data, I needed a list of cities and information, such as city names, ICAO codes for nearby airports, and geographical coordinates. I used Python’s libraries such as BeautifulSoup and requests by writing a script to scrape this data from a reliable source on the web and receiving it in json format.

The cities targeted for this project were Berlin, Hamburg, and Munich. The scraped information included were,

City Name- The name of the city.

ICAO Code- The ICAO codes of the airport in each city.

Latitude and Longitude- The geographical coordinates .

Collecting Flight Data

With the city data available, I proceeded to collect flight data using an API from AeroDataBox. Writing another Python script to make API requests for flight information from the airports in Berlin, Hamburg, and Munich over specific time intervals. The data has details such as departure and arrival times, as well as the ICAO codes of the airports.

Collecting Weather Data

I used the OpenWeather API to fetch weather data for the same cities. This has to do by making API requests to retrieve forecasts, temperature, wind speed, and rain. The data was collected, structured and stored in a Pandas DataFrame.

Storing Data Locally

After the data collection, the next step was to store it locally. I used MySQL, a popular relational database management system, to store the flight, weather, and city data. The data was first cleaned and transformed using Pandas, and then inserted into the MySQL database.

However, this process was challenging. Initially, I struggled with formatting the data correctly for MySQL, especially with handling date and time fields. After some trial and error, I managed to convert the time fields into a format that MySQL could accept.

The Cloud

With the local pipeline up and running, it was time to scale up. Migrating the data pipeline to the cloud would not only make it more scalable or robust but also allow for automation, to keep the data updated. Creating similar tables before inserting the data

Setting Up a Google Cloud SQL Instance

Setting up Google Cloud SQL instance to host the MySQL database. This involved creating an instance, configuring and making sure that the database could be accessed securely from the cloud. After setting up the instance, i noticed i named the instance ID as “gans_local”, the same name with the local Database, i should have given it a name like “gans_gcp” or “gans_cloud” to have a clear indication that this DB is the cloud and not local.

However, it didn’t affect my operations as i was carefully looking at each login credentials.

Migrating the Data to the Cloud

Moving the data from my local to the cloud was another challenging thing. I encountered issues with connecting to the Google Cloud SQL instance from my local machine. The problem was solved by configuring the correct IP address and enabling the necessary API access in Google Cloud.

Then exporting the static data such as cities and airport information, then imported it into the Google Cloud SQL. For dynamic data like flights and weather information, I implemented an automated continuous data collection process that updates the data in the Cloud.

Automating Data Collection with Cloud Functions

With the static data now in the cloud, the next step was to automate and update flights and weather data. For this, I utilised Google Cloud Functions, a computing service that allows you to run your code in the cloud without managing servers.

Creating two separate functions, one for updating flights data and another for weather data. These functions were triggered by Google Cloud Scheduler, to be sure the data stay updated daily.

The cloud is not not easy to manage, Identifying error logs in the cloud function was challenging and tricky, another challenge i faced was handling the unique constraints of cloud environments, particularly with security and ensuring that sensitive data like API keys and database passwords were kept secure. I overcame this by storing the sensitive information in a separate configuration file that was not publicly accessible.

Here are my functions, the wbs function is a sample test to see how function works in the cloud.

Data Integrity

A key requirement of the project was to ensure that the data was not duplicated or overwritten incorrectly. To address this, I implemented a strategy in the cloud functions. This meant that if a record already existed in the database, it would be updated with the new data.

Initially, I had frustrating moments where my code in the cloud function wasn’t inserting data, I had to go through and read the error logs, though they were hard to identify. Fortunately for me i was able to get it working correctly.

Challenges and Solutions

My Journey from a local data pipeline to an automated cloud based system wasn’t easy but challenging. Some of the challenges I faced and how I overcame them.

1. Data Formatting Issues: Handling date and time fields wasn’t easy. I had to be sure that the data was correctly formatted before inserting it into MySQL. Like converting UTC times to local times.

2. Cloud Connection: Connecting to the Google Cloud SQL instance from my local machine initially failed due to incorrect configurations. This was later fixed by configuring the correct IP address.

3. Security Concerns: Storing API keys and database passwords securely was a major concern. Creating a configuration file that was stored securely and not exposed to public access was the way i handled it.

4. SQL Syntax Errors: Studying MySQL documentation and other online resources to be sure i was getting the right syntax was my better option.

Here is the updated data as at when the function was last triggered.

Conclusion

This project was a practice into the world of Data Engineering, taking me from building a simple local pipeline to creating a cloud based system capable of scaling to meet future demands. The experience taught me the importance of proper data handling, extraction of data from multiple sources, the challenges of working with cloud technologies, and the importance of data security

Share your thoughts! 💭

Feel free to share your feedback, questions and comments if you have any!

--

--

John Igbokwe
Data 100
Writer for

🔍 Passionate about uncovering insights and driving decisions with data.