Data Mentorship Program Project

Oladayo
CodeX
Published in
5 min readJun 29, 2023

Feedback 1 (we will just delete this at the end of the process but a comment doesn’t allow me to write this much):

Hi Oladayo, thank you for submitting your article to DET! I’m Eduard and I will be you reviewer for your article. I think it’s a great idea to cover one of your projects, your journey and your learnings.

After reading the entire article I want to give you a couple of general remarks before diving into the details:

1. Try to make your article title a bit more interesting rather than just descriptive. What I mean is that people should read the title and think, “Oh, this sounds interesting, I should click on it”. This doesn’t mean you need to clickbait people but something a little more intriguing could help you reach I wider audience. Examples could be: “My Top 3 Learnings from Completing the Data Mentorship Program Project” -> this is just the first thing that came to my mind, so please think of other options as well.

2. Try to give your article a bit more structure with the help of Titles / Headings. This way you can guide your reader through your article a better and it looks more professional.

3. It’s always great to include graphics in your article, but the ones you inserted are really hard to read. Would it be somehow possible to make them more legible?

4. You could maybe also consider adding a paragraph of what the next steps in the project could be i.e., other features you would like to add or how the solution could be improved. This could give you the potential to expand on the project in the future and shows the reader that you can evaluate your solution.

I would ask you to have a look at the points I mentioned and try to implement them. Once this is done, I will come back and give you some more details on that version.

All the best,

Eduard

Hi everyone,

I have been part of the Data Mentorship Program for the past two (2) months. During this time, I was paired with a data engineering mentor and it’s been great experience learning from someone who works in the field.

For my final activity in the program, I worked on a project. Rather than starting a new project from scratch, my mentor and I decided that I should focus on improving a previous project of mine.

“Make it work, Make it right, Make it fast” — Kent Beck

The project I chose to make improvement is this. The project aimed at displaying in near real-time the position and altitude of a selected individual active satellite in low earth orbit, and also showcases the trends in terms of purpose and launch years of these satellites. The architecture of the project is shown below:

architecture of the project

The issues I identified looking at the architecture of the project are:

Lack of proper separation of concerns in the main.py module: The main.py module was intended for building the web application (dashboard) but it also contained transformation logic.

To address this, I moved the transformation logic in the main.py module to the etl.py module.

Hardcoding: In the etl.py module, I hardcoded lists web-hosted CSVs based on their purpose. However, this approach lacks flexibility and maintainability. If there is a need to update or modify these lists in the future, it would require modifying and recompiling the module.

To mitigate this issue, I moved the lists of web-hosted CSVs to a configuration file (config.yaml) and implemented logic to read the configuration file within the module.

Suboptimal scheduling approach for the etl.py Module: The etl.py module is scheduled to run at midnight. The ideal approach would be for the module to run whenever there is an update to the source file, a web-hosted CSV. However lacking the ideal approach (which will involve using a webhook), I had two options to allow for a timely update of data:

a. Run the etl.py module every hour.

b. Implement a trigger.py module: This module will handle the storage and update checks for the source file. The initial run of the trigger.py module will store the source file in a google cloud storage bucket.

Subsequent run of the module will compare the web-hosted CSV source file with the existing file in the bucket to check for update. If an update is detected, the module will overwrite the existing source file in the bucket with the updated version.

This action of overwriting the source file in the bucket will then trigger the execution of the etl.py module.

The trigger.py module is scheduled to run every hour.

I went with the second option because of cost. Cloud Function cost is a factor of the execution time and the memory utilization.

updated architecture of the project
metrics of the trigger.py module
metrics of the etl-function.py module

As shown in the images above, the median execution time per call of the trigger.py module is 3.0894s compared to 1.8136min for the etl-function.py. Also, the memory utilisation per call of the trigger.py is 256.58MiB compared to 289.78MiB for the etl-function.py.

Other improvements to the project were made in the form of additions:

a.I implemented data validation checks in the etl.py module using the data_validation.py module. I used the pandera library to implement the checks.

b. I also implemented alerting in the etl.py using the alert.py module. I used the sendgrid library to implement the alert. The implementation ensures that I receive an email notification whenever the etl.py module runs successfully or encounters a failure.

sample email from the etl.py module run

c. Finally, I ensured all files stored in the architecture are in the .parquet file format. This decision was made to take advantage of the columnar data format and compression properties inherent to parquet files. By utilizing parquet, I can effectively reduce storage cost.

You can view the web application here and the github repository here.

I want to say a big thank you to Rajvi for organizing and accepting me into the program. Also, a big thank you to Kunal for his time and for being a great mentor to me.

Thank you for reading.

--

--

Oladayo
CodeX
Writer for

data 📈, space 🚀🛰, augmented reality 👓 and photography 📸 enthusiast.