How I Orchestrated a Five-Week Data Engineering Mentorship Program: A Chronicle of Growth and Learning

7 min readNov 6, 2023

Creating and managing a mentorship program is a task that combines vision, organization, and a commitment to growth — both for the participants and the program itself. In this blog I share my experience coordinating a billingual mentorship program focused on data, ML and NLP engineering.

The Journey Begins: Recruitment

The vision was clear: a five-week mentorship program to delve into the realms of cutting-edge technologies, leveraging large language models (LLMs), and using open-source tools to forge innovative solutions. The call for participants was crafted to attract those passionate about data science, data analytics, and data engineering, and more specifically, people who were eager to build. Our pitch was:

“Join our mentorship for a chance to enhance your portfolio under expert guidance, gain real-world skills, network with peers, and contribute to the Hacktoberfest community. This no-cost program, welcoming those ready for a 10-hour weekly dedication to growth. The program will be offered in English and Spanish.”

The program offered hands-on experience with data engineering, analysis, storytelling, and deployment, wrapping up with a polished project showcasing their newfound expertise.

The Selection Conundrum

The response was overwhelming. Seventy aspirants for a mere 10 spots! A testament to the growing interest in tech mentorships. The decision was made to stretch our resources; we opened the doors to 40–50 participants. To keep the program manageable for our team, I decided to split the participants into 12 teams, with the option for a handful of folks to work independently. We gave participants the option to let us know their preferred track out of two:

Option 1: Craft a SQL database pipeline using Ploomber and JupySQL, topped with an ML model or analytics application.
Option 2: Develop a RAG NLP pipeline via Haystack, populating a vector DB and interfacing with an LLM for various functions.

Each team had between 2 to 5 participants, depending on their language of preference, their experience working with open source technologies (Python, Jupyter notebooks, dependency management, Git and GitHub, Docker), and the type of application they’d like to deploy (dashboard, chatbot, pipeline, API). Once the teams were formed, selected participants were notified.

Onboarding: The First Step

Welcoming the selected participants into our fold was crucial. The remote nature of the program posed a risk of disengagement, so we countered this as follows:

Immediate action following the notification of acceptance: welcome them into our Slack space, send invites for weekly meetings, respond promptly to them when they arrive.
Preparing a project template they can use, along with set up instructions (and the translated documentation). You can find the template here.
Welcome invited speakers to provide different perspectives to the participants. I invited Ploomber’s CEO Eduardo Blancas, who talked to us about his journey from mechanical engineering, to data science, to starting a business in tech. I also invited AI MakerSpace’s CEO Greg Loughnane and CTO Chris Alexiuk, who talked with us about building with LLMs.
Created as a first task for participants to talk to their teams and agree on a dataset to work on, and define the scope of their project. I ensured to @ mention them, and created a channel for them to send questions to our team.

Bridging Linguistic Gaps: A Bilingual Program

We hosted two sessions per week in English and Spanish, supplemented by Sunday office hours. This inclusive approach ensured that language was no barrier to learning and enabled us to cater to a diverse cohort. This part of the program took the most resources, as it required planning the topic ahead of time and also required to be on call on Sunday from noon til dusk to accommodate participants from different time zones. The topics we explored in both languages are as follows:

Week 1: introduction to data pipelines, LLM pipelines, collaborating effectively with Git and GitHub, setting up reproducible projects with Poetry and Docker
Week 2: setting up an ETL pipeline with SQL, setting up a RAG pipeline with vector databases and LLMs
Week 3: packaging and productionizing your ETL and RAG pipelines. Migration from in-memory database development to cloud-based database maintenance.
Week 4: setting up GitHub actions for CI/CD and scheduled jobs, dockerizing your application. Deploying your app on Ploomber cloud.
Week 5: Ongoing support is provided to the teams to finalize their deployment.

I was really happy with the participant’s engagement. Each week I requested participants to make one submission. The submission requirements were as follows:

Week 1: contact your team and decide what you’d like to build. Identify suitable datasets, ensuring the datasets are open and have clear and well defined licenses.
Week 2: initialize your project’s GitHub repository using the template, create a branch for data engineering (automation of data extraction and storage into a database — we worked with DuckDB and MotherDuck for ETL, and FAISS for RAG apps) and another branch for exploratory work (visualizations, data wrangling).
Week 3: after the code review process was initialized, the teams were then required to begin merging clean work into main (data pipelines), and begin developing their applications (chatbots for RAG apps, and visualization dashboards or ML models for ETL apps).
Week 4: initialize Docker container, initialize GitHub actions with CI/CD testing or automated data processing.
Week 5: teams were supported to deploy their final applications to Ploomber cloud.

The Deployment Finale

The culmination of our program was the deployment week. We provided GitHub templates and project structures for seamless collaboration and code management. My role pivoted to supporting teams through code reviews, Docker setup, and GitHub actions. Here you can find the team’s repositories and their deployed apps:

RAG chatbot that lets a user ask if certain facts have been covered by judicial rulings of the High Courts in Colombia. Repo. Deployed app

ETL (Extract, Transform, Load) pipeline for data analysis automation, specific to Adidas sales and main competitor sales. Repo. Deployed app.

Generation of a report with proteomic profile data, i.e., protein expression, of 77 patients diagnosed with breast cancer, for a better understanding of how gene expression behaves in positive cases and to detect possible new molecular markers for early and timely detection, through Voila Dashboards. Repo. App 1. App 2.

AirQ-Forecaster: An ETL-to-ML Pipeline for Predicting Air Quality Index. Repo. App

RAG chatbot that answers questions about song lyrics. Repo. App.

ETL pipeline that provides insights from trend data (generated by Google Analytics) and provides interaction between the data and AI. Repo. App.

ETL pipeline that extracts Yahoo Finance data and generates EDA and an ML model. Repo. App.

This chatbot is designed to assist people with choosing a recipe to cook. It is connected to a database of over 200,000 recipes from Food.com. You can ask the chatbot for suggestions of recipes based on criteria like ingredients, meal, cooking time, dietary restrictions, food ethnicity, etc. Repository.

A state-of-the-art application using Haystack and Chainlit. This application leverages the power of conversational AI to provide users with detailed insights from financial reports created using XBRL and other markup languages. Repository.
Two more applications were developed, but the teams asked that their app is kept private.

Reflections

The success of our mentorship program was not just in the projects produced but in the community fostered, the skills honed, and the barriers broken. We built more than just applications; we built bridges across languages, created connections, and planted the seeds for continued growth and learning. Our team at Ploomber was really happy with the participants’ work, and their feedback was highly valuable for us to improve Ploomber Cloud — our current focus.

Embarking on such an endeavor was as much a learning experience for me, our Ploomber team, as it was for the participants. The mentorship program became a living entity, evolving and adapting to the needs of its members. It was a reminder that the heart of technology is not just code and systems, but the people who breathe life into it. I look forward to fostering relationships with more developers, data and ML engineering enthusiasts, as well as the packages supported within the open source ecosystem!