Data Engineering Zoomcamp by Data Talks Club
At the beginning of this year, I have seen a post from Alexey Grigorev (founder of Data Talks Club) on Linkedin about starting a data-engineering-zoomcamp. I was knowing Data Talks Club before as they organized a machine-learning-zoomcamp last year. But that time, the timing was perfect for me and the zoomcamp context was correlated with my new year resolutions and goals. Therefore, I was ready to invest my time and energy.
In this blog post, I will be sharing what topics have been covered during the eight weeks zoomcamp and my final thoughts about the Data Talks Club.
During the six weeks of training, a variety of topics were covered to ensure all attendees have built the necessary blocks for data engineering. And in the last two weeks of eight weeks program, every attendee worked independently on a capstone project to practice the course content.
The first week started with an introduction to containerizing the applications with details of docker-compose. That is followed by infrastructure as code (IaC) practices using Terraform. The goals of the first week were to get used to the Google Cloud service provider and create a GCS bucket with Terraform which will be used as a data lake in upcoming weeks during the data ingestion process.
During the second and third weeks, the Airflow workflow management platform is used to orchestrate data pipelines. The end goal of the fully automated ETL pipeline consists of the tasks that enable retrieving data from external data sources (AWS S3 bucket), applying transformations on data files, loading transformed files to the data lake (GCS bucket) and finally moving data from the data lake to data warehouse solution (Google Cloud BigQuery). The key takeaways for me on these weeks are finding a chance to play around with the Airflows’ configuration parameters and deep dive into Airflows’ scheduling mechanism. I have also written a Medium article “Advancing Apache Airflow Workflow Schedules” about my learnings, feel free to take a look if you are curious :)
The topic of the fourth week was understanding analytics engineerings’ responsibilities to create clean reusable data models, documentation, and dashboards. For data model creation on the data warehouse level, we have used the Dbt(data warehouse transformation tool). At first glance, I was not able to capture the advantages of Dbt and felt we could do similar transformations and data modeling with Spark as well. However later, I realized that Dbt allows you to build a reusable transformation pipeline by completely using SQL syntax with the additional benefit of creating documentation automatically. Considering the time that is spent to create proper documentation, Dbt stands out from alternatives even just for that feature. When data modeling is done for BigQuery tables, we have used the Google Data Studio for creating interactive dashboards.
During the fifth and sixth weeks, we have worked on different data processing approaches and investigated batch processing with Spark and stream processing with Kafka. As I had some experience with both Spark and Kafka, I found these two weeks' content helpful for me to refresh my knowledge. The stream processing (Kafka) is covered without any practical exercise and I personally felt it could be more beneficial if there is a small exercise with Kafka within the homework process.
And finally, for the last two weeks, of the training, we have worked on the capstone project to practice the learnings of the last six weeks, starting from dataset selection to the creation of dashboards. I have used the Slack dataset for that practice, if you are curious about the final outcome, you can find the source code of the capstone project on GitHub.
Last but maybe the most important point of the zoomcamp that I would like to mention is, that I love the zoomcamp in really different aspects. And to be honest all of them were well managed for building the best community, the best learning experience with well connected and like-minded people around the world.
I would to mention a couple of aspects that blow my mind and definitely worth appreciation.
Not only Zoomcamp itself but the whole Data Talks Club builds an incredible community with people that are interested in data. There are also a bunch of other things happening to ease reaching out to the experts in the data domain. For instance, with #book-of-the-week , members of the club get the opportunity to ask questions directly to the author of the books in an open space and learn insights about specific topics. Being part of such a community is amazing as everyone in the community is open to sharing thoughts helping each other and creating connections over the globe. I have never seen such a connected and active community for a while and it made me motivated.
Lastly, thanks to all of the Data Talks Club team and tutors of the course, Victoria Perez Mola, Sejal Vaidya, Alexey Grigorev, Ankush Khanna, for willingly dedicating time & effort to create content and answering all of our questions in Slack patiently. I can't imagine how much effort each of the trainers put into this course and I am really thankful they did that. ❤