7 End to End Data Engineering Projects That Sets you Apart from the Rest
In the ever-evolving field of data engineering, standing out from the crowd is about demonstrating practical skills and innovative thinking. For aspiring data engineers, this means not just understanding the theory but also showcasing your ability to handle real-world data challenges from start to finish. Here are seven end-to-end data engineering projects that can significantly boost your portfolio and set you apart from the competition and give you unfair advantage over others.
1. Realtime Change Data Capture Streaming | End to End Data Engineering Project
In this video, we dive deep into the world of Change Data Capture (CDC) and how it can be implemented for real-time data streaming using a powerful tech stack. You will use the integration of technologies like Docker, Postgres, Debezium, Kafka, Apache Spark, and Slack to create an efficient and responsive data pipeline.
You will learn how to:
- Configure and Save data into PostgreSQL database
- Configure and capture changes on PostgreSQL with Debezium
- Stream data into kafka
- Add a streaming layer on top of kafka with Apache Spark, Flink, Storm or ksqlDB
2. Reddit Data Pipeline Engineering | AWS End to End Data Engineering
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and services including Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift.
With this project, you will learn:
- Apache Airflow with Celery and Postgresql
- Docker
- Using Reddit API
- AWS Glue
- AWS S3
- AWS Athena and
- Redshift Data Warehousing
3. Realtime Socket Streaming
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
This project showcases how to:
- Transfer Data over TCP/IP socket connection
- Use ChatGPT4 with Apache Spark
- Streaming and Visualisation with Apache Kafka
- Data replication with elasticsearch
- Data visualisation on Kibana (and other tools like PowerBI, Tableau, etc)
4. Sales Analytics with Apache Flink | End to End Data Engineering Project
This repository contains an end-to-end data engineering project using Apache Flink, focused on performing sales analytics. The project demonstrates how to ingest, process, and analyze sales data, showcasing the capabilities of Apache Flink for big data processing.
You will learn how to:
- Use Apache Flink for Data processing
- Setup a new project with Apache Flink
- Different Aggregation techniques with Apache Flink
- Source-Sink relationship with Apache Flink
5. End to End Data Engineering On Azure
This project provides an end-to-end data processing and visualization of visa numbers in Japan using Azure Cloud Spark Cluster with PySpark and Plotly. The spark clusters are set up within a Docker container on Azure.
You will learn how to:
- Use docker compose to setup spark cluster on Azure
- Understand the master-worker relationship with Apache Spark
- Write custom Spark SQL scripts for data processing
- Write custom Spark sink for data output
6. Modern Data Engineering with DBT (Data Build Tool) and BigQuery
This project showcases a deep dive into the powerful combination of DBT and BigQuery, the game-changers in modern data engineering.
You will learn how to:
- Set up DBT and BigQuery from Scratch
- Linking DBT and BigQuery
- Write SQL-based Transformations with DBT
- Convert Tables to Views with DBT and vice versa
- Seed data to BigQuery with DBT
- Write unit tests with DBT
- Generate Documentation with DBT
Before we go into the last project…
If you’ve come this far, consider giving me a FOLLOW, LIKE, and SUBSCRIBE to the Youtube channel for more contents like this.
7. AWS EMR (Elastic Map Reduce) For Data Engineers
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
You will learn how to:
- Setup AWS EMR from scratch
- Create Spark jobs to be submitted to the cluster
- Submit spark jobs to the cluster on AWS
- Learn the combination of the different technologies on AWS Cloud
Code to All the Videos are available on my Github:
And that’s a wrap!
If you are interested in any of the topics below:
— Python
— Data Engineering
— Data Science
— SQL
— Cloud Platforms (AWS/GCP/Azure)
— Machine Learning
— Artificial Intelligence
Like and Follow me on all platforms:
- Github: airscholar
- Twitter: @YusufOGaniyu
- Linkedin: Yusuf Ganiyu
- Youtube: CodeWithYu
- Medium: Yusuf Ganiyu
I regularly share daily contents on Linkedin, X, Medium & YouTube.
More courses available on datamasterylab.com