What should Data Engineering Practitioners master after Python and SQL?
In one sentence, learn database/data warehouse design techniques
Organizations that take their data seriously have generally comprised a well-designed, robust, fault-tolerant, and fully maintained database. Database, data warehouses, and data lakes are the heart of any data-driven company, and data engineers play a vital role in managing them.
If I imagine you as a data engineering beginner who already got experience with the fundamentals of Python and SQL and want to gain practical experience with database design modeling. I recommend you start to read,
- Different design techniques of database and data warehouse modeling. Techniques such as 3NF (Third Normal Form), Dimensional modeling (Star schema, Snowflake schema, Galaxy schema), Data Vault, and more.
- What are OLTP and OLAP?
- Normalization and denormalization of data
Once you feel a bit confident, you can create one project where you can use your Python and SQL skills to design a database or data warehouse using any design techniques. Based on my experience, I found most companies use 3NF for database modeling and Star schema for the data warehouse. So build your project around these two.
In this article, I will not explain the difference between databases, data warehouses, and data lakes or the design methodologies of different data modeling techniques. Neither I’m a data gure who knows these techniques by heart, nor am I fully skilled and proficient with these designing techniques. But, I expect you to learn yourself by creating projects. That’s the way I learn.
Anyway, at this point, if you feel you got enough theoretical knowledge about the database, data warehouse, and different modeling techniques. To help you out practically, I set a GitHub repository with detailed instructions and two datasets to play.
The repository is divided into multiple tasks. In this article, I’ve only covered the first two tasks. I will talk about the other tasks in the next article.
In the first task, you will,
- Set up the connection between your machine and the SQL database.
- Import the data from your local machine to the SQL database.
- Use Python functions to unzip the data, set up the connection, and import the data to the SQL database.
In the second task, you will,
- Create an ETL process to build a star schema for analytics purposes.
- Extract only the transactional data into the staging table from the SQL database.
- Create the dimensional and fact tables.
- Load data into dimensional and fact tables.
Detailed step-by-step information is provided in the GitHub Repo README.md file.
In my next article, I’ll explicitly discuss another two tasks. I will basically replicate the above two tasks on the AWS cloud platform in these next two tasks.
Once you’re done with these two tasks, you can add them to your GitHub. You can also create a pull request if you have something to share or a better way to work around this project. I appreciate it.
If you want to connect with me, let’s connect on LinkedIn or Instagram. In case you want to know more about me, check out my website.
If you have some questions to discuss, you can write me a LinkedIn message. Happy to help you with my experience.