Geek Culture
Published in

Geek Culture

My Guide to Becoming a “Market-Ready” Data Engineer With a $100 Investment in 3 Months or Less — 2021

As you may know, the demand for Data Engineers has quadrupled in the last 2 years. There are more jobs than candidates in the job market, especially for US work authorized candidates. I have seen companies offering up to $1500 in referral bonuses.

My goal in writing this post is to give a self-motivated beginner a “simpler” learning path to becoming a data engineer. I have outlined the most important foundational skills needed to break into a Data Engineer role (junior role at least). The data engineer journey or landscape is truly overwhelming and I can say confidently there is no single person that has acquired all the skills mentioned in an ideal data engineer job description. For those that are curious please take a look at this article for a comprehensive list of what data engineers need to know here. However, as one learns these building blocks outlined below, he/she would start feeling confident applying for roles and would probably land a job before completing the entire training. I would encourage any beginner to occasionally test themselves against the knowledge expected in the job market. It keeps one motivated and honest about where they are.

You can accomplish anything as long as you set your mind on it

Having worked as a data engineer for 3 years with 3 different companies. I can tell you that the fundamental skills have always been very useful and important to my success. Even with the ever-changing dynamics of tools rising and falling within the span of months and some tools rapidly becoming the mainstays, a good data engineer with a good grasp of the foundational skills has nothing to fear as they can learn the new tools/data services quickly.

I also wanted to list out things you can learn within a given and realistic time period that can result in a job offer. Once you get the job, nothing stops you from learning more and improving your skills on the job.

Some of the skills in data engineering are best picked up while on the job. For example knowledge about CI/CD DevOps pipelines, Agile/Scrum, workflow scheduling, etc are easier and best learned with/in real-world experience.

In this post after going over the vendor-agnostic foundational skills, I will describe the specific skills needed for becoming an Azure data engineer since that has been my domain for the past 3 years. Alternatively, I would also advise a beginner to focus their learning and training on tools revolving around a given cloud provider (AWS, Azure, GCP) of their choice. This is a more practical approach to landing a Data Engineering job faster. In the real world, most companies like to stick to a cloud provider and use tools in their ecosystem.

After reviewing countless job requirements on data engineering and going through countless interviews, I have gotten a better sense of clarity on what is truly important to be a useful data engineer in most companies.

I have outlined the TOP 5 foundational skills needed to be successful. I went ahead to give useful resources that I have reviewed and deemed sufficient for your consumption. I also give a realistic time allocation to absorb the material and its estimated cost. Most of the courses I provided are on Udemy. However, feel free to use any other website or youtube (sometimes the best content is found on youtube). I don’t have any Udemy course referral payments from these instructors. I just hope you get the best knowledge were ever possible

Please note that taking your time to practice is really crucial when doing self-learning so budget two times the course videos duration.

PRO TIP: Please note that there is a hack in ensuring you get the discounted price for the video courses on Udemy if it is not showing. First try to clear your browser cookies, if that does not work then signup to Udemy with a brand new email id.

Foundational skills

1. Good Foundational Knowledge of any programming language preferably Python

Why: Data engineering is not far from software engineering. The more I work in this space, the more the lines are becoming grayer. A good data engineer is really a software engineer with expert data skills. Creating an optimized data pipeline, writing complex business rule transformation, and creating an automated data flow system all involve the use of some programming concepts. Even the software tools you will use for data engineering are written in some language (python, scala, java, PowerShell, etc). In reality, you probably won’t have to implement an advanced algorithm like depth-first search. However, knowledge of these programming concepts like if-else, for loop, while loop, some advanced algorithms will be super important as a data engineer. Analyzing the run time and space requirement of any code you write makes you a very effective data engineer, that saves computation cost for your company and ensure timely data delivery.

Additionally, a good number of Data Engineer job interviews test your coding ability. Therefore, being comfortable in a language is important.

Please note: if you already have a background in another programming language, you are free to skip this entirely or learn the python syntax.

Courses

a) Beginner to Intermediate Python Course:
Complete Python Bootcamp: From Zero to Hero in Python:
This will give you a good grasp of some fundamentals of coding in Python and object-oriented programming.
https://www.udemy.com/course/complete-python-bootcamp/
Cost 15–20 dollars
Course Time: 24hrs
Learning Time: 1 month

b. Python Algorithms and Data Structures (for Mid to Senior Data Engineers)
Python for Data Structures algorithms and interviews
This course is crucial for understanding the fundamentals of software engineering. Please note you have to be at an intermediate level before taking this course. This is essential to get through most coding interviews for mid or senior roles.
https://www.udemy.com/course/python-for-data-structures-algorithms-and-interviews/
Cost 12–15 dollars
Course Time: 17hrs
Learning Time: 1 month

Bonus: Python for Data Analysis: Numpy, Panda’s Dataframe
This free videos in youtube are very comprehensive as it goes over the most popular python libraries used in the real world for data analysis like Pandas, Numpy. Feel free to skip the 4 hr course and jump straight to pandas if you don’t have time.
Numpy + Pandas 4 hr course
https://youtu.be/r-uOLxNrNk8
Pandas 1 hr course
https://youtu.be/vmEHCJofslg
Pandas Advanced concepts 1 hr course
https://youtu.be/P_t8LO-KgWM

2. Good Foundational Knowledge of SQL Programming (SQL Query writing)- Relational Database bonus

Why: SQL Programming is the most popular language spoken by data technologists all over the world. I don’t think you can proudly say you work with data if you don’t have a good grasp of SQL. However, to be a data engineer in this day and age, you need to not only be good but an expert in SQL. Remember you will be working a lot with Data Analyst, Data Steward, Business analyst, etc. These folks are already quite proficient to some extent in SQL and they will look up to you for help in solving some of their most difficult SQL problems. Secondly, most tools in the data engineering stack speak SQL. Right from the data sources, ETL tools, all the way to reporting tools, they all rely on SQL to execute their functions.

Courses

a) The Complete SQL Bootcamp 2020: Go from Zero to Hero in Udemy
This is a good first step to get you from beginner to intermediate in SQL
https://www.udemy.com/course/the-complete-sql-bootcamp/
Cost: 11–15 dollars
Course time: 9 hrs
Learning Time: 3 weeks (Spending 10 hrs a week)

b) SQL — Beyond The Basics
This course focuses on advanced concepts that are crucial in getting through most interviews these days and having that expert level knowledge as a data engineer. It will go over the most efficient ways to write elegant queries that will optimize your ETL workloads.
https://www.udemy.com/course/sql-beyond-the-basics/
Cost 11–15 dollars
Course time: 5hrs
Learning Time: 1.5 weeks

3. Good Foundational knowledge of common Data Analytics concepts- ELT, Datawarehousing, and Data Modelling

Why ELT: Data engineers are like the engine room workers keeping all the lights on and operational in an organization that needs data. ETL means extract transform load. Recently, the industry has moved to the ELT approach due to the cheapness of storage and the huge volumes of data being processed. People now bring code to data instead of pushing all the data to expensive computation machines. As a data engineer, you need to know that most organizations need your help in making sure they get a fresh update of their daily reports on time. Most times these reports involve consolidating data from various source systems, transforming and modeling them in a data warehouse so that it can be easily consumed by business intelligence reports or AI/ML models. All these are why the knowledge of ELT/ETL, data warehousing, and data modeling are important. Organizations will find it hard to integrate their disparate source systems to provide a consolidated high-level analytical report to a C-level executive without these skills in a data engineer.

Please note, that the below videos and learning materials are samples of a much deeper topic. Personally, I only understood some of these concepts through my work experience. However, as a data engineer, you need to at least have some familiarity with the terminology and have a basic understanding so that you won’t feel lost on the job.

a) Data Modelling Fundamentals
https://www.udemy.com/course/mastering-data-modeling-fundamentals/
Cost 13 dollars
Course Time: 3hrs

b) Data Warehousing Fundamentals
https://www.youtube.com/watch?v=J326LIUrZM8
Time: 1hr
https://youtu.be/lWPiSZf7-uQ
Time: 1hr

c) ETL for Data Warehouse
https://www.youtube.com/watch?v=7MOU1l30lXs
Time 1hr

d) Dimensional Modelling
https://www.youtube.com/watch?v=DspXXZrSVRk
https://www.youtube.com/watch?v=ajVfBJrTOxw
Time 2hr

My estimated Learning hours: 1 week

4. Knowledge of Distributed systems and computing architecture & Deep Understanding of Spark/Databricks

a) Get familiar with Big Data Tools and Concepts

As a data engineer working in the current landscape involves working with massive amounts of data or using systems (distributed systems) that are built for working with massive amounts of data. Therefore a good understanding of distributed systems architecture and computing for big data workload is very fundamental to any Data Engineer’s success. Most of these videos are an hour long. However, they provide the context and foundational knowledge of the principles employed during the design and use of these Big data tools. Many other tools share similar distributed architecture patterns and you will notice how they translate. This basically prepares us for the next section which is Spark

a) Hadoop Architecture and Ecosystem
https://www.youtube.com/watch?v=m9v9lky3zcE
Video time 1hr

b) Distributed Systems lecture
https://youtu.be/Y6Ev8GIlbxc
Video time 1hr

c) Distributed computing lecture
https://youtu.be/ajjOEltiZm4
Video time 15 mins

d) Big data File Format
https://youtu.be/jKfKmBdPuT4
Video Time 8 mins

e)Optional: Hive tutorial
https://youtu.be/nVI4xEH7yU8
Video Time: 2 hrs

f) Optional: Massive Parallel Processing Engines
https://youtu.be/NUGcAUyQY-k
Watch time 1 hr

4b) Deep Understanding of Spark/Databricks

It is a good idea to pick one of the most popular Big Data computing tools and know it very well. In doing this you learn more about big data and also have a very marketable skill that is in high demand. In the height of the hype and glamour around Hadoop and its Ecosystem so many tools or projects were developed. However only a few proved to be very prominent today. One of them is Spark, which distributes and computes big data using in-memory with a cluster of machines. It is quite efficient and supports multiple languages like SQL, Python, Scala, and R. The cloud-managed version of Spark is Databricks which is also insanely popular due to its cost-efficiency. Knowledge of this tool’s architecture and optimization is important to be useful in the job market.

a. Create a Databricks community edition account so you can have a platform to practice
https://community.cloud.databricks.com/login.html

b. Understand Spark architecture and the overall capabilities of Spark in Scala course:
I have not watched this video course but it promises to go over the in-depth architecture of Spark and Scala (which is the primary language of Spark). Don’t worry about Scala because Spark supports SQL and Python so you don’t need to be proficient in it.

Spark Essentials
https://www.udemy.com/course/spark-essentials/
Cost: 11–15 dollars
Course Time: 7.5hrs
Course Learning Time: 3weeks

c. Optional: Pyspark Tutorial- Knowledge of SQL and Python would really make learning Pyspark very easy
Pyspark for Spark
If your SQL is really strong Spark SQL will be sufficient to work in Spark for most Data Warehousing use cases. Things that you need Pyspark for are Spark streaming use cases and machine learning. They can be learned on the job by google searching or take this course

https://www.udemy.com/course/spark-and-python-for-big-data-with-pyspark/
Cost: 11–15 dollars
Course Time: 11hrs
Course Learning Time: 3weeks

d. Databricks /Spark Optimization: this is important because a lot of interviews ask about this
https://www.youtube.com/watch?v=daXEp4HmS-E&t=99s

Note that if you have good knowledge of SQL and Python you can work a lot with Spark

Video time: 1hrs

5. Cloud Knowledge and Cloud Data Tools/Services

Now we have gone over the cloud vendor-agnostic skills. Let me delve into the marketable skills you will need in addition to the above to actually hit the ground running and begin working as a data engineer. In this case, I will be solely focused on describing what you need to be an Azure Data engineer since that has been where I have focused for the past few years.

Please note that these skills can also be translated to the equivalent technologies for another cloud environment

Becoming an Azure Data Engineer

a) Azure Cloud knowledge

Learn Azure by enrolling in a certificate course like below

Doing computing in the cloud is a mindset and technology stack shift for many organizations running on-premise. It has its advantages, pitfalls, and constraints. Learning the basics of Azure cloud infrastructure really helps understand the value cloud brings to the organization for data analytics. You will also learn the tools to use for what scenario and best practices for data security, cost optimization, and resource management

I would start with
a) Azure AZ-900: fundamental of Azure

https://docs.microsoft.com/en-us/learn/paths/azure-fundamentals/

b) Azure Data Solution Services
https://www.youtube.com/watch?v=ohya6zTa1Hg
Watch time: 1hr

Azure Data Engineer
https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer

Learn a simple ETL tool in Azure- Azure Data Factory

Every data engineer in Azure loves this tool. Azure Data Factory is really easy to learn, use, and is very powerful. It is also one of the most intuitive and documented Azure products built. It is very easy to learn and there is a ton of information on the Azure website and on youtube for this. I encourage you to learn this and keep growing with it. It will simplify the ELT and data workflow orchestration and scheduling responsibilities you have as a data engineer.

Comprehensive overview playlist
https://www.youtube.com/watch?v=Mc9JAra8WZU&list=PLMWaZteqtEaLTJffbbBzVOv9C0otal1FO

Advanced-Data Factory concepts (Parameterization)
https://youtu.be/K5Ak4IdtBCo

Worthy Mentions

Some of the readers of this post will be surprised that I have not mentioned skills like NoSQL, Streaming, Graph DB, Machine learning, Relational Databases, etc. I am aware that they are important but I think that they are not fundamental for a beginner. Learning the above and being comfortable with it is hard enough and I wanted to ensure folks do not get overwhelmed.

Anyway for Data Engineers that want to learn more, try learning the below

  1. Understanding streaming technologies and Hadoop Kafka

2. NoSQL Databases

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store