How would I be a Data Engineer in 2024 and beyond?

Subhayan Ghosh
7 min readMay 5, 2024

--

This is the first article that I am writing on Medium however I have been working as a data engineer at Mercedes with experience of 5+ years now, so, it is only just that I start with something that can help aspiring data engineers.
Although, some of the points mentioned will help experienced data engineers as well.
I’ll also love to get some feedback so that I can be better at writing.

And yes, whatever I am going to write, will be relevant for a foreseeable future, so even if you’re reading this after 2024, give this a read, it’ll be worth it.

Okay, enough chit chat, let’s dive in to the world of data.

So, if you’re a complete newbie to software engineering then follow this,

  1. Learn a Programming language first.
    Here you have two choices (more than 2 if I am being honest, but you don’t need to think all that now).
    a. Python, comparatively easy to learn, heavily used in data science, ai-ml etc. Good choice. 😍
    b. Scala, comparatively tricky to learn, but gives you harder time in the beginning and makes you ready for the daunting Java programming 😢(trust me, at some point in your career you’ll need it and it’s the best language to learn Object Oriented Programming, which Python also can be used for but not the best option)
    Now all you need to do is choose a language and learn the basics (just find any playlist on YouTube and finish it), and practice as many problems as you can from Geek For Geeks and create 2–3 small projects (Don’t follow any course for projects or practice, seriously don’t). Okay, this is getting longer, I’ll create a separate post on this and I would love if you just follow me to read that. 😎
    After you are somewhat familiar to a programming language, if you are not in a rush to get a job, I would highly suggest you to learn Data Structures and Algorithms (DSA) and practice questions (Trust me, it’ll give you an unfair edge over others who did not read this blog 😂).
    Regarding DSA, the depth you want to go will determine the amount of time you need to spend (for basic, 1–2 months, advanced might take 6–12 months).
    Once you’re familiar with DSA, start working on side projects for your resume (again telling you, Don’t follow any course/ guided material).

If you’re not new to software engineering then, great, you’re much closer to become a Data Engineer than you think!

So, the must have skill for a data engineer will be to have at least intermediate level knowledge of SQL.
Depending on how familiar with following topics, you can spend one to three weeks here (Even if you know the concepts or have already worked with those, you need to dig deep).
This time I will encourage you to start with Oracle/ Postgres and if you love setting up things on your laptop then install Postgres (open source and all).
If you don’t, then open an account with Oracle (free) and start using https://livesql.oracle.com/ . Here you’ll see a few database already created and you can easily run SQL queries on the go.
There are many other alternatives, but I use this one as I started my career as a Oracle developer.
Now that we have a play ground ready, you need to be a master of the following topics:
1. DDL, DML, DCL
2. Joins, Grouping and aggregating data.
3. Order of execution of various identifiers.
4. Types of sub-queries.
5. How to structure your queries for maximum efficiency.
6. Window functions
7. CTE and recursive CTE
8. Indexing and types of indexes
9. How to read a query plan.
I will be writing a detailed blog on learning SQL, till then learn and master the above topics 🤓.
There are many playlists many courses for SQL as well, but I found that Kudvenkat has a playlist on SQL basics that covers all the above topics in a structured way, you can follow that. Also, Trendytech and Ankit Bansal’s channel on YouTube has playlists as well. I’ll suggest go through couple of videos from each and follow the one that resonates with you.
Once you’re somewhat comfortable with SQL, start practicing one or two questions each day from Ankit’s channel and try to solve them on your own before looking at the solution.

At this point you’re ready to step into the big data world.

Here, before directly going into Spark, I’ll suggest you to be familiar with Hadoop first.
This will help you to understand why Spark was invented and build your intuition.

You can follow a playlist from Prashant Pandey sir, he has a channel named Learning Journal in YouTube and it has a playlist that explains exactly what you need.

Now if you’re not in a rush, may be watch some videos on other Hadoop ecosystem tools like Hive, Sqoop and get some familiarity with Linux.
For practicing Linux, you can setup WSL on windows and play with it, or may be you can install a virtual machine to have Linux.
If you can afford, I will highly suggest you to take 6 months subscription of Itversity labs for 3000 INR, this is a great option as you have access to a cluster like how you have in production systems and access to almost all the needed big data tools at a cheaper cost than any other cloud systems. You need to choose the best option for you, but having some idea on Linux will help you grow as a data engineer (Knowledge of DevOps is going to be mandatory in future I believe).
So, this part you should give one week at most, then start with the industry leader Spark.

For Spark, you need to spend a little to purchase couple of courses from Prashant Pandey sir from Udemy (Spark for beginners is what you can start with, wait for Udemy discounts so that you can get it at a cheaper cost and before purchase search using incognito mode of your browser 😉).
You can watch the following video as well just for getting started https://youtu.be/7ooZ4S7Ay6Y?si=QZeUPKAG3KPok3Xi

If you have taken the itversity labs subscription then you can use their Jupyter notebook setup for practice instead of going through the hoopla of installing Spark on your system, and that helps in making note as well, take a look at the following repo to know how it will be like https://github.com/subhayansg/PySparkInJupyterNotebook
Now for practice, just write the SQL challenges you’re solving daily using Spark. That’s it.
This will be enough for you to get a good hold of writing code in Spark. And whenever you’re stuck, go to sparkbyexamples website, they will have tutorials for sure.
Also this will be the high time you start exploring the official documentation of Apache Spark and get familiar with theoretical aspects like how Spark runs, how is memory used in Spark, what are the must know theoretical concepts of Spark, what are the most asked questions of Spark in an interview.

Well, job is done! Congrats!!!

No, it’s far from over 😒.

Now you need to create some data engineering projects to get a feel for the real thing.
You can follow some from the startdataengineering website, may be anything that you will like to create, and here I’ll suggest you to use some tutorial to follow as in this you need to learn how to write test cases for your project, how to do data quality checks, how to do monitoring of your jobs.
Till this point, all I was talking about is batch processing. Meaning processing the data at a fixed interval. But you need to learn about stream processing or real-time processing as well.
For this, you should take up a course by either Prashant Pandey sir (Stream processing using Scala or Python) or from Rock The JVM series on Udemy (In Scala)
Here I’ll add a pro tip 😎.
If you take up the streaming course in Python if you have learned Scala and if you have learned Python then take it using Scala. As both of these languages are used heavily in data engineering Projects, it’s good know some basics so that you can work on any of these two, and this way you’ll be able to learn much faster.
Now that you’re familiar with streaming as well, add a streaming data source to the projects you have made previously. This will allow you to introduce some complexity in your project like how to handle idempotency, fault tolerance in your data pipelines, what to do incase your streaming data stops flowing into your data lake (if you’re not familiar with this term then learn a bit about them), how to run queries on streaming data (can you do all the aggregations that you can do on batch data?), most asked questions of Spark streaming etc.

Next again I’ll suggest you to take another course from Prashant sir (Beyond basics) and a course from Rock The JVM series if you have the budget (on Performance tuning, sadly this is only available in his website and costs about 85$).

Huh! This was much longer than I anticipated it to be. But seriously all this is just the beginner stuff and should not take more than 3–6 months. And once you’re done with this, you’re ready for a beginner level data engineering job, start giving interviews now.

Make sure to follow me for future articles where I go deep into different topics.

--

--