Tech Skills for your first Data Engineer job

Pavan Raju
4 min readDec 11, 2021

--

I’ve been a data engineer for almost two years now at four different companies — thanks to consulting. The exposure has given me the chance to identify common patterns across the industry.

In the last few months, I’ve been approached by people in the tech and business communities wanting to become Data/ML Engineers and Data Analysts.

In this blog post, I wanted to take this chance to go over the fundamental tech skills most of these roles need as well as why they’re important.

Pipelines aren’t easy. Photo by JJ Ying on Unsplash

Background

Data engineers build data pipelines, dashboards, infrastructure and the data aggregations to help data analysts and scientists do answer questions and make predictions.

Your day-to-day will vary greatly on the project you’re working on and who you’re delivering for. A non-technical user of the data might be expecting a dashboard to understand and digest how the business is operating. Whereas a proficient data professional might expect the data to be presented in a database table so they can perform queries on their own.

If you’re at the beginning of a project, you might be spending a up to a week to understand the data, the requirements of the project and how the business operates. Understanding how a business operates is important for a data engineer and it can really set you apart from other engineers. More on this in a future post though ;)

Depending on your background, you’ll may already have some prerequisite knowledge to break into the field. There are also a bunch of skills you need which they don’t teach you in university or have on job descriptions. You’ll pick up on these as you navigate through your careers so you don’t need to stress on these upfront.

If I were to start over, I’d prioritise learning the following three things.

SQL — Structured Query Language

A programming language called SQL is how most data professionals query their data from a database or data warehouse. You can write SQL to extract, transform and aggregate data to tell a story. Make sure you know the fundamentals of the syntax such as select , group by , order by , where and limit.

Once you’re comfortable with using these commands, I’d recommend tackling some case studies where there’s sample data available with some questions. As you work through harder problems, you’ll also learn about CTEs, window functions and other more complex aggregation methods.

Most databases and data warehouse technologies have a similar SQL syntax, learning these fundamentals will help you get rolling with most flavours of databases.

If you’re looking for SQL resources, I’d recommend Serious SQL. It has lots of questions, sample datasets, it’s easy to set up and has a Discord community.

Full disclaimer: I’m a mentor on this course, I don’t commission on the sales, it’s just a great course. There is a student discount available if you send a request to support@datawithdanny.com with your student email.

If you’re after free resources, you can always checkout Hackerrank and Leetcode too — I’ve used both to sharpen my skills.

Once you start nailing how to query databases, you’ll be creating tables of your own!

Python

A multi-purpose programming language that can be used for web, backend, cloud infrastructure, data science and data engineering. As a newcomer to the data engineering community, I’d suggest learning Python with a focus on data manipulation, processing and visualization libraries such as pandas, numpy and mathplotlib.

Many Data Scientists and ML Engineers I know develop their final models in Python too using libraries like scikit-learn.

The alternative to Python in data pipeline development are Scala and Java, for data science it would be R.

One of the best data focused Python tutorials I’ve seen is freeCodeCamp’s Data Analysis with Python on Youtube.
It goes through all the libraries I recommended above with some pretty good practices I’d recommend for beginners such as Jupyter notebooks.

Fundamentals of the Cloud

Many building blocks of a modern data infrastructure warehouse runs in the cloud. Snowflake, Redshift, Databricks, FiveTran and all run on one or more of the three major cloud vendors (Amazon Web Services, Microsoft Azure and Google Cloud).

Having a basic understanding of how the cloud works and how it affects the tools you operate will really help you. You don’t need to know all this in detail upfront as you’ll pick up on more concepts through your journey.

If you’re working with AWS, I’d start by learning about S3 (Simple Storage Service), EC2 (Elastic Compute Cloud) and Redshift. S3 and EC2 are the building blocks of most other AWS services as well as cloud data warehouses such as Redshift.

There are plenty of other cloud services you’ll come across in your travels.

Parting Words

I admit that I’ve been overwhelmed several times at the number of things to learn as a Data Engineer. There’s almost a new piece of tech out every few days or weeks since the data space is rapidly evolving.

It’s easy to stress out about learning everything under the sun — or maybe it’s just me! If you know the fundamentals of SQL, Python and about the cloud, you’ll be able to adapt and tackle anything else that comes your way.

Fundamentals are everything!

I hope this helps.

If you’d like to reach out, you can find me on Twitter: @pavanraju023

--

--

Pavan Raju

Data/ML Engineer from Sydney. I like data, software and BJJ.