PATH TO BECOME A DATA ENGINEER

Junaid Effendi
4 min readApr 18, 2020

--

Data Engineering is definitely one of the most demanded jobs in today’s world. As the data grows the need of Data Engineer grows and with the new technologies becoming common like Spark and Hadoop, companies are looking to hire people who can do the data handling job efficiently. I personally believe there is still short of Data Engineers in the market and its still a good time to think about it if you are interested in pursuing this field.

Data Engineering is actually not a new name, it has been in the market for decades but was never used the way we see nowadays. Back then Data Engineer’s job was also to handle, manage, transfer data but the difference was in the technologies and the size of data. Due to the small size of data, companies used old-school SQL systems which was enough at that time to process the data, but with the rise of data in the recent years there was needed a better solution. Few of the most common technologies you will hear are Spark, Hive and Presto, they are cheaper and faster.

This article is going to answer the most common questions I see every other day on Quora.

Typical questions are:

  • How to become a Data Engineer?
  • What’s the path to be a Data Engineer?
  • How to switch from a Software Engineer to a Data Engineer?
  • What technologies are required for a Data Engineer?

A typical path to be a Data Engineer includes few important things:

  • Love for Data
  • Big Data Technologies
  • Programming

Love for Data

It shows how passionate you are about the data, one must be working on the type of data he/she loves, for example looking into health care data might not be interesting to you, so you better look what you like. There is almost every type of data out there in the market. Top examples include; healthcare, financial, real estate, ad tech, social media, etc.

Big Data Technologies

In this space, technologies are still not that matured, companies are still adapting, improvements are coming every year, so if you want to be in this field you will need to be pro active in updating yourself with new tools and features.

Technologies can also be subdivided into categories:

Platforms

AWS and Google Cloud, these comprised of various technologies that help to build a fully scalable and reliable system. Compute Instances like EMR (EC2), databases like Redshift and query services like Athena are some common applications used by Data Engineers. Also, everyday improvements and new tools are coming out by these platforms, so you have to keep an eye out there.

Data Processing

Hadoop ecosystem (open source) that includes top tools like Spark, Hive, HDFS are very common in the market. Commonly, Spark is used to process large amount of data but SQL still holds its worthy place, tools like Hive and Presto (Athena in AWS) use the same SQL to query from a filesystem even Spark supports SQL as well known as Spark SQL, concludes that SQL is still worth learning. There is a new Spark release coming up known as Spark 3.0.

Schedulers

Airflow and Luigi are the most common schedulers in the market at the moment. Both Airflow and Luigi are open sourced projects by Airbnb and Spotify respectively, they do the same workflow management but have a different approach.

DataBases

Old School database systems are likely to go away pretty soon as the new data warehousing technologies like SnowFlake are emerging very quickly. AWS Redshift is still popular among companies, though it’s costly and lacks scalability.

Programming

Yes! Data Engineers are programmers as well, they write code to support data. Write pre-processing data logics, ETL, schedulers and much more. That’s why most of Data Engineers were used to be Software Engineers and companies might prefer a Software Engineer as a Data Engineer instead of having a fresh one.

Top programming languages used in this field are Scala, Python, Java. Spark is written in Scala and is supported by all three languages mentioned, while schedulers are usually written in Python like Luigi and Airflow.

Bonus!

If you are looking for a job then typically these are the minimum set of skills a Data Engineer must have based on my experience while looking at hundreds of job roles.

  • AWS (EMR, S3, Redshift)
  • Hadoop (Spark, HDFS)
  • SQL
  • Airflow
  • Python

This was first published on my blog few months ago, if interested don’t hesitate to have a quick glance at my blog.

--

--