Becoming a Data Engineer
As I explained in previous articles, I am a Software Engineer working at HelloFresh. I have been with the company for five years so far. Before HelloFresh and in this company I had different roles.
I started my career as a web developer doing everything, something that nowadays is called Full Stack Developer (well I have a firm opinion about that position, and it is not good, but that would be another article).
Then I switch for one year to mobile development working for a Dutch company, and finally, I become Backend Software Engineer for many years. At HelloFresh I had the chance to switch to Data Engineering like a year and a half ago.
In this piece I want to explain my experience, and which knowledge you need to learn to become a Data Engineer. Let me describe you some of these topics.
Dimensional Modeling (also known as Star Model)
This is a new data modeling that can result confusing coming from the Entity-Relationship Model. With Dimensional Modeling we model the data differently because we have different needs.
The basic of Dimensional Modeling is that we have two big components: Fact Tables and Dimensions.
In the Fact Table as the name states we are storing Facts, for instance, we sold one item. That order will be a fact, something that happened at some point in time.
Then, we have the Dimensions. A dimension gives some extra information related to that fact. For instance, we can have a product dimension, a customer dimension, date dimension, etc.
The Fact Table links to the Dimension with a Surrogate Key. This key points to that dimension item at some specific point in time.
To learn more about it I extremely recommend the book: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling
Knowing your infrastructure
Working with data, you will need to build your infrastructure to store and process the data. In this case, you have two different options: use some Hadoop flavor or go to some cloud tools like AWS Redshift plus some other services.
Choosing one or the other, it is your decision, if you do not want to take care of infrastructure that much then I will recommend going for AWS solution. But if you want to have more control of the whole process, then I will choose some Hadoop flavor.
About Hadoop flavors, I can talk about Cloudera. With this Hadoop distribution, it is pretty easy to start handling a cluster with some nodes and start using the HDFS to store in a distributed way your data.
With Cloudera Director + Manager, it is straightforward to maintain compared with other distributions like MapR.
If you decide to go for the cloud solution, then you will probably use S3 as your data storage and AWS Redshift as the solution to run some processes and query your data.
For your Hadoop distribution to query your data using SQL syntax you can choose between Impala or Drill. I can recommend Impala as an excellent tool to query your data using SQL in combination with the Hive metastore.
As you can see, I just mention only a few tools to do some basic stuff. So, switching to data engineering means to learn a lot of new technologies.
Analyzing your data
When you have the data stored in your distributed system, the next step is to start analyzing the data. Building some tables transforming and aggregating data.
In this case, I can just talk about Spark. This is the only tool I used so far for that purpose. To be more specific, PySpark.
If you choose a Hadoop distribution, you can start doing Map/Reduce tasks out of the box. But this is an old fashion way to accomplish your goal.
Nowadays, Spark is a powerful tool to run tasks in a distributed way and easy to manipulate and transform your input. Usually, the more natural tasks are batches processes. But if you need some real time analysis you can use Spark Streaming in combination with some message broker system like Kafka.
Other similar technology would be Flink, but to be honest, I do not have any experience with it.
Some advantage of using Spark is that it supports different languages like Java, Scala, and Python.
With Spark you can push the code to your data nodes and execute the transformations in a distributed way, trying to us the memory of the nodes instead of doing I/O operations on the file system. This is an advantage compared with Map/Reduce.
Orchestrating your tasks
With the data in your distributed solution, your technology to create tasks the next step is to orchestrate when to run each task.
In my opinion, the two biggest tools to achieve that are Airflow and Luigi. Luigi, it’s dying and losing support and Airflow it’s an Apache project with Airbnb as the main supporter.
With Airflow, you can create a Directed Acyclic Graph to set up the order in which the DAG should execute your tasks. A DAG is a collection of tasks running in a particular order.
This is also known as ETL (Extract, Transform and Load) because all the tasks are going to achieve those three necessary steps. Extract the data from an API, transform the data in some way and at the end load the data into your data solution.
It is possible to switch your career from Backend/Frontend engineering to Data Engineering, but you should be prepared to learn a new set of technologies and concepts, very different than the ones you are used to.
So my biggest advice is to be open-minded, humble and eager to read, practice, fail and start over. In few weeks you will start feeling more confident about all these concepts.
By the way, if you are already a Data Engineer in my company we are hiring! It does not matter if you are not a Data Engineer we have other positions open for Machine Learning Engineer, Backend Engineer, Frontend Engineer, Android/iOS Engineer, and many others.