What does a Data Engineer do?

A simple explanation of the career and what I do on a daily basis. For those who wondered about the difference between analysts, engineers, and scientists.

Published in

CodeX

7 min readAug 14, 2021

There are so many roles in the data industry right now that it’s easy to get lost, even if you’re in the field (happened to me when I first heard of the Amazon’s BIE role). The difference between the three big Data players (analysts, engineers and scientists) has been discussed in many posts like this. Actually, I’ve walked through these 3 options to chose my path (this article really opened my mind to choose DE instead of DS). This comparison is a great place to start studying data careers and understand how they work together in a data-driven business. Here I talk more about my role, the Data Engineer position.

Disclaimer: there are many DE variations focusing on infrastructure, platform, coding, pipelines, metrics. Mine has been always closer to analytics as an AE.

To dive into the semantics, we just combine the words Data + Engineer together. You might know data as pieces of information with valuable potential to drive business decisions. Remember that engineering is a study path based on applying scientific principles to design and build products (such as facilities in civil, machines in mechanical and devices in electrical engineering careers). Therefore, data engineers work on building solutions with data. Some will address common problems like just providing reliable data tables everyday, whereas others can be so innovative that customers don’t even know their needs. Or even hide in backstage like performance and costs.

Real data example

Imagine a company selling their set of products (or services) in a certain place (a set of covered cities in a country). The investors are putting their money expecting great return rates. Thus business managers have to think about the best strategies to work and report results to them. Decisions should be taken upon data analysis, so they will ALWAYS need numbers. Our friends, the Data Analysts (or Business/BI Analysts) are called to answer certain questions like these (really common in every business and explored in analytics courses):

In which city should we invest more in advertisement? Which products should be discontinued?

Data Analyst vs Data Engineer vs Data Scientist — Edureka

The DAs interpret requests and translate them into metrics. Success can be measured by the number of sales, distinct clients, total revenue or other indicator as long as it makes sense to that context. This is responsibility of the DA and an awesome job (the “link” between people and computers!). DAs usually model data with SQL, querying existing tables like sales at code below. They will analyse the output numbers to find out insights (useful discoveries like growing trends or unusual behavior in a certain segment). This is the delivery, which result might be presented with graphics (as a dashboard in Looker/Tableau or slides with Collab/Jupyter notebook screenplots).

SELECT
  date,
  city,
  product,
  COUNT(DISTINCT id_customer) AS customers,
  COUNT(id_sale) AS sales,
  SUM(amount) AS total_revenue
FROM sales
GROUP BY 1,2
ORDER BY 1,2

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Ok. So where is the DE?

Despite my bias as a former DA, I need to first explain the data motivation so you get the impact of the DE in business applications. Sometimes DEs don’t either understand much of that nor DAs need to call DEs. We even forget about DE’s if everything works fine or things never change (so unreal!). But behind the scenes, they will there taking care of our data!

The reason why tables like sales are available everyday and with correct numbers is due to DE’s work. In summary, we assure this data is collected regularly from the right sources, extracted or streamed successfully, treated to be easier to use and validated to provide the right value (what a response, huh? 😎). It feels like the data engineers are the “gear” powering this complex workflow (called a pipeline!). In addition to maintenance, business and processes might change and we’ll need to update parts of this process.

But where does data come from? Why does it have to follow a path? These are interesting questions to be answered by ETL/ELT (extract, transfer, load) processes. Nowadays we produce data in variety, volume, velocity (as in the 5 “V”s of Big Data) and it comes in many sources, either structured in a defined format like tables or unstructured like audio, images and video media. So guess what original data is not as simple as what you see in the final table! This is a fictitious example that mimics a record of a single sale (similar to what DEs find when connecting APIs to access data from a particular service):

{
"id": 1,
"product": 504 
"customer": ff6626c69507a6f511cc398998905670,
"country": BRL,
"city": 162,
"timestamp": "1628942263",
"status": "completed",
"currency": "BRL",
"item_amount": 100,
"quantity": 1
}

I just love how we model raw registers into beautiful and powerful data!

Hard to follow, right? Data engineers take care of all lifecycle of the data. They act from collection at producers to release to stakeholders. We make sure we do not lose data during this flow, so it arrives with quality. Also we think about every detail to promote usability, so after all people love data and want to use for their analysis! Otherwise, what is the point so none can use? Data should be easy to get and use, fast to run and simple to understand.

Think about how this table name sales look familiar. We follow good coding practices as to name things intuitively and without ambiguity. For me, data modelling is like art craftsmanship. During architecture design, we choose carefully how to name datasets and metrics so it’s easy for end-users to make their calculations (and for managers to understand). We will design structures, choose tools and decide when to collect the data, how to save it, where, which format, whom to provide access and so on. We can also catalog definitions and table relationships so people know where to find information. These kind of initiatives make users more self-service and autonomous in their work. Notice how we influence the data-driven mindset?

But it’s not a paradise. Problems happen and we solve them! This can be time-consuming but it’s definitely challenging (I enjoy the adrenaline to fix a broken pipeline as being a hero to save the world of data analysis). Programs might crash sometimes and we need to act on that ASAP to recover the missing or fix the wrong data. We propose solutions and decide strategies such as how and when to run, informing customers. With great power comes great responsibilities! I suffered in the beginning not knowing what to do, but issues help you gain experience and learn. It’s part of our culture to document and share them as in post mortem. Teamworking with great people is awesome!

There is also the part of innovation and improving things. As we learn from the past, we can validate with checks to anticipate problems (there shouldn’t be negative values for price). If company acquires a new sale platform, we’ll retrieve and incorporate data from this new source too. If a metric changes, we’ll update columns and possibly reload tables for old values. When regulations as LGPD came in, we started protecting data no so sensible information is exposed (your ID becomes a hash to hide your name). Moreover, we try to be up-to-date to use new and best technologies in the market, so we perform PoCs, migrate tools and develop in-house solutions.

Sounds a lot? These responsibilities are more related to the pipelines and analytics world. There are other initiatives in data engineering as: automating processes, creating code patterns or templates to abstract logic and avoid repetition, testing code to prevent failures, monitoring applications to easily detect anomalies, refactoring code processes and resizing resources (machines, licenses, storage) for performance and cost optimization. These attributions could be more related to data platform engineers.

Note that this analogy is just one example. Data Analysts are just one type of stakeholders using the data. There can be also Data Scientists and Machine Learning professionals working with more robust analysis which rely heavily on statistic models, algorithms, neural networks and artificial intelligence (IA). We DEs will mount the architecture so everyone can play with data. 🎲

How to be able to do all that? We’re originally specialized software engineers, like databases is one area in back-end development. Well, most of us are, I’m not! Of course a CS, SWE or IT major definitely makes this journey easier by providing a solid background of computing fundamentals. But not having that is not the end of the world, as today you can learn it all online! If you want to find more, I’m an advocate of self teaching at these topics. Check my articles Where should i start studying Data? and How I moved from back-office to back-end in 3 years. to get inspiration to start! 🤓

I hope I made it clear with the example what a Data Engineer (usually) does, our common challenges and how we are important to organizations! ❤️

Are you into DE too? In doubt with other careers? Please share your feedback, I’m dying for that. And share this article to your data lover friends!

What does a Data Engineer do?

A simple explanation of the career and what I do on a daily basis. For those who wondered about the difference between analysts, engineers, and scientists.

Real data example

Ok. So where is the DE?

Written by Carolina Maia