Demystifying Data Engineering

Pedro Castillo
6 min readNov 26, 2023

--

A data engineer, dressed as a wizard, coding.
A data engineer, dressed as a wizard, coding. Ideogram.ai

I’ve been riding the software engineer wave for a while now — tinkering with frontends, diving into backends, juggling APIs, and I have worked as a data engineer for some years now.

There’s this one question that keeps popping up in different workplaces: What’s the deal with data engineering, and why’s it so different from the regular software gig?

Alright, time to drop my two cents on this.

Definition of Data Engineering

Let’s demystify Data Engineering. According to P. Platter (Elite Data Engineering):

Data Engineering, as the word itself says, means engineering the data management practice.

The DAMA-DMBOK defines:

Data Management as the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles.

Coined in the 2010s by big tech players like MANGA & friends, Data Engineering is a specialized form of software engineering focused on data — from infrastructure to cybersecurity, mining, modeling, processing, and metadata management.

In essence, Data Engineers are software engineers navigating the data landscape, using software engineering practices to optimize data for maximum impact.

What is not Data Engineering

Let’s debunk some myths about Data Engineering. First and foremost, it’s crucial to understand that tools do not define the job. Repeat after me: SQL, Spark, Snowflake, or any other shiny tool are not Data Engineering.

It’s like saying knowing how to use Excel transforms you into an accountant or a finance expert. Likewise, just because someone can write SQL queries or whip up some Python code, it doesn’t automatically grant the title of Data Engineer. Similarly, writing some React here and there doesn’t necessarily make someone a Frontend Engineer.

Data Engineering is not about the tools; it’s about the craft. Tools are accelerators of the craft, not the essence of it. Consider this: carpenters upgraded to electric saws and CNC cutters to enhance their work, but they still resort to hammers and handsaws when needed.

Why do job titles don’t make it easier?

Now, let’s address the issue with job titles. Starting a title with a focus area like “Frontend Engineer” or “Data Engineer” can create mental restrictions. I’ve seen so-called frontend “engineers” who never touched a database or Data “engineers” convinced that data presentation is solely done through Tableau or PowerBI.

I prefer titles like “Software Engineer: Backend” or “Software Engineer: Data.” These acknowledge the foundation — software engineering practices encompassing design, architecture, testing, monitoring, and performance. They first emphasize the skill of engineering software and then highlight specialization in solving a specific family of problems.

Remember, in this dynamic field, cooperation with other specialists is essential. That’s why efficient organizations often have cross-functional teams. If you need to optimize UI performance due to complex data munging, delve into backend code. If data pipelines break due to inconsistent schemas, inspect the frontend code. You don’t need to be an expert in everything (unless you want to!), but knowing enough to engage in meaningful conversations fosters collaboration. After all, whether it’s for loops or variables, the basic building blocks don’t differ much across the popular programming languages in the industry.

After venting a bit, let’s get back to the matter.

So, why is it so different from regular software development?

Whenever this question pops up, what folks are really asking is: why does working in data seem so painful compared to traditional software development?

Here’s my take based on experience:

1. Unpredictable Inputs.

In traditional development, we might joke about users using the app ‘wrong.’ In data engineering, it’s not a joke; it’s a reality. Developing a data processing pipeline often means dealing with inputs you can’t control — a database dump or a Kafka topic managed by another team. Fields get added or removed without your say, and the application’s data bugs become your bug.

Defensive development is key. Explicit agreements with data sources and collaboration help minimize disruptions.

2. Vendor Lock-in and poor Developer Experience.

Vendors offer a variety of tools once exclusive to big tech, but the dark side is real. Companies make having local, reproducible environments (I’m looking at you, Snowflake) an uphill battle. You’re stuck treating vendor resources as black boxes with limited insights. Tests fail, and you’re left wondering if it’s a code bug, unavailable vendor resources, or outdated SDKs. Even AWS Lambda, known for vendor lock-in, offers local runtime emulation for local testing and development.

Think about it — your powerful development machine is essentially a data request middleman. Every development action, CI job, staging, and production workload means cash out to the vendor. No wonder these companies have hefty valuations.

Poor developer experience leads to test shortcuts. Monolithic test suites attempt to cover all test cases in one go, hoping to validate the entire pipeline.

3. Slow Feedback Cycles.

Welcome to Big Data — where executing a job takes time, sometimes hours. Yet, some teams opt to run queries directly in production for a quicker visual check.

Data engineers focus on data artifacts, not just software artifacts. Generating meaningful test data is a headache, data engineers need to invest in tooling for making tests easier. Fortunately, newer formats and data tools come with cloning and sampling capabilities.

Unlike traditional software aiming for loosely coupled systems, a data processing pipeline is tightly bound to its input data. There are strategies to deal with the problem, but that’s a topic for another post.

Why would anyone want to become a data engineer?

Despite the challenges discussed earlier, being a data engineer is far from a journey for software masochists. Here’s why:

1. Data is critical for Business.

In 2001, Watts S. Humphrey declared, “Every business is a software business.” A decade later, Marc Andreessen said, “Software is eating the world.” Fast forward to today, and it’s safe to say,

“Every business will be a data business.”

With the surge of AI and, more recently, LLMs (Large Language Models), organizations worldwide recognize that effective AI implementation requires robust data management. Quality data management distinguishes companies capable of scaling AI from those unable to do so. The multi-billion (almost trillion) dollar market size of Data Management Software signals the significance of data. Businesses perceive AI as a competitive advantage, and data management forms the bedrock for AI.

2. The field is evolving quickly.

Data engineering throws some interesting challenges at you:

  • Ever tried smoothly querying Terabytes or Petabytes of data?
  • What about serving up analytical data at low latency? Decisions like what to cache, when, and for how long can be a puzzle.
  • You’ll also be optimizing processes dealing with massive data daily — figuring out the best way to use hardware and what’s the optimal data representation.

Witnessing a legacy pipeline transform from a 6+ hour runtime to mere minutes is immensely gratifying. The challenges posed aren’t hurdles but rather opportunities to elevate the entire field. There is a continuous flow of innovation, with practitioners sharing data modeling techniques, emerging technologies disrupting the space, and entirely new architectures emerging to meet evolving AI requirements.

3. Data: The Nexus of Business and Tech.

While non-data software developers may grapple with understanding organizational dynamics, as a data engineer, you’re in a unique position. You know where to query for vital information and might even be tasked with ensuring the accuracy of reported numbers. Even when this data is available organization-wide, your familiarity with tooling and conventions grants you an extra degree of freedom.

Nowadays, data isn’t confined to tech departments. Marketing, sales, operations, logistics, finance, and product specialists all rely on data for decision-making. As a data engineer, you’re not just a technologist; you’re a collaborator with non-tech functions, understanding their needs and enhancing their decision-making processes. If you relish this cross-functional collaboration, data engineering is an ideal space for you.

Wrapping up

I hope you now have a better understanding of data engineering, its challenges, and opportunities. Despite challenges, being a data engineer is not for masochists; there are compelling reasons to pursue a career in the field.

For those collaborating with data engineers, consider their perspective before making significant changes like altering database schemas or introducing disruptive modifications to events. Please understand there is an intricate dance between data engineering and other realms of software.

And to my fellow data engineers, forge closer ties with application developers, familiarize yourselves with the code, and deep dive into the intricacies of the domain. After all, our shared goal is to produce better software.

--

--

Pedro Castillo

SW developer focused in data-driven applications to help people and businesses to make better decisions. Passionate about Big Data, Cloud Computing and AI.