Sitemap
Hacking Analytics

All around data & analytics topics

The Distinction between Data Engineering and Software Engineering Roles

12 min readJul 24, 2025

--

Press enter or click to view image in full size
Photo by Claudio Schwarz on Unsplash

Data Engineers are software engineers who work with data, right? Well, yes and no, the truth isn’t that simple. There are notable differences between Data Engineering and Software Engineering.

  1. Control over the inputs
  2. Scope Uncertainty
  3. History
  4. Refinement
  5. Speed of Delivery
  6. System of Record vs. System of copy
  7. Trainings
  8. Team Size

Control over inputs

Data Engineers generally have less control over their inputs compared to traditional Software Engineers. While Software Engineers often design and control both the structure and flow of data within their applications, Data Engineers must work with data produced by external systems, legacy applications, or third-party sources, often messy, incomplete, or inconsistent.

Software Engineers typically operate within a defined and controlled environment: they can shape the data models, adjust how data is collected, and modify application logic to improve data quality. In contrast, Data Engineers are tasked with making sense of data that was not necessarily created with downstream use in mind.

This means building pipelines, cleaning and transforming data, and ensuring reliability despite limited influence over how the data is generated.

To address this, approaches like Data Contracts and exception-based data engineering have emerged, aiming to bring more structure and control over upstream data inputs. These frameworks attempt to define and enforce clear expectations between producers and consumers of data. However, their success depends on consistent enforcement and organizational discipline — conditions that are not always present.

In many cases, pressure from leadership or business teams to ‘just get the data’ undermines these efforts, pushing the burden of quality and interpretation back onto data engineers and analysts, who must make do with whatever data is available, regardless of its readiness or reliability

Scope Uncertainty

The limited control Data Engineers have over their inputs introduces a higher degree of uncertainty around the scope of their work. Raw data must be explored, understood, cleaned, reshaped, and often re-investigated after each transformation—in an inherently iterative process.

Challenges such as poor data quality, undocumented process changes, shifts in how fields are captured, or changes in field semantics are common and must be addressed during pipeline development. Unlike many backend engineering roles, which typically work with well-defined APIs and current-state data models, Data Engineers often need to dig into historical data and grapple with legacy system migrations, cross-system data harmonization, and inconsistencies in granularity or structure. These problems rarely present themselves upfront and are often discovered only through deep investigation.

Some Data Engineering teams try to sidestep this uncertainty by relying on static specifications written by business analysts based on stakeholder input. While this can reduce ambiguity in scope, it often comes at the cost of quality, overlooking hidden data issues and mismatches between intent and implementation. This approach is especially common among external consultancies or development agencies, which typically work within fixed contracts and therefore optimize for predictable delivery over deep data integrity.

Other Data Engineering teams reduce their scope to only delivering the data, passing down the responsibility of cleaning and modelling the data to Analytics Engineers or Analysts. However, the core of the work in data is in shaping and modeling the data received, no matter which function ultimately does it. This is where the highest degree of uncertainty comes.

History

I’ve previously highlighted how dealing with historical data is one of the key differences between software engineering and data engineering. However, this is just one aspect of how historical context shapes the unique challenges in data engineering.

The Role of History in Data Engineering

Two major historical concerns set data engineering apart:

  1. The need to process historical data
  2. The need to maintain a history of changes

Processing Historical Data:

Software engineers typically optimize for low-latency operations, often striving for O(1) performance when handling individual user requests. In contrast, data engineers work with datasets that span days, months, or even years. They must design systems capable of scanning, aggregating, and transforming large volumes of data — often in batch mode — to achieve acceptable performance.

These batch processes may need to reprocess historical data to backfill missing fields, correct errors, or apply schema changes retroactively. Unlike point queries in software systems, data engineering workflows frequently involve reading and rewriting large portions of historical datasets, which presents significant scalability and consistency challenges.

Maintaining Historical Changes (Historization)

In many use cases, data engineers must track changes to data over time — something that’s rarely required in typical software engineering roles. This often involves implementing slowly changing dimensions (e.g., SCD Type 2), designing temporal data models, or using time travel capabilities from modern storage layers like Delta Lake or Apache Iceberg.

These systems are optimized for scanning large volumes of historical records, not just the current state. As a result, maintaining an accurate view of data evolution requires data engineers to process and store entire versions of records, often across millions or billions of rows.

Refinements

Data engineers need to engage closely with business counterparts and develop a strong understanding of the broader business context. The data they work with is shaped not just by technical systems, but also by the underlying business processes. Moreover, the outputs they produce — reports, pipelines, data models — are often consumed directly by business stakeholders. Without this alignment, there’s a risk of building technically correct but contextually irrelevant solutions.

The inherent uncertainty in the data makes it difficult for product managers to fully define requirements upfront. Instead, requirements often evolve during the development process, as data investigations uncover new insights or constraints, sometimes only after partial implementation has already occurred. Introducing an additional intermediary at this stage can slow down progress. It’s often more efficient for data engineers to work directly with business stakeholders to understand their needs and determine how best to process the data.

Sometimes, subject matter expert (SME) product managers can serve as effective business counterparts for the Data Engineering team. However, more often than not, Data Engineering teams are responsible for a broad range of data domains while operating with limited headcount. This makes it difficult to find a single SME who has deep expertise across all relevant areas.

Speed of Delivery

Data engineers are often expected to deliver faster than software engineers, even though their work is inherently more uncertain, exploratory, and dependency-heavy.

This expectation is most common and rooted in misconceptions, as well as a need by higher management to be able to steer the business. Some of the drivers of these misconceptions:

  1. Perceived as “just plumbing”: Stakeholders often see data engineering as a support function — just getting data from A to B — so they underestimate the complexity involved.
  2. Downstream pressure from analytics: Analysts and data scientists are frequently blocked until data is available. This makes DE work feel more “urgent” and puts pressure on faster delivery.
  3. Lack of detailed specs: Because requirements are fuzzy, data engineers are often told to “just get something working.” This makes it seem like they should deliver quickly and refine later.
  4. No user-facing UI: Without a front-end or app to build, there’s a perception that DE work is more straightforward. In reality, it often involves dealing with messy, inconsistent, or poorly documented data sources.
  5. High business visibility: Data pipelines feed dashboards, KPIs, and financial reports. When these are missing or incorrect, the pressure to deliver increases.

The realities of delivering as a Data Engineer are, however, a bit different:

  1. Unclear scope: DEs often start building before requirements are fully understood, especially when working with new data sources.
  2. Hidden complexity: Data is messy — dirty joins, null logic, evolving schemas, mismatched keys, etc. — and cleaning it takes time.
  3. System-wide impact: A bug in a DE pipeline can affect entire reporting systems, AI models, or financial reporting — higher stakes mean more validation is needed.
  4. Latency vs. throughput: DEs optimize for throughput (e.g., batch ETL over millions of rows), not latency (e.g., sub-second response). That trade-off requires different design and testing cycles.

Data Engineers are often caught between the expectation of rapid delivery and the complex realities of their work. To meet tight deadlines, they may be forced to cut corners — skipping unit tests, compromising on code structure, or deferring documentation. As a result, data engineering code can feel rushed, fragile, and held together with temporary fixes. This trade-off leads to more iterations down the line, as early shortcuts accumulate technical debt that must eventually be addressed.

System of Record vs. System of copy

The type of systems built and maintained by Software and Data Engineers. differ.

Software Engineers typically build and maintain Systems of Record (SoRs) — operational systems that are the authoritative source for business transactions (e.g., CRMs, ERPs, internal apps).

Data Engineers, by contrast, do not manage SoRs. Instead, they work with copies of that data, extracted for downstream use in analytics, reporting, ML, or operational insights.

Data engineers build pipelines that extract and transform data from SoRs into Systems of Insight (data lakes, warehouses) and sometimes into Systems of Action (e.g., alerting, marketing automation).

These systems come with different operational requirements — often less strict than SoRs in terms of consistency and real-time response, but critical in terms of data quality, recovery, and lineage. Despite not managing the SoR, data engineers do manage state, for example:

  • Historization (slowly changing dimensions, time-travel)
  • Change tracking (CDC, diff logic, audit trails)
  • Incremental processing or deduplication logic

Data engineers may also help build or support satellite operational systems, such as Master Data Management (MDM) platforms or Golden Record hubs. These systems maintain their state and sometimes act as the System of Authority, curating and governing data consolidated across SoRs. Though these systems don’t generate transactions, they require strong guarantees on identity resolution, consistency, and version control.

Data Engineers must adapt to these different systems, their different requirements, and build and maintain processes to ensure backup, restore, disaster recovery, SLAs for freshness and uptime, error handling, lineage, and schema evolution, but also ensure the integrity of the data transfers by implementing reconciliation processes.

Trainings

Lack of Structured Training Pathways

Compared to software engineering, training pathways for data engineers are far less developed and formalized. While some foundational topics — such as programming, databases, and SQL — are covered in computer science or data science degrees, these curricula generally fail to prepare graduates for the practical realities of data engineering work. Few academic programs cover the challenges of working with messy, incomplete, or unstructured data, let alone provide exposure to the tooling or infrastructure commonly used in real-world data platforms.

Data engineering remains a niche and specialized role, with much of the required skill set being acquired through hands-on experience within companies. The role requires familiarity with diverse and often proprietary tools — such as SSIS, Databricks, or Palantir Foundry — and the ability to work with distributed compute environments and cloud-based data infrastructure. These are not tools students typically encounter in academic settings. As a result, new hires frequently face steep learning curves and must ramp up rapidly once on the job.

Conflicted Organizational History

The historical underinvestment in data engineering, particularly in Europe and North America, has further compounded the training gap. Many large enterprises operated with a split model where data work was divided between back-office teams, often outsourced and focused on technical ETL work, and front-office teams closer to the business, tasked with producing reports or dashboards. This divide led to siloed expertise and a lack of investment in comprehensive training programs or internal talent development.

Additionally, the data science hype wave in the 2010s led many organizations to focus their data investments on modeling teams, often at the expense of data engineering foundations. Budgets were diverted to support exploratory data science projects, which were rarely productionized due to inadequate data infrastructure. As organizations entered the 2020s and began correcting this imbalance, they discovered a scarcity of experienced data engineers who could build stable platforms or mentor junior staff.

Consequences in Today’s Teams

Today, the consequences of this history are clear. Data engineers are expected to handle increasingly complex responsibilities — spanning software engineering, infrastructure, and business domain knowledge — yet there is no well-defined training pipeline to support this evolution. Many companies struggle to find senior data engineers who can both deliver and coach others, making it difficult to scale teams and institutionalize best practices. While the importance of data engineering is now more widely recognized, the training ecosystem lags behind, leaving organizations to fill the gap with in-house mentorship, self-guided learning, and trial by fire.

As a result, the skill level of data engineers often lags behind what companies might expect for such technically demanding roles, simply because many engineers have had limited opportunities to receive coaching or structured development. Although some companies have begun to correct for past underinvestment, this shift has been slow and uneven, and is often not backed by the same level of organizational support or funding that has historically gone to software or data science roles.

Team Size

The ratio of Software Engineers to Data Engineers ranges from 1:1 to 20:1, with financial institutions typically having the lower ratio of Software Engineers (i.e., more data engineers), particularly at Hedge Funds and other trading firms where Data is the product.

Nevertheless, in general, Data Engineers tend to end up in smaller teams than Software Engineers. Sometimes, even reaching sizes of 1 in some of the smaller companies.

This has quite a lot of implications.

Level of Independence — more limited support when working in a specific area. In software development, it is more common to be working on the same pieces of code or project. This is still possible in Data Engineering, some data engineering teams work on the same set of features as a core team, but this is generally less frequent. Data Engineers often do not get the same level of feedback, and are less frequently well supported by other, more senior engineers. These differences make it quite hard to be successful in the role unless you are quite independent.

Knowledge required: Data Engineers often need to have a wider breadth of knowledge to operate. They might need to know Infrastructure knowledge to set up their Data Platform, they might need to know how a wider range of domains work, what their process is, and the associated datasets; they need to know both the codebase and datasets.

Knowledge sharing: Knowledge sharing typically happens at a slower pace within Data Engineering teams, with fewer people sharing their knowledge in team meetings, reviewing merge requests, and providing mentoring. There is often a reduced feedback loop and longer learning curves.

Career growth: The career path for Data Engineers tends to be more focused than Software Engineering in a technical career path rather than a management path. In Tech companies, it is not unusual for Data Engineering Managers to manage much smaller teams than software engineering managers at the same level, typically retaining some IC responsibilities.

Single Point of Failure: Data Engineers often can be a single point of failure, particularly when embedded in new development teams. These Data Engineers often need to work quite closely with Software Engineering teams, building up a deep level of knowledge about the processes and the data generated by their application to build the initial datasets. This makes it particularly important for Data Engineers to document, much more than Software Engineers, who can more easily rely on Tribal knowledge. Also worth highlighting is that the burden of documentation, even without the aspect of being a single point of failure, Data Engineers need to document code, data models, data, architecture, processes, …

Exposure to Stakeholders: With a smaller team size, often lacking product managers, Data Engineers tend to have a greater exposure to stakeholders, requiring greater communication skills. Data Engineers need to be more comfortable dealing with ambiguity and be used to more frequent interruptions and context-switching.

Summary

At first glance, it may seem that Data Engineers are simply Software Engineers who work with data , but this oversimplification ignores the profound differences in context, constraints, and organizational history that shape the data engineering role.

This post explores those differences in depth across several dimensions:

Control over inputs: Data Engineers must work with external, often unreliable data sources they don’t control , unlike Software Engineers, who typically design both input and output systems.

Scope uncertainty: The iterative and investigative nature of working with raw data introduces hidden complexity, particularly when dealing with legacy systems or undocumented changes.

History: Data Engineers not only process historical datasets but also manage historization, tracking how data evolves — a concern rarely shared by Software Engineers.

Refinement and ambiguity: Unlike in software, requirements in data projects often emerge during development. This demands close, iterative collaboration with stakeholders rather than predefined specs.

Speed of delivery: Data Engineers are under pressure to deliver quickly — despite high ambiguity, legacy complexity, and broad business impact — leading to trade-offs in code quality and sustainability.

System of copy vs. system of record: Data Engineers build on top of systems they don’t own. They transform SoRs into systems of insight and must maintain integrity, lineage, and resilience despite not owning the source.

Training and team size: The profession suffers from weak formal training pathways and a conflicted organizational history. Data Engineers are often under-supported, operate in smaller teams, and are expected to master infrastructure, software engineering, and business context, frequently without mentorship. As a result, their skills sometimes lag behind expectations, and capability-building corrections happen slowly.

Together, these factors make data engineering a uniquely challenging discipline that spans software craftsmanship, data modeling, and organizational navigation, often under-recognized in its complexity.

--

--

Julien Kervizic
Julien Kervizic

Written by Julien Kervizic

Living at the interstice of business, data and technology | Head of Data at iptiQ by SwissRe | previously at Facebook, Amazon | linkedin.com/in/julienkervizic/

No responses yet