I deconstructed 13 data industry buzzwords. Here’s what I learned.

Published in

CodeX

11 min readJun 1, 2022

I’m a relative newbie to the data tech industry.

Of course, I’ve been aware of data in my past tech roles, and I’ve certainly used it. It’s pretty much impossible to be in tech without bumping up against data analysis and storage on a regular basis.

But data infrastructure and architecture design? The many ways in which this particular scaffolding shapes how companies of all industries operate? The seemingly endless discourse about organizing teams of data professionals?

Much of this was new territory when I started working at Estuary almost a year ago.

I work as a technical and content writer. Part of that is being a forever student. You have to read a lot, listen a lot, and ask a lot of questions. Even questions that feel silly to ask. (Especially questions that feel silly to ask.)

It’s important to interrogate the basics. Saying, yeah, sure, I’ve seen this concept before. But what does it really mean? How does it connect? What implications does it carry?

Industry jargon provides ample opportunity to ask questions like these. In data, we have technical jargon. We also have marketing jargon. We have a lot of overlap between those two categories. We have many terms that don’t tell us anything concrete, but allow us to understand the intangible business and human elements that breathe life into the technology.

I want to walk through 13 data buzzwords I’ve researched over the past year that helped me grow my knowledge in the data field. I’ll discuss my takeaways from each and link to my original articles. Hopefully, in this format, the patterns it took me months to see will be obvious to you.

A quick caveat… and a quick plug: This article focuses on the holistic, business-oriented aspects of data technology. At Estuary, we’re building a product that’s aimed at fixing some of these high-level problems using a novel technical foundation. If you’re interested in the nitty-gritty, I recommend you check out our source code on Github, our docs, or set up a free meeting with a team member.

Part 1: This vs That

“What’s the difference between this one thing and this other thing?”

These were the types of questions I got hung up on when I first began to learn about data technologies.

Many pairs of concepts get thrown around to show contrast. Often, this comes with a message that one is superior to the other, or that they can’t coexist. It implies that there’s a strong dichotomy, but when you dig into it, it turns out the reality is more nuanced than you thought.

1 & 2: ETL vs ELT

“E” is for extract, “T” is for transform, and “L” is for load. These are the three fundamental steps of a data pipeline: an element of data infrastructure that connects the various systems that create, store, and receive data.

In my first weeks of learning, I encountered messaging that ELT was generally superior to ETL, because of flexibility and other ✨ reasons✨.

This is true, to an extent. But a lot of that judgment is historical: in a mostly-outdated paradigm, ETL data pipelines were typically built in-house by a team of engineers. The team was prone to being overworked, and the pipelines were prone to breakage.

ELT has been marketed as a newer, smarter alternative. We usually hear about ELT in the form of a self-hosted platform or a managed service.

But it’s simply not true that you can re-arrange a data pipeline as easily as you can flip an acronym and get unquestionably better results. And the line between ETL and ELT isn’t as clear as it seems.

Read the full article here.

3 & 4: Batch processing vs real-time processing

The batch-vs-real-time dichotomy is one that proved particularly relevant working for a company focused on making real-time data pipelines more accessible. And maybe I’m a bit biased, but I think the conversation around this topic will only grow.

In a nutshell, batch data processing treats data in terms of records. Your system periodically checks in on the source data and asks what has changed — for example, polling the records of a table. It compiles a changeset and applies it elsewhere. By definition, a time delay is introduced between each batch.

Real-time data processing is event-based. Your system continuously reacts to changes coming from the source and applies them with no delay. The data events are constantly “in motion.”

Today, real-time processing is (accurately) seen as harder to implement, so it’s not yet the standard. Most organizations avoid building real-time infrastructure until they find a use case where they absolutely need it. They’ll apply real-time to that one use case, and keep a batch pipeline for everything else.

That’s where the real problem starts: it’s not that batch is always inferior. In some cases, timing really doesn’t matter that much. The problem is the splintering of infrastructure, and our focus should be on creating the most agnostic, flexible data foundation to prevent errors and breakage for real-time and batch.

More on the basics of batch vs real-time here.

5 & 6: Database vs data warehouse vs …

There are a lot of different models of data storage systems out there these days. You have your traditional operational databases (Postgres, Oracle, etc). You have your data warehouses (Snowflake, Redshift, etc). You have data lakes, which I’m not even going to start discussing here. And you have all matter of storage architectures in between.

One thing is for sure: it’s highly unlikely that there is one, single kind of data storage system that will meet all of any modern organization’s needs.

This means you’ll have different sources of truth for different workflows. You can’t get rid of your storage system designed for transactions and run an entire company off a storage system designed for analytics, or vice versa.

See the original article for details.

Once again, it’s not a matter of which is better. It’s a matter of recognizing the plurality of data use-cases, and finding a way to keep all your systems in sync.

We can’t always eliminate components, so how do we model and manage the complexity?

Part 2: “You’re talking a lot, but you’re not saying anything.”

Photo by Guillaume de Germaine on Unsplash

Eventually, I got past the “apples vs oranges” stage of my journey. I starting to re-frame everything in terms of what seemed to be the overarching tension: complexity vs unification.

As I dove further into the data space, I started scanning for solutions. As I did, I started to notice that there are many terms that appear to be referencing something technical and concrete, but start to break down under pressure.

There’s a whole spectrum of buzzwords like these that are various levels of substantial and useful. We’ll demonstrate that by walking through some examples.

7: Reverse ETL

Aha, you thought we were done with E’s, L’s, and T’s, did you? Not so fast.

Reverse ETL is my favorite example of a buzzword with little underlying substance.

It hinges on the idea that you used a pipeline to get data into storage. Now, you need a pipeline to get it out of storage and operationalize it.

Translation? More splintering of pipeline infrastructure, which is what we want to avoid. If your data stack is thoughtfully architected, you shouldn’t need to differentiate between “forward” and “reverse” pipelines.

This is the closest I’ll get to a “hot take” in this article. Instead of laying out the details here, I’ll point you toward my original post.

8: Modern data stack

A data stack is a type of tech stack designed to facilitate the storage, access, and management of data. It’s a catch-all term to encompass all the systems a given organization uses for its data workflows. But what makes a data stack modern? And why does it seem like everyone is always talking about modern data stacks?

The most tangible answer to the former question I can offer is: cloud-based components; namely, cloud warehouses and SaaS tools.

As far as why everyone is talking about modern data stacks? It’s harder to say. To an extent, the concept of a “modern data stack” is just a handy catch-all term for marketing. But at the same time, it’s a helpful way to categorize an extremely common and relevant enterprise data use case on a high level.

That’s why I believe that the “modern data stack,” while a bit vague, is still a helpful term.

I break it down further here.

Part 3: Human systems are messy, and powerful

It’s easy to grow frustrated with ambiguity, especially when ambiguous words are used to market technology. And if we’ve learned anything so far, it’s that many of the labels we use for data systems, platforms, and strategies are less concrete than they appear.

I wish I could say that this article is about to get more concrete, but it’s not. Because you have to acknowledge the ambiguity. Data technology buzzwords are ambiguous because the reality they describe is super complicated.

Data infrastructure, management, workflows, and goals ultimately speak to the infrastructure, management, workflows, and goals of the organization or business. Businesses are made up of people, and managing people is dynamic and messy. What’s more, both people and businesses are influenced by tons of unpredictable, random variables. These are highly uncontrolled environments.

So, how could we possibly expect managing data infrastructure to be a cut-and-dry, sterile process?

But if you’re willing to work in the grey area and combine the technological and human elements? That’s how you can unlock the next level of power.

9 & 10: Data engineering, analytics engineering, and other job titles

It starts with the way we categorize humans. You don’t have to be a recruiter or a job seeker to understand that the labels we slap on ourselves are just that: labels.

What the predominance of certain job titles over others can do is tell us the general trend of the industry.

Take data engineering, for example. Ten years ago, “data engineering” more or less meant wrangling bespoke ETL pipelines. As our approach to pipelining has shifted, so has the role.

At the same time, we’ve seen “analytics engineering” become the newest hot role in data. Meanwhile, we talk less about “data science” and “data analytics,” but these disciplines also remain highly relevant.

All these shifting and overlapping job functions speak to how dynamic the data field is — and how important it is to be intentional about organizing data teams.

11: Data mesh

“Data mesh” is similar to “modern data stack” in that it implies a concrete solution to all of your organization’s data woes. Like, you should be able to open up a set of instructions and put together a data mesh like you would a bookshelf from Ikea.

But there is no such instruction manual, and you’ll need a bit more than an allen wrench and patience to put a data mesh together.

Data mesh doesn’t describe a singular process, but rather, a set of principles. Applying these principles to your particular use case will take lots of research and strategy because your business needs, data, and teams and are all unique.

But when you invest the time and thought, all of the promised benefits of data mesh become very much real. Data can be freed from an ivory tower and distributed across the company. Teams can take ownership of their own data products. Trust in data and meaningful collaboration can grow.

If anyone tries to hand you a step-by-step manual to achieve this, you should probably run away from that person.

In that spirit, here’s an article in which I do not attempt to tell you how to build a data mesh.

12: DataOps

DataOps is yet another important conceptual framework with no one-size-fits-all implementation.

In brief, DataOps is about applying the principles of agile software development — iteration, cross-functional team collaboration, and communication in pursuit of shared goals — to a company’s data workflows.

Whereas data mesh describes an architectural pattern that has a major impact on how human teams work together, DataOps is more about the simultaneous management of teams and technology, regardless of configuration. In other words, DataOps can be a big part of implementing data mesh, but it can serve any number of other goals just as well.

DataOps acknowledges that data stakeholders are diverse, and for the industry to progress, more people need to be meaningfully involved. This needs to be supported by both technology and leadership.

13: Data democratization

Data mesh and DataOps have a lot of overlap. You can look at this in terms of the practices and systems they require. But perhaps more importantly, they share a very important common outcome: data democratization.

This term might just sound like the vaguest term of all, but it’s one of my personal favorites. It sums up so much of what we’re working toward, and what data technology ultimately serves.

The explosion of data platforms and technologies, as well the presence of data in basically every facet of modern life, tells us that data is not a niche thing. Treating it as such is a failure that drags us back into old and counterproductive paradigms.

Data is a tool for understanding business and life: for acting more intentionally and efficiently, and communicating more effectively.

As the uses of data explode, data engineers should not have to be overburdened and constantly putting out fires. Business users should be able to access the data they need without technical skills. In general, everyone should know where data comes from and be able to trust it — even if they don’t know what ELT is or understand the potential outcomes of a data mesh.

Reaching this goal is a technological problem, and it is also a very human problem. It’s kind of frustrating, but it seems pretty worthwhile, don’t you think?

So, what have we learned?

Words matter. Even the buzzwords that seem frivolous and surface-level can tell you a lot about an industry or discipline.

When that industry or discipline is highly technical, buzzwords create an alternative entry point that can help you think of things differently, or welcome more voices to the conversation.

Here are my takeaways from reading and writing about all these words.

Paradigms aren’t so much shifting as they are splintering. Data stacks are complex; alternative architectures are available; many competing platforms exist. The best way forward is to architect your data stack with complexity in mind. Focus on connection and flexibility rather than ripping things out and replacing them.
The same terms we use to start powerful conversations can also be used for the sake of fluff. Some terms are fluffier than others, but context is most important (this is by no means a data-industry-specific phenomenon).
Data is ultimately a tool that helps us model our complex world. The better data tools at our disposal, the more variables are introduced and the more chaos needs to be dealt with. We’re rewarded with more powerful results, but the process becomes harder and often messier.
Because the power and pitfalls of data are both growing, more people need to be meaningfully involved. Focusing both on humans and technological systems is the key to a future driven by trustworthy, shared data.

Find Estuary on Github | Docs | Slack