Data Engineering — Glossary of Terms

Published in

DP6 US

14 min readMar 8, 2022

Introduction

The discipline of data engineering is one that I find very interesting. It’s not a new area, but the popularization of technologies like Redshift and BigQuery in the early part of the last decade caused an explosion of new technologies and ideas that addressed problems and needs, new and old. Perhaps there are people in the area who would consider other moments in its history to be more remarkable, but it was in this era that I started my journey, and a large part of my experience comes from that time.

On the other hand, one bad aspect of this area is the sheer volume of concepts, technologies and variations of approaches that a new professional is faced with. The purpose of this article is not to be a study guide for such a professional. On the contrary, I want to make it clear that it is not necessary for someone to be familiar with all the terms on this list. It’s not an exhaustive list either. It’s based on my experience (and that of some of my colleagues at DP6) over several years working in the area.

I would like this list to be used as a starting point. If you feel confused about any term, I hope you’ll find it in the list below. While compiling this list, I researched a variety of opinions and definitions, and I don’t think it is possible to consolidate the entire discussion into the brief descriptions to which I have limited myself. I recommend a similar exercise for those who truly seek to understand any of the concepts below: don’t limit yourself to my vision, look for complementary and opposing points of view.

In relation to the this, I would welcome any criticism or suggestions that would correct items on the list.

Categories

Technical or market terms

This section is important because it provides an overview of the ecosystem and vocabulary of the area. Unfortunately, it also contains terms for which there is no universally accepted definition. Here, I present more general meanings and try to point out those that are a little more controversial.

Big Data

An umbrella term to designate the study, tools, solutions, etc. around datasets that are too large or complex for traditional data analysis and processing approaches. Some people define Big Data as any context where a machine is not sufficient for the data processing, analysis or storage required. There is some valid criticism of this definition, but it’s a simplified one that is worth keeping in mind.

Data Lake

A centralized repository of structured and unstructured data. It is possible to store the data in its native state, without the need for transformations, so that it can be consumed by other users later. It is not a term that is limited to storage, as it also covers functionalities required for a platform, including data analysis, machine learning, cataloging and data movement.

Data Warehouse

A repository optimized for executing analytic queries (see OLAP). Unlike a Data Lake, the premise is that the data is structured so that it can be stored in a Data Warehouse.

What is structured data?

To use an analogy, structured data is any type of file which you can upload to a Drive, for example. The characteristic that restricts the expected format of structured data varies from platform to platform, but generally we are talking about tables. Just as we cannot open just any file in Excel, restricting ourselves to formats such as .xls or .csv, we need a specific format to load in a Data Warehouse.

Data Lakehouse
This is a term used to describe approaches that attempt to combine the data structure and management features of a Data Warehouse with the low cost of storage of a Data Lake, including a structured transactional layer.

This term is broader and less clear than the previous ones. I think it is better to understand the need for it rather than get attached to something concrete. Delta Lake, by Databricks, is one of the main entities responsible for popularizing the topic, although AWS also make good use of it in their solutions.

You will find a good article on the subject here.

Data Mart

This term is used in the context of Data Warehouses, indicating a specific organizational structure or pattern. It is a layer that has specific subdivisions for each business unit or team (e.g. finance, marketing, product etc.), through which users can consume data in a format that meets their specific needs.

Data Mesh

Most people understand this as a decentralized architecture, where different teams are responsible for their data as if it were a product. There are other definitions that place more requirements on the architecture (including the original, and some implementation proposals by AWS). It’s a relatively new term and it will be interesting to see how its usage evolves, but it’s definitely a big trend with important core ideas.

Relational database

Formally, these are databases that model the data using a relational model. In practice, database management systems (applications such as MySQL, PostgreSQL and SQL server), which we call “relational”, support data access through the SQL language and have mechanisms to guarantee data integrity and support ACID transactions.

NoSQL database

This refers to systems that provide mechanisms for storing and retrieving data that is modeled differently from that used in relational databases. In practice, the term refers to several different types of databases that I will not list in this article.

Bear in mind that this is a comprehensive term, which relaxes some of the conditions associated with the characteristics of a relational database. Despite its name, it encompasses solutions that support SQL-like languages or transactions. However, it may differ in other aspects (e.g. Google Datastore is a NoSQL database that support ACID transactions, SQL-like queries and indexes, but stores data in an object database format.

Batch and streaming data

These words are used in different contexts. Some possibilities are:

When referring to a dataset. A data batch is a limited set (e.g. in time) and a data stream is an unlimited set (in the sense of “without borders”, or open). For example, a table containing the closing of a day’s sales is a data batch, while the set of HTTP requests made to a server over a network is a data stream.
When referring to a processing model. For example, when sending a data batch we can schedule a daily export of historical data in a table. With a streaming processing model, we can compute moving windows of health metrics across multiple systems to manage some autoscaling.
When referring to tools for data processing. You can find a good article about this subject, and the confusion in relation to the various uses of the term, here.

Cloud platforms

The use of cloud platforms to address a company’s data-related needs is a trend that needs no explanation. On-premise solutions (see below) are still relevant, and will likely remain so for a long time, mainly due to security, policy and process issues. Before talking about the providers, I think it’s a good idea to introduce some of the concepts and terms, to put some of the solutions in context:

Google documentation comparing the services of Amazon and Azure

The partiality of this view is debatable, as it is Google’s own documentation, but it was the most complete survey I found.

Important note: DP6 is a company with an agnostic technological position. We believe in a plurality of solutions, considering all options equally. Our Cloud list below may not represent what vendors offer equally, but that doesn’t mean that they don’t have solutions to meet some needs, or that we believe that the options we’ve listed are better. In particular, it is my opinion that offers from different suppliers tend to balance out as a simple matter of competition. Even today, they are already very close.

There are, of course, reasons to prefer one platform over another, but those considerations are beyond the scope of this article. Multi-cloud/hybrid cloud (using interconnected services from more than one provider) is a major trend that we think is worth considering.

On-premise

This term refers to on-premise software i.e. it is installed and run on computers or servers owned by the company that uses it. This is in contrast to cloud software, where the customer leases computing resources from another organization, usually through virtualization (see below).

Processing and storage

Processing refers to the ability to perform some activity (e.g. ingestion, processing or sending data), usually linked to CPUs. Storage is the space needed to keep the data that will be processed, usually linked to disk space.

In a cloud context, we are usually talking about the benefit of separating these two functionalities, reducing waste and, consequently, costs.

Virtual machine

An emulation of a computer, with virtual resources that are indirectly created by the computational resources of a more powerful physical machine. It’s the primary way for a user to consume resources from a cloud platform. While it’s a standalone product (Compute Engine on GCP, EC2 on AWS, and Virtual Machines on Azure), they serve as the foundation for the majority of services.

Serverless

A cloud computing model in which resources are allocated on demand, consuming computing resources only during application use (e.g. during ingestion, processing, or sending data). It is a misnomer, as there is still implicit use of a server, but the user is not responsible for managing its resources or lifecycle (informally, “turning it on and off”).

Google Cloud Platform

The products of Google Cloud Platform (often abbreviated as GCP) are divided into categories, of which we are mainly interested in data analysis, databases and storage. The most relevant certification is the Professional Data Engineer.

Complete list of products
Certification program
Free-use plan

Cloud solution guide (by sector and technology)

Blog

Google BigQuery

GCP’s serverless data warehouse platform, with columnar data storage, provides an SQL interface for data stored natively in the solution. The pricing model is initially based on data storage and query, but a fixed-price model, which is based on reserved processing capacity, is also available.

Google Cloud Storage

This is a solution for storing data of any size and any format. It is mainly used for dealing with unstructured data, but it is a flexible solution and not just restricted to this function.

Dataproc

A managed service that performs data processing at scale using tool solutions such as Apache Spark (among many others). More recently, a serverless feature was also launched.

Dataflow

A serverless managed service for data processing at scale, using Apache Beam framework, developed as a unified language for describing batch and stream data loads.

Pub/Sub

This is a service for message queues that implements a publisher and subscriber architecture (You will find a Microsoft document on this type of architecture here), which is perfect for implementing secure communication between services (in event-driven architecture, for example) or managing streaming data flows (often used in conjunction with Dataflow in Google architecture). Supports pull and push modes (Here is a Google doc explaining the difference). For communication between services, it is possible to use Cloud Tasks too (comparison of products).

Cloud Composer

A managed service for orchestrating workflows using Apache Airflow (Here are some other platforms and software).

Cloud SQL

A GCP managed relational database service for MySQL, PostgreSQL and SQL Server.

Cloud Data Fusion

A visual interface tool for creating data ingestion, processing and export flows. The underlying open source technology that it uses is CDAP.

Cloud Bigtable

A system for storing large amounts of data in a key-value structure with high read and write capacity with low latency, which supports fast access.

Dataprep

A visual interface tool for exploring, cleaning and preparing data for analysis and machine learning. Our experience was that it was a “super Excel”, so it has a low learning curve for users of the tool. It is supported by a proprietary solution, Trifacta.

Cloud Functions

A serverless computing solution that is ideal for performing simpler tasks, like extracting, transforming and/or loading data.

Other interesting tools (+links)

(Data analysis) Dataplex, Analytics Hub, Looker (Serverless computing), Cloud Run, Workflows, (Databases) Firestore, Cloud Spanner, Datastream, (Tools for developers) Artifact Registry, Cloud Build, Cloud Source Repositories, Cloud Scheduler

Cheatsheet for GCP products

Amazon Web Services

Amazon’s cloud platform is generally referred to as AWS. The products are divided into categories, of which we are primarily interested in analytics, databases and storage.

Search tool for products

Certifications

Link to AWS products focused on analytics

AWS Lambda

A serverless computing solution, which is ideal for performing simpler tasks, such as extracting, transforming and/or loading data.

Amazon Athena

A solution to consult S3 data using SQL in a serverless way. The pricing model is based only on querying the data.

Amazon Redshift

An AWS Data Warehouse solution with data storage in columns and SQL support (based on PostgreSQL). Pricing model based on the reservation of computational resources, but more recently they announced that there would be a serverless model as well.

Amazon SageMaker

A managed platform for building, training, and deploying machine-learning models on AWS infrastructure. It deals with many challenges that are often encountered when working with large volumes of data, leaving data scientists to focus on model development and analysis.

AWS Glue

A serverless data integration service. The data integration process is based on the acronym ETL (Extract, Transform and Load). Glue allows you to build data integration processes with technologies such as PySpark, which use the cloud infrastructure for scaling and parallel processing.

Amazon S3 (Simple Storage Service)

Analogous to Google’s Cloud Storage, this is an agnostic data storage service that accepts any data, regardless of format. It is a reference tool when talking about Data Lake, as a way of storing large volumes of unstructured data.

Amazon Kinesis, SQS and SNS

These are messaging services in the publisher and subscriber format. They support pull and push use cases, with specialized services for different scenarios. Simple Queue Service creates message queues that will be processed by other services, and Simple Notification Service replicates messages to a large number of subscribers. Kinesis is a more complete service for handling streaming data (take a look at this question on StackOverflow, with different opinions comparing products).

Step Functions
A managed process orchestration service within AWS. It allows you to automate IT routines, Machine Learning or data processing, for example.

Other interesting tools (+links)

(Data analysis) Amazon EMR, Amazon Kinesis, Amazon QuickSight, Amazon Lake Formation, AWS Data Pipeline, (Databases) Amazon Aurora, Amazon RDS, Amazon DynamoDB, (Tools for developers) AWS CodeCommit, AWS CodePipeline

Microsoft Azure

We will follow up this article in the future to include this section. Keep an eye out for future blog posts!

Acronyms

ACID Atomicity Consistency Isolation Durability

Refers to four important properties for transactions in a database.

API Application Programming Interface

This is a general term to indicate a communication interface with a system or software. It is generic, to the point that it’s common to refer to a library’s set of methods as its API, for example. However, it is most commonly associated with the REST standard in the context of web tools, defined by a series of standards, of which the use of the HTTP protocol is the most notable. Another common pattern is RPC.

In the context of data engineering, the use of APIs is common, both for transferring (collecting and sending) data between servers, and for automating communication between services.

SDK Software Development Kit

This is a collection of tools that facilitates software development, usually available in a single package. In some contexts, the SDK also includes compilers, testing tools, debuggers etc. Another common context is when the manufacturer of some platform offers a collection of SDKs for different programming languages that package the communication with their APIs. This is an important distinction to keep in mind.

DAG Directed Acyclic Graph

A graph-specific Directed class (graph theory), often used in the context of orchestration (workloads or data), as it is used to model a loop-free dependency structure, in tools such as Airflow or DBT. The abstraction itself is very generic and is used in a multitude of different contexts, sometimes unrelated to data.

ETL Extract Transform Load

This is the process of extracting data from one, or more commonly, multiple sources and transferring them to a system in a structured way, ready for analysis (see Data Warehouse above). It is sometimes used as an umbrella term to refer to any data transformation in an architecture, or to identify a tool for that purpose (an ETL tool), but by doing this it loses some of its meaning due to over-generalization.

ELT Extract Load Transform

This acronym is related to ETL, and is used to identify the practice used in more modern architecture of carrying out heavier transformations within the analytics system itself, made possible by advances in processing capacity in solutions such as BigQuery, Redshift and Snowflake. The concept of ETL exists within ELT, but it is usually the responsibility of a specialized tool that extracts the data in a standardized and raw (but still structured) way from the source. Therefore, it is appropriate to say that the acronym refers to a more complete process of ELTL.

Reverse ETL

It is not an acronym per se, but a use of the term ETL specifically to denote a solution whose objective is to send structured data in a Data Warehouse to other platforms in an automated way for day-to-day use (e.g. customer audiences), a practice called Operational Analytics. The term “reverse” refers to the use of the term ETL for data ingestion in the ELT architecture, this being the analogous process.

OLTP Online Transaction Processing

This refers to a system that manages and facilitates transactions in an application, such as an e-commerce (sales, inventory), ATM (withdrawals, deposits) etc. It is common to use relational databases, but the use of NoSQL technologies that support transactions is also possible (the term NewSQL is used in these cases).

OLAP Online Analytical Processing

This is complicated.

Originally a term used to describe a system capable of executing multidimensional queries (via a structure called an “OLAP Cube”), nowadays it is more commonly used as a term in contrast to OLTP, describing the use of a database for analytical purposes. Today, Data Warehouses with storage in columns, like BigQuery, Redshift and Snowflake, are used for this purpose, and they are said to implement the functionalities of an OLAP system.

Other platforms and software

Apache Airflow

A platform for developing workflows programmatically in Python, using DAGs as an abstraction. It is an Open-Source solution (GitHub), but there are proprietary self-managed platforms on cloud providers (Cloud Composer).

Apache Beam

A unified programming model for writing batch and streaming data processing pipelines. It is expected that one of the supported platforms (called runners) will be used to run code with SDK. It is an Open-Source solution (GitHub) with support for Python and Java languages.

Apache Spark

A multi-language engine for scalable data processing with machine learning capabilities. It supports several languages, including SQL, as well as batch and streaming data processing.

Read: An article comparing Spark and Beam

Census

A reverse-ETL platform containing a series of data integrations between the main Data Warehouse platforms and platforms for sales, ads, marketing etc.

dbt

An open-source command-line tool for writing data transformation pipelines using SQL. There is a managed platform from the same vendors, dbt Cloud.

Great_Expectations

An open-source data-quality framework that helps with writing tests, documentation and monitoring.

Snowflake

A data platform with storage support for the three leading cloud providers (AWS, GCP and Azure). Originally identifying itself as a “Data Warehouse as a Service”, its aim is to offer a modern, cloud platform for analytics and data storage.

References

Some other interesting glossaries out there:

Complete Data Engineer’s Vocabulary | by Kovid Rathee | Towards Data Science

Data warehouse | Stitch data glossary

Data Engineering Glossary — Trifacta

Profile of the Author: Gabriel Takeshi Andrade Higa| Graduated in Computer Science from USP, with more than 5 years of experience working with data engineering and data science at DP6.

Profile of the Co-author: Eduardo Podgornik Romero | Graduated in Computer Science, with 4 years of experience working with Data Engineering at DP6. Aficionado of games, codes and metrics.