Data Engineering != Spark

Published in

97 Things

2 min readMay 16, 2019

The misconception that Apache Spark is all you’ll need for your data pipeline is common. The reality is that you’re going to need components from three different general types of technologies in order to create a data pipeline. These three general types of Big Data technologies are:

Compute
Storage
Messaging

Fixing and remedying this misconception is crucial to success with Big Data projects or one’s own learning about Big Data. Spark is just one part of a larger Big Data ecosystem that’s necessary to create data pipelines.

Put another way:

Data Engineering = Compute + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases

Batch and Real-time Systems

There are generally 2 core problems that you have to solve in a batch data pipeline. The first is compute and the second is the storage of data. Spark is a good solution for handling batch compute. But the more difficult solution is to find the right storage — or more correctly, finding the different and optimized storage technologies for that use case.

Compute Component

Compute is how your data gets processed. These compute frameworks are responsible for running the algorithms and the majority of your code. For Big Data frameworks, they’re responsible for all resource allocation, running the code in a distributed fashion, and persisting the results.

Storage Component

Storage is how your data gets persisted permanently. For simple storage requirements, people will just dump their files into a directory. As it becomes slightly more difficult, we start to use partitioning. This will put files in directories with specific names. A common partitioning method is to use the date of the data as part of the directory name.

NoSQL Databases

For more optimized storage requirements, we start using NoSQL databases. The need for NoSQL databases is especially prevalent when you have a real-time system. Most companies will store data in both a simple storage technology and one or more NoSQL database. Storing data multiple times handles the different use cases or read/write patterns that are necessary. One application may need to read everything and another application may only need specific data.

Messaging Component

Messaging is how knowledge or events get passed in real-time. You start to use messaging when there is a need for real-time systems. These messaging frameworks are used to ingest and disseminate a large amount of data. This ingestion and dissemination is crucial to real-time systems because it solves the first mile and last mile problems.

This was originally published on jesse-anderson.com.