Introduction to Data Engineering

Jae Hyeon Bae
Analytics Vidhya
Published in
6 min readNov 23, 2019

Data engineering can be defined as literally every engineering work on data processing, analytics, visualization, etc. Before going deeper, let’s talk first about why data engineering is important and what its scope is.

Data is the most important asset in a company because it’s the only ground truth with which a company’s business can be optimized. If all the data is easily accessible to all the teams, it can be leveraged in new and exciting ways. Everybody knows about Netflix’s recommendation and personalization engine, which is a perfect example of how business can be optimized by utilizing customer data. We hardly need to reiterate the importance of data; just consider the number of Big Data companies such as Splunk, Elastic, Hortonworks, Cloudera, Teradata, etc.

Data science is the process of data-driven decision making. The prerequisite for data science is data engineering. Data scientists handle data in a statistical manner to solve various problems including not limited to recommendation/personalization, e-mail targeting, market efficiency optimization. Let’s say one delivery company is doing an advertising campaign and promotions to bring about people’s first purchase. Then the most important measure of their spending efficiency is the cost for the first purchase. In this case, those metrics should be visualized by a graph with multiple dimensions including location, properties of targeted users, and time. Usually, data scientists focus more on statistical modeling and data analytics while data engineers work to deliver data, extract information, and store it.

Now, how does data engineering work? With the example of delivery campaign, we need to track how much money we spent on which media where advertisements were published and exposed to users. This event information should be pipelined to a data backend and kept somewhere. Data pipelines and storage systems shine here. To build analytic models, such raw data should be processed and restructured for efficient computation. What is needed here to process data to a specific format is an Extract-Transform-Load (ETL) framework. After building analytic models, we need to reiterate the campaign targeting strategy. When the user information is retrieved in advertisement matching systems, that profile information is matched by updated targeting systems. We need an indexing and retrieval engine to get the user profile and match with the targeting model. Finally, if we want to measure the performance of the campaign strategy, we need to draw, with a visualization technique, a trend that shows whether its performance is getting better or worse.

Data Pipeline

Let’s talk briefly about each data engineering component. First of all, the data pipeline is the entry point; it delivers data from “A” to “B”. “A” can start from the business front-end such as mobile apps or web sites and any intermediate data storages. “B” should be a persistent storage to make data durable. There are two communication models for a data pipeline — the “push” and “pull” models. With the “push” model, data is being sent from the sender to the receiver. The receiver can forward data sequentially based on the configuration but it does not expose any interface that provides random access to data. Apache Flume, Fluentd, Netflix Suro all follow this model. The “pull” model works similarly but it exposes APIs to consume data randomly. Apache Kafka and Simple Queue Service (SQS) of Amazon Web Service (AWS) have APIs to consume messages with offsets or visibility timeout.

The most popular project in the data pipeline area is Apache Kafka. It consists of three parts — the producer, broker, and consumer. The producer sends messages to the broker under specific topics. The broker stores them in local file systems. A topic can have multiple partitions, which for load can be distributed among multiple machines. The broker exposes an API to get the data based on the topic, partition, and offset. The offset can be considered as the simple message ID, starting from 0 and incrementing monotonically. When the consumer sends a request to the broker, the broker responds to the request with a chunk of data. The images are captured from the Kafka website.

Kafka is a simple distributed message queue that can work as the buffer for backpressure on the endpoints. Say that we want to store data to Elasticsearch for analytics; the Elasticsearch ingestion rate, though, is somehow degraded. Then, if the message incoming rate is faster than the Elasticsearch ingestion throughput, the data producer side should keep messages in the memory or drop them not to blow up its memory. Kafka can keep messages for a while as long as its disk space is enough. Also, Kafka performance is the best among various distributed message queues because of its contention-free architecture. As mentioned earlier, Kafka messages can be consumed based on topics, partitions and offsets, without any coordination among consumers. This is why Kafka can support broadcast messages where all consumers see all messages. In addition to broadcasting, if the topic has multiple partitions, consumption from those partitions can be distributed to multiple workers. For example, if we have four partitions in a topic, the number of workers can be one, two, on up to four. With one worker, it will consume all messages from four partitions and with four workers, each worker will consume messages from a single partition.

Through the data pipeline, messages will be kept in storage space including NoSQL database, aggregation engine such as Elasticsearch, and cold file storage like HDFS or AWS S3. For storing data in files, the file format should be considered carefully. To express structured data, it is common to use text files with json strings. For better performance, however, a columnar file format like parquet or ORC would be more useful. If we can fully utilize the power of the aggregation engines, we don’t need to store data in the file system. Unfortunately, however, aggregation engines cannot do everything. So for more flexible analysis later, all data should be kept in cold file storage.

Data Indexing

Let’s turn our attention now to indexing and retrieval. For traditional information retrieval that has the power of an inverted index, the most popular choice is Elasticsearch. An inverted index is the mapping from a keyword to a document. If we have a bunch of documents, each document can be mapped as a bag of keywords under the specific document id. If we “invert” this relationship, we can build the index as the mapping from keywords to the document like an index can be found at the end of a textbook. Elasticsearch can support real-time indexing of stream events, various boolean queries and aggregation features. It has a great ecosystem such as Logstash (data pipeline) and Kibana (analytics UI for visualization) with open source community support. The following image is captured from here.

Elasticsearch ingestion and aggregation performance are not perfect. Due to its indexing architecture — building an immutable index in the memory, flushing it to the disk, and merging them — the process depends heavily on disk IO. Also, Elasticsearch does not support pre-aggregated metrics. For example, it needs to do full scan the data to get the metrics such as minimum, maximum, and average.

Metric Aggregation Engine

A nice workaround for these problems can be Druid (“high-performance, column-oriented, distributed data store”). Druid can be configured with time granularity and a list of defined metrics. For example, if our data is a stream of events showing the response time of specific REST APIs, then we can configure Druid to ingest the data with a minute granularity and average metric on the response time. Then, Druid aggregates the average of the response time every minute so there’s no need for further post-aggregation with scanning individual records.

However, there is no silver bullet. Even though Druid can, in some situations, perform much better than Elasticsearch, Elasticsearch has some advantages over Druid. Since Elasticsearch is a search engine with a document-based model, indexing itself is idempotent and entity centric indexing can be useful to implement various metrics efficiently. But Druid stores each row without any document ID, its ingestion is not idempotent. Also, since it has many moving pieces including heavy dependency on Zookeeper for distribution coordination, its maintenance and operations are not as straightforward as what’s needed for Elasticsearch.

As I described so far, data engineering technologies are largely contributed by open source community. Since each open source project has own pros and cons, it should be carefully chosen based on legacy environment and engineers’ preferences. In the next blog post, I will talk about stream processing.

--

--