Data Engineering Digest #15 (Aug 2020)

Maycon Viana Bordin
data.plumbers
Published in
19 min readOct 1, 2020
Photo by Brett Sayles from Pexels

This edition came a little bit later than expected, but the highlights of this edition are still fresh and interesting.

To start, we have a new update of Kafka in their ongoing process to remove Zookeeper as a dependency, something that is planned to happen on version 3.0.

We also have the paper describing the inner workings of Delta Lake, published by Databricks. They give details on the motivations for Delta Lake, how they were able to achieve the ACID properties through an Object Store and Parquet files, while still enabling schema evolution, time travel, caching, among other really useful features.

And finally we have a benchmark comparing the latency of Apache Kafka and Apache Pulsar. In a nutshell, the results show that Pulsar has a more predictable latency than Kafka both on end-to-end latency as well as for the publish latency. The full description of the experiments and results can be found on the article.

New Tools & Updates

Data Engineering Role

Courses & Training

Publications

Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Podcasts & Presentations

Real Data Architectures & Platforms

Data Culture

Data Lake

Data Architecture

Data Governance

Data Formats

Delta Lake

Apache Parquet

Apache Avro

Data Pipelines

Data Quality Tools

Data Processing

Apache Hadoop

Apache Spark

Apache Hive

Presto

MapReduce

Project Ray

Stream Processing

Apache Flink

Apache Spark Streaming

Apache Beam

Apache Kafka Streams

Ingestion

Batch

Change Data Capture

Real-Time

Messaging

Apache Kafka

Apache Pulsar

Workflow Management

Apache Airflow

Prefect

Dagster

Argo

Apache NiFi

Cloud Providers

AWS

Google Cloud

Azure

Databases

NoSQL

In-Memory & Data Grid

Modern Data Warehouses

--

--