Data Engineering Digest #14 (Jul 2020)

Maycon Viana Bordin
data.plumbers
Published in
18 min readAug 21, 2020
Photo by Pixabay from Pexels

In this edition there’s a great article about data reliability and which metrics matter in a Data Platform (such as data downtime), how to measure them and how they can impact other teams.

Another great article comes from Netflix and how they manage the costs of their data platform by integrating metrics from the AWS billing with their S3 data inventory, data catalog and Job Platforms. Netflix is able to track their costs per area and even suggest TTL for table partitions, saving money in the end.

We also highlight the release of Flink 1.11.0, with a new Source API and support for Change Data Capture on Flink SQL. There’s also the release of Hadoop 3.3.0 with support to ARM architectures and Java 11. And the release of Samza 1.5.0.

New Tools & Updates

Spark 3.0

Data Engineering Role

Courses & Training

Podcasts & Presentations

Real Data Architectures & Platforms

Data Culture

Data Lake

Data Architecture

Data Governance

Data Catalogs

Data Quality

Cost Efficiency

Data Formats

Delta Lake

Apache Parquet

Apache Hudi

Avro

Data Pipelines

ML Pipelines

Data Processing

Apache Spark

Apache Hive

MapReduce

Presto

Dask

Stream Processing

Apache Flink

Apache Spark Streaming

Apache Beam

Kafka Streams

Change Data Capture

Storage

Apache HDFS

Messaging

Apache Kafka

Apache Pulsar

Workflow Management

Apache Airflow

Luigi

Prefect

Cloud Providers

AWS

Google Cloud

Azure

Databases

NoSQL

Relational

Modern Data Warehouses

--

--