Data Engineering Weekly #27

Published in

Data Engineering Weekly

4 min readFeb 1, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 27th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on decentralized content moderation, Kafka as a database, Snowflake’s External Table, Dagster 0.10.0, Uber’s real-time data intelligence platform, Dropbox’s Superset adoption, Cloudflare’s data center operations using Airflow, Apache Kudi’s clustering, Timeline’s data lake.

Martin Kleppmann: Decentralized content moderation

January 2021 is a happening month, brings a lot of debate over censorship and content moderation by social media. People gossip and spread misinformation over the centuries, but the impact is limited to a local context. Twitter’s and Facebook created a Cerebro for misinformation. The author summarizes the need to rethink content moderation from a centralized, subjective moderation to democratic, decentralized content moderation. It is an exciting space to watch how data infrastructure can evolve to improve content moderation.

Decentralised content moderation

Published by Martin Kleppmann on 13 Jan 2021. Who is doing interesting work on decentralised content moderation? With…

martin.kleppmann.com

Facebook’s Fighting abuse @scale 2019 conference contains some exciting talks on the same.

Fighting Abuse @Scale 2019 recap - Facebook Engineering

Fighting abuse presents unique challenges for large-scale organizations working to keep the people on their platforms…

engineering.fb.com

David Xiang: Kafka As A Database? Yes Or No

Apache Kafka plays a vital component in modern infrastructure. Is Kafka a database? It is a hot debate that shapes the future of streaming technology. The author summarizes the merits and demerits of treating Kafka as a database. One of Kafka’s conventional arguments is that it supports the read/ write separation of concerns with write-once/ multi-model read pattern. Simultaneously, maintaining data integrity and multi-model materialization is not cheap and can further complicate the system design. Nonetheless, it is exciting to watch the evolution of streaming databases.

Kafka As A Database? Yes Or No - A Summary Of Both Sides

I recently read through a Hacker News thread discussing the article "Kafka Is Not A Database", by Arjun Narayan and…

davidxiang.com

Snowflake: External Tables Are Now Generally Available On Snowflake

The cloud storage services like AWS S3, Azure Data Lake Storage, or Google Cloud are the popular choice for data lake systems. Snowflake, the famous cloud data warehouse, introduced external tables that enable Snowflake to query cloud data storage. Snowflake also supports streaming ingestion for the external datasets similar to Apache Hudi & Delta Lake. Presto played the federated query engine role to unify querying data lake and cloud data warehouse systems, and it is a significant development from Snowflake to provide the native implementation.

External Tables Are Now Generally Available on Snowflake | Snowflake

Snowflake is announcing the general availability (GA) of the External Tables, a key data lake workload feature in the…

www.snowflake.com

Dagster: Dagster 0.10.0: The Edge of Glory

Dagster released version 0.10.0, codenamed “The Edge of Glory.” It’s exciting to see Dagster’s focus on native scheduler instead of relying on the cron or Kubernetes, supporting the sensors, tight integration with Kubernetes, and I/O manager abstraction to simplify the dev & testing phase of the pipeline development.

Dagster 0.10.0: The Edge of Glory | Dagster Blog

Published on 2021-01-19 We are delighted to announce Dagster 0.10.0, codenamed "The Edge of Glory." This release's…

dagster.io

Uber: Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability

Uber writes about Gairos, its real-time data processing, storage, and querying platform to facilitate streamlined and efficient data exploration at scale. The total size of queryable data served by Gairos is 1,500+TB, and the number of production pipelines is over 30. The total number of records is more than 4.5 trillion, and the total number of clusters is over 20. Over 1 million events flow into Gairos every second. The Gairos Optimization Engine is an exciting implementation to self-tune Elasticsearch & ingestion pipeline.

Uber's Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability

Real-time data (# of ride requests, # of drivers available, weather, game) enables operations teams to make informed…

eng.uber.com

Dropbox: Why we chose Apache Superset as our data exploration platform

Apache Superset is now the top-level Apache project. Dropbox writes about why it chooses Apache Superset over competitive visualization frameworks like redash, mode & periscope.

Why we chose Apache Superset as our data exploration platform

Today the Apache Software Foundation announced Apache Superset as one of its official top-level projects. Apache…

dropbox.tech

Cloudflare: Automating data center expansions with Airflow

The infrastructure operations and maintenance tasks are often scheduled as a cron job. However, cron has its limitation, and the orchestration engines like Airflow provides much more efficient scheduler for non-time-sensitive tasks. Cloudflare writes an excellent blog on the same of using Apache Airflow for data center operations.

Automating data center expansions with Airflow

Cloudflare's network keeps growing, and that growth doesn't just come from building new data centers in new cities…

blog.cloudflare.com

Apache Hudi: Optimize Data Lake layout using Clustering in Apache Hudi

The small file is a classic problem in data infrastructure and inherent impact on query performance. Apache Hudi introduced a pluggable clustering architecture to handle the small files and colocated related data to improve query efficiency.

Optimize Data Lake layout using Clustering in Apache Hudi

This blog is a repost of this Hudi blog on medium.

medium.com

Trainline: Building a data lake: from batch to real-time using Kafka

Timeline writes about its data pipeline evolution. It’s exciting to see a similar data ingestion maturity model from API integration to batch processing to real-time data ingestion systems.

Building a data lake: from batch to real-time using Kafka

We know having a single place to store and query all available data (a data lake) is a critical requirement in the…

engineering.thetrainline.com

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.