Data Engineering Weekly #25

Published in

Data Engineering Weekly

6 min readJan 11, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 25th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Kleiner Perkins’s future of computing and data infrastructure, LinkedIn’s fast ingestion with Gobblin, Intuit’s data journey, AWS’s PyDeequ, Alibaba Cloud’s Flink infra with 4 billion events per sec, Expedia’s ML deployment pattern, Delta lake vs. Hudi, handling late-arriving dimensions, entity resolution for big data, Airflow 2.0 and Debezium year-in-review 2020.

Kleiner Perkins: Looking ahead to the future of computing and data infrastructure

Kleiner Perkins writes an excellent blog about the future of computing and data infrastructure. The cloud data warehouse, serverless architecture, workflow as a (No)code movement, and the lack of an end-to-end solution to optimize the ML infrastructure value chain are some of the exciting trends to watch. The author’s take on data security enforces the importance of a metadata management system.

The (data/ security) breaches showing up in the news on a near-weekly basis all seem to be rooted in the same problem — a lack of awareness of what data an organization has, where it is, and who has access to it.

Looking ahead to the future of computing and data infrastructure

As an early stage investor, I spend a lot of time speaking with founders, their target customers, employees at larger…

www.kleinerperkins.com

You can read Data Engineering Weekly’s take on data infrastructure trends here.

Back To The Future: Data Engineering Trends 2020 & Beyond

Welcome to the 23rd edition of data engineering weekly. This week's edition is a yearend special edition where we will…

www.dataengineeringweekly.com

LinkedIn: FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

LinkedIn writes about the evolution of Apache Gobblin from a batch ingestion framework to a fast ingestion framework minimizing the ingestion latency from 45 minutes to less than 5 minutes. The blog narrates how Gobblin uses Apache Iceberg to guarantee read/ write isolation, the tradeoff with ORC format encoding, and continuous data publishing is an exciting read. Yarn’s choice for resource management and scheduling is interesting, and looking forward to reading more on how the Gobblin replanner evolves from stop-the-world rebalance to dynamic rebalance.

FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

Since the containers are already provisioned (except for new container allocations needed for cluster expansion), job…

engineering.linkedin.com

Intuit: The Intuit Data Journey

Clean migration is a sign of effective engineering, and Intuit writes one such clean migration of its data infrastructure from on-premise to the cloud-native. The blog emphasizes data infrastructure fundamentals, such as to treat data as a product, and focuses on data quality, availability, performance, security & cost-effectiveness. It’s exciting to read the challenges ahead of Intuit’s data platform and the focus on the data mesh approach.

The Intuit Data Journey

Accelerating Development of Smart, Personalized Financial Products & Services

medium.com

AWS: Testing data quality at scale with PyDeequ

AWS introduced Deequ, a data quality library, in early 2019. Deequ is used internally at Amazon to verify the quality of many large production datasets. Dataset producers can add and edit data quality constraints. The system computes data quality metrics regularly (with every new version of a dataset), verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success.

As an evolution of Deequ AWS open source PyDeequ, a python wrapper on top of Deequ can be integrated with PySpark to define and run the test cases.

Testing data quality at scale with PyDeequ | Amazon Web Services

You generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break…

aws.amazon.com

Alibaba Cloud: Four Billion Records per Second! What is Behind Alibaba Double 11 — Flink Stream-Batch Unification Practice during Double 11 for the Very First Time

Alibaba writes a great success story on Apache Flink’s scalability and the effectiveness of stream-batch unification. During the last Double 11 Global Shopping Festival, Apache flink pipeline processed an impressive four billion records per second. The data volume also reached an incredible seven TB per second. The Flink-based stream-batch unification has successfully withstood strict tests in terms of stability, performance, and efficiency in Alibaba’s core data service scenarios. This article shares the practice experience and reviews the evolvement of stream and batch unification within Alibaba’s core data services.

Four Billion Records per Second!

This article analyzes the practice of stream and batch unification for big data processing within Alibaba’s core…

alibaba-cloud.medium.com

Expedia: Accelerate Machine Learning with the Optimal Deployment Pattern

Expedia writes about ML model deployment patterns, narrating some of the significant challenges operating the ML model in production and how they differ from the traditional back-end systems. The blog is an exciting read about various deployment patterns and each deployment model’s pros & cons.

Accelerate Machine Learning with the Optimal Deployment Pattern

Maximize business results with real-time, streaming and batch inferencing

medium.com

Lu Jiaqi: The ACID table storage layer- thorough conceptual comparisons between Delta Lake and Apache Hudi

The support for ACID on top of the object storage is a significant development in 2020. The blog narrates the data lake approach’s drawback and compares the ACID support between Databricks Delta Lake and Apache Hudi.

The ACID table storage layer- thorough conceptual comparisons between Delta Lake and Apache Hudi…

While I was doing my data engineer internship at Cathay Financial Holdings, I spent most of my time researching the…

h164654156465.medium.com

Databricks: Handling Late Arriving Dimensions Using a Reconciliation Pattern

Processing facts and dimensions are the core of data engineering. In a typical event sourcing, the producer publishes the facts and dimensions in different streams. The blog narrates some of the design challenges with the late-arriving dimensions, especially with the fast/rapidly changing dimensions(RCD), and how reconciliation pattern helps to solve it.

How to Handle Late Arriving Dimensions With a Streaming Reconciliation Pattern

Processing facts and dimensions is the core of data engineering. Fact and dimension tables appear in what is commonly…

databricks.com

ACM Computing Survey/ The morning paper: An overview of end-to-end entity resolution for big data

One of the most critical tasks for improving data quality and increasing data analytics’s reliability is Entity Resolution (ER), aiming to identify different descriptions that refer to the same real-world entity. The paper narrates an end-to-end view of ER workflows for Big Data, critically reviews the pros and cons of existing methods, and concludes with the leading open research directions.

An Overview of End-to-End Entity Resolution for Big Data

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity…

dl.acm.org

The morning paper provides excellent summarization of the paper.

An overview of end-to-end entity resolution for big data

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020…

blog.acolyer.org

Databand: Airflow 2.0 and Why We Are Excited at Databand

Airflow version 2.0 is a significant milestone release for the Airflow community. Databand shares a similar excitement narrates two significant features released with Airflow 2.0; the Decorator Flows & Scheduler performance.

Airflow 2.0 and Why We Are Excited at Databand

Airflow 2.0 is here and we are so excited about the new release! There is a bevy of new features that have solved a lot…

medium.com

Debezium: Debezium in 2020 -- The Recap!

Debezium, the defacto open-source distributed platform for change data capture, publishes the year-in-review 2020. The blog post contains a rich consolidation of some exciting articles about the adoption of Debezium.

Debezium in 2020 -- The Recap!

Debezium is an open-source distributed platform for change data capture. Start it up, point it at your databases, and…

debezium.io

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly #25

Looking ahead to the future of computing and data infrastructure

As an early stage investor, I spend a lot of time speaking with founders, their target customers, employees at larger…

Back To The Future: Data Engineering Trends 2020 & Beyond

Welcome to the 23rd edition of data engineering weekly. This week's edition is a yearend special edition where we will…

FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

Since the containers are already provisioned (except for new container allocations needed for cluster expansion), job…

The Intuit Data Journey

Accelerating Development of Smart, Personalized Financial Products & Services

Testing data quality at scale with PyDeequ | Amazon Web Services

You generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break…

Four Billion Records per Second!

This article analyzes the practice of stream and batch unification for big data processing within Alibaba’s core…

Accelerate Machine Learning with the Optimal Deployment Pattern

Maximize business results with real-time, streaming and batch inferencing

The ACID table storage layer- thorough conceptual comparisons between Delta Lake and Apache Hudi…

While I was doing my data engineer internship at Cathay Financial Holdings, I spent most of my time researching the…

How to Handle Late Arriving Dimensions With a Streaming Reconciliation Pattern

Processing facts and dimensions is the core of data engineering. Fact and dimension tables appear in what is commonly…

An Overview of End-to-End Entity Resolution for Big Data

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity…

An overview of end-to-end entity resolution for big data

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020…

Airflow 2.0 and Why We Are Excited at Databand

Airflow 2.0 is here and we are so excited about the new release! There is a bevy of new features that have solved a lot…

Debezium in 2020 -- The Recap!

Debezium is an open-source distributed platform for change data capture. Start it up, point it at your databases, and…

Written by Ananth Packkildurai