Data Engineering Weekly #31

Published in

Data Engineering Weekly

5 min readMar 7, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 31st edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Redpoint Ventures Reverse ETL, JP Morgan’s data mesh implementation, DBT’s modern data stack, ValidIO’s ML & Data trends 2021, Airbnb’s visualizing data timeline, Pinterest’s lesson learned from running Kafka at scale, Confluent’s 42 things to do once Zookeeper is gone, LinkedIn’s solving data integration problem with Apache Gobblin, Facebook’s mitigating the effect of silent data corruption, Reddit’s scaling reporting system, and LinkedIn’s GraphQL implementation of DataHub.

Redpoint Ventures: Reverse ETL — A Primer

Over the last decade, the cloud and SAAS products changed the way business operates. In a modern business, the customer data spread across many SAAS vendors. A single source of truth is a myth in the modern data infrastructure. I often call this a "Truth In Motion." I shared a similar thought a year back.

On the same line, the blog narrates “Reverse ETL.,” where the data are flowing from the internal data warehouse to SAAS providers like Salesforce, Zendesk & Intercom. It is an exciting space to watch as the success depends on how the SAAS vendors simplify the ingress and produce cost-effective time to value the customer’s data.

Reverse ETL — A Primer

Data infrastructure has gone through an incredible evolution over the past three years. We have moved from Extract…

medium.com

JPMorgan Chase: Implementing a Data Mesh Architecture at JPMC

JPMC talked about the team’s thoughts and implementation strategy of adopting the data mesh principles. The talk is a good narrative of structuring the data mesh principles, publishing taxonomy, and the need for a pragmatic compromise while adopting the data mesh principles.

Implementing a Data Mesh Architecture at JPMC

Before we begin, there are a couple of things that I want to run by the audience. At the end of the presentation, we…

www.dremio.com

DBT: The Modern Data Stack: Past, Present, and Future

How does the data engineering world look like beyond the Hadoop ecosystem? The blog from DBT gives a comprehensive overview of the modern data stack, starting from the introduction of Redshift and its impact on the data warehouse. The blog narrates the challenges ahead reminds this space is wide open for innovation over the next decades.

The modern data stack: past, present, and future

I recently gave a talk with this title at Sisu's Future Data conference, and since I think in prose and not Powerpoint…

blog-getdbt-com.cdn.ampproject.org

VALIDIO: ML & Data Trends: Wrapping up 2020 and looking into 2021 & beyond

The blog narrates how the underlying data infrastructure influences the ML development in line with the recent trends on “Reverse ETL” and the modern cloud-native data stack. The blog also reiterates we are still in the early stages of MLOps, Data Quality tooling, and unified data architecture on the path to industrialization ML development.

ML & Data Trends: Wrapping up 2020 and looking into 2021 & beyond

2020 brought a digitalization explosion across the world. Microsoft estimates that the first two months of the pandemic…

medium.com

Airbnb: Visualizing Data Timeliness at Airbnb

Commitment, Consistency & Clarity in the data pipeline are the core principles to build trust in data to empower a data-driven culture. Airbnb writes an exciting blog about SLA Tracker and how it took a data-driven approach to debug the data pipeline to improve efficiency.

Visualizing Data Timeliness at Airbnb

by Chris Williams, Ken Chen, Krist Wongsuphasawat, and Sylvia Tomiyama

medium.com

Confluent/ Pinterest: Lessons Learned from Running Apache Kafka at Scale at Pinterest

Pinterest writes its lessons learned from running Apache Kafka at scale. Broker replacement, partition rebalancing, and cost control are the common challenges running Kafka at scale, and the blog narrates how automation can help run the tasks. The Pinterest Orion is an exciting project to watch.

How Pinterest Runs Kafka at Scale

Apache Kafka ® is at the heart of the data transportation layer at Pinterest. The amount of data that runs through…

www.confluent.io

Confluent: 42 Things You Can Stop Doing Once ZooKeeper Is Gone from Apache Kafka

Confluent writes about the advantages of removing the Zookeeper dependency can improve the Kafka infrastructure with performance, capacity planning, operations, and monitoring. The KIP-500 RFC on replacing Zookeeper with a self-managed quorum is an exciting read.

42 Ways Zookeeper Removal Improves Apache Kafka

Soon, Apache Kafka ® will no longer need ZooKeeper! With KIP-500, Kafka will include its own built-in consensus layer…

www.confluent.io

LinkedIn: Solving the data integration variety problem at scale, with Gobblin

The growing niche SAAS applications add complexity to the data ingestions to the data warehouse system. LinkedIn writes about Apache Gobblin’s unique approach to building data integration at scale. Instead of relying upon per source connectors, the multi-stage protocol & message format architecture seems an elegant solution for a complex problem.

Solving the data integration variety problem at scale, with Gobblin

Editor's Note: Recently, the Apache Software Foundation (ASF) announced Apache® Gobblin™ as a Top-Level Project (TLP)…

engineering.linkedin.com

Facebook: Mitigating the effects of silent data corruption at scale

In a large-scale infrastructure, files usually compressed when they are not being read and decompressed when a request to read the file. What happens when the decompression fails? How often the failure? Facebook writes an exciting blog about its paper. silent data corruption at scale.

Silent data corruption: Mitigating effects at scale - Facebook Engineering

Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale…

engineering.fb.com

Reddit: Scaling Reporting at Reddit

Reddit writes about its journey on scaling the reporting platform from Redis to Apache Druid. The blog discusses the broader limitations of adopting the key-value storage for serving the analytics, the overhead on the application development, and operation issues with unknown bugs.

Scaling Reporting at Reddit

Reddit's Advertising business has seen incredible growth over the last few years and has consistently evolved to meet…

redditblog.com

LinkedIn: DataHub Project Updates (February 2021 Edition)

One of the challenges of adopting a modern data stack is that it is isolated towards dashboarding and reporting use cases. It is refreshing to read that the recent LinkedIn DataHub release focuses on adopting GraphQL to ease the integration with broader infrastructure components.

LinkedIn DataHub Project Updates (February 2021 Edition)

A monthly newsletter on DataHub Project

medium.com

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly #31

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

Reverse ETL — A Primer

Data infrastructure has gone through an incredible evolution over the past three years. We have moved from Extract…

Implementing a Data Mesh Architecture at JPMC

Before we begin, there are a couple of things that I want to run by the audience. At the end of the presentation, we…

The modern data stack: past, present, and future

I recently gave a talk with this title at Sisu's Future Data conference, and since I think in prose and not Powerpoint…

ML & Data Trends: Wrapping up 2020 and looking into 2021 & beyond

2020 brought a digitalization explosion across the world. Microsoft estimates that the first two months of the pandemic…

Visualizing Data Timeliness at Airbnb

by Chris Williams, Ken Chen, Krist Wongsuphasawat, and Sylvia Tomiyama

How Pinterest Runs Kafka at Scale

Apache Kafka ® is at the heart of the data transportation layer at Pinterest. The amount of data that runs through…

42 Ways Zookeeper Removal Improves Apache Kafka

Soon, Apache Kafka ® will no longer need ZooKeeper! With KIP-500, Kafka will include its own built-in consensus layer…

Solving the data integration variety problem at scale, with Gobblin

Editor's Note: Recently, the Apache Software Foundation (ASF) announced Apache® Gobblin™ as a Top-Level Project (TLP)…

Silent data corruption: Mitigating effects at scale - Facebook Engineering

Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale…

Scaling Reporting at Reddit

Reddit's Advertising business has seen incredible growth over the last few years and has consistently evolved to meet…

LinkedIn DataHub Project Updates (February 2021 Edition)

A monthly newsletter on DataHub Project

Written by Ananth Packkildurai