Data Engineering Weekly #28

Published in

Data Engineering Weekly

5 min readFeb 7, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 28th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s ML for computer architecture, Microsoft’s PyTorch vs. TensorFlow, Capital One’s Time travel offline ML evaluation frameworks, Alibaba Cloud’s Data Lake introduction, PayPal’s Next-Gen data movement framework, Apache Pinot’s integration story with Presto, Gradient Flow’s growing importance of Metadata, Metadata Day 2020 overview, Monte Carlo Data’s data pipeline SLA, and TDD with Apache Airflow.

Google: Machine Learning for Computer Architecture

The custom accelerators like Google TPU and Edge CPU significantly advanced the ML workloads. The hardware accelerator ecosystem must continue to innovate in architecture design and acclimate to rapidly evolving ML models and applications to sustain these advances. Google AI writes about blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the chip’s overall performance.

Machine Learning for Computer Architecture

One of the key contributors to recent machine learning (ML) advancements is the development of custom accelerators…

ai.googleblog.com

Microsoft: A tale of two frameworks: PyTorch vs. TensorFlow

TensorFlow and PyTorch are the two most popular Machine Learning framework. Microsft writes a comparison article that illustrates the differences between PyTorch and TensorFlow by focusing on creating and training two simple models, mainly how to use dynamic subclassed models with the Module API from PyTorch 1.x and the Module API from TensorFlow 2.x.

A tale of two frameworks: PyTorch vs. TensorFlow

Comparing auto-diff and dynamic model sub-classing approaches with PyTorch 1.x and TensorFlow 2.x

medium.com

Capital One: Time Travel is Real-Building Offline Evaluation Frameworks

In a typical system design, multiple services act on an entity to change the state. For an offline ML model evaluation, Adding the temporal view of all the systems’ data and processing to identify the state of customer interactions over time with time travel function is challenging. Capital One writes an exciting blog post discussing some of the challenges of building such a system with a high-level reference architecture.

Time Travel is Real-Building Offline Evaluation Frameworks

How a “wormhole” framework can help personalization platforms leverage machine learning

medium.com

Alibaba Cloud: Data Lake: Concepts, Characteristics, Architecture, and Case Studies

Alibaba Cloud writes an excellent overview about Data Lake. The blog is an exciting summary of what is a data lake? What are the characteristics of a data lake? The data architectural pattern differences between Lambda and Kappa architectures, Comparing the commercially available data lake solutions, and a case study from Huwai Data Lake system design.

Data Lake: Concepts, Characteristics, Architecture, and Case Studies

This article provides deep insights into the data lake concept and compares some common solutions available in the…

alibaba-cloud.medium.com

PayPal: Next-Gen Data Movement Platform at PayPal

Data that moves is alive and valuable. At rest, data is dead. PayPal writes its journey to build the Next-Generation of Data Movement Platform. The design principles behind the PayPals Risk Analytical Dynamic Datasets(RAAD) pipeline build on top of Apache Gobblin and Apache Airflow is an exciting read about a self-serving unified data platform.

Next-Gen Data Movement Platform at PayPal

…using Apache Airflow scheduler and Apache Gobblin — a data integration framework open-sourced by LinkedIn.

medium.com

Apache Pinot: Real-time Analytics with Presto and Apache Pinot

Apache Pinot writes a two-part post about Pinot integration with Presto. The blog narrates various design choices, the trade-off between latency and flexibility, and discusses Pinot’s aggregator pushdown implementation with significant performance improvement.

Real-time Analytics with Presto and Apache Pinot — Part I

In this world, most analytics products either focus on ad-hoc analytics, which requires query flexibility without…

medium.com

Real-time Analytics with Presto and Apache Pinot — Part II

This blog post is the second part of a two-part series on using Presto with Apache Pinot. You can find the first part…

medium.com

Gradient Flow: The Growing Importance of Metadata Management Systems

Metadata management is the critical feature of data infrastructure. We’ve seen several technology companies developed internal metadata management systems and shared the challenges that led them to focus on metadata, including Airbnb’s Data portal, Netflix’s Metacat, Uber’s Databook, LinkedIn’s Datahub, Lyft’s Amundsen, WeWork’s Marquez, Spotify’s Lexikon. Gradient Flow writes an exciting blog about the importance of metadata, the current architectural pattern for metadata management, and various vendors for the metadata landscape.

The Growing Importance of Metadata Management Systems

By Assaf Araki and Ben Lorica. As companies embrace digital technologies to transform their operations and products…

gradientflow.com

Knowledge Technologies: Review: Metadata Day 2020

LinkedIn organized the metadata day 2020 last December as a general forum to discuss the current trend in metadata management. Data Engineering Weekly wrote the chronological order of metadata management systems by various companies. Continuing to echo the metadata day’s impact, the author writes an exciting summary of the metadata day 2020.

Review: Metadata Day 2020

For a full day on December 14, 2020, LinkedIn sponsored a virtual workshop called Metadata Day, followed by a public…

medium.com

Data Engineering Weekly’s Metadata Day Special Edition:

Data Engineering Weekly #21: Metadata Edition

Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the…

www.dataengineeringweekly.com

Monte Carlo Data: How to Make Your Data Pipelines More Reliable with SLAs

SLA, SLO, SLI are widely used to measure the reliability of the services. Slack, for instance, provides customer credit if the SLA breaches below 99.99%. The author narrates how a data pipeline can successfully adopt a similar measure to improve the data reliability and minimize data downtime.

Why You Need to Set SLAs for Your Data Pipelines

How to set expectations about data quality and reliability for your company

towardsdatascience.com

In case you missed it, I gave a talk about operating data pipeline on Airflow @Slack a couple of years back contains some best practices on data pipeline reliability.

Marcos Marx: How to develop data pipeline in Airflow through TDD (test-driven development)

Continuous integration and testing are a vital part of improving the productivity of developing a data pipeline. One of Apache Airflow’s critical attributes of success is writing and testing a data pipeline programmatically. The author writes an exciting blog walking through the steps to enable the Test-Driven-Development on data pipeline using Apache Airflow.

How to develop data pipeline in Airflow through TDD (test-driven development)

I’ve been reading a lot about DataOps and MLOps methodologies lately. One of the pillars of these methodologies is to…

blog.magrathealabs.com

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly #28

Machine Learning for Computer Architecture

One of the key contributors to recent machine learning (ML) advancements is the development of custom accelerators…

A tale of two frameworks: PyTorch vs. TensorFlow

Comparing auto-diff and dynamic model sub-classing approaches with PyTorch 1.x and TensorFlow 2.x

Time Travel is Real-Building Offline Evaluation Frameworks

How a “wormhole” framework can help personalization platforms leverage machine learning

Data Lake: Concepts, Characteristics, Architecture, and Case Studies

This article provides deep insights into the data lake concept and compares some common solutions available in the…

Next-Gen Data Movement Platform at PayPal

…using Apache Airflow scheduler and Apache Gobblin — a data integration framework open-sourced by LinkedIn.

Real-time Analytics with Presto and Apache Pinot — Part I

In this world, most analytics products either focus on ad-hoc analytics, which requires query flexibility without…

Real-time Analytics with Presto and Apache Pinot — Part II

This blog post is the second part of a two-part series on using Presto with Apache Pinot. You can find the first part…

The Growing Importance of Metadata Management Systems

By Assaf Araki and Ben Lorica. As companies embrace digital technologies to transform their operations and products…

Review: Metadata Day 2020

For a full day on December 14, 2020, LinkedIn sponsored a virtual workshop called Metadata Day, followed by a public…

Data Engineering Weekly #21: Metadata Edition

Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the…

Why You Need to Set SLAs for Your Data Pipelines

How to set expectations about data quality and reliability for your company

How to develop data pipeline in Airflow through TDD (test-driven development)

I’ve been reading a lot about DataOps and MLOps methodologies lately. One of the pillars of these methodologies is to…

Written by Ananth Packkildurai