Data Engineering Weekly #28

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 28th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s ML for computer architecture, Microsoft’s PyTorch vs. TensorFlow, Capital One’s Time travel offline ML evaluation frameworks, Alibaba Cloud’s Data Lake introduction, PayPal’s Next-Gen data movement framework, Apache Pinot’s integration story with Presto, Gradient Flow’s growing importance of Metadata, Metadata Day 2020 overview, Monte Carlo Data’s data pipeline SLA, and TDD with Apache Airflow.

Google: Machine Learning for Computer Architecture

The custom accelerators like Google TPU and Edge CPU significantly advanced the ML workloads. The hardware accelerator ecosystem must continue to innovate in architecture design and acclimate to rapidly evolving ML models and applications to sustain these advances. Google AI writes about blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the chip’s overall performance.

Microsoft: A tale of two frameworks: PyTorch vs. TensorFlow

TensorFlow and PyTorch are the two most popular Machine Learning framework. Microsft writes a comparison article that illustrates the differences between PyTorch and TensorFlow by focusing on creating and training two simple models, mainly how to use dynamic subclassed models with the Module API from PyTorch 1.x and the Module API from TensorFlow 2.x.

Capital One: Time Travel is Real-Building Offline Evaluation Frameworks

In a typical system design, multiple services act on an entity to change the state. For an offline ML model evaluation, Adding the temporal view of all the systems’ data and processing to identify the state of customer interactions over time with time travel function is challenging. Capital One writes an exciting blog post discussing some of the challenges of building such a system with a high-level reference architecture.

Alibaba Cloud: Data Lake: Concepts, Characteristics, Architecture, and Case Studies

Alibaba Cloud writes an excellent overview about Data Lake. The blog is an exciting summary of what is a data lake? What are the characteristics of a data lake? The data architectural pattern differences between Lambda and Kappa architectures, Comparing the commercially available data lake solutions, and a case study from Huwai Data Lake system design.

PayPal: Next-Gen Data Movement Platform at PayPal

Data that moves is alive and valuable. At rest, data is dead. PayPal writes its journey to build the Next-Generation of Data Movement Platform. The design principles behind the PayPals Risk Analytical Dynamic Datasets(RAAD) pipeline build on top of Apache Gobblin and Apache Airflow is an exciting read about a self-serving unified data platform.

Apache Pinot: Real-time Analytics with Presto and Apache Pinot

Apache Pinot writes a two-part post about Pinot integration with Presto. The blog narrates various design choices, the trade-off between latency and flexibility, and discusses Pinot’s aggregator pushdown implementation with significant performance improvement.

Gradient Flow: The Growing Importance of Metadata Management Systems

Metadata management is the critical feature of data infrastructure. We’ve seen several technology companies developed internal metadata management systems and shared the challenges that led them to focus on metadata, including Airbnb’s Data portal, Netflix’s Metacat, Uber’s Databook, LinkedIn’s Datahub, Lyft’s Amundsen, WeWork’s Marquez, Spotify’s Lexikon. Gradient Flow writes an exciting blog about the importance of metadata, the current architectural pattern for metadata management, and various vendors for the metadata landscape.

Knowledge Technologies: Review: Metadata Day 2020

LinkedIn organized the metadata day 2020 last December as a general forum to discuss the current trend in metadata management. Data Engineering Weekly wrote the chronological order of metadata management systems by various companies. Continuing to echo the metadata day’s impact, the author writes an exciting summary of the metadata day 2020.

Data Engineering Weekly’s Metadata Day Special Edition:

Monte Carlo Data: How to Make Your Data Pipelines More Reliable with SLAs

SLA, SLO, SLI are widely used to measure the reliability of the services. Slack, for instance, provides customer credit if the SLA breaches below 99.99%. The author narrates how a data pipeline can successfully adopt a similar measure to improve the data reliability and minimize data downtime.

In case you missed it, I gave a talk about operating data pipeline on Airflow @Slack a couple of years back contains some best practices on data pipeline reliability.

Marcos Marx: How to develop data pipeline in Airflow through TDD (test-driven development)

Continuous integration and testing are a vital part of improving the productivity of developing a data pipeline. One of Apache Airflow’s critical attributes of success is writing and testing a data pipeline programmatically. The author writes an exciting blog walking through the steps to enable the Test-Driven-Development on data pipeline using Apache Airflow.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

--

--

--

Weekly data engineering newsletter. Subscribe to https://www.dataengineeringweekly.com. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent any of my employer’s opinion.

Recommended from Medium

How To Build a Basic Chatbot From Scratch

Delving into Deep Learning — Part 1

Bayesian Linear Regression with Bambi

Random Walk with Restart and its applications

Image Processing with Python: Color Correction using Histogram Manipulation

Counter Strike matches result prediction

Preparing the data for Transformer pre-training — a write-up

Iris dataset with 3 Different Classifiers 🌼🌸❀

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ananth Packkildurai

Ananth Packkildurai

Data Engineer. I write data engineering weekly; the weekly newsletter focused on data engineering. Subscribe at www.dataengineeringweekly.com.

More from Medium

How a Data Science Veteran is Solving Enterprise Data Quality

5 Ways Data Quality Can Impact Your AI Solution

Data Quality

The Difference Between Training Data vs. Test Data in Machine Learning