Data Engineering Weekly #2

Published in

Data Engineering Weekly

4 min readFeb 3, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Apache Pinot is gaining momentum as a realtime OLAP system for data engineering needs. In this blog post, Sapient narrates its experience benchmarking Apache Pinot. The ingestion rate cross 120k entries/second on one node is impressive.

Tasted Apache Pinot and we Loved it!

We were pioneering the personalized experience for our client and we were on look-out for a time series database to fit…

medium.com

Netflix open sourced metaflow.org December 2019. Metaflow follows a layered architecture approach to run the data workload, a contrasting approach from a tightly coupled airflow’s scheduler architecture. In this post, Netflix explains how the scheduler layer integrated with the AWS step functions.

Unbundling Data Science Workflows with Metaflow and AWS Step Functions

by David Berg, Ravi Kiran Chirravuri, Romain Cledat, Jason Ge, Savin Goyal, Ferras Hamad, Ville Tuulos

netflixtechblog.com

The Airflow operator represents a single idempotent task. Operators determine what executes when your DAG runs. One of the drawbacks of the operator is that no Airflow does not have explicit inter-operator communication, aka no easy way to pass messages between operators! AIP-31 proposal adopting a functional DAG abstraction to hide the complexity. The following article explains how the functional definition can solve the inter-operator communication.

AIP-31 —Airflow Functional DAG Definition

Intro — AIP-31

medium.com

Python becomes the de facto language for data science workload. Apache Spark community continually improves the performance of PySpark. Pinterest writes about its data infrastructure to empower their data science workload. The design approach to isolate the Python environment for each workload and the use of SparkMagic is an exciting read.

Empowering Pinterest data scientists and machine learning engineers with PySpark

Tien T. Nguyen | Machine Learning Platform, Jingge Zhou | Machine Learning Platform, Zirui Li | Data Processing…

medium.com

Cost optimization is essential engineering in cloud computing. Netflix writes about its cost optimization platform in this blog post. The automated TTL recommendations only for tables with material cost-saving potentials are the highlight of this post.

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

By Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park

netflixtechblog.com

Square writes about using Amundsen to support users’ privacy. The post narrates the challenges to label columns for the sensitive data and the usage of Google’s Cloud data loss prevention tool.

Using Amundsen to Support User Privacy via Metadata Collection at Square

When I started at Square, one of our primary privacy challenges was that we needed to scale and automate insights into…

developer.squareup.com

Spark 3.0 made many improvements with the SparkSQL. The article explains the internals of the Spark SQL execution plan and how to interpret the query plan to optimize the execution.

Mastering Query Plans in Spark 3.0

Spark query plans in a nutshell.

towardsdatascience.com

Structured Streaming was initially introduced in Apache Spark 2.0. It has proven to be the best platform for building distributed stream processing applications. The article narrates troubleshooting streaming performance using the Spark UI 3.0

How to Better Monitor Streaming Queries with Spark 3.0 Structured Streaming

This is a guest community post from Genmao Yu, a software engineer at Alibaba. Structured Streaming was initially…

databricks.com

The support for running Spark on Kubernetes added with version 2.3, and Spark-on-k8s adoption has been accelerating ever since. The lake of external shuffle service is one of the drawbacks of adopting Spark on Kubernetes. Spark 3.0 added support for soft dynamic allocation to mitigate the issue. The benchmark in the blog shows the performance difference between Spark on K8s and Spark on Yarn narrowing.

Performance of Apache Spark on Kubernetes has caught up with YARN

Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on…

towardsdatascience.com

The modern data platform is moving from the traditional data warehouse -> data lake to data mesh. This blog post is blueprint guidance on how to move the data warehouse to the data mesh world. The blog focused on Google cloud offerings, but the concept still applicable to any cloud infrastructure.

Building a Data Platform to Enable Analytics and AI-Driven Innovation

Build a Data Mesh & Set up MLOps

medium.com

There has been a growing interest lately among the industry on getting better control over one’s data ecosystem and improving its operational efficiency. Following Amundsen (Lyft), DataHub (Linkedin), Databook (Uber), and Metacat (Netflix), Criteo published it’s internal data discovery system DataDoc.

DataDoc — The Criteo Data Observability Platform

How we regained control on our data ecosystem and tackled governance issues.

medium.com

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Engineering Weekly #2

Tasted Apache Pinot and we Loved it!

We were pioneering the personalized experience for our client and we were on look-out for a time series database to fit…

Unbundling Data Science Workflows with Metaflow and AWS Step Functions

by David Berg, Ravi Kiran Chirravuri, Romain Cledat, Jason Ge, Savin Goyal, Ferras Hamad, Ville Tuulos

AIP-31 —Airflow Functional DAG Definition

Intro — AIP-31

Empowering Pinterest data scientists and machine learning engineers with PySpark

Tien T. Nguyen | Machine Learning Platform, Jingge Zhou | Machine Learning Platform, Zirui Li | Data Processing…

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

By Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park

Using Amundsen to Support User Privacy via Metadata Collection at Square

When I started at Square, one of our primary privacy challenges was that we needed to scale and automate insights into…

Mastering Query Plans in Spark 3.0

Spark query plans in a nutshell.

How to Better Monitor Streaming Queries with Spark 3.0 Structured Streaming

This is a guest community post from Genmao Yu, a software engineer at Alibaba. Structured Streaming was initially…

Performance of Apache Spark on Kubernetes has caught up with YARN

Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on…

Building a Data Platform to Enable Analytics and AI-Driven Innovation

Build a Data Mesh & Set up MLOps

DataDoc — The Criteo Data Observability Platform

How we regained control on our data ecosystem and tackled governance issues.

Written by Ananth Packkildurai