Data Engineering Weekly #32

Published in

Data Engineering Weekly

4 min readMar 13, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 32nd edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Picnic’s Data Vault modeling, Mihaileric’s why we need more data engineers, Microsoft’s onboarding data scientist checklist, Netflix’s data movement with Google Services, Redpoint Venture’s data feedback loop with SAAS applications, DoorDash’s declarative real-time feature engineering, Uber’s applying ML for internal auditing, Pinterest’s ML techniques to fight misinformation, Monte Carlo’s new data quality rules, and Anna Anisienia’s take on Airflow task group design.

Let’s start this week with some fun but also the sad reality of the data engineering journey.

Picnic: Data vault - new weaponry in your data science toolkit

The emerging cloud datawarehouse and the structured data approach bring back the importance of data modeling techniques like data vault and the Kimball methodologies. Picnic writes an exciting read on how it uses these data modeling techniques on top of Snowflake to empower historical data access, time-traveling through historical data, integrate with the real-time pipeline.

Data vault: new weaponry in your data science toolkit

Picnic is an online grocery, where we aim to bring grocery shopping and the traditional Milkman model into the 21st…

blog.picnic.nl

It is an exciting area of study. I'm wondering how traditional data modeling techniques go hand-in-hand with the modern data engineering principles of the immutable idempotent data pipeline and data versioning techniques. If you've thoughts, let's connect and discuss.

LinkedIn | Twitter

mihaileric.com: We Don't Need Data Scientists, We Need Data Engineers

How do the data practitioners’ (data & ML engineering) jobs distributed across the companies are interesting to understand the data domain’s emerging pattern. Though the Author analyzed a small set of YC startups, the underlying observation is worth noting. The modern ML frameworks like Tensorflow, PyTorch industrialized machine learning, but the data collection, cleaning & labeling remains unindustrialized and requires manual work for the most part.

https://www.mihaileric.com/posts/we-need-data-engineers-not-data-scientists/

Netflix: Data movement for Google services at Netflix

The business operations use multiple SAAS tools to operate a business unit effectively. It brings many challenges like data access control, lineage tracking, and integration with other business operations. Netflix writes an exciting blog post highlighting how it tackles the challenges using a proxy service for Google workspace apps integrations.

https://netflixtechblog.medium.com/data-movement-for-google-services-at-netflix-9a77ca69f7c4

Redpoint Ventures: The Feedback Loops in Data that Will Change SaaS Architecture

As we noticed in Netflix’s Google workspace integration journey, It’s an increasingly common pattern for an enterprise to contribute and leverage data from SAAS applications to meet the business goals. The author captures the feedback loop of data flowing across the SAAS applications. It is an exciting space to watch.

The Feedback Loops in Data that Will Change SaaS Architecture

About a year ago, I wrote a post on the hub and spoke data model. The idea is that in the future SaaS applications…

www.linkedin.com

DoorDash: Building Riviera: A Declarative Real-Time Feature Engineering Framework

ML models play a significant role in improving the users’ experience. As a result, an efficient feature engineering framework is a critical part of the ML infrastructure.DoorDash writes an exciting blog that narrates the importance of having a near-realtime feature store to enrich the customer experience and how the Flink-as-a-service platform helps to fulfill the mission.

Building A Declarative Real-Time Feature Engineering Framework

Allen Wang Kunal Shah In a business with fluid dynamics between customers, drivers, and merchants, real-time data helps…

doordash.engineering

Uber: Applying Machine Learning in Internal Audit with Sparsely Labeled Data

As machine learning continues to evolve, transforming the various industries, it touches. Uber narrates one such transformation on how ML helps its internal auditing system, answering questions such as how many Agents per country, number of transactions, total cash paid, evolution over the past three years. It’s no surprise to notice the data availability and data labeling mentioned as the most significant challenge rather than ML model development.

https://eng.uber.com/ml-internal-audit/

Pinterest: How Pinterest fights misinformation, hate speech, and self-harm content with machine learning

Providing a safe and secure experience from health misinformation to hate speech, self-harm, and graphic violence is a significant challenge for social platforms. Pinterest narrates the ML-driven architecture that empowers the system to detect unsafe content before it’s reported automatically.

How Pinterest fights misinformation, hate speech, and self-harm content with machine learning

Using the latest in machine learning to eliminate harmful content

medium.com

Monte Carlo: The New Rules of Data Quality

Historically the data quality checks focused on a siloed, data producer-driven testing, which is essentially equivalent to unit testing. Is a unit test is enough for data testing? The blog narrates some of the principles to follow to engineer data quality and empathize the data quality is a collective responsibility.

The New Rules of Data Quality

Introducing a better way to manage data quality at scale with testing and observability.

towardsdatascience.com

Anna Anisienia: TaskFlow API in Apache Airflow 2.0 — Should You Use It?

A data pipeline is more than a unit of execution and often requires sharing its state for the downstream jobs for providing composable pipeline. The blog narrated some of the task group design’s pros and cons and the practical implication and raised some interesting points on data transformation vs. orchestration.

TaskFlow API in Apache Airflow 2.0 — Should You Use It?

Think twice before redesigning your Airflow data pipelines