Data Engineering Weekly #4

Published in

Data Engineering Weekly

3 min readFeb 5, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the fourth edition of the data engineering newsletter. This week’s release is a new set of articles that focus on data orchestration, ML applications, tuning data workload, and Kafka on Kubernetes.

Airflow is a huge step forward over loosely coupled cron jobs for running the data pipeline. Dagster, a data-aware, typed, self-describing, logical orchestration graph, takes the data orchestration to the next level by focusing on local development, testable code before production, and Linking data assets to the code that produced them. The focus on data dependencies, not with pure execution dependencies, is a data engineer’s dream comes true.

Dagster: The Data Orchestrator

As machine learning, analytics, and data processing become more complex and central to organizations, improving the…

medium.com

Amundsen is a data discovery and metadata engine, open-sourced by Lyft joining LF AI Foundation.

Amundsen Joins LF AI as New Incubation Project - LF AI

LF AI Foundation (LF AI), the organization building an ecosystem to sustain open source innovation in artificial…

lfai.foundation

Vimeo writes a post on video social analytics infrastructure using Apache Spark. The major challenge around integrating the external API guarded with severe rate limits. The practical usage of micro batching to workaround external API rate limiting and decouple the application logic from API data sourcing is a pragmatic approach and an exciting read.

Video social analytics at scale using Apache Spark

A deep dive into how we built data pipelines to access the APIs of platforms like Facebook and YouTube at scale using…

medium.com

Slack writes about ML infrastructure to prevent spam invites. The key takeaway is the simplicity of the approach and focuses on the operational aspect of the ML application.

Blocking Slack Invite Spam With Machine Learning - Slack Engineering

A fact of life for building an internet service is that, sooner or later, bad actors are going to come along and try to…

slack.engineering

Koalas is an open-source project which provides a drop-in replacement for pandas that focuses on scalability. Databricks writes a post on how PySpark can effectively work with Koalas.

How Koalas-Spark Interoperability Helps pandas Users Scale - The Databricks Blog

Koalas is an open source project which provides a drop-in replacement for pandas, enabling efficient scaling out to…

databricks.com

EMR is the widely used big data service from AWS. Monitoring Amazon EMR clusters is essential to help detect critical issues with the applications or infrastructure in real-time and identify root causes quickly. AWS writes about how to integrate EMR metrics with Prometheus and other monitoring ecosystems such as Grafana for dashboarding and AWS SNS to send notifications and alerts.

Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana | Amazon Web…

This post discusses installing and configuring Prometheus and Grafana on an Amazon Elastic Compute Cloud (Amazon EC2)…

aws.amazon.com

The Buy vs. Build on the table when it comes to stream processing considering the complexity of the system. Apache Kafka and AWS Kinesis are the leading competitors when it comes to message brokers. It’s (not) surprising that Apache Kafka still years ahead in stream processing.

Kinesis vs. Kafka

What is better from latency/throughput perspective? Let’s find out through benchmarks!

medium.com

Strimzi is an open-source CNCF sandbox project that focuses on running Apache Kafka on Kubernetes while providing container images for Apache Kafka itself, Zookeeper, and other components that are part of the Strimzi ecosystem. The blog post narrates how to move the Apache Kafka workload to Kubernetes.

Introduction to Strimzi: Apache Kafka on Kubernetes (KubeCon Europe 2020) - Red Hat Developer

Apache Kafka has emerged as the leading platform for building real-time data pipelines. Born as a messaging system…

developers.redhat.com

Apache Kafka consumers are a single-threaded processing model that follows one partition consumed per thread. The model simplifies the ordering and processing guarantee in processing the stream of events. The downside of the approach, we often underutilize the CPU. Confluent writes a blog post that narrates how can we implement Multi-Threaded Message Consumption with the Apache Kafka Consumer and the challenges around it.

Multi-Threaded Messaging with the Apache Kafka Consumer

Multithreading is "the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to…

www.confluent.io

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Engineering Weekly #4

Dagster: The Data Orchestrator

As machine learning, analytics, and data processing become more complex and central to organizations, improving the…

Amundsen Joins LF AI as New Incubation Project - LF AI

LF AI Foundation (LF AI), the organization building an ecosystem to sustain open source innovation in artificial…

Video social analytics at scale using Apache Spark

A deep dive into how we built data pipelines to access the APIs of platforms like Facebook and YouTube at scale using…

Blocking Slack Invite Spam With Machine Learning - Slack Engineering

A fact of life for building an internet service is that, sooner or later, bad actors are going to come along and try to…

How Koalas-Spark Interoperability Helps pandas Users Scale - The Databricks Blog

Koalas is an open source project which provides a drop-in replacement for pandas, enabling efficient scaling out to…

Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana | Amazon Web…

This post discusses installing and configuring Prometheus and Grafana on an Amazon Elastic Compute Cloud (Amazon EC2)…

Kinesis vs. Kafka

What is better from latency/throughput perspective? Let’s find out through benchmarks!

Introduction to Strimzi: Apache Kafka on Kubernetes (KubeCon Europe 2020) - Red Hat Developer

Apache Kafka has emerged as the leading platform for building real-time data pipelines. Born as a messaging system…

Multi-Threaded Messaging with the Apache Kafka Consumer

Multithreading is "the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to…

Written by Ananth Packkildurai