Data Engineering Weekly #3

Published in

Data Engineering Weekly

4 min readFeb 3, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

I’m excited to read about the GPU-accelerated streaming platform this week. NVIDIA writes about cuStreamz, the first GPU-accelerated streaming data processing library. Written in Python, it built on top of RAPIDS, the GPU-accelerator for data science libraries.

cuStreamz: More Event Stream Processing for Less with NVIDIA GPUs and RAPIDS Software

One can view cuStreamz as a bridge that connects Python-Streaming and GPUs — with sophisticated and reliable streaming…

medium.com

Continue on the GPU-accelerated stream processing, Apache Flink 1.11 introduces a new External Resource Framework, which allows you to request external resources from the underlying resource management systems (e.g., Kubernetes) and accelerate your workload with those resources. The blog post explains how to integrate the GPU plugin that can help to build an end-to-end real-time AI workflow.

Accelerating your workload with GPU and other external resources

06 Aug 2020 Yangze Guo Apache Flink 1.11 introduces a new External Resource Framework, which allows you to request…

flink.apache.org

Linkedin writes about learning from Hadoop incidents. All the modern workflow schedulers support retries, but the unstable infrastructure hides the resource cost with this build-in fault tolerance of the system. Though the article focused on HDFS data loss, the theory applies all parts of the data pipeline.

Theory vs. Practice: Learnings from a recent Hadoop incident

Our disaster recovery strategy has largely been focused around replicating data from a cluster in the event of data…

engineering.linkedin.com

Tencent wrote a guest post about its Apache Kafka infrastructure to Handle 10 Trillion+ Messages Per Day. The federated Kafka clusters and the logical topic mapping is emerging as a design pattern to handle large scale Kafka infrastructure. The proxy approach for the consumers is a contrasting approach from the Kafka consumer SDK approach.

How Tencent PCG Scales Massive Data Pipelines with Apache Kafka

As one of the world's biggest internet-based platform companies, Tencent uses technology to enrich the lives of users…

www.confluent.io

eBay writes about Terapeak Research 2.0 platform based on Apache Kafka and Elastic search. The article narrates its approach to the fault-tolerant pipeline. The primary, secondary consumer pattern is something new to me.

Terapeak Research 2.0 - Making the Data Processing Pipeline Robust

Earlier this year in our Seller Hub, we added a new feature called Terapeak Product Research to help eBay sellers…

tech.ebayinc.com

Patterns of Distributed Systems is a refreshing read about the system design. The data infrastructure engineers deal with multiple distributed systems, and the article is an exciting read to approach the design abstractly.

Patterns of Distributed Systems

Unmesh Joshi Unmesh Joshi is a Principal Consultant at ThoughtWorks. He is a software architecture enthusiast, who…

martinfowler.com

The COVID-19 outburst changes the landscape of many businesses and personal life. The Expedia data visualization group writes a fantastic article about how it monitors local restrictions to predict when the customers want to go traveling again and employees’ well-being.

How Expedia Group Is Monitoring Market Recovery During covid-19

Providing clarity during a pandemic with data visualization

medium.com

The source to destination validation is an essential step in an ETL pipeline. Direct Energy rewrote over 350 SQL Server stored procedures in PySpark as part of on-premises data warehouses to AWS migration. The article narrates Pythagoras, a data reconciliation engine using Amazon EMR and Amazon Athena.

Build a distributed big data reconciliation engine using Amazon EMR and Amazon Athena | Amazon Web…

This is a guest post by Sara Miller, Head of Data Management and Data Lake, Direct Energy; and Zhouyi Liu, Senior AWS…

aws.amazon.com

Apache Flink writes about Pandas UDF’s support for PyFlink. The current version supports only the scalar Pandas UDFs.

PyFlink: The integration of Pandas into PyFlink

04 Aug 2020 Jincheng Sun (@sunjincheng121) & Markos Sfikas (@MarkSfik) Python has evolved into one of the most…

flink.apache.org

Cost optimization becomes mainstream engineering in the cloud infrastructure. Expedia’s blog series is an exciting read on optimizing Apache Spark’s cost for running the batch workload.

Part 1: Cloud Spending Efficiency Guide for Apache Spark on EC2 Instances

How I saved 60% of costs in an Apache Spark job, with no increase in job time and no decrease in data processed

medium.com

Part 2: Real World Apache Spark Cost Tuning Examples

I outline the procedure for working through cost tuning

medium.com

Continue with the cost optimization, Amazon EC2 Spot Instances, which enable you to use unused Amazon EC2 computing capacity in the AWS Cloud, offer up to 90% savings over On-Demand Instances. That data may need to be “shuffled” to other Amazon EC2 instances to continue processing. In this article, Qubole writes about FSx for Lustre; a high-performance parallel file system provides a mechanism to offload and eventually access this data in a high-performance shared file system helps reduce costs and improve performance.

How Qubole optimizes cost and performance by managing shuffle data | Amazon Web Services

Ad hoc analytics, data exploration, data engineering, and machine learning (ML) workloads are often run at a massive…

aws.amazon.com

Many innovative approaches are coming out for data lifecycle management. The presentation walks through the major trends in each part of the data life cycle, such as data pipelines, compute engines, data modeling, data products, and data quality. It’s missing the data discovery/ data accessibility trends, though.

The Five Important Trends in Data, and the One Megatrend Powering Them All by @ttunguz

Yesterday, Dremio hosted the Subsurface Conference, the first conference on cloud data lakes. More than 5000 people…

tomtunguz.com

DBT is gaining much momentum as the leader of an analytics engineering workflow engine. The article walkthrough five reasons why BigQuery users should use DBT. The article focuses on BigQuery, but the reasoning applicable to any SQL databases.

5 reasons why BigQuery users should use dbt

How do you implement and test data pipelines with BigQuery to create intermediate tables and manage metadata and data…

yu-ishikawa.medium.com

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Engineering Weekly #3

cuStreamz: More Event Stream Processing for Less with NVIDIA GPUs and RAPIDS Software

One can view cuStreamz as a bridge that connects Python-Streaming and GPUs — with sophisticated and reliable streaming…

Accelerating your workload with GPU and other external resources

06 Aug 2020 Yangze Guo Apache Flink 1.11 introduces a new External Resource Framework, which allows you to request…

Theory vs. Practice: Learnings from a recent Hadoop incident

Our disaster recovery strategy has largely been focused around replicating data from a cluster in the event of data…

How Tencent PCG Scales Massive Data Pipelines with Apache Kafka

As one of the world's biggest internet-based platform companies, Tencent uses technology to enrich the lives of users…

Terapeak Research 2.0 - Making the Data Processing Pipeline Robust

Earlier this year in our Seller Hub, we added a new feature called Terapeak Product Research to help eBay sellers…

Patterns of Distributed Systems

Unmesh Joshi Unmesh Joshi is a Principal Consultant at ThoughtWorks. He is a software architecture enthusiast, who…

How Expedia Group Is Monitoring Market Recovery During covid-19

Providing clarity during a pandemic with data visualization

Build a distributed big data reconciliation engine using Amazon EMR and Amazon Athena | Amazon Web…

This is a guest post by Sara Miller, Head of Data Management and Data Lake, Direct Energy; and Zhouyi Liu, Senior AWS…

PyFlink: The integration of Pandas into PyFlink

04 Aug 2020 Jincheng Sun (@sunjincheng121) & Markos Sfikas (@MarkSfik) Python has evolved into one of the most…

Part 1: Cloud Spending Efficiency Guide for Apache Spark on EC2 Instances

How I saved 60% of costs in an Apache Spark job, with no increase in job time and no decrease in data processed

Part 2: Real World Apache Spark Cost Tuning Examples

I outline the procedure for working through cost tuning

How Qubole optimizes cost and performance by managing shuffle data | Amazon Web Services

Ad hoc analytics, data exploration, data engineering, and machine learning (ML) workloads are often run at a massive…

The Five Important Trends in Data, and the One Megatrend Powering Them All by @ttunguz

Yesterday, Dremio hosted the Subsurface Conference, the first conference on cloud data lakes. More than 5000 people…

5 reasons why BigQuery users should use dbt

How do you implement and test data pipelines with BigQuery to create intermediate tables and manage metadata and data…

Written by Ananth Packkildurai