Data Engineering Digest #8 (January 2020)

Maycon Viana Bordin

Published in

data.plumbers

8 min readFeb 7, 2020

Photo by eberhard grossgasteiger from Pexels

New Tools

Introducing Flyte: Cloud Native Machine Learning and Data Processing Platform

Today Lyft is excited to announce the open sourcing of Flyte, a structured programming and distributed processing…

eng.lyft.com

Data Engineering Role

Most In Demand Tech Skills for Data Engineers

Data Engineer is the fastest growing job title according to a 2019 analysis. Which tech skills are most in demand for…

towardsdatascience.com

Courses & Training

7 Resources to Becoming a Data Engineer - KDnuggets

Date Engineering is one of the fastest growing and in-demand occupations among Data Science practitioners. The ability…

www.kdnuggets.com

Top 13 data engineer and data architect certifications

Data and big data analytics are the lifeblood of any successful business. Getting the technology right can be…

www.cio.com

Notes for Databricks CRT020 Exam Prep

As I walk through the Databricks exam prep for Apache Spark 2.4 with Python 3, I’m collating notes based on the…

medium.com

Podcasts

Change Data Capture For All Of Your Databases With Debezium

An interview about how the Debezium framework simplifies implementing change data capture for all of your database…

www.dataengineeringpodcast.com

Planet Scale SQL For The New Generation Of Applications

An interview about YugabyteDB and how it was architected to power the new generation of planet scale applications The…

www.dataengineeringpodcast.com

Replatforming Production Dataflows

An interview about how Mayvenn replatformed their production dataflows using Ascend and improved their ability to…

www.dataengineeringpodcast.com

Pay Down Technical Debt In Your Data Pipeline With Great Expectations

An interview about how the Great Expectations framework helps you add meaningful tests and validation to your data…

www.dataengineeringpodcast.com

Real Data Architectures

Designing Production-Ready Kappa Architecture for Timely Data Stream Processing

At Uber, we use robust data processing systems such as Apache Flink and Apache Spark to power the streaming…

eng.uber.com

A Deep Dive into Unified’s Data Lake

What is a data lake? How does it work? In this post we answer these questions in the context of Unified’s data lake.

medium.com

Some Common Data Science Stacks

7 stacks from interviewing Analysts, Scientists, and Engineers.

towardsdatascience.com

Data Culture

Empower Data Owners to become a Data-Driven Enterprise

A detailed look at the missing Data Owner role that keeps organizations from becoming data driven.

medium.com

The data product lifecycle

Your organisation wants to dive head-first into data and AI but you don’t really know where to start? Data&AI is on the…

medium.com

Data Lake

What Is a Data Lakehouse? — The Databricks Blog

Over the past few years at Databricks, we’ve seen a new data management paradigm that emerged independently across many…

databricks.com

How Amazon is solving big-data challenges with data lakes

Back when Jeff Bezos filled orders in his garage and drove packages to the post office himself, crunching the numbers…

www.allthingsdistributed.com

The Distributed Data Mesh as a Solution to Centralized Data Monoliths

Instead of building large, centralized data platforms, enterprise data architects should create distributed data…

www.infoq.com

A Guide To Modern Batch Data Warehousing — Extraction

Redefining the data extraction patterns to follow “Functional Data Engineering” best practices

towardsdatascience.com

Starting out with data puddles, then we’ll think about data lakes

Comic Relief is re-thinking its data ingestion, storage and query stack with Lambda, S3 & Athena. Here is a quick intro…

medium.com

Multi-tenancy for Big Data: Part 2

Modern businesses understand that data is not just important to your business, it is your business.

blog.ellation.com

Data Governance

Observability for Data Engineering

Observability is a fast-growing concept in the Ops community that caught fire in recent years, led by major…

medium.com

How Data Quality Can Kill your Data Science Project… If You’re Not Careful

If “Data Scientist is the sexiest job in the 21st Century”, then data quality is the least sexy aspect, but it’s still…

medium.com

Towards a Data Quality Score in open data (part 1)

Why Open Data Toronto created a score to assess data quality and what it measures

medium.com

Reducing Organizational Complexity with DataOps

Organizational complexity creates significant problems, but executives in a McKinsey Survey showed little understanding…

medium.com

Data Formats

Comparison of Big Data storage layers: Delta vs Apache Hudi vs Apache Iceberg. Part#1

All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect…

medium.com

Delta Lake

Is “Delta Lake” Replacing “Data Lakes”? (Ep. 6)

Delta Lake is another striking open-source project that Databricks supports. What’s the values of Delta Lake?

medium.com

Migrating from Hive to Delta Lake + Hive in Hybrid Cloud Environment

Everything about migration from Hive to Delta Lake + Hive

medium.com

Partitioned Delta Lake : Part 3

A tutorial about how to use partition in Delta Lake

medium.com

Upsert In Delta Lake : Part 4

Welcome to fourth part of series on how to upsert/merge data from an Apache Spark DataFrame into a Delta table.

medium.com

Delta Lake: Extract the real value from Data Lake

Delta Lake provides great features and solves some of the biggest issues that come with a data lake. On top of all, it…

medium.com

Apache Parquet

Compaction / Merge of small parquet files

Optimising size of parquet files for processing by Hadoop or Spark

medium.com

Data Pipelines

4 Easy steps to setting up an ETL Data pipeline from scratch

Setting up an ETL pipeline within few commands

towardsdatascience.com

Data Processing

Big Data Analytics: Apache Spark vs. Apache Hadoop

Learn why Apache Spark was created, and how it addresses Apache Hadoop’s shortcomings.

towardsdatascience.com

Apache Spark

How we reduced our Apache Spark cluster cost using best practices

It’s been about 3 months now since I switched over to Lisbon from Italy. I’ve been offered a chance to work with one of…

medium.com

Spark in Docker in Kubernetes: A practical approach for scalable NLP

Natural Language Processing using the Google Cloud Platform’s Kubernetes Engine

towardsdatascience.com

Spark deserves a better IDE

Authors: Raj Bains, Maciej Szpakowski

medium.com

Spark UDAF could be an option!

Calculate average on sparse arrays

medium.com

Infrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and…

Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step…

medium.com

The What, Why, and When of Apache Spark

Before-you-code Spark basics

towardsdatascience.com

Apache Hive

Herding the Elephants: moving data from PostgreSQL to Hive

In this article we are sharing learnings and practical advice for making PostgreSQL data available to Spark in an…

medium.com

Presto

Presto-Powered S3 Data Warehouse on Kubernetes

Presto is a distributed query engine capable of bringing SQL to a wide variety of data stores, including S3 object…

medium.com

Stream Processing

Apache Flink

Flink as a Service at JW Player

JW Player is the world’s largest network-independent platform for video delivery and intelligence. Our global footprint…

medium.com

Apache Flink State Types

Apache Flink is 4th generation open source data processing framework. Flink does support stateful and stateless…

medium.com

Timers management in Apache Flink

Introduction

medium.com

Change Data Capture

Practical Change Data Streaming Use Cases with Apache Kafka & Debezium

Gunnar Morling discusses practical matters, best practices for running Debezium in production on and off Kubernetes…

www.infoq.com

Messaging

Apache Kafka

Streaming Machine Learning with Tiered Storage

Kai Waehner Print The combination of streaming machine learning (ML) and Confluent Tiered Storage enables you to build…

www.confluent.io

Pipeline to the Cloud - On-Premises Data Streaming for Cloud Analytics

Robin Moffatt Print This article show how you can offload data from on-premises transactional (OLTP) databases to…

www.confluent.io

Streams and Tables in Apache Kafka: Event Processing Fundamentals

Michael Noll Print Part 2 of this series discussed in detail the storage layer of Apache Kafka: topics, partitions, and…

www.confluent.io

Streams and Tables in Apache Kafka: Elasticity, Fault Tolerance & Advanced Concepts

Michael Noll Print Now that we've learned about the processing layer of Apache Kafka ® by looking at streams and…

www.confluent.io

7 mistakes when using Apache Kafka

Apache Kafka is used as a message broker but can be extended by additional tools to become a whole message processing…

blog.softwaremill.com

Who and why uses Apache Kafka?

Some claim that Kafka is one of the most popular tools in the world.

blog.softwaremill.com

Kafka the afterthoughts: message encoding and schema management

In this article I share notes and thoughts, from my journey with Kafka, about data encoding and schema management.

medium.com

Event-driven Autoscaling for Kubernetes with Kafka & Keda

Autoscale Kubernetes workloads based on message count in a Kafka topic

medium.com

Apache Pulsar

Why Apache Pulsar — A Gentle Comparison with Kafka

What is Apache Pulsar?

medium.com

What are Pulsar Functions?

From Pulsar in Action by David Kjerrumgaard

medium.com

pulsar-express, a web interface for Apache Pulsar

Pulsar-express aims to be a simple web application that allow the users to see informations about their Apache Pulsar…

medium.com

Workflow Management

Apache Airflow

Confessions of an Airflow user

Airflow, Airflow, Airflow… how I love and hate thee. The siren calls of scale and flexibility tempt me, even as I have…

medium.com

Generalizing data load processes with Airflow

Data load processes should not be written twice, they should be generalized

towardsdatascience.com

Automatic Airflow DAG creation for Data Scientists and Analysts

TL;DR: DAG creator is a python script when runs, it will pick the latest json definition files and substitutes the…

towardsdatascience.com

Why Apache Airflow Is a Great Choice for Managing Data Pipelines

A look at capabilties which makes Airflow better than its predecessors

towardsdatascience.com

Reliably Upgrading Apache Airflow at Slack’s Scale

For two years we’ve been running Airflow 1.8, and it was time for us to catch up. Here’s how we did it without…

slack.engineering

Scaling DAG Creation With Apache Airflow

One of the more difficult tasks within the Data Science community is not designing a model to a well-constructed…

towardsdatascience.com

Apache Airflow is Fun For Data Engineer!

Implementation Apache Airflow in Tunaiku

medium.com

Integrating Airflow + Datadog on docker-compose

Integrating Airflow running on Docker + Datadog took way longer than I expected, so I decided to share my guide to the…

medium.com

Cloud Providers

AWS & Snowflake vs GCP: how do they stack up when building a data platform?

When we talk about data, the number of technologies available on the market is overwhelming and staying up to date is a…

medium.com

AWS

Getting Started with Data Analysis on AWS

Learn how to use AWS Glue, Amazon Athena, and Amazon QuickSight to transform, enrich, analyze, and visualize…

towardsdatascience.com

Amazon S3 Data Lake-Storing & Analyzing the Streaming Data on the go — Serverless Approach

Making A S3 Data Lake by Storing the Streaming Data && Analyzing it on the go…

towardsdatascience.com

Publish Streaming data into Aws S3 Datalake and Query it

Consume Streaming data from Aws Kinesis, build Datalake in S3 and run Sql Quries from Athena.

medium.com

How to merge NoSQL and SQL using AWS Glue

How to report on data from both NoSQL and SQL at the same time without going crazy

levelup.gitconnected.com

Google Cloud

Building Real-time data pipelines with Google Cloud Pub/Sub

Motivation

medium.com

Azure

Building a Dynamic data pipeline with Databricks and Azure Data Factory

TL;DR A few simple useful techniques that can be applied in Data Factory and Databricks to make your data pipelines a…

towardsdatascience.com

Securing access to Azure Data Lake gen2 from Azure Databricks

There are a number of ways to configure access to Azure Data Lake Storage gen2 (ADLS) from Azure Databricks (ADB). This…

medium.com

Databases

NoSQL

Relational vs NoSQL and RDBMS to NoSQL Migration - DZone Database

Given the choice of a Relational Database (RDBMS) vs a NoSQL database, it has become more important to select the right…

dzone.com

The Multi-Model Knowledge Graph - DZone Database

Enterprise Knowledge Graphs (EKGs) have been on the rise and are incredibly valuable tools for harmonizing internal and…

dzone.com

DynamoDB is Not a Database

Amazon describes DynamoDB as a database, but it’s best seen as a highly-durable data structure in the cloud.

medium.com

KeyDB is a Fork of Redis that is 5X Faster

What if I told you there is a fork of Redis that can run 5x faster with nearly 5x lower latency. What if you no longer…

medium.com

MongoDB in production: How connection pool size can bottleneck application scale

Understanding how MongoDB connection pools and pool sizing works is a fundamental part of running an effective MongoDB…

medium.com

Maximizing Disk Utilization with Incremental Compaction

By Raphael S. Carvalho and Benny Halevy, January 16, 2020

medium.com

In-Memory & Data Grid

Scalable Data Grid Using Apache Ignite - DZone Big Data

In this article, I introduce the concept of a Data Grid, it's properties, services it offers, and finally how to design…

dzone.com

Relational

4 Data Sharding Strategies for Distributed SQL Analyzed - DZone Database

A distributed SQL database needs to automatically partition the data in a table and distribute it across nodes. This is…

dzone.com