Data Engineering Digest #11 (April 2020)

Maycon Viana Bordin

Published in

data.plumbers

13 min readMay 9, 2020

New & Updated Tools

ProphecyHub: Metadata re-invented with Git & GraphQL for Data Engineering

Authors: Raj Bains, Arpan Agrawal, Mayank Kotwal

medium.com

Introducing Apache Pinot 0.3.0

Built at LinkedIn, Pinot is an open source, distributed, and scalable OLAP data store that we use as our de-facto…

engineering.linkedin.com

AWS Glue now supports serverless streaming ETL

Posted On: AWS Glue now supports streaming ETL. This feature makes it easy to set up continuous ingestion pipelines…

aws.amazon.com

Data Engineering Role

Assessing and interviewing data engineers from a distance

When in-person technical interviews are no longer an option, hiring managers still have a wealth of online resources at…

blog.insightdatascience.com

Courses & Training

How to Prepare for and Clear the GCP : Professional Data Engineer Exam

Hey guys, so I passed the GCP : Professional Data Engineer exam on 17th January 2020 and I went through a really tough…

medium.com

PATH TO BECOME A DATA ENGINEER

Data Engineering is definitely one of the most demanded jobs in today’s world. As the data grows the need of Data…

medium.com

(Review) Udacity Data Engineer Nanodegree

Or A.K.A, a journey to become a modern data engineer.

medium.com

Podcasts & Presentations

Building A Knowledge Graph Of Commercial Real Estate At Cherre

An interview about how Cherre builds and maintains a knowledge graph of commercial real estate data and how it enables…

www.dataengineeringpodcast.com

Making Data Collection In Your Code Easy With Rookout

An interview with Rookout's CTO about the importance of including non-technical roles in the data collection process…

www.dataengineeringpodcast.com

Building Real Time Applications On Streaming Data With Eventador

An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to…

www.dataengineeringpodcast.com

Taming Complexity In Your Data Driven Organization With DataOps

An interview about using a DataOps approach to reduce the technical and organizational complexity that occurs in data…

www.dataengineeringpodcast.com

Real Data Architectures

Create AWS baby datalakes to handle ongoing daily data batch

In the era of micro-service architecture, AI (or BI)-powered applications are structured as a collection of services…

medium.com

How we built a modern data platform in a digital bank from the ground zero

To Start any organization’s data journey, an initial step is to build a data platform. however, it’s not rocket science…

medium.com

Data Culture

A Data Engineer’s Perspective On Data Democratization

How democratizing data shapes the data engineering effort

towardsdatascience.com

How Data Science is Boosting Netflix

When used effectively, data can transform your business in magical ways and take it to new heights.

towardsdatascience.com

When Data Science turns into Homoeopathy

A few thoughts on the importance of data literacy in data-driven companies.

medium.com

Don’t Buy Data — Invest In It

Huge sums of money are wasted on data because companies are spending it the wrong way. There’s a difference between…

towardsdatascience.com

Data Lake

Design Patterns for Data Lakes

Data Lake is the heart of big data architecture, as a result there needs to be careful planning in designing and…

medium.com

A Data Scientist’s Guide to Data Architecture

What you need to know to build a robust data process

towardsdatascience.com

Reinventing the Data Platform in the Cloud

In our first story we pointed out which architectural aspects and paradigms are crucial for a sucessful data platform…

medium.com

Data Governance

Data Discovery in 2020

A brief survey of data catalogs from Big Tech data teams

medium.com

Data Catalogs — Unlocking Value in your Data Lakes

It’s increasingly clear that successful data lake transformation and adoption of self-service rests on findability and…

medium.com

Five mistakes to avoid while building a Data Platform

Having spent couple of years now in the world of data, building end to end data management platforms right from the…

medium.com

Sustainable Privacy Compliance Requires Disciplined Data Management

Data management, including meta-data management, data governance, master data management, has been advocated since the…

towardsdatascience.com

Data Formats

Delta Lake

How to optimize and increase SQL query speed on Delta Lake

There are two time-honored optimization techniques for making queries run faster in data systems: process data at a…

databricks.com

How to optimize and increase SQL query speed on Delta Lake

There are two time-honored optimization techniques for making queries run faster in data systems: process data at a…

databricks.com

How to Build a Modern Clinical Health Data Lake with Delta Lake - The Databricks Blog

The healthcare industry is one of the biggest producers of data. In fact, the average healthcare organization is…

databricks.com

Improving Resiliency with Databricks Delta Lake & Azure

Databricks Delta Lake with few Azure features can protect our data lake & help to restore easily in case of any issues.

medium.com

Databricks Delta Architecture

As organizations nowadays have a lot of data, which could be customer data or S3 or could be unstructured data from a…

medium.com

Data Pipelines

Building a Simple ETL Pipeline with Python and Google Cloud Platform

Extracting data from an FTP server using Google Cloud Functions

towardsdatascience.com

Improve your Data Lifecycle with Metadata-Driven Pipelines

No digital transformation program is complete without a data-based initiative. With some speculating that artificial…

medium.com

Lessons learned building serverless data pipelines

Before many of the cooler features of AI products can be productionalized you need high quality and correct data…

medium.com

Data Pipelines with OpenFaas on K3s

A short story of how I used OpenFaas, Nats, and K3s to create a data pipeline for inserting data into a data lake

medium.com

Build a Scalable Data Pipeline with AWS Kinesis, AWS Lambda, and Google BigQuery

This blog details how to handle large amounts of event-triggered data for live time backend analysis with AWS Kinesis…

medium.com

ML Pipelines

Productionising ML Projects with Google BigQuery and PySpark: Predicting Hotel Cancellations

All too often, data scientists get caught up in the exploratory phase of data science — i.e. running multiple models on…

towardsdatascience.com

How to Build Machine Learning Pipelines with Airflow & Papermill

Learn to scale your machine learning workflows at will.

medium.com

Data Quality & Tools

Dirty Data — Quality Assessment & Cleaning Measures

Practical guide to understand, build, and execute data quality & cleaning process

towardsdatascience.com

Data Ingestion

Apache Sqoop

Sqoop scenarios and options

As part of the modern day big data architecture, it has become imperative to move data from RDBMS to Hadoop Distributed…

medium.com

Data Processing

Apache Spark

Six Spark Exercises to Rule Them All

Some challenging Spark SQL questions, easy to lift-and-shift on many real-world problems (with solutions)

towardsdatascience.com

Apache Spark Dataset Encoders Demystified

RDD, Dataframe and Dataset in Spark are different representations of a collection of data records with each one having…

towardsdatascience.com

Spark Optimizations for Advanced Users - Spark Cheat Sheet

I started Apache Spark learning almost 3 years back, when I was working with Android middle-ware project as part of my…

medium.com

An Apache Spark Application In Microservices Ecosystems

Articulating a problem reflects our knowledge about its domain. In the software and data world, we almost always…

medium.com

An Apache Spark Application In Microservices Ecosystems

Articulating a problem reflects our knowledge about its domain. In the software and data world, we almost always…

medium.com

4 simple tips to improve your Apache Spark job performance!

Making your Apache Spark application run faster with minimal changes to your code!

medium.com

Using the Spark Aggregator class in Scala

Type-safe Aggregations: what are they?

towardsdatascience.com

Successful spark-submits for Python projects.

Smoothly run your project on an actual cluster, instead of that pretend one you’ve been using.

towardsdatascience.com

Apache Hadoop

Installing Hadoop 3.2.1 Single node cluster on Windows 10

This article is a step-by-step guide to install a Hadoop single node cluster on Windows 10 operating system.

towardsdatascience.com

Map-Reduce with Python & Hadoop on AWS EMR

Let’s do some basic Map-Reduce on AWS EMR, with the typical word count example, but using python and Hadoop streaming.

levelup.gitconnected.com

Apache Flume and Hbase in Hadoop

Welcome to lesson ‘Apache Flume and HBase’ of Big Data Hadoop tutorial which is a part of ‘big data training’ offered…

medium.com

Apache Hive

A POC for YouTube Data Analysis using Pig & Hive

In Today’s world as the 4 V’s of Big data (Volume,Variety,Velocity & Veracity) are very rapidly increasing it has…

medium.com

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

In this article, Thai Bui describes how Bazaarvoice leverages Alluxio as a caching tier on top of AWS S3 to maximize…

medium.com

Presto

Presto with Kubernetes and S3 — Benchmark

In the first part of this blog, I described how to deploy a Presto cluster with Kubernetes and configure it to access…

medium.com

Introducing our High-Performance Elasticsearch Connector for Presto

Our Presto Elasticsearch Connector is built with performance in mind. Here are some of the use-cases it is being used…

medium.com

Stream Processing

Big Data Battle: Streaming data approach using Apache Flume vs PySpark

Big Data เป็นนิยาม ซึ่งอธิบายถึงปริมาณของข้อมูลมหาศาล…

medium.com

Kafka Stream (KStream) vs Apache Flink - DZone Big Data

Two of the most popular and fast-growing frameworks for stream processing are Flink (since 2015) and Kafka's Stream API…

dzone.com

Apache Flink

Event-Driven Supply Chain for Crisis with FlinkSQL

How Open Source Streaming technologies can help improve supply chain during Covid-19

towardsdatascience.com

Twitter Streaming using Flink

Flink is an open source stream-processing framework. It does provide stateful computation over data streams, recovery…

medium.com

Running Flink Application on Kinesis Data Analytics(KDA)- Part 1

Learn how to run flink stream processing application in was kinesis data analytics environment. Covers some best…

medium.com

Visualize fraudulent transactions via CEP with Kafka, Flink, SQL, D3.js and Mapbox.

I am not too proud to admit it — I like writing code in javascript. I’m a novice, but along with Python it’s my go-to…

medium.com

The Foundations for Building an Apache Flink Application

Understanding stream processing using Flink from bottom-up; a practical guide for coding a Flink processing Java…

medium.com

Apache Spark

Optimize Spark Structured Streaming for Scale.

Spark structured streaming production-ready version was released in spark 2.2.0. Our team was excited to test it on a…

medium.com

The connection between Spark Streaming and Apache Kafka using Java

How to connect Kafka and Spark

medium.com

Apache Kafka Streams

Bottom Up Approach To Kafka Stream Internals

Kafka Stream library needs to complete couple of steps before getting a stream application up and running. These steps…

medium.com

Kafka Stream Processing — Composing Views By Example

The sample code and instruction to setup and run, is available in GitHub. Event driven system designed with CQRS…

medium.com

Calculating speed, bearing and distance using Kafka Streams Processor API

Sometimes the classic DSL Kafka is not enough for us. The Processor API allows you to freely define the processor, and…

medium.com

Apache Storm

Introduction to Apache Storm

Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably…

medium.com

Change Data Capture

Change data Capture : Old way

Introduction

medium.com

Storage

Hadoop TeraGen Revisited

Someone asked me to help benchmark and compare throughput of on premise and cloud big data storage. Instead of just…

medium.com

Apache HDFS

HDFS Erasure Coding

Reduce storage overhead significantly in your HDFS cluster by leveraging Erasure Coding

towardsdatascience.com

Understanding Hadoop HDFS

HDFS (Hadoop Distributed File System) is a distributed file system for storing and retrieving large files with…

medium.com

Messaging

Kafka vs. RabbitMQ: Why Use Kafka?

Are all data streaming services made equal?

medium.com

Data Pipelines: Scaling a Message Broker System for Half the Cost

Tealium is a data hub platform that processes events at a large scale. We’ve seen tremendous growth in the amount of…

medium.com

Apache Kafka

Ordering of events in Kafka

Certain use cases require strict ordering of events (messages/records with data payload and/or state) to be maintained…

medium.com

Optimizing Kafka Cluster Deployments in Kubernetes

We, at Axual, pride ourselves in running high volume, mission critical Apache Kafka clusters for businesses in various…

itnext.io

Kafka 101 — An introduction to Kafka

I have been thinking of doing something this quarantine and here you go! As the title mentions, this blog is an…

medium.com

Apache Kafka in a Nutshell

Architecture, Use Cases, and a Getting Started guide — rolled into one

medium.com

An investigation into Kafka Log Compaction

Kafka log cleaner design and usage

medium.com

Kafka Connect on Kubernetes, the easy way!

This is a tutorial that shows how to set up and use Kafka Connect on Kubernetes using Strimzi, with the help of an…

itnext.io

Apache Pulsar

Why StreamSQL moved from Apache Kafka to Apache Pulsar

Apache Kafka and event streaming are practically synonymous today. Event streaming is a core part of our platform, and…

medium.com

Data-Friendly Messenger Apache Pulsar Gains Market

Open source messaging system, Apache Pulsar has been promoted from incubator status to a top-level project as its…

medium.com

Apache Pulsar 2.5.1

Apache Pulsar community has successfully released 2.5.1 version. Learn improvements and bug fixes in Apache Pulsar…

medium.com

Workflow Management

Python Data Engineering Tools: The Next Generation

A look below the surface of new data engineering / science frameworks

medium.com

Apache Airflow

Apache Airflow in a Digital bank Production

Production Background: We have 100s of data pipelines(mostly Apache spark) in production to ingest the raw data…

medium.com

Airflow’s dashboard was testing our patience, so we made our own

tl;dr: Creating/editing a pipeline visually is impossible in existing tools and it makes data engineering a chore. This…

medium.com

Apache Airflow — Programmatic platform for Data Pipelines and ETL/ELT workflows

In current data driven world, Data Pipelines and ETL(Extract, Transform and Load) workflows plays a major role in…

medium.com

How did I resolved pip package dependency issue in Apache Airflow?

Talks about the cyclic dependency issue with pip packages, how to resolve it using PythonVirtualenvOperator.

medium.com

Apache Airflow — Plugins, SubDAGs and SLAs

Airflow Plugins

medium.com

Elastic(autoscaling) Airflow Cluster in Kubernetes

In this article, I will demonstrate how we can build an Elastic Airflow Cluster which scales-out on high load and…

itnext.io

Airflow Schedule Interval 101

The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a…

towardsdatascience.com

Alpaca: Airflow at JW Player

How our custom platform built on top of Airflow allows users to quickly create new Airflow DAGs

medium.com

Airflow with YAML Dags and kubernetes operator

Simplifying the creation of DAGs

medium.com

Airflow : Zero to One

In current world, we process a lot of data and the churn rate of it increases exponentially with passing time, where…

medium.com

Apache Oozie

Batch Processing of data from MYSQL to HDFS using Oozie workflow

Encountered with a challenge of automating the process of data collection from MySQL server to HDFS using Oozie…

medium.com

Cloud Providers

AWS

Serverless Data Lake: Storing and Analysing Streaming Data using AWS

Making an Amazon S3 Data Lake on Streaming Data using Kinesis, Glue, Athena and Quicksight

medium.com

Building a Data Lake with AWS Lake Formation

With growing numbers of people accessing data, it is important that data platforms are flexible and scalable. Hence…

medium.com

PEX — The secret sauce for the perfect PySpark deployment of AWS EMR workloads

How to use PEX to speed up deployment of PySpark applications on ephemeral AWS EMR clusters

towardsdatascience.com

AWS Glue: An ETL Solution with Huge Potential

AWS Glue is a fully managed serverless ETL service with enormous potential for teams across enterprise organizations.

medium.com

How I connect an S3 bucket to a Databricks notebook to do analytics.

A basic use case to connect Amazon S3 and a databricks notebook.

towardsdatascience.com

Update and Insert(upsert) Data from AWS Glue

Introduction

towardsdatascience.com

Map-Reduce with Python & Hadoop on AWS EMR

Let’s do some basic Map-Reduce on AWS EMR, with the typical word count example, but using python and Hadoop streaming.

levelup.gitconnected.com

Simplify data pipelines with AWS Glue automatic code generation and Workflows | Amazon Web Services

In this post, we discuss how to leverage the automatic code generation process in AWS Glue ETL to simplify common data…

aws.amazon.com

Exploring the public AWS COVID-19 data lake | Amazon Web Services

This post walks you through accessing the AWS COVID-19 data lake through the AWS Glue Data Catalog via Amazon SageMaker…

aws.amazon.com

A public data lake for analysis of COVID-19 data | Amazon Web Services

As the COVID-19 pandemic continues to threaten and take lives around the world, we must work together across…

aws.amazon.com

Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon…

Today we are adding a new Amazon Kinesis Data Firehose feature to set up VPC delivery to your Amazon Elasticsearch…

aws.amazon.com

Integrating AWS Lake Formation with Amazon RDS for SQL Server | Amazon Web Services

To grow and develop your business, you must collect data from a myriad of sources (such as relational and NoSQL…

aws.amazon.com

Simplify your Spark dependency management with Docker in EMR 6.0.0 | Amazon Web Services

Apache Spark is a powerful data processing engine that gives data analyst and engineering teams easy to use APIs and…

aws.amazon.com

Apache Hive is 2x faster with Hive LLAP on EMR 6.0.0 | Amazon Web Services

Customers use Apache Hive with Amazon EMR to provide SQL-based access to petabytes of data stored on Amazon S3. Amazon…

aws.amazon.com

Read your S3 access logs in AWS Athena

The last two posts, I am exploring (again) the static Web hosting through S3 + Route53 and as additional layer the…

medium.com

Migrating Big Data Workloads to AWS EMR

Overview

medium.com

How to configure Kerberos in AWS EMR ?

What is Kerberos?

medium.com

Google Cloud

Creation of an ETL in Google Cloud Platform for automated reporting

Learn how to create your own serverless and fully scalable ETL for automated reporting using PyTrends as an example

towardsdatascience.com

Migrating Data Processing Hadoop Workloads to GCP

Written by Anant Damle and Varun Dhussa

medium.com

Migrating Hive ACID tables to BigQuery

Migrating data from Hadoop to Google BigQuery is a fairly straightforward process. DistCP is usually leveraged to push…

medium.com

Testing Airflow jobs on Google Cloud Composer using pytest

A reliable CI/CD without reinventing the wheel

towardsdatascience.com

Azure

How to Secure Your Azure Machine Learning Experiments

A step-by-step guide to adopt best practices and a strong security posture when deploying the Azure Machine Learning…

medium.com

Azure SQL Database network settings (Private Link, VNET Service Endpoint) and Azure Data Factory

Azure SQL Database has a few extra settings on the Firewalls and Virtual Networks tab in addition to Private Link and…

medium.com

Web Activity (Sending Email) in Azure Data Factory

Lately, I have been using Azure Data Factory (ADF) gently. I kinda like the concept of code-less data engineering ETL…

medium.com

Databases

NoSQL

Massive Scale Databases

Introduction

itnext.io

CAP Theorem & its relevance to No SQL DB

CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a…

medium.com

Hadoop NoSQL: Hbase, Cassandra, MongoDB

7- DEMYSTIFYING THE HADOOP TECHNOLOGY

medium.com

Building a Distributed Hadoop Cluster with HBase on Amazon EC2’s from Scratch

If you want to build a Distributed Hadoop Cluster on AWS EC2 with HBase, then the best option is to use AWS EMR. But if…

medium.com

Cassandra in Kubernetes

Headless Services, KubeDNS, Init Containers, Lifecycle hooks and other K8s concepts on the way

medium.com

A Glance at Apache Cassandra

Apache Cassandra is a kind of distributed column-oriented NoSQL database and performs very high performance in vast…

medium.com

SQL and NoSQL: key differences

A guide to the two most common types of database systems

medium.com

SQL vs NoSQL : What’s the best option for your database?

One of the essential choices a developer must make is about what database technology to use for structuring and…

medium.com

Exploring MongoDB

Data is the new age fuel —MongoDB, the database for modern application can be the best fit when handling large data…

medium.com

Query Optimization in MongoDB

MongoDB is a NoSQL database most commonly referred as a document database. It was basically designed for ease of…

medium.com

Persistent Databases Using Docker’s Volumes and MongoDB

With Docker Compose version 3

medium.com

Using Apache Pinot and Kafka to Analyze GitHub Events

Pinot is the latest Apache incubated project to follow in the footsteps of other tremendously popular open source…

medium.com

In-Memory & Data Grid

Apache Ignite and CAP Theorem

Is Apache Ignite consistent, available or both?

medium.com

Apache Ignite with persistence!

From last couple of years, I have been working on this amazing product from Apache called Ignite ,which is, an in…

medium.com

Persistence Options for Apache Ignite

How to choose the right persistence store for Apache Ignite.

medium.com

Modern Data Warehouses

Guide to Data Warehousing

Short and comprehensive information about different data modeling techniques

towardsdatascience.com

Data Warehouse with a Lake view

Having had various discussions around data warehousing, data lakes and big data technologies I felt the urge to share…

medium.com