Data Engineering Digest #15 (Aug 2020)

Published in

data.plumbers

19 min readOct 1, 2020

This edition came a little bit later than expected, but the highlights of this edition are still fresh and interesting.

To start, we have a new update of Kafka in their ongoing process to remove Zookeeper as a dependency, something that is planned to happen on version 3.0.

We also have the paper describing the inner workings of Delta Lake, published by Databricks. They give details on the motivations for Delta Lake, how they were able to achieve the ACID properties through an Object Store and Parquet files, while still enabling schema evolution, time travel, caching, among other really useful features.

And finally we have a benchmark comparing the latency of Apache Kafka and Apache Pulsar. In a nutshell, the results show that Pulsar has a more predictable latency than Kafka both on end-to-end latency as well as for the publish latency. The full description of the experiments and results can be found on the article.

New Tools & Updates

What's New in Apache Kafka 2.6 - ZooKeeper Removal Progress and More

On behalf of the Apache Kafka ® community, it is my pleasure to announce the release of Apache Kafka 2.6.0. This…

www.confluent.io

Scylla Enterprise Release 2020.1.0

By Tzach Livyatan

medium.com

20 MORE Hot Data Tools and What They Don’t Do

In the past few months, the data ecosystem has continued to burgeon as some parts of the stack consolidate and as new…

towardsdatascience.com

Sneak Peek: Apache Flink 1.11 Is Coming Soon!

This article describes the new features, improvements, and important changes of Flink 1.11 and Flink’s future…

medium.com

Data Engineering Role

Data engineering in 2020

It is incredible how fast data processing tools are evolving. And with it, the nature of the data engineering…

medium.com

The Maturity of Data Engineers

Does the rise of data science and machine learning affect the role of data engineers?

towardsdatascience.com

The Facebook Data Engineer Interview

Facebook data engineers transform raw and complex data into actionable insights for better business decision making

towardsdatascience.com

Courses & Training

Designing Data-Intensive Applications Book Review

A modern classic for database and distributed system users

towardsdatascience.com

Basic Hive Interview Questions & Answers 2020

Big Data interviews may be conducted on general lines (wherein you must have a general idea about the popular Big Data…

medium.com

Publications

Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Podcasts & Presentations

Highlights from the Spark AI Summit

medium.com

How to Produce a Successful Virtual Event - Tips and Best Practices - The Databricks Blog

It's been a few weeks since Spark + AI Summit 2020 and we can still feel the amazing energy from this global virtual…

databricks.com

Kafka Summit 2020 Roundup and Highlights

If you know me, you know two things: first, that I am committed to remote work as an effective way to build a company…

www.confluent.io

Kafka Summit 2020: Day 1 Recap and Highlights

This year's Kafka Summit is my first and I've been lucky to have a behind-the-scenes look at the event since joining…

www.confluent.io

Real Data Architectures & Platforms

How to Build Your Data Platform like a Product

Go from a (standard dev. of) zero to a bonafide data hero

towardsdatascience.com

Architecting Modern Data Engineering using Azure Databricks

There should no doubts in anyone’s mind about how Big Data and AI are fueling the next revolution. Data is the new oil…

medium.com

How our own Serverless DataPlatform reduced AWS bill by 93 %

Serverless servers are the servers that you don’t serve

medium.com

Data lake on GCP using Terraform

Use Terraform to set up infrastructure-as-code for a Data Lake on Google Cloud Platform.

towardsdatascience.com

Data Culture

The Chief Data Officer’s Scorecard for Digital Transformation

Data is the new oil. The last decade has seen a tremendous increase in the amount of data generated and stored by…

medium.com

How Much Can Bad Data Cost Us?

The never-ending fight to cease bad data

medium.com

Good Tales of Bad Data

When data breaks and no one hears it, does it make a sound?

towardsdatascience.com

Data Lake

A Journey of Data Engineer to the cloud — Data Lake

A generic onboarding flow of data engineers to the cloud.

medium.com

Data Lake Design Patterns on AWS — Simple, Just Right & The Sophisticated

A guide to choosing the correct data lake design on AWS for your business

towardsdatascience.com

Pattern to efficiently UPDATE terabytes of data in a Data Lake

SQL “UPDATE” equivalent for a data lake using Apache Spark execution engine. Processes ~15TB in under an hour.

medium.com

Some issues when building an AWS data lake using Spark and how to deal with these issues

A lengthy post for ALL who struggle with Spark and Data Lake

towardsdatascience.com

Data Lake Can Help Break Data Silos

By: Arvind Heda

medium.com

Building a data lake without drowning

Lessons learned that will help you avoid common pitfalls during your data lake implementation

medium.com

Build a Hybrid Multi-Cloud Data Lake and Perform Data Processing Using Apache Spark

Create a Multi-Cloud Data Lake using Terraform and run a configuration driven Apache Spark data pipeline

medium.com

Data Architecture

How to Build Your Data Platform like a Product

Go from a (standard dev. of) zero to a bonafide data hero

towardsdatascience.com

Table Design Best Practices for ETL

How to Design Source System Tables for ETL Pipelines

towardsdatascience.com

How to create a successful data platform 2.0

Read this if your data platform 1.0 was not successful as you thought it would be.

medium.com

Guiding principles for building and managing effective, scalable data pipelines for Machine…

Saravanakumar Subramaniam, Principal Data Engineer, Toby Sykes, Global Head of Data Engineering

medium.com

Data Governance

Want to Master Your Data? Here’s Why You Should Care About Metadata

Know your data better than you know yourself

towardsdatascience.com

DataOps — Fully Automated, Low Cost Data Pipelines using AWS Lambda and Amazon EMR

A Guide to completely automate data processing pipelines using S3 Event Notifications, AWS Lambda and Amazon EMR.

towardsdatascience.com

Discover your Metadata- Apache Atlas

What’s the first thing you do when you receive a parcel? — You check the senders name and address.

medium.com

Business Glossary, Data Dictionary, and Data Catalog?

You might have heard these terms before while working with businesses. I am sure you are still confused between the…

medium.com

How to improve data quality for machine learning?

The secret of building a better model

towardsdatascience.com

Partnering for data quality

How two groups at Microsoft teamed up on a data quality initiative.

medium.com

How to Calculate the Cost of Data Downtime

Introducing a better way to measure the financial impact of your bad data

towardsdatascience.com

The role of DevOps, Data Ops and ML Ops in Delivering Enterprise AI

Arijit Mitra

medium.com

Data Formats

Delta Lake

Top 5 Reasons to Convert Your Cloud Data Lake to a Delta Lake

If you examine the agenda for any of the Spark Summits in the past five years, you will notice that there is no…

databricks.com

Delta lake with Spark: What and Why?

Get to know the storage layer which enabled ACID and updates with Spark

towardsdatascience.com

Building a notebook-based ETL framework with Spark and Delta Lake

The process of extracting, transforming and loading data from disparate sources (ETL) have become critical in the last…

medium.com

Delta Lake and Data Lakes — Getting Started

A gentle introduction to data lakes and the most prominent solution used — Delta Lake. Theory and practice on newest…

medium.com

Near real-time finance data warehousing using Apache Spark and Delta Lake

Financial institutions globally deal with massive data volumes that calls for large scale data warehousing and…

medium.com

Facilitating Business Demands with Databricks Delta Lake

Although, Data Lake has been in place for a couple of good amount of time meeting business demands, there is a need for…

medium.com

Apache Parquet

Details you need to know about Apache Parquet

Parquet is a columnar file format that supports nested data. Lots of data systems support this data format because of…

medium.com

Overcoming Parquet Schema Issues

Couple approaches on how we overcame parquet schema related issues when using Pandas and Spark dataframes.

medium.com

Apache Avro

Avro in its simplicity

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop…

medium.com

All you need to know about Avro schema

medium.com

Data Pipelines

A Paved Road for Data Pipelines

Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere…

medium.com

Big Data Pipeline Recipe

Introduction

itnext.io

Productive data pipelines at Geoblink

from Jenkins monolith towards Airflow on Kubernetes

medium.com

Your first data pipeline with Kafka

Using Docker, Kafka, Python and Postgres

medium.com

Data Quality Tools

Why I Built an Opensource Tool for Big Data Testing and Quality Control

I’ve developed an open source data testing and quality tool called data-flare. In this post I’ll share why I wrote this…

medium.com

Data Validation Framework in Apache Spark for Big Data Migration Workloads

Quality Assurance Testing is one of the key areas in Bigdata

medium.com

Data Quality Libraries: The Right Fit

A high-level comparison of TensorFlow Data Validation, Great Expectations, and Deequ

medium.com

Partnering for data quality

How two groups at Microsoft teamed up on a data quality initiative.

medium.com

Data Processing

Scaling AI with Project Ray, the Successor to Spark

Usually a project at Berkeley lasts for about five years. Around 2006, some smart users/students were complaining about…

medium.com

Apache Hadoop

An Introduction to Hadoop in EMR AWS.

Big Data being an integral part of Machine Learning, here we are going to process Freddie Mac Monthly Loan dataset…

medium.com

Apache Spark

Distributed Data Processing with Apache Spark

Data processing with general purpose distributed data processing engine.

medium.com

Part 6: Summary of Apache Spark Cost Tuning Strategy

The step by step overview of the cost tuning strategy

medium.com

Part 1: Cloud Spending Efficiency Guide for Apache Spark on EC2 Instances

How I saved 60% of costs in an Apache Spark job, with no increase in job time and no decrease in data processed

medium.com

Video social analytics at scale using Apache Spark

A deep dive into how we built data pipelines to access the APIs of platforms like Facebook and YouTube at scale using…

medium.com

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

This article describes Data Mechanics, Spark on Kubernetes, and cloud-native ideas and practices of Alibaba’s…

medium.com

How-to perform a spark-submit to Amazon EKS cluster with IRSA

In previous article, I have introduced how we submit a Spark job to an EKS cluster. As long as we’re using other AWS…

medium.com

The Secrets Behind the Optimized SQL Performance of EMR Spark

Here gives an overview of the latest performance and efficiency optimizations that were made to TPC-DS Perf after its…

medium.com

Boosting Apache Spark Application by Running Multiple Parallel Jobs

There might be a question in your mind from the title of this article that Apache Spark already performs data…

medium.com

Spark Integration With kafka(Batch)

Spark integration with kafka (Batch)

medium.com

Distributed Processing with PyArrow-Powered New Pandas UDFs in PySpark 3.0

How to implement high-performance Pandas-like User Defined Functions (UDFs) by using PySpark powered by Spark 3.0.0

towardsdatascience.com

Part 3: Cost Efficient Executor Configuration for Apache Spark

Find the most efficient executor configuration for your node

medium.com

How to handle bad records/Corrupt records in Apache Spark

Hi Everyone,

medium.com

PySpark: unit, integration and end-to-end tests.

Through this article I intend to show one way of creating and running PySpark tests.

medium.com

Some issues when building an AWS data lake using Spark and how to deal with these issues

A lengthy post for ALL who struggle with Spark and Data Lake

towardsdatascience.com

How to connect Jupyter Notebook to remote spark clusters and run spark jobs every day?

This is not a trivial problem as many thinks

towardsdatascience.com

How Delta Lake 0.7.0 and Apache Spark 3.0 Combine to Support Metatore-defined Tables and SQL DDL …

Last week, we had a fun Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee…

databricks.com

Apache Hive

Hive Finally Has Flink!

Jason introduces the architecture of Hive integration in Flink, discusses problems, and how to solve them.

medium.com

An alternative way of loading or importing data into Hive tables running on top of HDFS based data…

Preceding pen down the article, might want to stretch out appreciation to all the wellbeing teams beginning from…

medium.com

Apache Hive dealing with different data file

In this Blog I will try to explain different ingestion methods for different file format (semi-structured) like…

medium.com

Create a Scale-Out Hive Cluster with a Distributed, MySQL-Compatible Database

Author: Mengyu Hu (Platform Engineer at Zhihu)

medium.com

Limitation of Hive Data Validation

In a big data world, hive is one of the most popular data warehouse tool. Though it comes with some convenient and…

medium.com

How To Create Your Own Hive SerDe — Hive Custom Data Serialize-Deserialize Mechanism

As mentioned in my earlier blog post, SerDe is an interface which hive use to deserialize (read data from table’s hdfs…

medium.com

Presto

Using Starburst Presto to Federate SQL Queries Across Multiple Data Sources

The data virtualization world as we know it is full of ifs and buts…

medium.com

Getting started with Apache Presto and Apache Kudu backend on Kubernetes

Here we will see how to quickly get started with Apache Kudu and Presto on Kubernetes

medium.com

Presto in Elastic Container Service

Tempted by improving Hive queries performance, our team decided to try well known big data in-memory engine Presto…

medium.com

MapReduce

The Why and How of MapReduce

When do I need to use MapReduce? How can I translate my jobs to Map, Combiner, and Reducer?

medium.com

Project Ray

Scaling AI with Project Ray, the Successor to Spark

Usually a project at Berkeley lasts for about five years. Around 2006, some smart users/students were complaining about…

medium.com

Stream Processing

Streaming Big Data Analytics

The recent years have seen a considerable rise in connected devices such as IoT [1] devices, and streaming sensor data…

medium.com

Event Streaming and The Problem with Up-Scaling Partition Counts

A common goal of distributed and immutable log-based event brokers (such as Apache Kafka, Apache Pulsar, AWS Kinesis…

medium.com

Selecting the Right streaming Engine for your Data Pipeline

Until recently most of the data warehouses and data lakes were batch oriented where data was captured in file systems…

medium.com

cuStreamz: More Event Stream Processing for Less with NVIDIA GPUs and RAPIDS Software

One can view cuStreamz as a bridge that connects Python-Streaming and GPUs — with sophisticated and reliable streaming…

medium.com

Apache Flink

The Run-In Period for Flink and Hive

Jason addresses the bugs and compatibility issues with Flink-Hive by operating on a Hive database using Flink SQL.

medium.com

A Deep Dive into Apache Flink 1.11: Stream-Batch Integrated Hive Data Warehouse

Li Jinsong and Li Rui, Alibaba Technical Experts, talk about the features, revisions, and improvements of Apache Flink…

medium.com

Windows operator: Heart of processing infinite streams in Flink

medium.com

How to run Apache Flink locally?

Step by Step guide for local installation of Apache Flink

medium.com

Apache Spark Streaming

Understanding Spark Streaming with Kafka and Druid |

As a Data Engineer I’m dealing with Big Data technologies, such as Spark Streaming, Kafka and Apache Druid. All of them…

medium.com

Structured Streaming in Spark 3.0 Using Kafka

Using Docker, Spark 3.0, Kafka and Python

medium.com

State Transformation in Spark

Spark Streaming is able to handle state-based operations, i.e. operations containing a state susceptible to be modified…

medium.com

Spark integration with kafka (Streaming)

Apache Kafka — Spark structured streaming is one of the best combinations for building real time applications. In the…

medium.com

Storing Structured Streaming Data to a Hive Table

In this article we talk about how you can read data from files using Spark Structured Streaming and store the output to…

medium.com

Streaming evolving JSON from Kafka using Spark

Background

medium.com

Apache Beam

ETL with Apache Beam — Load Data from API to BigQuery

Building a serverless, scalable pipeline with dataflow

medium.com

Writing ETL Pipelines In Apache Beam — Part 1

Write Your Pipeline Code Once & Carry it to runner of your choice!

medium.com

Windowing in Cloud Dataflow (Fixed, Sliding, Session)

Learn basics of windowing concepts in dataflow with example data and visualization

medium.com

Apache Beam fundamentals

After almost 2 years into Apache Beam, processing terabytes of data per day, millions of events per minutes, I listed…

medium.com

Apache Kafka Streams

Creating a streaming data pipeline with Kafka Streams

Creating a rule-based streaming data topology

itnext.io

Implementing custom SerDes for Java objects using Json Serializer and Deserializer in Kafka…

In this article, I will show you how to implement custom SerDes that provides serialization and deserialization in JSON…

medium.com

Testing Kafka Streams - A Deep Dive

Tools for automated testing of Kafka Streams applications have been available to developers ever since the technology's…

www.confluent.io

Ingestion

Batch

Data Ingestion with Apache SQOOP

OVERVIew

medium.com

Change Data Capture

Change data capture

Use a tool and don’t code it, Its harder than we think

medium.com

CDC with Postgres Debezium to Kafka Strimzi

medium.com

Debezium and KStreams to Handle Data Aggregation

Prerequisite:

medium.com

Using PostgreSQL pgoutput plugin for change data capture with Debezium on Azure

Set up a Change Data Capture architecture on Azure using Debezium, Postgres and Kafka was a tutorial on how to use…

itnext.io

How Bolt Adopted Change Data Capture with Confluent for Real-Time Data & Analytics

This article describes why Bolt, the leading European on-demand transportation platform that operates across…

www.confluent.io

Real-Time

Upload files to AWS S3 using Apache Flume

When you choose Apache flume, there is no out-of-the box S3 sink available (at least till the date of the post). But…

medium.com

Messaging

Kinesis vs. Kafka

What is better from latency/throughput perspective? Let’s find out through benchmarks!

medium.com

Performance Comparison Between Apache Pulsar and Kafka: Latency

Apache Kafka is well known for its high performance. It is able to process a high rate of messages while maintaining…

medium.com

Benchmarking Kafka vs. Pulsar vs. RabbitMQ: Which is Fastest?

Apache Kafka ® is one of the most popular event streaming systems. There are many ways to compare systems in this…

www.confluent.io

Apache Kafka

Kafka Monitoring by Zabbix

Introduction

medium.com

Diving Deep into Kafka

The objective of this blog is to build some more understanding of Apache Kafka concepts such as Topics, Partitions…

medium.com

A brief introduction of Apache Kafka.

What is Kafka? How to use Apache Kafka? How to start with this real-time stream processing tool? Is Kafka the “magic…

medium.com

Streaming data into Kafka S01/E02 — Loading XML file

In this article, we will see how to read records from XML files and load them into Kafka.

medium.com

How to Pitch Kafka

Imagine you are a senior engineer working in a company that’s running its tech stack on top of AWS. Your tech…

medium.com

Streaming data into Kafka S01/E01 — Loading CSV file

Ingesting data files in Apache Kafka is a very common task. Among all the various file formats that we can find, CSV is…

medium.com

Kafka Monitoring by Zabbix

Introduction

medium.com

How to Deploy Kafka Connect on Kubernetes Using Helm Charts

By Amit Yadav, Sr. Engineer, DevOps at Ignite Solutions

medium.com

Confluent Platform: Connecting SAP HANA to Kafka

Introduction

medium.com

Confluent Platform: Connecting Splunk to Kafka

Introduction

medium.com

Apache Pulsar

Apache Pulsar celebrates the 300th contributor

Apache Pulsar celebrates the 300th contributor!

medium.com

Workflow Management

Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow

Choosing a task orchestration tool

towardsdatascience.com

Apache Airflow

Is Apache Airflow good enough for current data engineering needs?

The pros and cons of Apache Airflow as a workflow management platform for ETL & Data Science and deriving from that the…

towardsdatascience.com

Twitter Data Pipeline using Apache Airflow

Build a production pipeline using Airflow on Docker

towardsdatascience.com

5 essential tips when using Apache Airflow to build an ETL pipeline for a database hosted on…

Best practices for beginners working with Airflow

towardsdatascience.com

An Apache Airflow MVP: Complete Guide for a Basic Production Installation Using LocalExecutor

Simple and quick way to bootstrap Airflow in production

towardsdatascience.com

Generating Airflow Unit Test in MOMO

TLDR

medium.com

Airflow State 101

A Gentle Overview of State in Apache Airflow

towardsdatascience.com

How Does Tokopedia Take Airflow to the next level?

AirBnB’s golden son, AirFlow, is great. But we take it to another level.

medium.com

Apache Airflow

Setting up and creating your first workflow

towardsdatascience.com

Using Airflow to dynamically schedule workflows

Apache Airflow is a platform for scalable workflow scheduling and execution with detailed monitoring and management. In…

medium.com

Running Apache Airflow DAG with Docker

medium.com

Step by step: build a data pipeline with Airflow

Build an Airflow data pipeline to monitor errors and send alert emails automatically. The story provides detailed steps…

towardsdatascience.com

The Smarter Way of Scaling With Composer’s Airflow Scheduler on GKE

We love Airflow At StoreMaven. Being a super Data-Driven company, we run hundreds of daily tasks transforming, loading…

medium.com

Prefect

The All-New Prefect Server and UI

Two key Prefect products are becoming open-source projects in our biggest release ever

medium.com

Distributed data pipelines made easy with AWS EKS and Prefect

How to set up a distributed cloud workflow orchestration system within minutes and focus on providing value rather than…

towardsdatascience.com

Dagster

Dagster: The Data Orchestrator

As machine learning, analytics, and data processing become more complex and central to organizations, improving the…

medium.com

Argo

Designing Workflows Using Argo

Orchestrate parallel jobs on K8s with container-native workflow engine.

medium.com

Apache NiFi

Deep dive into a custom Apache Nifi processor

Just a couple of years ago, software projects didn’t exceed a bunch of files! You could store a project on a Floppy…

itnext.io

Cloud Providers

AWS

Getting started with large-scale ETL jobs using Dask and AWS EMR

In this tutorial, we will walk through setting up a Dask cluster on top of EMR (Elastic MapReduce), AWS’s distributed…

towardsdatascience.com

Getting started on AWS Data Wrangler and Athena

It is a well-known fact that s3 + Athena is a match made in heaven but since data is in S3 and Athena is serverless, we…

medium.com

An Introduction to Hadoop in EMR AWS.

Big Data being an integral part of Machine Learning, here we are going to process Freddie Mac Monthly Loan dataset…

medium.com

Essential Functionalities to Guide you While using AWS Glue and PySpark!

Introduction

medium.com

Data Analysis using AWS Glue/ Athena/ Quick sight

Technologies used:

medium.com

AWS Glue Made Easy

AWS Glue is a fully managed service for ETL. This service makes it easy and cost-effective to categorize, clean…

medium.com

The Amazon Athena Overview

What is AWS Athena and how to get working with an immersive database service?

medium.com

Pricing & Cost Optimization Athena

Introduction to Athena:

medium.com

Building cheaper and more performant EMR pipelines

A story about 60% EMR saving costs with a 500% processing time improvement

medium.com

Google Cloud

How to do text similarity search and document clustering in BigQuery

Use Document embeddings in BigQuery for document similarity and clustering tasks

towardsdatascience.com

Migrate Oracle Data to BigQuery using Dataproc and Sqoop

Want to migrate Oracle Database to BigQuery? Don’t know how to move the data from Oracle to BigQuery? Want to do this…

medium.com

Key BigQuery concepts

This is the part 5 of the series, Modernising a Data Platform and BigQuery concepts. In this part and next few parts…

medium.com

Step by Step Guide to load data into BigQuery

In this Part 6 of the series, “Modernisation of a Data Platform”, we would be focussing a little more on BigQuery’s key…

medium.com

5 reasons why BigQuery users should use dbt

How do you implement and test data pipelines with BigQuery to create intermediate tables and manage metadata and data…

medium.com

How long does Google Dataflow pick and process Pub/Sub messages in “real time”?

(A test and its results)

medium.com

DATAFLOW for Google Cloud Professional Data Exam

Programming model for Apache Beam

medium.com

Google DataFlow utility pipelines( File Conversion and Streaming data generation)

Dataflow Streaming Data Generator

medium.com

Transfer Data From GCS to S3 Using Google Dataproc With Airflow

GCP and AWS are the top two public cloud providers and many of the organisation moving from on promises to Cloud. Now a…

medium.com

Migrate Oracle Data to BigQuery using Dataproc and Sqoop

Want to migrate Oracle Database to BigQuery? Don’t know how to move the data from Oracle to BigQuery? Want to do this…

medium.com

Running PySpark Jobs on Dataproc Cluster using Workflow Templates

Dataproc is a managed Apache Spark and Apache Hadoop service that lets user take advantage of open source data tools…

medium.com

Costs and performance lessons after using BigQuery with terabytes of data

Learned lessons that will help you to put your BigQuery’s billing under control and increase your query performance.

medium.com

Demystifying BigQuery reservations

Learn ho to mix flat-rate and on-demand pricing

medium.com

Introducing BQconvert — BigQuery Schema Converter Tool

BQconvert is a python basec opensource tool. BQconvert will help you to convert any database’s schema into bigquery…

medium.com

Azure

What (t*****k) happened to our Azure Data Factory costs

First Azure joke I ever heard was: “Azure Pricing — you need a PHD to understand that”.

medium.com

New Global Parameters feature in Azure Data Factory

Some of the new enhancements which Microsoft has introduced are related with Global parameters in Azure Data Factory…

medium.com

Dynamic Pipelines in Azure Data Factory

This article is focused on creating dynamic data factory pipelines, by parameterize the static information added to the…

medium.com

Telemetry & usage data collection with Azure ETL tools (Part 1)

What are the challenges of collecting Telemetry & usage data and how to overcome them using Azure ETL tools

medium.com

Full And Incremental Data Loading Concept From Source To Destination Using Azure Data Factory

Objective: Our objective is to load data incrementally or fully from a source table to a destination table using Azure…

medium.com

Azure Data Factory: Organize your big data workflows in the cloud

The world which we live in is changing rapidly, especially after getting hit by the COVID-19 pandemic. The availability…

medium.com

Architecting Modern Data Engineering using Azure Databricks

There should no doubts in anyone’s mind about how Big Data and AI are fueling the next revolution. Data is the new oil…

medium.com

Azure Databricks: How to configure Databricks CLI and multiples workspaces

Some months ago I started to work with Azure Databricks and I was learning many things about Databricks from scratch. I…

medium.com

Azure Event Hubs — Azure Databricks

Let’s implement a quick and simple IBOR scenario

medium.com

Databases

SQL vs NoSQL Databases — Are They Really That Different?

In this post, I talk about the differences between SQL and NoSQL databases, and help you decide which is best suited…

medium.com

NoSQL

Druid and SuperSet for Real-time monitoring at scale

Your company generates a lot of data, about several terabytes per day, and you want to find a tool that could take…

medium.com

Leverage Plugins to Ingest Parquet Files from S3 In pinot

How to use Spark to push billions of rows in S3 to Pinot without writing a single line of code?

medium.com

Columnar Stores — When/How/Why?

Demystifying Row vs Column Big Data stores (Parquet, Postgres, Avro, etc)

towardsdatascience.com

The Architecture of Amazon’s DynamoDB and Why Its Performance Is So High

DynamoDB is a NoSQL database provided by Amazon Web Service (AWS). It can provide extremely high performance, more than…

medium.com

Exploring the NoSQL Family

A (long) primer on a growing requirement for Data Scientist interviews

medium.com

A Beginner’s Reference to SQL vs. NoSQL

Many new developers wonder what the difference is between SQL and NoSQL. The subjects come up often in theoretical…

levelup.gitconnected.com

Cassandra Read / Write Consistency & Replication Factor

Environment:

medium.com

NoSQL-Apache Casandra Architecture

We all started learning about Database with a formal definition, right! “ A database is a collection of data that is…

medium.com

Explain the meaning, architecture, and components of HBase

Hbase is an open-source non-relational database management system written in Java and runs on the top of the HDFS. It’s…

medium.com

In-Memory & Data Grid

How to Set up a Fully Replicated Apache Ignite Cluster

Does one care for yet another key-value store ? not really. Fortunately, Apache Ignite is a lot more than that. It’s an…

medium.com

Modern Data Warehouses

What is a Data Warehouse, when and why to consider one

“When should you consider getting a data warehouse?”

towardsdatascience.com

Comprehensive Guide to the Data Warehouse

Learn about the role of the data warehouse as the master store of analysis-ready datasets.

towardsdatascience.com

A tale of two data warehouses — NewsUK’s data platform migration to Google Cloud (Part 1)

NewsUK’s data platform migration to BigQuery

medium.com

ClickHouse vs Amazon RedShift Benchmark

How does ClickHouse fare against Amazon RedShift? Here’s Altinity’s benchmark report.

medium.com

ClickHouse Dictionaries Explained

One of the most useful ClickHouse features is external dictionaries. How can you build elegant designs from this…

medium.com

Compression in ClickHouse

It might not be obvious from the start, but ClickHouse supports different kinds of compressions, namely two LZ4 and…

medium.com

What’s ClickHouse anyway ?

I’m writing this short blog post in which I want to very briefly talk about ClickHouse and explain what it is. I aim to…

medium.com