Data Engineering Digest #14 (Jul 2020)

Published in

data.plumbers

18 min readAug 21, 2020

In this edition there’s a great article about data reliability and which metrics matter in a Data Platform (such as data downtime), how to measure them and how they can impact other teams.

Another great article comes from Netflix and how they manage the costs of their data platform by integrating metrics from the AWS billing with their S3 data inventory, data catalog and Job Platforms. Netflix is able to track their costs per area and even suggest TTL for table partitions, saving money in the end.

We also highlight the release of Flink 1.11.0, with a new Source API and support for Change Data Capture on Flink SQL. There’s also the release of Hadoop 3.3.0 with support to ARM architectures and Java 11. And the release of Samza 1.5.0.

New Tools & Updates

5 Trends in Big Data and SQL to Be Excited About in 2020

Distributed data processing, collaborative SQL, and open-source

medium.com

Apache Flink 1.11.0 Release Announcement

06 Jul 2020 Marta Paes (@morsapaes) The Apache Flink community is proud to announce the release of Flink 1.11.0! More…

flink.apache.org

Samza - Announcing the release of Apache Samza 1.5.0

IMPORTANT NOTE: As noted in the last release, this release contains backward incompatible changes regarding samza job…

samza.apache.org

Apache Hadoop 3.3.0

Apache Hadoop 3.3.0 incorporates a number of significant enhancements over the previous major release line…

hadoop.apache.org

ksensehq/eventnative

EventNative is an open-source data collection framework - ksensehq/eventnative

github.com

Release 4.2 · hazelcast/hazelcast-jet

We added Change Data Capture (CDC) support for PostgreSQL and MySql. New connectors use Debezium to provide a…

github.com

Spark 3.0

Accelerating Queries and Reducing Data Transfer with Databricks' New BigQuery Connector

At Databricks, we are building a unified platform for data and AI. Data in enterprises lives in many locations, and…

databricks.com

How to Better Monitor Streaming Queries with Spark 3.0 Structured Streaming

This is a guest community post from Genmao Yu, a software engineer at Alibaba. Structured Streaming was initially…

databricks.com

How to Effectively Use Dates and Timestamps in Spark 3.0

Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing…

databricks.com

Data Engineering Role

The Changing Role and Importance of Data Engineering

Bruce Philp, Head of Data Engineering North America, Tom Goldenberg, Junior Principal Data Engineer and Toby Sykes…

medium.com

I Followed Data Engineer Resumes To Learn How To Break Into The Field

Here is what 50 data engineers’ resumes say about what experience and education is required

towardsdatascience.com

Intro to Data Engineering for Data Scientists

An overview of data infrastructure which is frequently asked during interviews

towardsdatascience.com

Courses & Training

Coursera’s Data Engineering with GCP Professional Certificate: better content or better marketing?

An honest review from a recently certified data student

towardsdatascience.com

Podcasts & Presentations

A Developer's Reflections on the Spark + AI 2020 Virtual Summit

Developers attending a conference have high expectations: what knowledge gaps they'll fill; what innovative ideas or…

databricks.com

GumGum speaks at Spark + AI Summit 2020

GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory…

medium.com

Highlights from Spark+AI Summit 2020 for Data engineers

In these takeaways focusing on the data engineering topics, I’ll provide as resources, the most interesting talks I've…

engineering.klarna.com

A Developer's Reflections on the Spark + AI 2020 Virtual Summit

Developers attending a conference have high expectations: what knowledge gaps they'll fill; what innovative ideas or…

databricks.com

DataOps For Streaming Systems With Lenses.io

An interview about how the Lenses.io platform addresses the DataOps challenges for streaming systems to power…

www.dataengineeringpodcast.com

Open Source Production Grade Data Integration With Meltano

An interview about the Meltano project and their goal of building a fully open source data integration platform that is…

www.dataengineeringpodcast.com

Making Wind Energy More Efficient With Data At Turbit Systems

An interview with the founder of Turbit Systems about how they are improving the efficiency and sustainability of wind…

www.dataengineeringpodcast.com

Build More Reliable Distributed Systems By Breaking Them With Jepsen

An interview with Jepsen creator Kyle Kingsbury about what he has learned about distributed systems by breaking them. A…

www.dataengineeringpodcast.com

Real Data Architectures & Platforms

An island of truth: practical data advice from Facebook and Airbnb

How-to guide for building core datasets at your company. Explore concepts on data lakes, warehouses, table schemas, and…

towardsdatascience.com

Building A Cloud-Native, Cloud-Agnostic Data lake

towardsdatascience.com

Prophecy: Teamwork’s Data Lake

You need a Data Lake.

engineroom.teamwork.com

Our Digital Transformation towards a Successful Big Data Platform

At Kamstrup we create smart metering solutions for energy, heat and water. This post is about how we ended up with our…

medium.com

ETL Data Pipeline and Data Reporting using Airflow, Spark, Livy and Athena for OneApp

Writing ETL Batch job to load data from raw storage, clean, transform, and store as processed data.

medium.com

Stream Processing Access Logs: LoKI Stack

Airtel’s customer reach is tremendously vast, and so comes the chuck of data each application creates every day. Such…

medium.com

Data Culture

How to Scale Your Data Team with Confidence

Hint: it takes more than a few fancy algorithms

towardsdatascience.com

What is “Big Data” — Understanding the History

A tour through history, how we ended up here, what capabilities we’ve unlocked, and where do we go next? Why Google…

towardsdatascience.com

How the chief data officer role becomes a new big data imperative

Exploring all the things you should know before hiring a chief data officer.

towardsdatascience.com

The Importance Of A Data-Driven Culture In Any Organisation

There are lots of statistics floating around the web detailing how much added value embracing data could bring to any…

medium.com

The 5 Gaps to being Data Driven

A tried and trusted approach

medium.com

From Advanced to Effective Analytics

A path to understanding and aligning the Lukes and Yodas in data driven businesses

towardsdatascience.com

Data Lake

From Data Lakes to Data Reservoirs

Create Clean, Beautiful, Protected Data Resources with Apache Spark and Delta Lake

towardsdatascience.com

Data Lake Icon: Visual Reference

If you were looking for a data lake icon or data lake logo, our creative team created a few for Openbridge

blog.openbridge.com

Implementing a Data Lake or Data Warehouse Architecture for Business Intelligence?

This article explains what business intelligence is, the process to deliver BI and compare a DWH and Data Lake…

towardsdatascience.com

Building A Cloud-Native, Cloud-Agnostic Data lake

towardsdatascience.com

Complete Big Data Solution for Click Stream Events

Setting up a Centralised Data Lake at Deutsche Telekom on AWS

medium.com

Common data engineering challenges and their solutions

Last year — before COVID-19 put a stop to conferences — I attended the Strata Data Conference in San Francisco. The…

medium.com

Data Architecture

What is a Data Mesh — and How Not to Mesh it Up

A beginner’s guide to implementing the latest industry trend: a data mesh.

towardsdatascience.com

MDM Data Architecture: Unify, optimize, and accelerate

Every enterprise has its unique requirement for an MDM architectural hub because they come with different barriers to…

medium.com

The transformations (you need to know) towards sustainable data architectures — Part-1

During the past years, there were tremendous transformations in data technologies and in companies strategies to be…

medium.com

Data Architects Modern Cloud Technology Stack

Photo by Donald Giannatti on Unsplash

medium.com

Building a Data Platform to Enable Analytics and AI-Driven Innovation

Build a Data Mesh & Set up MLOps

medium.com

Securing our Big Data Platform (Part 2)

When last we saw our heroes, they were battling Hadoop authentication in DPaaS. But our intrepid knights were able to…

medium.com

Part 2 — Modernising a Data Platform & BigQuery Mastery

In Part 1, we had a brief introduction to the concept of modernisation and key concepts of a traditional DWH.

medium.com

Scaling the data infrastructure to support your growing company

Small companies have small data needs, but when you start to scale those challenges get bigger.

medium.com

Data Governance

How to choose a data governance platform

What a data access control and governance platform should provide to ensure enterprise grade security and compliance in…

blog.privacera.com

What’s Hiding In Your Data? Data Discovery Is Crucial, But Not Sexy

Some technologies are inherently sexy. Take artificial intelligence. It is likely to take over the vast majority of…

medium.com

Master Data Management: challenges and basics

A Master Data Management system is the single point of truth of all data company-wide. The problem we want to manage is…

medium.com

Data Catalogs

A metadata comparison between Apache Atlas and Google Data Catalog

Learn how your metadata is structured on both systems.

medium.com

Essential features of Data Catalogs

Everything you need to know to build a sustainable, long-lasting data catalog solution

medium.com

DataDoc — The Criteo Data Observability Platform

How we regained control on our data ecosystem and tackled governance issues.

medium.com

Data Catalogs and your data rocks

In the first article of this series, we talked about why Data Catalogs are so trendy. It is time now to put some meat…

medium.com

Castor: get tech giants data discovery tools in a click

Data assets grow exponentially

medium.com

Data Quality

How to Fix Your Data Quality Problem

Introducing a better way to prevent bad data.

towardsdatascience.com

What is Data Reliability?

And how to use it to start trusting your data.

towardsdatascience.com

Data Quality Is Paramount

Well-designed, fast analytics systems using the shiniest of new technologies are worse than…

medium.com

Understanding Data Quality With Disney

Let’s explore a few common data quality themes & issues that affect our business outcomes!

medium.com

Cost Efficiency

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

By Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park

netflixtechblog.com

Data Formats

Delta Lake

Reliable and serverless data ingestion using Delta Lake on AWS Glue

The Big Data scenario

medium.com

delta lake and athena external tables

staging data lake files to aws s3 using delta lake tables to track changes for daily upserts of data, then making…

medium.com

Is Delta lake streaming a production-ready?

Delta lake comes with awesome features to overcome the outcomes of spark or any big data platform. So what outcomes…

medium.com

Apache Parquet

Insights Into Parquet Storage

Most of you folks working on Big data will have heard of parquet and how it is optimized for storage etc. Here I will…

medium.com

Apache Hudi

Query Hudi Dynamic Dataset in AWS S3 Data Lake With Athena

Background

medium.com

Avro

Darwin, Avro schema evolution made easy!

How to simplify Avro schema evolution with an easy and lightweight library

medium.com

Schema evolution is not that complex

When I first started working with Apache Kafka and the Confluent Schema Registry, I did not focus enough on how to…

medium.com

Data Pipelines

Why Traditional ETL Tools Are Less Relevant Today

Data has been the primary reason why computers & Information Technology evolved. In the modern age Data is the key…

medium.com

How to Build Advanced SQL

Building more maintainable, readable and optimized data workflows

medium.com

Why monitoring your big data analytics pipeline is important (and how to get there)

Your Big Data analytics pipeline provides insights to your business, now gain insights into your Big Data pipeline…

towardsdatascience.com

ML Pipelines

Machine Learning Powered Data Pipeline

In the course of the last years the interest in Data Science and Machine Learning has continuously increased. Thanks to…

medium.com

Data Processing

Dask vs Vaex for Big Data

Can you really process bigger than memory datasets on your laptop? Is Dask faster than Vaex? I did some benchmarking so…

towardsdatascience.com

Spark Vs. Snowflake: The Cloud Data Engineering (ETL) Debate!

Authors: Raj Bains, Saurabh Sharma

medium.com

Explain about Pig and Hive in Hadoop and their differences

Pig hadoop and Hive hadoop have a similar function. They are tools that ease the difficulty of writing MapReduce java…

medium.com

Apache Spark

Spark SQL: Adaptive Query Execution

Altering the physical execution plan at runtime.

medium.com

Koalas, or PySpark disguised as Pandas

One of the basic Data Scientist tools is Pandas. Unfortunately, the excess of data can significantly ruin our fun. That…

medium.com

Why my Cat uses Apache Spark for handling Big Data

A Cat briefly meowing about Spark and how Data Scientist manhandle big data problems.

medium.com

How to run a PySpark job in Kubernetes (AWS EKS)

A complete tutorial on deploying an EKS cluster with Terraform and running a PySpark job using the Spark Operator

towardsdatascience.com

5 Spark Best Practices

That I Wish I Knew Before Starting My Project

towardsdatascience.com

Best practices for caching in Spark SQL

Deep dive into data persistence in Spark.

towardsdatascience.com

Mastering Query Plans in Spark 3.0

Spark query plans in a nutshell.

towardsdatascience.com

Performance of Apache Spark on Kubernetes has caught up with YARN

Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on…

towardsdatascience.com

How To Start with Apache Spark and Apache Cassandra

Apache Cassandra is a specific database that scales linearly. This has its price: specific table modelling…

medium.com

Accelerating Spark 3.0 Google DataProc Project with NVIDIA GPUs in 6 simple steps

Spark 3.0 + GPU is here. And it is a gamechanger

towardsdatascience.com

Understanding Spark UI

Apache spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource…

medium.com

Deep Dive into Apache Spark Array Functions

A practical guide to using array functions

medium.com

Apache Spark With PySpark — A Step-By-Step Approach

In a recent machine learning (ML) project at Carnegie Mellon, I’ve had to process relatively large streams of movie…

medium.com

Apache Hive

Supporting Multiple Time Zones on Hive with Single Data Source

At Udemy, most internal analytics dashboards rely on data residing in Pacific Time (PT) partitioned Hive tables…

medium.com

Can Hive Union Distinct Really Cause Data Loss?

Story of a 4 month long investigation into data loss issues with Hive

medium.com

HIVE — know the unknown

1. Is Every query in hive will trigger map reduce job?

medium.com

MapReduce

The Why and How of MapReduce

It’s programming technique for manipulating large data sets. Hadoop MapReduce is a specific implementation of this…

levelup.gitconnected.com

Presto

Can Presto SQL on Hadoop Replace Your Data Warehouse?

Of all the open source SQL on Hadoop options, I think Presto is the most technically sound. It has a lot of what you…

medium.com

Load and Query CSV Files in S3 with Presto

How to load and query CSV files in S3 with Presto

towardsdatascience.com

Running Presto with Hive Metastore on a Laptop in 10 Minutes

This tutorial guides beginners to set up a stack of Presto and Hive Metastore on your local server to query data on S3…

medium.com

Running Presto in a Hybrid Cloud Architecture

Running Presto in a hybrid cloud environment with Alluxio as a data orchestration layer to help serve data more…

medium.com

Dask

Supercharging Hyperparameter Tuning with Dask

Dask improves scikit-learn parameter search speed by over 100x, and Spark by over 40x.

towardsdatascience.com

Stream Processing

Apache Flink

Running Apache Flink with RocksDB on Azure Kubernetes Service

Recently I was looking into how to deploy an Apache Flink cluster that uses RocksDB as the backend state and found a…

towardsdatascience.com

How to Setup Pyflink on Amazon EMR

What is Flink?

medium.com

Flink Architecture — Job manager, Task manager and Job client

Apache Flink is a distributed stream processing engine. It does use the Akka framework for it’s distributed processing…

medium.com

Windowing in Apache Flink

Windowing is a key feature in stream processing systems such as Apache Flink. Windowing splits the continuous stream…

medium.com

Time Attributes in Apache Flink

One of the major difference between stream and batch processing is the need to explicitly handle time in stream…

medium.com

Apache Spark Streaming

Spark Structured Streaming as a Batch Job?

Using Trigger ONCE functionality in streaming

medium.com

Solving Small file problem in spark structured streaming : A versioning Approach

Streaming jobs usually creates too many small files which impacts the performance of jobs and queries reading these…

medium.com

How to start Spark Structured Streaming by a specific Kafka timestamp

In this blog post I’m going to illustrate three options for how to process kafka events by a particular timestamp

medium.com

Apache Beam

Apache Beam Pipeline for Cleaning Batch Data Using Cloud Dataflow and BigQuery

There are various technologies related to big data in the market such as Hadoop, Apache Spark, Apache Flink, etc, and…

towardsdatascience.com

A Data Engineering Perspective on Go vs. Python (Part 2 — Dataflow)

In Part 2 of our comparison of Python and go from a Data Engineering perspective, we’ll finally take a look at Apache…

towardsdatascience.com

Performing Deduplication in Real Time streaming pipeline with Apache Beam stateful processing

Apache Beam

stateful processing Apache Beammedium.com

Apache Beam, Google Cloud Dataflow and Creating Custom Templates Using Python

Apache Beam

, Google Cloud Dataflow and Creating Custom Templates Using Python Apache Beammedium.com

Tensorflow Extended, ML Metadata and Apache Beam on the Cloud

A practical and self-contained example using GCP Dataflow

towardsdatascience.com

Kafka Streams

Kafka Streams Window By & RocksDB Tuning

Kafka Streams offers a feature called a window. In this post, I will explain how to implement tumbling time windows in…

medium.com

Run Real-Time ETLs Using Kafka Streams API on OpenShift

Today more and more organizations are moving away from ETL, an ETL process in the form of Extract-> Transform -> Load…

medium.com

Introduction to Stream Processing using Kafka Streams

Kafka Streams is a Java library developed to help applications that do stream processing built on Kafka. To learn about…

medium.com

Kafka Stream Basic Operations

Basic operations of Kafka Stream, a real-time data transformation library for Kafka (works well with Java Spring too).

medium.com

Change Data Capture

3 ways to capture delete operation from the database when you build data pipeline

Introduction

medium.com

Kafka Connect: How to create a real time data pipeline using Change Data Capture (CDC)

Microservices, Machine Learning & Big Data are making waves among organizations. Curiously they all share the same…

medium.com

Debezium - CDC / Oracle

Imagine the following scenario, we have a CUSTOMERS table in the ORACLE database that we need to capture instantly…

medium.com

Implementing the Transactional Outbox Pattern with Debezium in Quarkus

This is the second instalment in a series on building a microservice from the ground up with Quarkus, Kotlin and…

levelup.gitconnected.com

Change Data Capture and Kafka

Kafka (originally came from Apache) is an open source steaming platform to deliver high performance and idempotent…

medium.com

Using Change Data Capture (CDC) in Scylla

Change Data Capture (CDC) allows users to track data updates in their Scylla database. While it is similar to the…

medium.com

Storage

Apache HDFS

HDFS is dead?

Context: “HDFS is dead and object stores like S3/GCP Buckets/Blob store is the way forward” came up in our office…

medium.com

HDFS — Hadoop Distributed File System

Overview

medium.com

Messaging

Comparing Apache Kafka and Apache Pulsar

When to use Pulsar and when to use Kafka, and why.

blog.softwaremill.com

Pulsar vs. Kafka — Part 1 — A More Accurate Perspective on Performance, Architecture, and Features

This blog compares two of the most favored messaging systems on the market — Pulsar and Kafka in performance…

medium.com

Pulsar vs Kafka — Part 2 — Adoption, Use Cases, Differentiators, and Community

This is Part 2 of a two-part series in which we share our perspectives on Pulsar vs. Kafka. In Part 1, we compared…

medium.com

Apache Kafka

Building a Real-Time Leaderboard with Kafka Connect and KSQL

Pratilipi is the largest Indian language storytelling platform. Pratilipi is currently (July 2020) home to 250,000+…

medium.com

Data Streaming With Apache Kafka

Streaming platform capable of handling trillions of events, distributed, horizontally-scalable, fault tolerant, commit…

medium.com

Getting Started with Apache Kafka — Beginners Tutorial

The objective of this article is to build an understanding of What is Kafka, Why Kafka, Kafka architecture, producer…

medium.com

How To Produce/Consume Messages With Java, Apache Camel and Kafka

A simple pub-sub message system with Java using Apache Camel and Kafka

medium.com

Monitoring Confluent Cloud Kafka with Datadog — Natural Intelligence

When I came across Apache Kafka and its concept of a streaming platform I asked myself — how will I monitor it?

medium.com

How to Use Protobuf with Apache Kafka and Schema Registry

Full guide on working with Protobuf in Apache Kafka

medium.com

Event Sourcing From Static Data Using Kafka

A different distributed scheduler approach .

medium.com

Apache Pulsar

Announcing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pulsar

We are excited to announce that StreamNative and OVHcloud are open-sourcing “Kafka on Pulsar” (KoP).

medium.com

Workflow Management

Apache Airflow

AIP-31 —Airflow Functional DAG Definition

Intro — AIP-31

medium.com

Dependencies between DAGs in Apache Airflow

In Apache Airflow, we can create dependencies between different DAGs, and run downstream DAGs that depend on…

towardsdatascience.com

How to build a Data Pipeline with Airflow

Some fundamentals you need to know to start using Airflow with Python

towardsdatascience.com

Data Pipelines With Apache Airflow

Apache Airflow — The most widely used tool for workflow orchestration

medium.com

Apache Airflow in 5 minutes

A quick introduction to Apache Airflow (A beginners guide)

medium.com

APACHE AIRFLOW: What it is and why you should start using it

Things you need to know to get started with Apache Airflow. Apache Airflow’s considerable benefits and its…

medium.com

Data Pipeline Orchestration on Steroids: Apache Airflow Tutorial, Part 1

By Rafael Pierre

medium.com

An Overview of Apache Airflow Architecture

From Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter

medium.com

AirFlow

Airbnb is a fast growing, data informed company. Our data teams and data volume are growing quickly, and so does the…

medium.com

A journey to Airflow on Kubernetes

… or how I got it to work, piece by piece, in a logical way

towardsdatascience.com

Deploy and Run Apache Airflow on AWS ECS Following Software Development Best Practices

This blog post is covering how to apply best practices in the deployment of Apache Airflow. In order to have true…

medium.com

Airflow DAG Performance and Reliability

Set up measures to ensure that data made available to the business users is always reliable when they want it.

medium.com

Orchestrating machine learning experiments for MLOps using Apache Airflow

Nowadays that more and more machine learning models are going to production, the need to operationalize the overall…

medium.com

Luigi

A Tutorial on Luigi, Spotify’s Pipeline

The world’s second most famous plumber’s best tool!

towardsdatascience.com

Prefect

Dancing in the Dark

What happens if the Prefect Cloud API goes down?

medium.com

Cloud Providers

AWS

My Review on AWS Lake Formation

How can AWS Lake Formation help you building Data Lakes?

medium.com

Real-time Data Processing With Kinesis Data Analytics

TL;DR — Using Kinesis Data Analytics to get real-time insights in Flanders’ traffic situation!

medium.com

How to stream real-time data into Snowflake with Amazon Kinesis Firehose

Businesses today can benefit in real-time from the data they continuously generate at massive scale and speed from…

towardsdatascience.com

Modeling Graph Relationships in DynamoDB

Koan is centered around goals, and how those goals connect to teams and people within a company. These connections can…

medium.com

AWS Glue: Amazon’s New ETL Tool

What is AWS Glue and do you need it?

towardsdatascience.com

Can you modify data stored on S3?

Yes, you can! By using transactional Hive ORC tables

medium.com

Getting started with EMRFS

The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing…

medium.com

Connect Jupyter Notebook to AWS Glue Endpoint

If I am not wrong, then almost everyone in data engineering industry have heard of Apache Spark and if not (highly…

medium.com

The ‘Small Files Problem’ in AWS Glue

AWS Glue is a pay-as-you-go ‘serverless’ extract, transform and load (ETL) tool, using Apache Spark under the covers to…

medium.com

AWS — ETL transformation

Introduction

medium.com

Write Streaming data to multiple Data Stores- AWS Kinesis

I have realized that some of you might be a little (like I was) confused on how streaming data works. How this data is…

medium.com

Query Hudi Dynamic Dataset in AWS S3 Data Lake With Athena

Background

medium.com

Getting Started with AWS EMR Like a Boss

Guide to setup AWS EMR with Spark, YARN Queues and Zeppelin.

medium.com

Google Cloud

BigQuery: the unlikely birth of a cloud juggernaut

How 10 engineers transformed cloud data analytics

towardsdatascience.com

How Servian helped REA Group repatriate 500TB of BigQuery data

<tl;dr>: REA Group engaged Servian to help plan and successfully deliver the repatriation of its core Google Cloud data…

medium.com

BigQuery Cost and Performance Optimization

Multiply the performance while diving the cost by the same factor.

medium.com

Get started with BigQuery and dbt, the easy way

Find here the quickest way to get started with dbt and BigQuery using only free offerings from Google Cloud.

towardsdatascience.com

BigQuery: delta to latest — all history

The Challenge

medium.com

Apache Beam Pipeline for Cleaning Batch Data Using Cloud Dataflow and BigQuery

There are various technologies related to big data in the market such as Hadoop, Apache Spark, Apache Flink, etc, and…

towardsdatascience.com

Getting Started with Bigtable on GCP

Bigtable is a fully managed NoSQL database on Google Cloud. If you’ve never looked at it before Bigtable can seem a…

medium.com

6 Steps to Migrate to Cloud Spanner

This is an overview of steps necessary to migrate to Cloud Spanner with some application downtime (zero downtime…

medium.com

Cloud Spanner: Read Statistics

Cloud Spanner is Google’s fully managed scalable relational database service. We recently announced a new feature…

medium.com

Cloud DataFlow: A Unified Model for Batch and Streaming Data Processing

Dataflow is a fully-managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service…

medium.com

Scheduled serverless dbt + BigQuery service

My colleague Felipe Hoffa recently published a blog post titled Get started with BigQuery and dbt, the easy way. More…

medium.com

Azure

Azure Synapse Analytics as a Cloud Lakehouse: A New Data Management Paradigm

Will enterprise data lake & enterprise data warehouse (EDW) coexist?

towardsdatascience.com

Deep Dive — Azure Cosmos Partitions and PartitionKey

Modeling Partition Key and Partition effectively in Azure Cosmos

medium.com

An end-end analytics solution with Azure Synapse Analytics

Ever since Microsoft announced the Azure Synapse, I had been waiting for its public preview. Recently Microsoft…

medium.com

How to implement Slow changing Dimension 1 in Azure Data factory

In this article, I will talk about how we can implement slow changing dimension (SCD Type 1).

medium.com

Azure Data Factory- Azure functions as pipeline activity- Part1

In my previous post I talked about how easy it is to use data factory to copy your data from on-premise to Azure cloud…

medium.com

Databases

Relational Databases vs NoSQL

This discussion here will give an eagle’s eye view of the difference between Relational Database and NoSQL.

medium.com

Popular Myths About Relational & No-SQL Databases Explained

What’s no longer true about relational and No-SQL databases in 2020?

medium.com

NoSQL

Apache Cassandra — Distributed Row-Partitioned Database for Structured and Semi-Structured Data

Open source distributed row-partitioned database management system (distributed DBMS) to handle large amounts of…

medium.com

Tombstones in Apache Cassandra

Tombstones are a sophisticated mechanism to handle deletes in a distributed datastore like Apache Cassandra.

medium.com

Getting the Most out of Lightweight Transactions in Scylla

By Kostja Osipov, July 15, 2020

medium.com

BigTable — Almost All You Need to Know

Say hello to the database that powers many core Google services

medium.com

Bulk Loading data into JanusGraph

Early this year, I was tasked to do a POC work to check the feasibility of switching our Identity Resolution solution…

medium.com

Apache Cassandra for Structured and Semi-Structured Data

Apache Cassandra — Distributed Row-Partitioned Database for Structured and Semi-Structured Data

medium.com

Apache Cassandra Benchmarking: 4.0 Brings the Heat with New Garbage Collectors ZGC and Shenandoah

AUTHOR: Alexander Dejanovski, DataStax Originally posted on datastax.com

medium.com

Why Apache Cassandra Rocks

Apache Cassandra provides high availability, scalability, no SPOF and even Multi-DC support out of the box.

medium.com

Relational

Indexing Very Large Tables

A short guide to the best practices around indexing large tables and how to use partitioning to ease the load on…

towardsdatascience.com

Spanner’s SQL Story

Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent…

medium.com

Modern Data Warehouses

Our journey to a new data warehouse

How come that we decided to remodel an existing data warehouse and how exactly we’ve done it

medium.com

Setting up Snowflake for your analytics stack

Getting the most out of Snowflake requires giving thought to configuration from the the beginning, not 6 months down…

towardsdatascience.com

Leveraging Webhooks for Real-time Data Warehousing

Introduction to Webooks, an event-driven alternative to Polling

medium.com