Data Engineering Digest #12 (May 2020)

Maycon Viana Bordin

Published in

data.plumbers

17 min readJun 19, 2020

New Tools

Use Delta Lake 6.0 to Automatically Evolve Table Schema and Improve Operational Metrics

Learn more about Delta Lake release 0.6.0 and how it will allow you to automatically evolve table schema in merge…

databricks.com

A multi-node, elastic, petabyte scale, time-series database on Postgres for free (and more ways we…

Today we have a big announcement: we're officially making multi-node TimescaleDB, a petabyte-scale distributed…

blog.timescale.com

TerminusDB

TerminusDB is an open source model driven graph database for knowledge graph representation designed specifically for…

terminusdb.com

MetricsDB: TimeSeries Database for storing metrics at Twitter

We covered Observability Engineering's high level overview in blog posts earlier here and its follow up here. Our time…

blog.twitter.com

25 Hot New Data Tools and What They DON’T Do

“Wait, do tool X and tool Y work together? I thought they were competitive.”

medium.com

Spark 3.0

How to Speed up SQL Queries with Adaptive Query Execution

This is a joint engineering effort between the Databricks Apache Spark engineering team - Wenchen Fan, Herman van…

databricks.com

Preview Apache Spark 3.0 Using the Databricks Runtime 7.0 Beta

We're excited to announce that the Apache Spark 3.0.0-preview2 release is available on Databricks as part of our new…

databricks.com

How Python type hints simplify Pandas UDFs in Apache Spark 3.0

Pandas user-defined functions (UDFs) are one of the most significant enhancements in Apache Spark for data science…

databricks.com

Data Engineering Role

How To Think About Data

The real difference between a data engineer and a data scientist — how they think

towardsdatascience.com

Data Engineer, Data Science and Data Analyst — What the Difference?

Get to know the professions in the data field

towardsdatascience.com

Complete Data Engineer’s Vocabulary

Concepts that data engineers must know in 10 words or less

towardsdatascience.com

Voicing for Data Engineering, the unsung hero

How I have switched gear to help businesses kick-start their data infrastructure and reporting pipeline

towardsdatascience.com

Data Engineering: What is it?

towardsdatascience.com

Courses & Training

Data Engineering on GCP Specialisation: A Comprehensive Guide for Data Professionals

If you are a data professional considering to upskill, there is no shortage of learning options, but if you are looking…

towardsdatascience.com

5 Free Courses to learn Apache Spark in 2020

Hello guys, if you are thinking to learn Apache Spark to start your Big Data journey and looking for some awesome free…

medium.com

How to Prepare for the Confluent Certified Operator for Apache Kafka (CCOAK) exam

Getting the Apache Kafka certification from Confluent is a great way of making sure to have your skills recognized by…

medium.com

Datastax Certified Apache Cassandra Developer | Exam tips 2020

Preparation guidelines and resources for Apache Cassandra 3.x Developer Associate Certification

medium.com

Podcasts & Presentations

Mapping The Customer Journey For B2B Companies At Dreamdata

An interview about the challenges of tracking the customer journey for B2B companies and how Dreamdata is addressing…

www.dataengineeringpodcast.com

Power Up Your PostgreSQL Analytics With Swarm64

An interview with Swarm64 CEO Thomas Richter about optimizing PostgreSQL on high performance hardware and FPGAs for…

www.dataengineeringpodcast.com

StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

An interview with StreamNative co-founder Sijie Guo about his experience contributing to the Pulsar framework for…

www.dataengineeringpodcast.com

Enterprise Data Operations And Orchestration At Infoworks

An interview with Amar Arsikere about the complexities of data operations at enterprise scale and the approach that…

www.dataengineeringpodcast.com

Machine Learning through Streaming at Lyft

Sherin Thomas talks about the challenges of building and scaling a fully managed, self-service platform for stream…

www.infoq.com

From Batch to Streaming to Both

Herman Schaaf talks about how the streaming data platform at Skyscanner evolved over time. This platform now processes…

www.infoq.com

Kafka: A Modern Distributed System

Tim Berglund covers Kafka's distributed system fundamentals: the role of the Controller, the mechanics of leader…

www.infoq.com

Real Data Architectures

Enabling Data-Driven Decisions

A story of building central Data Platform at the Financial Times using the latest technology trends.

medium.com

Data Culture

Bye Bye Big Data!

Everyone used to say that big data was the future. Was it wrong? What about now?

towardsdatascience.com

Choose Smart Data Over Big Data to Save Your Business

From a data engineer’s perspective

towardsdatascience.com

Data Lake

A Complete Guide On Serverless Data Lake using AWS Glue, Athena and QuickSight

Step-By-Step Walkthrough on ETL Data Processing, Querying and Visualization in a Serverless Data Lake

towardsdatascience.com

Data Lake — Design For Better Architecture, Storage, Security & Data Governance

Data-driven outcomes, forecasting, and predicting business trends is essential to any business. Today we see at least…

medium.com

Datalake File Ingestion: From FTP to AWS S3

Transferring files from FTP server to AWS s3 using Paramiko in Python

towardsdatascience.com

Data Lake Analytics

Data Lake an ever-evolving set of technologies which is used to store structured, semi-structured and unstructured data…

medium.com

Tale of Data Warehouse, Data Lake and Data Pond

As a data engineer, you always have a confusion on Data Warehouse, Lake and Pond. What are they and the most important…

medium.com

Data Governance

How Privacy Killed RBAC

This is a short story about how pressures from the real world can have a grave impact on our world of technology

medium.com

Principles of lazy data documentation — and how to get your team onboard

Documenting data is a pain. Is there hope for the lazy? Here are a handful of tools and techniques for low-friction…

blog.quiltdata.com

Collaborative Data Catalog for DataOps

Modernizing data platforms can create challenges. Although traditional data catalogs can deliver some visibility across…

medium.com

Implementation of Decentralized Data Quality

A Viewpoint shift from Data Quality to Collaborative Data Quality

towardsdatascience.com

DataOps

Make your data, and your organization ready for AI

Ever since AI was thrust into the spotlight with Watson in 2007, organizations have wanted to leverage AI in their…

medium.com

A DataOps perspective on App and Data Democratization

How DataOps facilitates access to data and apps, and helps to scale a data-driven company

medium.com

Why DataOps Is Here to Stay

With DataOps, data engineers and data scientists can work together, bringing a level of collaboration and…

towardsdatascience.com

4 Easy Ways to Start DataOps Today

The primary source of information about DataOps is from vendors (like DataKitchen) who sell enterprise software into…

medium.com

A DataOps perspective on App and Data Democratization

How DataOps facilitates access to data and apps, and helps to scale a data-driven company

medium.com

What the Heck is *Ops?

A Guide to Ops Terms and Whether We Need Them

medium.com

The Seven Pillars of DataOps

What is DataOps?

medium.com

Data Formats

GAVRO — Managed Big-Data Schema Evolution

Wouldn’t it be great to build a data ingestion architecture that was resilient to change? More specifically, resilient…

towardsdatascience.com

Delta Lake

Delta Lake in production: a critical evaluation

I have seen several posts and tutorials on Delta Lake using “Hello World” kind of examples, where everything works…

medium.com

Delta Lake

Hola!

medium.com

Delta Lake: Schema Enforcement & Evolution

medium.com

Use Delta Lake 0.6.0 to Automatically Evolve Table Schema and Improve Operational Metrics

Learn more about Delta Lake release 0.6.0 and how it will allow you to automatically evolve table schema in merge…

databricks.com

Apache Avro

How to deserialize AVRO messages in Python Faust?

Faust is a stream processing library, porting the ideas from Kafka Streams to Python.

medium.com

Apache Parquet

Cool Sh*t I Just Learned: Parquet’s Predicate Pushdown

Does it really exist? A pursuit of finding Parquet’s Predicate Pushdown empirical evidence 🕵🏻‍♂️

medium.com

Data Pipelines

Self-serve data pipelining platform

By — Karuna Saini ( Engineer, Data Platform)

medium.com

How to build a scalable big data analytics pipeline

Set up an end-to-end system at scale

towardsdatascience.com

Real Time Data Pipeline — More Than We Expected

When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming…

medium.com

Build Your First Data Pipeline in just Ten Minutes

Step-by-step process to build your first data pipeline with a real-world use case using PDI.

medium.com

Data Quality Tools

Introducing a new pySpark’s library: owl-data-sanitizer

A library to democratize data quality within companies with pySpark data pipelines.

towardsdatascience.com

Data Processing

Getting started with Spark and batch processing frameworks

What you need to know before you dive into big data processing with Apache Spark and other frameworks.

blog.insightdatascience.com

Hadoop vs. HDFS vs. HBase vs. Hive

What’s the difference?

medium.com

Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head

API and performance comparison on a billion-rows dataset. What should you use?

towardsdatascience.com

Apache Spark

DIY: Apache Spark & Docker

Set up a Spark cluster in Docker from scratch

towardsdatascience.com

Apache Spark with Kubernetes and Fast S3 Access

Use Spark in a simple and portable way on-promise and in the cloud

towardsdatascience.com

Working with JSON in Apache Spark

Denormalising human-readable JSON for sweet data processing

medium.com

Revealing Apache Spark Shuffling Magic

Five Important Aspects of Apache Spark Shuffling to know for building predictable, reliable and efficient Spark…

medium.com

Apache Spark BigQuery Connector — Optimization tips & example Jupyter Notebooks

Learn how to use the BigQuery Storage API with Apache Spark on Cloud Dataproc

medium.com

Apache Spark With DynamoDB Use Cases

Code examples of JAVA Spark applications that write and read data from DynamoDB tables running in an AWS EMR cluster.

medium.com

The Most Complete Guide to pySpark DataFrames

A bookmarkable cheatsheet containing all the Dataframe Functionality you might need

towardsdatascience.com

Big Data: Spark, AWS & SQL

simple cloud computing with S3 & EMR

medium.com

Spark Serialization Errors

A deep dive into the causes of serialization errors in Spark

medium.com

Quickstart: Apache Spark on Kubernetes

Give your big loads a smooth sailing using the native Apache Spark Operator for Kubernetes

towardsdatascience.com

Dynamic Partition Pruning in Spark 3.0 - DZone Big Data

With the release of Spark 3.0, big improvements were implemented to enable Spark to execute faster and there came many…

dzone.com

Apache Spark: Caching

Apache Spark provides an important feature to cache intermediate data and provide significant performance improvement…

towardsdatascience.com

The Pros and Cons of Running Apache Spark on Kubernetes

Kubernetes support was only recently added for Spark. How does it compare to other deployment modes and is it worth it?

towardsdatascience.com

Five Ways to Perform Aggregation in Apache Spark

Aggregation being the widely used operator among data analytics assignments, Spark provides a solid framework for the…

medium.com

UDAF and Aggregators: Custom Aggregation Approaches for Datasets in Apache Spark

Aggregations on data records is a necessary part of a data analytics exercise and therefore Spark is designed to put…

towardsdatascience.com

How to modernize and scale VaR calculations and risk management with Apache Spark, Delta Lake and…

Managing risk within the financial services, especially within the banking sector, has increased in complexity over the…

databricks.com

Apache Hive

Understanding Hadoop Hive

Hive is a data warehouse system which is used for querying and analysing large datasets stored in HDFS. It process…

medium.com

Hive UDAFs, or why Java’s type system sucks

Apache Hive is one of the most ubiquitous big data technologies out there; its job is to enable all kinds of data…

medium.com

Apache Hadoop

Apache YARN & Zookeeper

All about Resource Allocation and High Availability

towardsdatascience.com

Partition Management in Hadoop

Our solution to the Hadoop small files problem

medium.com

Presto

Presto sails on the Ship of Theseus

The Open Source Software, Presto, presents a real-life case study of the philosophical problem: The Ship of Theseus.

medium.com

Querying Multiple Data Sources with a Single Query using Presto’s Query Federation

In this first post in a new series, we introduce Presto and show how to use it to combine data from several sources…

medium.com

Apache Drill

Query data in Google Cloud Storage with SQL using Apache Drill

Google Cloud users are no strangers to BigQuery. Its a petabyte scale serverless warehouse with SQL interface, blazing…

medium.com

Apache Sqoop

RDBMS to HDFS and back

towardsdatascience.com

A Step by Step Guide for Loading Oracle Datasets into Hadoop using Docker Containers

Tutorial : Oracle To Hadoop with Docker containers

medium.com

Apache Sqoop — hide your password!

Apache Sqoop is a versatile and very useful tool when it comes to gathering data for your Big Data project.

medium.com

Stream Processing

Realtime Stream Processing Architectural Solution

In my previous post I have described the feasibility study on technology selection for a realtime stream processing…

medium.com

To stream or to not stream. That is a question.

Data streaming through Kafka is becoming an essential part of any data application. We are using Kafka mostly thanks to…

medium.com

Apache Flink

Stream Processing Best Practices with Apache Flink

Apache Flink is used for building a pipeline for streaming data analysis. This section discusses best practises I have…

medium.com

Apache Flink Series 9 — How Flink & Standalone Cluster Setup Work?

In this post, I am going to explain, how Flink starts itself, and what happens when you submit your job to the…

medium.com

Flink Map, CoMap, RichMap and RichCoMap Functions

Flink has a powerful functional streaming API which let application developer specify high-level functions for data…

medium.com

Flink Checkpointing

State management comes out of the box for Flink and it is considered as the first-class citizen. While Flink abstracts…

medium.com

SQL Editor for Apache Flink SQL

This is the very first version of the SQL Editor for Apache Flink.

medium.com

Apache Spark

Spark Streaming with HTTP REST endpoint serving JSON data

Speed up development and testing of spark structured streaming pipelines using HTTP REST endpoint as streaming source.

medium.com

Streaming from Kafka to PostgreSQL through Spark Structured Streaming

medium.com

Integration Testing in Spark Structured Streaming

A guide for writing integration test for a Spark Structured Streaming Application

medium.com

Apache Flume

Trickle-feed unstructured data into HDFS using Apache Flume

towardsdatascience.com

Clustering & Resources

Apache YARN & Zookeeper

All about Resource Allocation and High Availability

towardsdatascience.com

Change Data Capture

Stream your data changes in MySQL into ElasticSearch using Debizium, Kafka, and Confluent JDBC…

How to stream data changes from MySQL into Elasticsearch Index

towardsdatascience.com

Faster Change Data Capture for your Data Lake

The intent of this article is to discuss and present a new, faster approach to performing Change Data Capture (CDC) for…

medium.com

Debezium

Stream your data changes in MySQL into ElasticSearch using Debizium, Kafka, and Confluent JDBC…

How to stream data changes from MySQL into Elasticsearch Index

towardsdatascience.com

Data liberation pattern using Debezium engine

Integrating legacy applications into your Event-Driven Architecture.

medium.com

CDC made Easy with KTable, Debezium and Kafka Connect

by Karthikeyan Siva Baskaran and Somanath Sankaran

medium.com

Debezium- Production Deployment Preparation

If you are working on Debezium and plans to move it to production, I will suggest you go through this self-explanatory…

medium.com

Debezium Custom Converters

Creating custom converters using Debezium’s new SPI to override value conversions

medium.com

MySQL to PostgreSQL using Kafka Connect

Our objective would be to quickly set up a data pipeline and move data from MySQL to PostgreSQL.

medium.com

Storage

Apache HDFS

Hadoop Distributed File System

A comprehensive guide to understanding HDFS and it’s inner workings

towardsdatascience.com

Messaging

Apache Kafka

Apache Kafka vs. Enterprise Service Bus (ESB)

Typically, an enterprise service bus (ESB) or other integration solutions like extract-transform-load (ETL) tools have…

medium.com

Is Apache Kafka a Database?

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and…

medium.com

Kafka with AVRO vs., Kafka with Protobuf vs., Kafka with JSON Schema

Experiments with Kafka serialisation schemes — playing with AVRO, Protobuf, JSON Schema in Confluent Streaming…

medium.com

SSL Authentication with Apache Kafka

Apache Kafka is the next big thing in Event-driven architectures and Microservices ecosystem and with its fast…

medium.com

Schema and Topic Design in Event-Driven Systems (featuring Kafka!)

In the microservices world, communication between services provides a host of problems — one of them being “how do we…

medium.com

Ballerina Kafka Serialization with Avro

This article will demonstrate how to use Apache Avro serialization / deserialization in Ballerina Kafka producers and…

medium.com

Apache-Kafka — Stream Avro Serialized Objects In 6 Steps.

Set up the environment for Kafka (Kafka server, Zookeeper, Schema Registry) and Docker.

medium.com

Learn how to use Kafkacat — the most versatile Kafka CLI client

Kafkacat is an awesome tool and today I want to show you how easy it is to use it and what are some of the cool things…

medium.com

Kafka for Engineers

Here are things about Kafka that you need to understand as a software engineer

levelup.gitconnected.com

The streaming bridges — A Kafka, RabbitMQ, MQTT and CoAP example

The streaming bridges — A Kafka, RabbitMQ, MQTT and CoAP examplemedium.com

3 Libraries You Should Know to Master Apache Kafka in Python

First thing first. Why Kafka? Kafka is intended for boosting an event-driven architecture. It empowers the architecture…

towardsdatascience.com

Intro to streaming data and Apache Kafka

Overview of streaming data architectures and why Apache Kafka has become so popular

towardsdatascience.com

Kafka Internals For Troubleshooting

It is not necessary to know Kafka internals in order to run it but understanding these internals helps to provide…

medium.com

Apache Pulsar

One new contestant to bring down the King: Apache Pulsar

Nowadays we’re in a new age of Event-Driven Architecture rise. This is not the first time we’ve lived that. Before…

medium.com

Apache Pulsar 2.5.2

The Apache Pulsar 2.5.2 version is a huge effort from the community, with over 56 commits, general improvements and bug…

medium.com

Workflow Management

Apache Airflow

Getting started with Airflow locally and remotely

Airflow has been around for a while, but it has gained a lot of traction lately. So what is Airflow? How can you use…

towardsdatascience.com

How We Monitor Apache Airflow in Production

A quick guide on what (and how) to monitor to keep your workflows running smoothly.

blog.gojekengineering.com

3 Steps to Advanced Alerting on Airflow with Databand

As a data engineer, you need to create trust in your data. You need to be aware of problems in pipeline timeliness and…

medium.com

Build your first data warehouse with Airflow on GCP

What are the steps in building a data warehouse? What cloud technology should you use? How to use Airflow to…

towardsdatascience.com

A Complete Guide to Setting up a Local Development Environment for Airflow (with Docker and…

Collaborate on Airflow workflows with ease using this setup that includes a docker-compose, PyCharm, and DAG validation…

medium.com

A Gentle Introduction To Understand Airflow Executor

Let’s discuss details about what’s Airflow executor, compare different types of executors to help you make a decision

towardsdatascience.com

Automated Reporting System Using Airflow

Configure scheduled reports in under 15 minutes

medium.com

Apache Airflow and Kubernetes — Pain Points and Plugins to the Rescue

I explore some of the Airflow pain points we struggled with and how plugins were used to address them.

medium.com

From Zero to Apache Airflow Contribution — Part 1

How to make your first contribution to Apache Airflow project

medium.com

From Zero to Apache Airflow Contribution — Part 2

You are in part 2 of how to make your first Apache Airflow contribution. If you haven’t started in part 1, I advise you…

medium.com

How to create an ETL pipeline in Python with Airflow

Simple ETL with Airflow

medium.com

Airflow: how and when to use it (Advanced)

Beyond basic concepts of Airflow, there is a lot to consider. Choosing an operator and DAG structure is important for…

towardsdatascience.com

Machine Learning Workflow

Kubeflow 1.0 — Quick Overview

Kubeflow is an open-source and free machine learning Kubernetes-native platform for developing, orchestrating…

medium.com

Developing Machine Learning Pipelines

In Data Science, you are only as good as the way you structure your work

towardsdatascience.com

Operationalization of ML Pipelines on Apache Mesos and Hadoop using Airflow

An architecture for bringing ML models into production at NEW YORKER

towardsdatascience.com

Cloud Providers

AWS

Add Newly Created Partitions Programmatically into AWS Athena schema

Python script to load new partitions using Glue Job. Simple. Fast. Clean.

medium.com

Lessons From Processing 300 Million Messages a Day

Common problems when scaling AWS Glue

medium.com

Challenges during migration from On-Premise to AWS Cloud

In case of on-premise set up, infrastructure management is one of the challenge for a company. Now a days, companies…

medium.com

Glue ETL — Redshift Source

Redshift is a common destination for ETL pipelines. Using Redshift as an ETL source is not very common. It also creates…

medium.com

Extract, Transform, Load (ETL) — AWS Glue

Learn how to use AWS Glue for ETL operations in Spark on Novel Corona Virus Dataset

towardsdatascience.com

7 Things I Found Annoying About AWS Glue

1. Extremely slow start times

medium.com

Build first ETL solution using AWS Glue..

In this post, I am going to discuss how we can create ETL pipelines using AWS Glue. We will learn what is aws glue, how…

medium.com

Cross-account AWS Glue Data Catalog access with Glue ETL

To process data in AWS Glue ETL, DataFrame or DynamicFrame is required. A DataFrame is similar to a table and supports…

medium.com

Implementing Glue ETL job with Job Bookmarks

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data…

medium.com

Process Events with Kinesis and Lambda

Processing Kinesis Events with Lambda

medium.com

AWS Data Analytics — Kinesis Part-1

How to move data on AWS ?

medium.com

AWS Kinesis Data Streaming with Lambda and Serverless

Today we are going to explore AWS Kinesis Data Streaming with Lambda functions. So, Amazon Kinesis is a managed…

medium.com

Google Cloud

Loading Data from multiple CSV files in GCS into BigQuery using Cloud Dataflow (Python)

A Beginner’s Guide to Data Engineering on Google Cloud Platform

medium.com

Implementing a Data Vault in BigQuery

In my last post I went over how we went about implementing a fitness program at Pandera, the need for a custom solution…

medium.com

How to use Dynamic SQL in BigQuery

Format a string, and use EXECUTE IMMEDIATE

towardsdatascience.com

Google Cloud Data Catalog — Integrate Your On-Prem RDBMS Metadata

Code samples with a practical approach on how to ingest metadata from on-premise Relational Databases into Google Cloud…

medium.com

Where is my data? The answer is Google Data Catalog

All you want to know about GCP Data Catalog

medium.com

Optimal performance with Bigtable: changing the key of your table with Apache Beam

Bigtable is a high performance distributed NoSQL database. But what if you discover that performance is not so great in…

towardsdatascience.com

Transform JSON to CSV from Google bucket using a Dataflow Python pipeline

In this article, we will try to transform a JSON file into a CSV file using dataflow and python

medium.com

Cloud Firestore on Beam with Java

Recently, I’ve been in charge to create a streaming data pipeline which handle data coming through Cloud Pub/Sub…

medium.com

Using the Bigtable emulator with Apache Beam and BigtableIO

Cloud Bigtable is a great high performance distributed NoSQL database that can store petabytes of data, but sometimes…

medium.com

Data Pipeline in GCP: Cloud Function Basics

Most Data Scientists, prefer to own the end to end data pipeline of their models, but owning a pipeline requires a lot…

towardsdatascience.com

Azure

Continuous integration and delivery in Azure Data Factory

Definitive guide to building CI/CD pipelines for Azure Data Factory using Azure DevOps

medium.com

How to build a data platform with Azure and Snowflake

This blog explains how Azure and Snowflake provide a very powerful toolset to quickly build data platforms.

medium.com

Azure Data Lake Store Gen2 Snapshot

Data Lake Storage Gen2 is the result of converging the capabilities of our two existing storage services, Azure Blob…

medium.com

Monitoring and Reporting the real-time data in Power BI

About Kafka: Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to…

medium.com

How to Increase Azure Databricks Cluster vCPU Cores Limits

The solution of the warning: “This account may not have enough CPU cores to satisfy this request” for Azure Databricks

towardsdatascience.com

Databases

NoSQL

MongoDB 4.2.6

MongoDB is a popular distributed document database. It offers replication via a homegrown consensus protocol which…

jepsen.io

TileDB 2.0 and the Future of Data Science

The Helicopter View

medium.com

Everything you should know about NoSQL database — System Design

It is hard to choose between relation (RDBS) or non-relational database (NoSQL) while designing a system. A fair…

medium.com

Firestore/Datastore: unlocked the query filter capabilities in Go

Firestore and Datastore are powerful but the query capabilities are limited. Discover a Go library that I wrote for…

medium.com

Azure Cosmos Database Concepts, Data Modelling — Part 1

NoSQL is becoming the latest trend in all applications. You must have seen a lot of changes in the last decade where…

medium.com

A Bumpy Journey To Rewrite A Bulk Upload API For Cassandra — P1: The Database

A deep dive into how Cassandra handles data writes and consistencies

medium.com

Understanding Distributed database/system using Cassandra

When I first came across the term distributed systems or database, the very reaction that came to my mind was that…

medium.com

Comparing CQL and the DynamoDB API

Six years ago, a few of us were busy hacking on a new unikernel, OSv, which we hoped would speed up any Linux…

medium.com

An introduction to Apache HBase

Overview

medium.com

HBase DB in Distributed Systems

HBase Database overview

medium.com

Apache Druid Migration: AWS to GCP — Part 2

In this article, we will discuss how to migrate Druid from AWS to GCP cloud platform. For details on how to set up…

medium.com

In-Memory & Data Grid

Distributed Cache Design : 🖥

A Cache is like short-term memory. It is typically faster then the origin data source. You know Accessing data from RAM…

medium.com

Relational

How does Spanner avoid single point of failures in writes?

Google’s Spanner is a relational database with 99.999% availability which is roughly 5 mins a year. Spanner is a…

medium.com

Migrating from Postgres to CockroachDB: bulk loading performance

I recently migrated a large database (~400GB) from PostgreSQL over to CockroachDB. This blog is a recap of my process…

medium.com

SQLZoo: The Best Way to Practice SQL

A Wild Playground to Test Your Skills with Solutions

towardsdatascience.com

Modern Data Warehouses

Building a Data Warehouse Pipeline: Basic Concepts & Roadmap

Five processes to improve your data pipeline operability and performance

towardsdatascience.com

The Rise and Fall of the OLAP Cube

One of the biggest shifts in data analytics over the past decade is the move away from building ‘data cubes’, or ‘OLAP…

towardsdatascience.com

Making queries 100x faster with Snowflake

Why and how we migrated our product usage data from PostgreSQL to Snowflake

medium.com