Data Engineering Digest #13 (Jun 2020)

Maycon Viana Bordin

Published in

data.plumbers

18 min readJul 28, 2020

--

Photo by Samuel Wölfl from Pexels

New Tools

Delta Engine Introduction and Overview of How it Works

Today, we announced Delta Engine, which ties together a 100% Apache Spark-compatible vectorized query engine to take…

databricks.com

questdb/questdb - Release 5.0

Migrated to Java 11 (#272). Fixed #154 SQL: vectorized group by hour(timestamp) (#398) SQL: cancel active read-only…

github.com

Recent database technology that should be on your radar (part 1)

Recent database technology that should be on your radar (part 1) I'm a huge fan of databases, so much so that I've…

lucperkins.dev

Spark Delight — We’re building a better Spark UI

“The Spark UI is my favorite monitoring tool” — said no one ever.

towardsdatascience.com

Hyperspace, an indexing subsystem for Apache Spark™, is now open source - Open Source Blog

For Microsoft's internal teams and external customers, we store datasets that span from a few GBs to 100s of PBs in our…

cloudblogs.microsoft.com

Spark 3.0

Introducing Spark 3.0 - Now Available in Databricks Runtime 7.0

We're excited to announce that the Apache Spark TM 3.0.0 release is available on Databricks as part of our new…

databricks.com

Spark 3.0 — New Functions in a Nutshell

Recently Apache Spark community releases the preview of Spark 3.0 which holds many significant new features that will…

medium.com

Spark & AI summit and a glimpse of Spark 3.0

If there is a framework that super excites me, it’s Apache Spark. If there is a conference that excites me, it’s the…

towardsdatascience.com

Five highlights on the Spark 3.0 Release

Spark 3.0.0 was officially released yesterday (18/Jun/2020), and it is a major change (no pun intended) in the most…

itnext.io

About Joins in Spark 3.0

Tips for efficient joins in Spark SQL.

towardsdatascience.com

Apache Spark 3.0: Remarkable Improvements in Custom Aggregation

In the context of the recent official announcement on Spark 3.0, Aggregator would now become the default mechanism to…

medium.com

Spark 2.x to spark 3.0 — Adaptive Query Execution — Part1

Switching Join Strategy, by Radhwane Chebaane and Wassim Almaaoui

medium.com

Data Engineering Role

Dream of Becoming a Big Data Engineer? Discover What Sets Us Apart From Software Engineers

We ain’t doing the same thing

towardsdatascience.com

Podcasts & Presentations

Data Collection And Management To Power Sound Recognition At Audio Analytic

An interview about how Audio Analytic is building a data set of high quality audio samples from scratch to power their…

www.dataengineeringpodcast.com

Bringing Business Analytics To End Users With GoodData

An interview about how the GoodData platform lets you bring business analytics to your customers and end users. The…

www.dataengineeringpodcast.com

Data Management Trends From An Investor Perspective

An interview with Astasia Myers of Redpoint Ventures on the data management industry trends that she is paying…

www.dataengineeringpodcast.com

Accelerate Your Machine Learning With The StreamSQL Feature Store

An interview with the creator of StreamSQL on the complexities of building a feature store and the benefits that they…

www.dataengineeringpodcast.com

Building A Data Lake For The Database Administrator At Upsolver

An interview about Upsolver's mission to build a data lake that empowers the database administrator to step into the…

www.dataengineeringpodcast.com

Real Data Architectures & Platforms

A Brief History of Liv Up Data Platform

This is the journey of building a data platform, from 2017 to 2020. During these three years, our business changed a…

medium.com

How TripleLift Built an Adtech Data Pipeline Processing Billions of Events Per Day - High…

Monday, June 15, 2020 at 9:04AM This is a guest post by Eunice Do, Data Engineer at TripleLift, a technology company…

highscalability.com

How Scribd Ditched the Data Center and Accelerated Its Development Velocity - The Databricks Blog

Guest blog by R Tyler Croy, Director of Platform Engineering at Scribd People don't tend to get excited about the data…

databricks.com

How to build an Analytics & Reporting Solution on the GCP

In our previous story we looked at how to build a modern Cloud Data Platform and which capabilities it should offer…

medium.com

Building University of Indonesia’s Realtime Analytics Pipeline

How we designed and implemented University of Indonesia’s big data streaming architecture as part of my bachelor…

medium.com

Optimized Real-time Analytics using Spark Streaming and Apache Druid

Our advertising data engineering team at GumGum uses Spark Streaming and Apache Druid to providing real-time analytics…

medium.com

How To Visualize Public Transport Using Kibana, Elasticserach, Logstash (Elastic Stack) and Kafka…

Do you think about analyzing and visualizing geo data? Why not try Elasticsearch? The so-called. ELK (Elasticsearch +…

medium.com

How to build a real-time analytical platform using Kafka, ksqlDB and ClickHouse ?

Recently at StreamThoughts, we have looked at different open-source OLAP database solutions that we could quickly…

medium.com

Four building blocks for scaling insights — Part 2: The evolution of our insight infrastructure

A modular approach helped us scale our insight infrastructure as we grew rapidly and went from start-up to scale-up…

medium.com

Data Culture

Data Monetization

Every organization today has access to vast amounts of data. Data about operations, finances, customers, supply chains…

medium.com

Everything you need to know about data culture

As data has the potential to fuel innovation and produce more value, many enterprises invest in numerous technologies…

medium.com

How to empower data-driven culture on construction (1/4)

The data management routine must be guided by the better use of the information companies already have and aim to reach…

medium.com

Do you really have a data strategy?

Many companies claim of having a data strategy. Let’s see what makes this real.

towardsdatascience.com

Data Lake

Do you really need a data lake?

Let me help you decide.

towardsdatascience.com

Data Lake vs Data Warehouse

For a long time, I didn't understand the concepts of Data Lake and Data Warehouse. I thought it was the same thing - a…

luminousmen.com

Data Engineer, Patterns & Architecture The future

Deep-dive into Microservices Patterns with Stream Process

towardsdatascience.com

Data mesh (not a service mesh)

The speed of business today calls for data architectures evolution - from the warehouse to data lakes to data mesh.

towardsdatascience.com

Business Intelligence meets Data Engineering with Emerging Technologies

How to make BI better with new rising technologies and twelve data engineering approaches.

towardsdatascience.com

Do you really need a data lake?

Let me help you decide.

towardsdatascience.com

Modernize Analytics Infrastructure with a Modern Data Unification approach

How a collaborative Modern Data Unification approach can enable the business to scale alongside growing amounts of…

towardsdatascience.com

Data Architecture

Data Engineer, Patterns & Architecture The future

Deep-dive into Microservices Patterns with Stream Process

towardsdatascience.com

Guerrilla Data Architecture

medium.com

A Modern Data Architecture

When people think of AI driven products they…

medium.com

Data Governance

Facilitating Data discovery with Apache Atlas and Amundsen

The story of enabling a modern data discovery service within a big data democratization platform

medium.com

Govern your data: it’s a tough job, but someone has to do it!

Data Governance: why it cares and how Quantyca approaches it with an iterative process

medium.com

What We Got Wrong About Data Governance

And how we can make it right

towardsdatascience.com

How to find and organize your data from the command-line

Introducing metaframe: a markdown-based, git-versionable documentation tool and data catalog for data scientists.

towardsdatascience.com

Why data catalogs are data governance rock stars?

Data catalogs smell like teen spirit!

medium.com

Govern your data: it’s a tough job, but someone has to do it!

Data Governance: why it cares and how Quantyca approaches it with an iterative process

medium.com

Speed up Data Catalog Implementation with Automation and AI

It’s no doubt true that crowdsourcing is a great data catalog capability. After all, it enables teams and departments…

medium.com

Dissecting the need of Data Catalog: Top 5 reasons

Data Catalog Solution to Manage Your Data Lake | Boost Data Access & Discovery. End-to-end data lineage. Unified Data…

medium.com

Why metadata is crucial in your data management strategy

When data is created, so is metadata. However, this type of information is not enough to properly manage data in this…

medium.com

Definition, Benefits, and Practical Use of Master Data Management

Data not only defined the last decade but will have a critical impact in the years to come. From serving as the…

medium.com

The Six Dimensions of Data Quality — and how to deal with them

Building your models and analysis on solid foundations

towardsdatascience.com

Data Quality — You’re Measuring It Wrong

Introducing a better way: data downtime

towardsdatascience.com

Data Quality, DataOps, and the Trust Blast Radius

A lack of trust will dramatically impact your efforts to become data driven unless you proactively limit the Blast…

towardsdatascience.com

How to ensure productivity and data quality for a phone survey at scale

Part two in our series of lessons learned based on a 6,000 person survey in India and a 600 person survey in Kenya.

medium.com

How can AI help to make Enterprise Data Quality smarter?

Hardly anyone relying on data can say their data is perfect. There is always that difference between the data you have…

towardsdatascience.com

Earn a Bigger Gig and Help your Business go Real-Time

The most valuable resource driving your company today is its data. But, usually there are many barriers between your…

medium.com

DataOps solves the data challenges of businesses.

DataOps started from a desire to deal with the data silo’s, and enable non-tech savvy’s to answer their questions with…

medium.com

The Importance of Testing Your Data

In the software development process, testing plays an important role. Testing ensures the software’s ultimate quality…

medium.com

Data Formats

A Data Lake new era

Data Lake and Data Warehouse in real time and low cost

medium.com

Delta Lake

Delta Lake Year in Review and Overview

Try out Delta Lake 0.7.0 with Spark 3.0 today! It has been a little more than a year since Delta Lake became an…

databricks.com

What is and Why Delta Lake? How Change Data Capture (CDC) gets benefits from Delta Lake

Introduction

medium.com

Apache Parquet

CRUD operation on Parquet files with Azure Data Factory

If you ever need to do a CRUD operation on Parquet file in ADF then you can review this article for few hints

medium.com

Data Pipelines

ML Pipelines

Industrialization of a ML model using Airflow and Apache BEAM

Introduction

medium.com

Data Processing

Apache Spark

How to access S3 data from Spark

Getting data from an AWS S3 bucket is as easy as configuring your Spark cluster

blog.insightdatascience.com

Stop using Pandas and start using Spark with Scala

Why Data Scientists and Engineers should think about using Spark with Scala as an alternative to Pandas and how to get…

towardsdatascience.com

A modern guide to Spark RDDs

Everyday opportunities to reach the full potential of PySpark

medium.com

Understand Spark As If You Had Designed It

Among the current frameworks available on the data space, just a few have achieved the status that Spark has in terms…

towardsdatascience.com

Should I repartition?

About Data Distribution in Spark SQL.

towardsdatascience.com

Extract and load of ETL jobs in Apache Spark

If you have been working in apache spark and had a look at spark UI or spark history server, you would know the fact…

medium.com

Extracting ZIP codes from longitude and latitude in PySpark

Given the pair of (longitude, latitude) how could one find the corresponding US ZIP code?

medium.com

Be in charge of Query Execution in Spark SQL

Querying data in Spark became a luxury since Spark 2.x because of SQL and declarative DataFrame API. Using just few…

towardsdatascience.com

Deep dive into Apache Spark Window Functions

Window functions operate on groups of data and return values for each record or group

medium.com

PySpark EDA Basics: Practical Parallel Processing

don’t calculate, delegate

medium.com

Faster extract and load of ETL jobs in Apache Spark

If you have been working in apache spark and had a look at spark UI or spark history server, you would know the fact…

medium.com

How to pre-process large datasets for machine learning using Spark

Introduction

medium.com

Apache Spark: Window function vs Struct function

The goal of this article is to compare the performance of two ways of processing data. There are window function and…

medium.com

Apache Spark Optimization Techniques

Before discuss various optimization techniques have a quick review how does spark run

medium.com

Dash is an ideal front-end for your Databricks Spark Backend

📌 Learn how to deliver AI for Big Data using Dash & Databricks in a live webinar on July 28th at 1pm EDT.

medium.com

Faster extract and load of ETL jobs in Apache Spark

If you have been working in apache spark and had a look at spark UI or spark history server, you would know the fact…

medium.com

Presto

Presto On Azure

Learn how Presto runs with Azure and how to set up Presto Cluster on Azure Cloud.

medium.com

An Alternative Reporting Solution for Microservices with Presto

During the past few years, old monolith applications started to evolve into distributed ones and many of these…

medium.com

Apache Pig

The charm of Apache Pig

A big data tool not to miss

towardsdatascience.com

Akka Actors

How to Write a Simple Data Processing Application With Akka Actors

When we talked about Data Processing or doing Big Data ETL, the first that comes to mind is using Hadoop (or Spark)…

levelup.gitconnected.com

Stream Processing

The case for Realtime Stream Processing

Even since I got interested in providing insights based on data already available in the database, I’ve been looking…

medium.com

Handling Dead Letters in a Streaming System

How we solved the critical problem of invalid records that broke our streaming pipeline.

blog.gojekengineering.com

Overview of the DataFlow Model

how the dataflow model deal with the streaming data

medium.com

Comparison between different streaming engines

medium.com

Zero to Streaming Application — Backend

Note: This is part of “Zero to Streaming Application”, learn about streaming applications building a POC. Full code…

medium.com

Apache Flink

Run a Stateful Streaming Service with Apache Flink and RocksDB

Build a stateful streaming service

medium.com

Reading Avro files using Apache Flink

medium.com

Demo: How to Build Streaming Applications Based on Flink SQL

Here shows how to use Flink SQL to integrate Kafka, MySQL, Elasticsearch, and Kibana to quickly build a real-time…

medium.com

A simple way to build your Real-Time dashboard

Building a live dashboard could be a headache due to the complex architecture and hard maintenance. Nowadays, the data…

medium.com

The Flink Ecosystem: A Quick Start to PyFlink

This article will introduce PyFlink’s architecture and provide a quick demo in which PyFlink is used to analyze CDN…

medium.com

Read from specific partitions with Flink Kafka Consumer in a Docker swarm cluster

Apache Flink offers a powerful integration with Kafka service, with an high level wrapper for the consumer. The main…

medium.com

Flink Checkpointing and Recovery

Apache Flink is a popular real-time data processing framework. It’s gaining more and more popularity thanks to its…

medium.com

Apache Beam

Running an Apache Beam Data Pipeline on Azure Databricks

A brief walk-through on how to execute an Apache Beam Pipeline on Databricks

towardsdatascience.com

Reading NUMERIC fields with BigQueryIO in Apache Beam

To read a NUMERIC from BigQuery in Beam you need to extract the scale of the number from the schema, and use that to…

medium.com

Apache Spark Streaming

Optimized Real-time Analytics using Spark Streaming and Apache Druid

Our advertising data engineering team at GumGum uses Spark Streaming and Apache Druid to providing real-time analytics…

medium.com

Complete Stream Environment (Go + Kafka + Spark + Deltalake)

Due to the increasing number of Microservices and IoT scenarios, we often come across the need to have a complete…

medium.com

Streaming Data from Apache Kafka Topic using Apache Spark 2.4.5 and Python

Creating a CDC data pipeline: Part 2

medium.com

Apache Structured Streaming for end-to-end real-time application

Many applications today require processing data in real time and make decisions based on real time data such as fraud…

medium.com

Peering through the ‘Window’ of Structured Spark Streaming

Time is critical in streaming applications as compared to batch. The choice to go ahead with batch or streaming always…

medium.com

Change Data Capture

Streaming Data from Microsoft SQL Server into Apache Kafka

Creating a CDC data pipeline: Part 1

medium.com

Change data capture: Install Debezium on K8s

Create a namespace for the resources we’re going to create:

medium.com

Messaging

Apache Kafka

Using Kafka as a Temporary Data Store and Data-loss Prevention Tool in The Data Lake

Introduction

medium.com

Using Kafka for Collecting Web Application Metrics in Your Cloud Data Lake

This article demonstrates how Kafka can be used to collect metrics on data lake storage like Amazon S3 from a web…

towardsdatascience.com

Why should anyone use Apache Kafka?

This is a story told time and again.

medium.com

Connecting Kafka to a MinIO S3 bucket using Kafka Connect

How to connect data being distributed via a web-socket to Kafka and then onto an S3 bucket

medium.com

Welcome to Kafkaland!

Kafka is blazing fast, you have enormous freedom, and it’s easy to go wrong with it when coming from another message…

medium.com

Apache Kafka Startup Guide: System Design Architectures: Notification System, Web Activity Tracker…

What is Kafka?

medium.com

Streaming Data Pipelines using Kafka connect

Streaming data pipelines is the backbone of an effective data pipeline in the modern systems.

medium.com

Kafka-s3 Sink in < 3 mins

Kafka avro record to s3 bucket in local

medium.com

Kafka, for your data pipeline? Why not?

Create a streaming pipeline using Docker, Kafka, and Kafka Connect

towardsdatascience.com

Apache Kafka — Understanding how to produce and consume messages?

I lack motivation. Every tool I try to explore at least requires three times to concentrate. After a while, I lose my…

medium.com

Why should anyone use Apache Kafka?

This is a story told time and again.

medium.com

Should I backup my Kafka cluster? And how?

One of our clients recently had an interesting request: can you backup our Kafka cluster?

blog.softwaremill.com

Apache Pulsar

Apache Pulsar Key Shared Mode-Sticky Consistent Hashing

Kafka Pulsar is a pub-sub message streaming service created by Yahoo.

medium.com

Announcing AMQP-on-Pulsar: bring native AMQP protocol support to Apache Pulsar

We are excited to announce that StreamNative and ChinaMobile are open-sourcing AoP, which brings the native AMQP…

medium.com

Announcing: The Apache Pulsar 2020 User Survey Report

This survey report reveals Pulsar’s accelerating rate of global adoption and highlights key features on Pulsar’s…

medium.com

Understanding Pulsar Message TTL, Backlog, and Retention

I’ve noticed some confusion out there around message deletion and retention in Pulsar (for example, in order to keep…

medium.com

Workflow Management

Orchestrating ETL Pipelines

In a previous post we have talked about our Data Platform, the tech choices we have made throughout its implementation…

medium.com

Apache Airflow

Integrating Docker Airflow with Slack to get Daily Reporting

Use Airflow to deliver daily weather forecasts to Slack

towardsdatascience.com

Deploying Scalable, Production-Ready Airflow in 10 Easy Steps Using Kubernetes

Airflow, Airbnb’s brainchild, is an open-source data orchestration tool that allows you to programmatically schedule…

levelup.gitconnected.com

How Apache Airflow is helping us to evolve our data pipeline at QuintoAndar

How we are scaling our data workflows to support the growing business analytical demands with Apache Airflow.

medium.com

Omada and Apache Airflow

Learn how Omada leverages Apache Airflow for orchestrating various pipelines

medium.com

Setting up Airflow on a local Kubernetes cluster using helm

In this post, I will cover steps to setup production-like Airflow scheduler, worker and a webserver on a local…

medium.com

Use Airflow to project confidence in your data.

A key tenet of Raybeam’s mission whenever we start at a new client is to deliver value quickly. This value often takes…

medium.com

Airflow — sharing data between tasks

If you look online for airflow tutorials, most of them will give you great introduction what Airflow is. They will talk…

towardsdatascience.com

Migration to Airflow: One year feedback

At Maisons Du Monde, we used to create and schedule our data pipelines with Rundeck, which is an automation server like…

medium.com

Airflow: How To Refresh Stock Data While You Sleep — Part 1

In this first tutorial on Apache Airflow, learn how to build a data pipeline to automatically extract, transform and…

towardsdatascience.com

Apache Airflow At Palo Alto Networks

As a part of the Cortex Data Lake platform team at Palo Alto Networks, we are building a Batch processing pipeline to…

medium.com

Prefect

Seamless move from Local to AWS Kubernetes Cluster with Prefect

Pre Requisites

medium.com

Map faster! Major mapping improvements in Prefect 0.12.0

The second generation of Prefect’s unique approach to dynamic parallel pipelines is here.

medium.com

Workflow Automation: Empowering teams through our in-house, self-service framework

A Workflow consists of steps, configured to respect a predefined order and accomplish a specific business objective…

medium.com

Cloud Providers

AWS

Ditch the Database

How to use AWS S3 Select to query smarter and maybe cheaper

towardsdatascience.com

Moving Faster With AWS by Creating an Event Stream Database

As engineers at Nike, we are constantly being asked to deliver features faster, more reliably and at scale. With these…

medium.com

Advanced monitoring of AWS glue jobs by enabling spark UI

Docker container to enable spark history server for monitoring glue jobs using sparkUI

towardsdatascience.com

Data Lake Vs Lake Formation

Why do we actually need Data lake?

medium.com

My Top 10 Tips for Working with AWS Glue

I have spent a significant amount of time over the last few months working with AWS Glue for a customer engagement. For…

medium.com

Run a Spark/Scala/ Python Jar/Script using AWS Glue Job (Serverless) and Scheduling it using a…

Easy Step-by-Step Guide to Create a Glue Job and schedule it using a Glue Trigger

medium.com

AWS Data Lake: Build Your Business Intelligence System.

Introduction:

medium.com

Some quick notes for AWS Data Analyst tools (Athena, Glue etc)

Lessons learned doing real AWS Data Analyst projects at work

medium.com

Extract and transform data from AWS Athena’s views and load into AWS S3 as a CSV file using AWS…

We have AWS Athena reading some data in S3, so that we can perform SQL querying for analytics purposes. I created a…

medium.com

Google Cloud

Loading and transforming data into BigQuery using dbt

A data engineering tool to build Data Lakes, Data Warehouses, Data Marts, and Business Intelligence semantic layers in…

medium.com

BigQuery Dataset Metadata Queries

This is a quick bit to share queries you can use to pull metadata on BigQuery datasets and tables.

medium.com

Decoupling Dataflow with Cloud Tasks and Cloud Functions

Are you developing data pipelines on Google Cloud and you sometimes struggle to choose the right product ? Do you feel…

medium.com

Designing Data Processing Pipeline on Google Cloud Platform (GCP) — Part I

An architectural overview of processing big data using GCP services

medium.com

Easy pivot() in BigQuery, finally

Introducing the easiest way to get a pivot done in BigQuery. Did you know this is one of the most requested features…

towardsdatascience.com

Loading and transforming data into BigQuery using dbt

A data engineering tool to build Data Lakes, Data Warehouses, Data Marts, and Business Intelligence semantic layers in…

medium.com

Load files faster into BigQuery

Benchmarking CSV, GZIP, AVRO and PARQUET file types for ingestion

towardsdatascience.com

Processing AVRO data using Google Cloud DataProc

In this story, we will see how Google Cloud Platform’s managed service Cloud DataProc can be leveraged to read and…

medium.com

Reading NULLABLE fields with BigQueryIO in Apache Beam

How to read NULLABLE fields from BigQuery with Apache Beam, using GenericRecord values (that is, encoded in Avro).

medium.com

BigQuery: Creating Nested Data with SQL

Working with SQL on nested data in BigQuery can be very performant. But what if your data comes in flat tables like…

towardsdatascience.com

Extract Nested Structs without Cross Joining Data in BigQuery

A short post sharing an example of less common but highly useful BigQuery Standard SQL syntax

towardsdatascience.com

How to use BigQuery API with your own dataset?

Using Flask and Bigquery APIs to extract data from BigQuery datasets based on user query parameters.

towardsdatascience.com

Graph data analysis with Cypher and Spark SQL on Cloud Dataproc

How to read in BigQuery data and use Spark SQL and the Morpheus library to carry out graph data analysis

levelup.gitconnected.com

Sqoop Data Ingestion on GCP

RDBMSes (Relational Data Base Management Systems) have been around for decades, many people use it to store structured…

medium.com

Azure

Azure Data Factory, a powerful Cloud ETL tool.

There is no way, you can implement an Analytics project without a powerful ETL tool.

medium.com

Consuming a SOAP service using Azure Data Factory Copy Data Activity

How to configure a Azure Data Factory CopyData Activity to consume a SOAP service

medium.com

Improve your Data Lifecycle with Metadata-Driven Pipelines

No digital transformation program is complete without a data-based initiative. With some speculating that artificial…

medium.com

Azure Data Factory Pipeline

Brief introduction on Data Pipelines in Azure Data Factory

medium.com

Databases

The Many Flavours Of SQL

What the SQL landscape looks like in 2020 and what’s it’s future?

towardsdatascience.com

Recent database technology that should be on your radar (part 1)

Recent database technology that should be on your radar (part 1) I'm a huge fan of databases, so much so that I've…

lucperkins.dev

Exploring OLAP on Kubernetes with Apache Pinot

It was April 2020 when I first heard about Apache Pinot through a tweet from Kenny Bastani. Online Analytics Processing…

medium.com

Time-series data: Why (and how) to use a relational database instead of NoSQL

Contrary to the belief of most developers, we show that relational databases can be made to scale for time-series data.

medium.com

NoSQL

NoSQL Databases: a Survey and Decision Guidance

(At the bottom of this page, you find a BibTeX reference to cite this article.)

medium.baqend.com

The best SQL vs NoSQL mindset I've ever heard

TL;DR - SQL RDBMS is optimizing for storage. NoSQL is optimizing for computing power. Nowadays, computing power is…

codarium.substack.com

ElasticSearch On Steroids With Avro Schemas

How to tackle the interface version explosion in a large enterprise setup

towardsdatascience.com

Does Elasticsearch lie? How does Elasticsearch work?

Elasticsearch surprises us with its capabilities and speed of action, but does it return the correct results? In this…

medium.com

MongoDB queries don’t always return all matching documents!

When I query a database, I generally expect that it will return all the results that match my query. Recently, I was…

blog.meteor.com

Choose SQL

Let's get straight to the point; choose an SQL database for your web application. I think I can't make my self clearer…

stateofprogress.blog

A Real-Time Database Survey: The Architecture of Meteor, RethinkDB, Parse & Firebase

Real-time databases make it easy to implement reactive applications, because they keep your critical information…

medium.baqend.com

Migrating Cassandra from one Kubernetes cluster to another without data loss

Our experience withing changing the K8s operator for Cassandra

medium.com

Leveraging Shenandoah to cut Cassandra’s tail latency

At Outbrain, we use Cassandra extensively. Up until recently our Cassandra clusters were configured with G1 and…

medium.com

Anti-patterns which all Cassandra users must know

No amount of performance tuning can mitigate a known anti-pattern. When you google ‘antipatterns in Cassandra’ you will…

medium.com

Backup and restore Cassandra cluster

Compatible with Elassandra

medium.com

Loading CSV Into Hbase Table In Kerberized Hadoop Cluster

Looking for some quick step by step method to load bulk of data into Hbase table in a Kerberos enabled Hadoop cluster…

medium.com

In-Memory & Data Grid

What an in-memory database is and how it persists data efficiently

Hey guys.

medium.com

Relational

How Does PostgreSQL Implement Batch Update, Deletion, and Insertion?

This article addresses all the frequently asked questions pertaining to batch update, insertion, and deletion in…

medium.com

How to Optimize SQL Queries

This article sorts out some special techniques for optimizing SQL Queries

towardsdatascience.com

Modern Data Warehouses

Business Intelligence meets Data Engineering with Emerging Technologies

How to make BI better with new rising technologies and twelve data engineering approaches.

towardsdatascience.com

Evolution of the DWH — What is a Data Lake House?

Sometimes I miss the old days when you got a big monolith deployment of tech; a new version full of features and bug…

medium.com

How we migrated our data warehouse from Redshift to BigQuery

We recently migrated our data warehouse at Omio from AWS Redshift to Google BigQuery.

medium.com

Data Lake vs Data Warehouse in Modern Data Management

Distinguish data lake vs data warehouse; modernize your data management and analytics with data platforms.

medium.com

Redshift vs BigQuery vs Snowflake: A comparison of the most popular data warehouse for data-driven…

Digital transformation is the new norm within the modern organisation where they continually challenge the status quo…

medium.com

The Death of Data Warehouse?

With the price of compute engine is getting cheaper, massive parallel processing advertised everywhere and “big data”…

medium.com

Data Engineering

Data Engineering Digest

Maycon Viana Bordin

Written by Maycon Viana Bordin

Editor for

data.plumbers

Data Engineer

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams