Sign in Get started

Data Engineer Things

Insights and ideas on data and engineering.

Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice

Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice

Explore the potential of Large Language Models (LLMs) changing the future of mental healthcare and learn how to fine-tune LLMs by example

May 19

Trending Now

How Twitter processes 4 billion events in real-time daily

How Twitter processes 4 billion events in real-time daily

From Lambda to Kappa

May 25

Roadmap to Learn Data Engineering: How I Would Start Again

Roadmap to Learn Data Engineering: How I Would Start Again

A completely free curriculum I wished I had

Jun 7

The Hadoop Distributed File System

The Hadoop Distributed File System

Everything you need to know about the HDFS

May 24

Real-Time Data Processing: Spark Streaming vs. Flink

Real-Time Data Processing: Spark Streaming vs. Flink

Choosing the right tool for handling big data in real-time

Steffi Christopher

May 28

Automate Dbt Date Logic with Python — Part 2

Automate Dbt Date Logic with Python — Part 2

Simplifying Our Models and Tests From Part 1 Using Meta Config

May 14

Why did Databricks build the Photon engine?

Why did Databricks build the Photon engine?

The Lakehouse, its motivation, and the difference between Photon and the existing engine.

Apr 6

Latest stories

Azure Databricks in the Enterprise Context: Networking

Azure Databricks in the Enterprise Context: Networking

A Comprehensive Overview of Network Security and Compliance with Databricks

Jun 7

What Consistency Really Means in Data Systems?

What Consistency Really Means in Data Systems?

Consistency varies significantly across databases, distributed systems, and streaming systems.

RisingWave Labs

Jun 7

How to build a Data Pipeline with AWS Glue and Terraform

How to build a Data Pipeline with AWS Glue and Terraform

A step-by-step guide to an ETL project that explores Australian property price

Jun 6

Everything you need to know about MapReduce

Everything you need to know about MapReduce

All the key insights from the paper MapReduce: Simplified Data Processing on Large Clusters from Google

Jun 1

Bloom Filter in Short

Bloom Filter in Short

Set.contains() at scale with some False Positives

May 30

Test Driven Development for Data Engineering (Part 1)

Test Driven Development for Data Engineering (Part 1)

How to write unit tests for data engineering

Yaakov Bressler

May 28

Granular Look at Left, Semi, and Anti Joins in PySpark

Granular Look at Left, Semi, and Anti Joins in PySpark

In data operations, understanding the inner-working of the various types of joins can optimize query performance and accuracy. Spark…

Nicholas Piesco

May 20

Customer segmentation using Spark ML and Scikit learn in Spark— part 3

Customer segmentation using Spark ML and Scikit learn in Spark— part 3

May 16

Understanding Snowflake Table Locks

Understanding Snowflake Table Locks

A hands-on look at table locks.

May 16

EDA and Data Transformation using PySpark — part 1

EDA and Data Transformation using PySpark — part 1

GitHub repository

May 16

The Inheritance Schema Design Pattern for MongoDB Data Modelling

The Inheritance Schema Design Pattern for MongoDB Data Modelling

In the world of NoSQL databases, particularly MongoDB, designing an efficient data model is crucial for optimal application performance…

May 12

How I build an ETL pipeline with AWS Glue, Lambda, and Terraform

How I build an ETL pipeline with AWS Glue, Lambda, and Terraform

A Step-by-Step Guide

May 12

Enhance your data quality tests with the dataform-assertions package

Enhance your data quality tests with the dataform-assertions package

dbt is no longer the only choice for testing data pipelines

Fumiaki Kobayashi

May 12

My Data Pipeline Orchestrators Journey

My Data Pipeline Orchestrators Journey

Originally Posted at: www.junaideffendi.com

May 5

I spent 5 hours understanding more about the Delta Lake table format

I spent 5 hours understanding more about the Delta Lake table format

All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

May 4

What is something we have but don’t own and is never working when you need it.

What is something we have but don’t own and is never working when you need it.

Testing is difficult but pains could be eased with unified tooling. Here we explore the pros and cons of testing with new tools to help us

May 2

Installing (and Switching between) Different Versions of Python

Installing (and Switching between) Different Versions of Python

How to install and switch between different python versions.

Yaakov Bressler

May 1

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II

Question: Using MySQL’s public employee sample database, create a DAG to move data from the employee’s table to BigQuery.

Apr 27

How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly

How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly

Migrating our data warehouse to Greenplum enables us to access data from Hive in real-time, eliminate storage issue, and much more!

Bernard Adhitya

Apr 26

Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)

Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)

TL;DR: It’s a challenging market, yet it holds promising prospects.

Apr 25

Introduction to Apache Iceberg | PySpark

Introduction to Apache Iceberg | PySpark

The Story Behind a Data Lake

Apr 25

Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling

Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling

Exploring the transition from composite to surrogate keys for enhanced performance and maintainability in data warehousing.

Ivanna Ditlevsen Jurkiv

Apr 25

Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI

Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI

Discover the basics of using Gemini with Python via VertexAI, creating a Web UI with NiceGUI and using Jinja2 to construct modular prompts

Apr 25

Pydantic for Experts: Reusing & Importing Validators

Pydantic for Experts: Reusing & Importing Validators

Advanced techniques for reusing and importing validation across python models.

Yaakov Bressler

Apr 21

Do We Need the Lakehouse Architecture?

Do We Need the Lakehouse Architecture?

When data lakes and data warehouses are not enough.

Apr 20

About Data Engineer ThingsLatest StoriesArchiveAbout MediumTermsPrivacyTeams