Homepage
Open in app
Sign in
Get started
Data Engineer Things
Insights and ideas on data and engineering.
ETL
Data Architecture
Optimization
Interview Guide
Career Growth
AI in Data Engineering
About
Contribute
Follow
Following
Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice
Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice
Explore the potential of Large Language Models (LLMs) changing the future of mental healthcare and learn how to fine-tune LLMs by example
Volker Janz
May 19
Trending Now
How Twitter processes 4 billion events in real-time daily
How Twitter processes 4 billion events in real-time daily
From Lambda to Kappa
Vu Trinh
May 25
Roadmap to Learn Data Engineering: How I Would Start Again
Roadmap to Learn Data Engineering: How I Would Start Again
A completely free curriculum I wished I had
Wei Chun
Jun 7
The Hadoop Distributed File System
The Hadoop Distributed File System
Everything you need to know about the HDFS
Vu Trinh
May 24
Real-Time Data Processing: Spark Streaming vs. Flink
Real-Time Data Processing: Spark Streaming vs. Flink
Choosing the right tool for handling big data in real-time
Steffi Christopher
May 28
Automate Dbt Date Logic with Python — Part 2
Automate Dbt Date Logic with Python — Part 2
Simplifying Our Models and Tests From Part 1 Using Meta Config
Leo Godin
May 14
Why did Databricks build the Photon engine?
Why did Databricks build the Photon engine?
The Lakehouse, its motivation, and the difference between Photon and the existing engine.
Vu Trinh
Apr 6
Latest stories
Azure Databricks in the Enterprise Context: Networking
Azure Databricks in the Enterprise Context: Networking
A Comprehensive Overview of Network Security and Compliance with Databricks
Eduard Popa
Jun 7
What Consistency Really Means in Data Systems?
What Consistency Really Means in Data Systems?
Consistency varies significantly across databases, distributed systems, and streaming systems.
RisingWave Labs
Jun 7
How to build a Data Pipeline with AWS Glue and Terraform
How to build a Data Pipeline with AWS Glue and Terraform
A step-by-step guide to an ETL project that explores Australian property price
Bella Jiang
Jun 6
Everything you need to know about MapReduce
Everything you need to know about MapReduce
All the key insights from the paper MapReduce: Simplified Data Processing on Large Clusters from Google
Vu Trinh
Jun 1
Bloom Filter in Short
Bloom Filter in Short
Set.contains() at scale with some False Positives
Susmit
May 30
Test Driven Development for Data Engineering (Part 1)
Test Driven Development for Data Engineering (Part 1)
How to write unit tests for data engineering
Yaakov Bressler
May 28
Granular Look at Left, Semi, and Anti Joins in PySpark
Granular Look at Left, Semi, and Anti Joins in PySpark
In data operations, understanding the inner-working of the various types of joins can optimize query performance and accuracy. Spark…
Nicholas Piesco
May 20
Customer segmentation using Spark ML and Scikit learn in Spark— part 3
Customer segmentation using Spark ML and Scikit learn in Spark— part 3
Introduction:
Suhaib Arshad
May 16
Understanding Snowflake Table Locks
Understanding Snowflake Table Locks
A hands-on look at table locks.
Jonathan Duran
May 16
EDA and Data Transformation using PySpark — part 1
EDA and Data Transformation using PySpark — part 1
GitHub repository
Suhaib Arshad
May 16
The Inheritance Schema Design Pattern for MongoDB Data Modelling
The Inheritance Schema Design Pattern for MongoDB Data Modelling
In the world of NoSQL databases, particularly MongoDB, designing an efficient data model is crucial for optimal application performance…
Karen Zhang
May 12
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
A Step-by-Step Guide
Lorena Gongang
May 12
Enhance your data quality tests with the dataform-assertions package
Enhance your data quality tests with the dataform-assertions package
dbt is no longer the only choice for testing data pipelines
Fumiaki Kobayashi
May 12
My Data Pipeline Orchestrators Journey
My Data Pipeline Orchestrators Journey
Originally Posted at: www.junaideffendi.com
Junaid Effendi
May 5
I spent 5 hours understanding more about the Delta Lake table format
I spent 5 hours understanding more about the Delta Lake table format
All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores
Vu Trinh
May 4
What is something we have but don’t own and is never working when you need it.
What is something we have but don’t own and is never working when you need it.
Testing is difficult but pains could be eased with unified tooling. Here we explore the pros and cons of testing with new tools to help us
Peter Flook
May 2
Installing (and Switching between) Different Versions of Python
Installing (and Switching between) Different Versions of Python
How to install and switch between different python versions.
Yaakov Bressler
May 1
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II
Question: Using MySQL’s public employee sample database, create a DAG to move data from the employee’s table to BigQuery.
Jennifer Ebe
Apr 27
How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly
How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly
Migrating our data warehouse to Greenplum enables us to access data from Hive in real-time, eliminate storage issue, and much more!
Bernard Adhitya
Apr 26
Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)
Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)
TL;DR: It’s a challenging market, yet it holds promising prospects.
Yingjun Wu
Apr 25
Introduction to Apache Iceberg | PySpark
Introduction to Apache Iceberg | PySpark
The Story Behind a Data Lake
Pavan Kumar
Apr 25
Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling
Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling
Exploring the transition from composite to surrogate keys for enhanced performance and maintainability in data warehousing.
Ivanna Ditlevsen Jurkiv
Apr 25
Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI
Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI
Discover the basics of using Gemini with Python via VertexAI, creating a Web UI with NiceGUI and using Jinja2 to construct modular prompts
Volker Janz
Apr 25
Pydantic for Experts: Reusing & Importing Validators
Pydantic for Experts: Reusing & Importing Validators
Advanced techniques for reusing and importing validation across python models.
Yaakov Bressler
Apr 21
Do We Need the Lakehouse Architecture?
Do We Need the Lakehouse Architecture?
When data lakes and data warehouses are not enough.
Vu Trinh
Apr 20
About Data Engineer Things
Latest Stories
Archive
About Medium
Terms
Privacy
Teams