A Beginner’s Guide to Data Engineering — Part I

Data Engineering: The Close Cousin of Data Science

Robert Chang
14 min readJan 8, 2018

--

Image credit: A beautiful former slaughterhouse / warehouse at Matadero Madrid, architected by Iñaqui Carnicero

Motivation

The more experienced I become as a data scientist, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist’s toolkit. I find this to be true for both evaluating project or job opportunities and scaling one’s work on the job.

In an earlier post, I pointed out that a data scientist’s capability to convert data into value is largely correlated with the stage of her company’s data infrastructure as well as how mature its data warehouse is. This means that a data scientist should know enough about data engineering to carefully evaluate how her skills are aligned with the stage and need of the company. Furthermore, many of the great data scientists I know are not only strong in data science but are also strategic in leveraging data engineering as an adjacent discipline to take on larger and more ambitious projects that are otherwise not reachable.

Despite its importance, education in data engineering has been limited. Given its nascency, in many ways the only feasible path to get training in data engineering is to learn on the job, and it can sometimes be too late. I am very fortunate to have worked with data engineers who patiently taught me this subject, but not everyone has the same opportunity. As a result, I have written up this beginner’s guide to summarize what I learned to help bridge the gap.

Organization of This Beginner’s Guide

The scope of my discussion will not be exhaustive in any way, and is designed heavily around Airflow, batch data processing, and SQL-like languages. That said, this focus should not prevent the reader from getting a basic understanding of data engineering and hopefully it will pique your interest to learn more about this fast-growing, emerging field.

  • Part I (this post) is designed to be a high-level introductory post. Using a combination of personal anecdotes and expert insights, I will contextualize what data engineering is, why it is challenging, and how it can help you or your organization to scale. The primary audience of this post is for aspiring data scientists who need…

--

--

Robert Chang

Data @Airbnb, previously @Twitter. Opinions are my own.