Work-Bench Snapshot: The Rise of Data Engineering

Priyanka Somrah 🦋
Work-Bench
Published in
4 min readFeb 13, 2020

The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.

We are on the cusp of a new era in data and are excited to see the rise of data engineering. With over 35,600 and 119,950 results in new job openings on Glassdoor and LinkedIn respectively, there are more data engineers now than ever. From an organizational standpoint, data is totally broken — data science and engineering teams use different, siloed workflows. And the worst part is that they aren’t even efficient.

Most data scientists still spend most of their time completing tasks that are not essential parts of their jobs — such as rolling out their own infrastructure and testing, validating and maintaining their pipelines. In an ideal world, data scientists should be able to focus on building machine learning models and deriving meaningful insights from the data they collect; and data engineers should be able to build and maintain complex data pipelines at scale in order to make the data available for data science and analytics operations.

Today, the modern data landscape is shifting to bridge the gap between the data science and data engineering workflows, where aspects of each discipline are driven by tooling and data best practices. We are now seeing a new crop of startups leveraging new tools and techniques to work with data in a way that enables data engineers and scientists to be more productive and efficient in their work:

  1. dbt is an open source data transformation tool that empowers data analysts and data engineers to perform complex data modeling operations in SQL.
  2. Dagster is an open source data processing model, used for orchestrating data pipelines for ETL, Machine Learning and Analytics.
  3. Datalogue is a data pipelining platform built from the ground up to automate manual, ad hoc, and repetitive tasks necessary for data-driven insights.
  4. Preset is a data visualization platform, powered by Apache Superset, an open source analytics and data exploration platform.

These tools, not only bridge the gap between data science and engineering, but also free up both teams to do what they were meant to in the first place. Below is a collection of articles, videos, and people who are shaping the future of the modern data landscape:

Blog Posts

  1. Analytics Engineering for Everyone by Tristan Handy
    “Analytics engineering is fundamentally about building a bridge between technology and the business. Analytics engineers are knowledge specialists: like librarians, they curate an organization’s knowledge. These are the humans that we’ve been missing on our data projects, and this is the practice that we need to perfect if we’re to solve the core dysfunction of data. With a mature analytics engineering practice, knowledge is built up incrementally by many people in many small pieces.”
  2. Observability for Data Engineering by Evgeny Shulman
    “In our previous lives managing Data Engineering teams, we always struggled with maintaining good visibility into projects and infrastructure. We suggest giving more thought to Observability for your own data stack, and considering the factors that make it unique. This will help you build a more robust data operation by making it easier to align your team on statuses, identify issues faster, and debug more quickly.”
  3. The Future of Data Engineering by Chris Riccomini
    “There is, however, a second type of toil that I believe data engineering has not yet automated: policy toil. This kind of drudgery involves making decision about who can have access to what data, how long data should be persisted, what kind of sensitive data is allowed to be in which data systems, and in which geographies data may reside. Data engineering is usually not the team that ultimately decides the answers to these questions, but they often must act as a liasion or driver when finding the answers. This usually means navigating requests through other parts of the organization such as security, compliance, and legal.”

Videos

  1. Tactical Data Engineering | Looker by Julian Hyde

The speaker, Julian Hyde, is the author of Apache Calcite and an architect at Looker. In this video he describes the purpose of every database management system, looking closely at patterns that re-emerge throughout the data management process.

2. Data Engineering Principles — Build frameworks not pipelines by Gatis Seja

This presentation outlines the challenges around managing data flow from multiple sources and building large reliable data warehouses, and describes how applied data engineering principles can be used to build robust and easily manageable data products.

3. Functional Data Engineering — A Set of Best Practices by Max Beauchemin

In this talk, Max discusses how the functional programming paradigm can be applied to Data Engineering to bring clarity to the process — such as solving some of the inherent problems of ETL and empowering data teams to tackle larger problems.

People to Follow on Twitter

Nick Schrock

  • Nick Schrock is the founder of Elementl and the co-creator of GraphQL. Nick spent the bulk of his career building frameworks for product developers at Facebook, and is currently working on an open source tool called Dagster.

Max Beauchemin

  • Max is the creator of Apache Airflow, a data pipeline workflow engine and Apache Superset, a data visualization platform. He is a senior software engineer at Lyft and was previously on the “Analytics & Experimentation Products team” at Airbnb.

Kyle Kirwan

  • Kyle is the co-founder and CEO of Toro Data Labs and a former product manager working on the Data Knowledge Platform at Uber.

Note: Datalogue is a Work-Bench portfolio company

--

--