Work-Bench Snapshot: The Last Mile of the ETL Framework

Published in

Work-Bench

6 min readSep 15, 2020

The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.

Defining the Last Mile Problem in Analytics

Here at Work-Bench, as enterprise software investors, we’ve been thinking through the last mile problem in analytics first hand and have already explored similar topics in previous Snapshots (The Evolution of Data Discovery & Catalog and The Rise of Data Engineering.)

The last mile problem in analytics broadly refers to the challenges around data generation and consumption. While data generation is a process that is traditionally owned by the engineers and revolves around cleaning, validating and transforming raw data formats, data consumption on the other hand is the point at which clean data from data pipelines is consumed by product teams and business users who then leverage BI tools to query and convert data into meaningful business outcomes. This final step in the analytics chain is the most important one because this is where data can effectively be used to generate actionable insights.

But there seems to be a disconnect between data generation and consumption. There is typically a lot of work that is being done upstream by the data producers that downstream users including data scientists and analysts don’t have visibility into. At the very best, generalized data catalog frameworks help surface the most relevant datasets and tables that analysts can query from but they don’t necessarily highlight the nuances that are required to generate metrics in a standardized and accurate way or support any of the analytics work on top of key metrics.

This leads to a new set of challenges around the way people collect and share metrics in an organization which amplifies the agility problem that already exists throughout the data lifecycle: Multiple stakeholders are using cuts of data, and storing their own slices of data in notebooks or excel spreadsheets on their laptops that becomes their main metric they report out on. Since the current practices around streamlining the creation of metrics is lacking, analysts across the same organization end up collecting metrics in a different way, which then leads to confusion and disagreement about how a metric is defined or how to handle an outlier in the data.

In essence, the key to addressing the last mile problem in data analytics is really about implementing the right controls over how data users slice and dice the data. But very few organizations have the discipline to effectively maintain these controls so that not everyone have their own slices of data:

Metrics are hard to find, and aren’t usable across tools: Traditionally, metrics reporting is largely fragmented, siloed and inconsistent, mainly because curating and sharing tribal knowledge does not scale well as pipelines get more complex. Since most product teams may not have the engineering skills to build or maintain pipelines or be proficient in SQL, they often have to rely on data scientists to query metrics stored across multiple data stores.

Traditional BI tools and data dashboards are not always actionable: Over the past few years, we’ve seen the rapid adoption of data dashboards that have been driving business value through aesthetically pleasing, interactive graphs that capture high level views of what the data means. While dashboards are a great tool to visualize and digest snapshots of data from multiple sources, they generally lack the underlying context around the data to make them actionable and trustworthy.

Trends and Tools Shaping the Analytics Ecosystem

Centralized Metrics Store
The purpose of a global metric repository is to create a single source of truth for metrics definitions, where key metrics are defined in one place only and are reusable across different tools. This centralized metrics framework solves the challenge of inconsistent metrics definitions by capturing rich metadata and lineage which enables data users to understand the provenance of each metric and how it is being calculated. Creating a single source of truth significantly improves the accuracy of metrics computations and frees data scientists from pipeline management.

Transform Data is a shared data interface that captures the end state of the data and turns it into a metric entity, creating a centralized repository that can track the data lifecycle management, and anomaly detection.

Other tools include Looker (LookML), and SQL-based modeling tools, dbt and Dataform.

Data Notebooks
Computational notebooks for data science like Jupyter and Apache Zeppelin have emerged as great tools for data science and product teams to collaborate on. They enable users to create and share live code, graphics and visualizations and function as an interactive IDE through which users can execute code and generate meaningful insights instantly, all in real time.

While next-gen BI and analytics tools like Preset, Metabase and Redash have been gaining a lot of steam for democratizing analytics and simplifying data access to non-technical users, data notebooks offer a different set of benefits: They are process-oriented (authors can comment directly while working with their codes so that everyone else can understand and validate the process), easily accessible (as long as the users can code in the language in which the tool is written, they can query data directly from the notebooks), and they generate better visualizations (unlike static visualizations, data notebooks enable users to dive in and play around with the results, interactively).

Count.co is a data analysis platform designed around notebooks that supports data management by creating fully documented and accessible data models that teams can collaborate on and trust.
Deepnote is a data science notebook that provides an easy set-up and management for teams to build their machine learning models without spending too much time on setting up the infrastructure.

Other tools include: Hex, Noteable, Observable, Kaggle, Polynote, and Mode.

Relevant Blog Posts for Additional Reading

An island of truth: practical data advice from Facebook and Airbnb by James Mayfield
“Many organizations have made a shift toward embedding data analysts and data scientists within product development teams so they can quickly build tables, define metrics, and run experiments. This paradigm helps teams move quickly with data, but an anti-pattern is emerging where analysts and data scientists quickly prototype pipelines (in tools like Airflow) and then “throw their ETL over the wall”. BI engineers are asked to adopt pipelines with no context on what the data means or how it will be used, which creates hard feelings.”
Data Analyst 3.0: The Next Evolution of Data Workflows by Sid Sharma
“When an ad-hoc question comes in, a Data Analyst often starts the diagnosis from scratch. It’s a manual process involving SQL to fetch the granular data, adding relevant dimensions, and finally using Python/R to dig up the insight. The process is reactive, needs to start from scratch every time, and hampers decision-making velocity. Result: Businesses end up in a similar situation as the first BI wave — business stakeholders queuing in their tickets, this time to get an answer to their “why” questions.”
Dashboards are Dead by Taylor Brownlow
“Much like in my experience at a certain unnamed company, this dashboard is succeeding in getting people to do something with data, but not necessarily something meaningful with data. At said unnamed company, we tried to solve this by adding more and more dashboards, then adding more and more filters to those dashboards, then killing those dashboards when they decidedly weren’t useful. This negative feedback loop contributed to a serious mistrust of data and inter-team schisms, many of which I believe are still around if passive-aggressive LinkedIn updates are to be believed.”

People to Follow on Twitter

Anthony Goldbloom: Co-founder and CEO of Kaggle, a subsidiary of Google that provides a customizable environment for Jupyter notebooks.

Jeffrey Heer: Co-founder of Trifacta and a scientist focusing on data visualization and interactive analysis.

Nick Handel: Co-founder and CEO of Transform Data, and was previously the Head of Data at Branch International, and a Product Manager at Airbnb where he designed Bighead, Airbnb’s end-to-end machine learning platform.

Videos

How Reporting and Experimentation Fuel Product Innovation at LinkedIn by Kapil Surlaker
In this video, Kapil Surlaker, data and analytics lead at LinkedIn, describes how LinkedIn leverages its Unified Metrics Platform to measure and perform experimentation efficiently at scale and in a secured way.
Democratizing Metric Definition and Discovery at Airbnb by Lauren Chircus
This video serves as an introduction to Airbnb’s metric experimentation framework and discusses how it is being used to scale metric definition and discovery.
Beyond Interactive: Empowering Everyone with Jupyter Notebooks by Michelle Ufford
In this video, Michelle Ufford, founder of Noteable.io, discusses the key considerations that led Netflix to implement Jupyter, the popular open source Notebook tool at scale and how the team leverages this tool internally to power data access and exploration.