ER Maps for dbt users

Shiva Shivakumar
5 min readMay 17, 2022

--

Debug semantic gaps in modern data operations, faster!

We share how we’ve built ER Maps for dbt pipelines and saved hours debugging semantics in modern Healthcare and Retail data apps. We show how you can create ER Maps for your dbt pipelines, and save hours in < 10 minutes. Our ER Maps tool helps teams import dbt artifacts, style with colors and icons, and map schema lineage (for BigQuery, Postgres, Snowflake, and more.)

ER Maps for dbt pipelines

Why ER Maps? Data teams have used ER Diagrams (ERDs) for decades to visually communicate semantic details between tables, columns, and flows. Pictures versus thousands of words, popular and time-tested! However ERDs have one key problem — they are useful for design, but hard to maintain for operational reality.

Reinvent like Google Maps? “Classic” ERDs were invented to model ‘state of the art’ paper atlases for the physical world. If you drove or took public transit pre-2005, you used a set of artifacts — paper maps, a mental model of critical landmarks (schools and cafes), and radio traffic reports on accidents and closures. To get from point A to B, you “mentally merged” these artifacts. To get from A to B with friends, you had to convey roads, distances, and exit lanes, often with multiple semi-accurate frames of reference.

We are inspired by how Google Maps reimagined navigation for current and operational reality from 2005 onwards — recent road network and traffic conditions in one visual canvas, with user tools to search and navigate the complex physical world.

We now live in a data world with data teams operating on hundreds and thousands of data flows, tables, and column relationships, across data stacks and multi-clouds. We believe it’s time to reinvent ER Maps, to be simpler, scalable, and operationally accurate. And save users time when navigating and operating complex data flows!

Why dbt? dbt has quickly become a standard programming environment for teams rebuilding on a modern data stack. Also, the community is vibrant — analytics engineers are at the forefront of solving high-value and gnarly problems. With ER Maps, dbt teams can

  1. Quickly create a common canvas of data flows to supplement dbt’s docs and lineage graph. Style with colors and icons in code.
  2. Debug semantic bugs. Identify how every column was computed, from real SQL code.
  3. Diff across schema graph versions. What changed, when, and how? Identify any breaking changes and create audit trails.

Here’s a video demo of ER Maps and how to build your own in < 10 minutes! Ping shiva@ on dbt slack if you’d like to use the tool, or if you’d like to brainstorm about the future of ER Maps.

Preview of ER Maps

Examples from Healthcare and Retail

Here’s our first example from the Healthcare industry. Patients visit Doctors for care. We need to compute Cost and Quality metrics for visits, for insurance payments and patient care. We use a dbt pipeline to transform PII-sensitive data in a FHIR-based schema through 100+ models, in 7 stages from loading, cleaning, canonicalization, through to the metrics tables. Below is an example ER Map. Here’s how we’ve used ER Maps to save our team tens of hours debugging and operating each month.

  1. Setup familiar conventions. In Figure 1.1, we color coded and picked icons for Tables in the same ‘entity group’. For example, Patient tables are in Blue, with a User icon. Billing tables are in Green with the Dollar icon. (Auto-rendered with a code-based styles template.) We set up conventions for colors and icons (in styles.json), and we programmatically apply the same styles to every version of our ER Maps. This helps our teams connect the ‘’big picture’’ of tables and flows, and work on the same source of truth during debugging and onboarding.
  2. Debug with context of lineage. In Figure 1.1, Transaction.payment_value is a critical column. We see its lineage and relationships to upstream and downstream columns. (Auto-computed from dbt’s artifacts, based on the real SQL code.) During debugging, we scan to critical columns in seconds, and see how columns impact each other, especially before and after changes. This saves our teams time versus following chains of SQL code.
  3. Fast search. In Figure 1.2, fast catalog search with full context of tables and columns. (Auto-computed from dbt’s SQL in artifacts). We use search to quickly focus in on specific columns, and use the full context on how tables relate.
  4. Diffs when versions evolve. In Figure 1.2, when schemas change, we see Diffs on what’s changed per stage. (Auto-computed). After a sequence of changes, we track which Tables and Columns changed, when, and how? This helps analytics engineers plan downstream impact of refactoring models.
  5. Collaborate and Share. We share, export into PDF/PNGs, and collaborate on one canvas with our team. This helps our teams onboard new members fast, and our domain experts review our semantics with the latest operational changes

In Example 2, we see the popular TPC-DS benchmark from the Retail industry. The TPC-DS documentation is detailed, dense, and ~130 pages on how the retail data graph relates. Like Google Maps, each ER Maps user can focus on their specific “A to B” journeys, search and navigate key relationships at different zoom resolutions, along with the associated context of related tables and columns. For example, in 2.1, see how item_sk column relates to other columns and in Figure 2.2, how the Tables relate.

Figure 1.1: ER Map for Patient Healthcare, with styles and lineage. FHIR-based schema zoomed in (Patients, Doctors, Procedures, Transactions, …)
Figure 1.2: Catalog search
Figure 1.3: Diffs across versions (e.g., Version 7.1 to 7.5)
Figure 2.1: Retail with TPC-DS schema zoomed in (Customers, Inventory, Items, Stores, …)
Figure2.2: Zoom and pan for full set of Retail tables

--

--