Data Arena
Published in

Data Arena

https://tag.bio/solutions/data-products/

Lessons learned after deploying over 200 domain-driven data products

There are two hot buzzwords competing for the data business 2021 Concept of the Year: Data Harmonization and Data Product.

If you’re in my field — and presumably you are — you’ve started hearing and using these terms on a daily basis. I’m now even seeing job titles on LinkedIn called, e.g. “Director of Data Products”.

Don’t get me wrong — these popular terms are useful! Partly because, as described in this article about harmonization, both terms refer to an abstract dream of a perfect system that does everything right with data and solves everyone’s problems. I mean, who doesn’t want that?

The definition of Data Product is changing

The initial definition of data product was coined by DJ Patil in 2012 — “a product that facilitates an end goal through the use of data”. Unfortunately, Patil’s definition is unspecific — doesn’t that statement essentially describe all software? What software doesn’t use some form of data to facilitate end goals?

A different, better definition for data product came in 2019 — in the original Data Mesh article:

For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers.
 — How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, by Zhamak Dehghani

Originally framed by Dehghani with the phrase data as a product — I’m now seeing most folks in the data architecture and applications field start to refer to instantiations of that concept — i.e. decentralized, domain-driven data sources (potentially combined with end-user applications that match Patil’s definition) — as data products.

The fluidity and subjectivity in the definition of data product is partially responsible for fueling the hype — as the term can mean different good things to different groups.

It’s more than just hype, of course — the advantages of this emerging architectural pattern are real and transformative. I know this, because my company Tag.bio has been designing and implementing decentralized, domain-driven data products since 2014.

At present count, our technology has helped researchers and customers design, build, maintain and use over 200 disparate, domain-driven data products.

Here’s what we’ve learned.

Making domain-driven data products a reality

How does an organization achieve the ambitious, dreamlike promise of harmonized, domain-driven data products? It takes work, of course — design, development, and iteration with domain experts.

From my perspective, the work begins with proper modeling of domain-specific data for optimal use — i.e. reporting, analytics, exploration, machine learning — all of those operations require data to be in a well-modeled, well described, domain-driven schema.

For example, no Machine Learning (ML) engineer wants to have to take weeks to figure out how to join data from 30 input tables or traverse a complicated knowledge graph just to get the right data into a data frame — they just want the data frame! A well-designed data product should give the ML engineer their desired output with minimal effort and coding — thereby it’s faster to do once, and far cheaper to maintain over time.

The same goes for reporting and simple dashboard software, e.g. Tableau — it’s a lot easier and requires far less maintenance if the data is already represented a single table that’s dashboard-ready.

In a basic sense, an organization can implement an initial version of domain-driven data products using a cluster of data warehouses and data views — while perhaps also enabling ownership of those assets by domain-specific groups — i.e. the marketing groups owns the marketing data products.

Data products must also be findable and accessible in order to be useful across an organization, so implementing a catalog of data products with metadata and access information is usually the next step in the process.

However, data products must be much more than findable, queryable sources of well-modeled data

  • Data products are also the codebases and algorithms that query and analyze the data.
  • Data products are also responsible for data quality, observability and governance.
  • Data products are responsible for domain-specific, useful end-user experiences.
  • Data products are responsible for versioning, provenance and reproducibility of data analysis artifacts, not just data.

That’s where the real challenge, cost and time lies. The initial steps — data modeling, domain ownership of data products, and data product cataloging — are just the beginning.

And that’s exactly where using a Domain Driven Data Product Design Kit (4DK) and an out-of-the-box Data Mesh platform offers a significant cost, time, scaling, and maintenance advantage—i.e. 3 months instead of 3 years.

How to accelerate the process — and minimize maintenance later on

  • Continue to utilize your existing databases, data warehouses, and data lakes. Don’t waste time and money re-implementing that wheel unless you really have to. On the other hand, it may be important to transfer ownership of these existing data resources to domain-specific groups.
  • Use a data product layer on top of those data sources to perform domain-specific modeling and integrate domain-specific applications.
  • Harmonize (i.e. use the same) the technology for the layer that ingests and models domain-specific data in each data product. Data quality testing, data observability, and data governance then becomes instantly available for every data product.
  • It’s better if the harmonized data ingestion and modeling technology is low-code. Your data engineers will thank you, engineers won’t stick you with unmaintainable code after they leave, and onboarding new data engineers will be a snap.
  • Make sure all data products speak the same API language. With this, data products can then self-describe themselves to the larger data catalog, and polyglot client applications can run domain-specific applications across multiple data products.
  • Embed domain-specific algorithms and applications inside of each data product. This not only significantly increases efficiency of algorithms, but it also makes the applications available to all consumers of a data product.
  • Embedding also allows automated testing and governance systems within a data product to extend beyond data elements to algorithms and applications.
  • Domain-specific algorithms usually require pluggable pro-code elements — in this case, low-code is not better. Let your data scientists bring the appropriate algorithms (e.g. via R/Python) into the data product.
  • Use the same containerization/deployment/CI/CD process for every data product. This ensures harmonized error detection, testing, observability and governance over the entire mesh of data products.
  • Iterate on the usefulness of data products with end users and consumers of the data product. Don’t stop iterating until it’s what they need.
  • If a data product has divergent use cases which produce a design conflict, split the data product into two.

I’ll wrap it up here — there’s a lot more we’ve learned and integrated into the Tag.bio Data Mesh platform and 4DK — but I hope that these initial lessons offer some useful advice to those of you who are rolling-your-own.

--

--

--

Data Arena is a place where you will find the most exciting publications about data in general. Here you can exchange ideas with people who are really making things happen with data. Join us, share your ideas, concepts, use cases, codes and let’s make the data community grow.

Recommended from Medium

Welcome to the data world

Interactive differential expression analysis with volcano3D

Gif showing interactive volcano3D plot

Time Series Forecasting

Winner winner chicken dinner.

4 Data Visualization Tools To Transform Your Data Storytelling

The 3 t-tests for Data Scientists

Extracting communities from Social Graph Network

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jesse Paquette

Jesse Paquette

Full-stack data scientist, computational biologist, and pick-up soccer junkie, located in Brussels and San Francisco. Twitter: @bzdyelnik

More from Medium

There is no Data Warehouse in a Digital World

Simplicity in Metrics Platform Design- Design for Everyone

What’s fueling the triple-digit growth of the Snowflake data platform?

A Meta-architecture for Data Mesh