Data Modelling is More than Documentation

Maintain high-quality data assets by using data catalogs to continuously improve your organisation’s data models.

Published in

Slalom Data & AI

8 min readDec 5, 2022

Photo by Parabol | The Agile Meeting Toolbox on Unsplash

A key enabler for a successful management of enterprise data is the practise of data modelling that represents organisational data elements and their relationships, in the appropriate context.

In this article, I explore the idea of using modern data catalogs as an enabler to continuously improve data models.

Data modelling overview

There are three levels of modelling, each with its own purpose.

Conceptual data model

A conceptual data model is a bird’s eye view of the key subject areas and entities in a domain. The intent of the conceptual data model is to provide a simple visual way to build mental models of the enterprise data and communicate the domain (ie the business area or organisation) for a shared understanding and to build transparency.

The audience for a conceptual data model is everyone in the organisation who has an interest in the data, from developers to architects, from designers to managers, to executives.

Conceptual data models can be written in many easily accessible diagramming tools like draw.io, Lucidchart, or PowerPoint, and typically use business-friendly language.

Logical data model

At the next level is the logical data model, which provides a detailed logical. The logical data model is independent of any database implementation or technology.

The audience for a logical data model includes stakeholders who would perform hands-on work with the data and have deep interest in the data characteristics (e.g. attributes, keys, relationships, etc.).

Logical data models are often written in specialised tools like erwin by Quest, ER/Studio or SqlDBM and provide capabilities that support the modelling process (e.g. version control, forward and reverse engineering and generating data definition language). Some organisations may still choose tools like draw.io or Lucidchart as an extension to conceptual modelling, forgoing capabilities like forward and reverse engineering.

Physical data model

Finally, the physical data model defines what is being built in the database or storage technology. Tools for physical modelling generate data definition language (DDL) statements or schema definition that is applied by developers in the physical database or storage technology for software development. It is the ultimate source of truth of the data definition.

The physical data model is dependent on the underlying technology and naming conventions, data types, etc. may be restricted. The physical model may also consider partitioning strategy, indexing, or other optimizations fit for the specific data store.

So, what’s the problem with data modelling?

Although data modelling is very important in the software development lifecycle, we see challenges in the adoption of data modelling processes and practices in organisations. These challenges have a wide variety of reasons, but we can group them into two main categories:

Prioritisation of “speed to market” for software products, and
Data modelling myths.

Prioritisation of “speed to market” often results in lack of adherence to leading practices making the development of either an incomplete and/or brittle data models in the design phase all too common. When these models do not adequate meet the business requirements, they are often overridden later in product development — making the models irrelevant and untrusted.

In other circumstances, developer myths prevail and create new challenges. Some of the myths I have seen in the past include:

Myth 1 –Agile and data modelling don’t mix well

Data modelling suffers from a perception of being a “waterfall” process to define, build, and implement — in a world of Agile software development. The need to ‘build & ship fast’ can lead to development with a base set of tables before understanding the business domain and creating conceptual and logical data models. The focus on technology leads to software and product feature build as a priority, putting data design and modelling in the backseat.

Myth 2 — Data modelling is a lot of effort with little value

Another misperception is that there is limited value in data modelling process. It’s true that data modelling requires a concerted effort at the start. One must understand, analyse and define the domain. Fear of getting into an “analysis-paralysis” mindset, makes it tempting to skip the effort and start building software with only a partial model to demonstrate working software.

Myth 3 — It’s not representative of the actual data

Models undergo an initial baseline, and after some time passes it’s bound to be true that the software needs to evolve with the business requirements — as does the underlying data model. Yet, many organisations rate the step of reflecting any changes or deviations in the physical databases up to the conceptual and logical data models as a low priority documentation, which causes stakeholders to lose trust in the data models — making the idea that models are useless a self-fulfilling prophecy.

People and process are a common break point

Data modelling processes need to be robust and agile to facilitate rapid, changing, and sometimes conflicting business needs.

The conceptual and logical data models should not become obsolete once a physical model is available; rather it should represent the physical model in an abstracted form to facilitate the intent of the model to a wide variety of business stakeholders. This maintenace requires a lifecycle to accomodate the inevitable data evolution within an organisation.

There are many reasons why software systems may need updates relatively quickly in physical data structures or models, prior to updating conceptual and logical models:

To facilitate “speed to market”, by releasing an initial minimum viable product (MVP) and then continuously improving the software and the underlying data structures.
To update a relatively stable product after a long time, or the product ownership changes to a different team.
Emergency fixes or changes to handle a bug or a critical business change request.

In such scenarios it is possible for changes to occur in the physical data structures before updating logical or conceptual data model. For example, the development team may update the physical model and the logical and conceptual model updates may be left to a business analyst or architect as a secondary activity.

Rather than debate on the correctness of these responsibility assignments, let’s focus on decoupling the process of updating logical and conceptual models from the physical models.

The typical data modelling process starts from the conceptual, to logical and then to a physical data model, so how do we take physical data models and then loop back to the conceptual and logical data models as a decoupled process?

“Active” data catalogs for iterative modelling

A data catalog collects metadata about the data stored in the physical database/storage location. The catalog contains a wealth of information about the physical model which can be used to reconstruct the logical and conceptual data models. A modern “active” data catalog like Collibra, Alation, etc. can collect metadata from the systems automatically with little to no manual intervention. This approach provides better results and more completeness guarantees of the metadata collected.

Data catalogs also provide a graphical view of the metadata and its relationships. Using the metadata and views, a data analyst or architect can map the metadata to the existing logical and the conceptual data models, updating them accordingly. The physical models can be recreated from the updated logical models to confirm that they align with physical data model in production.

This creates an iterative loop starting from conceptual to logical to physical model and then back to logical and conceptual data models.

Using data catalogs to support updated data models.

There are features in data modelling tools like reverse engineering tools that can enable a similar process. I focused on data catalogs for two reasons:

In a modern data platforms, data is stored in different formats (e.g. relational database, CSV, JSON, key value stores, etc.) and data catalogs can provide a more robust and complete support for metadata capture and collection.
Data catalogs provide a “data-led” view of the model compared to “concept-led” view by using metadata to showcase how data is actually used in the organisation, thus informing a data model that better matches business needs.

Continuous improvement of data models with data catalogs

The validation and alignment of the models does not necessarily need to be fully automated. It can be manual (when the tools for modelling and cataloguing may not be compatible to support automation). However, automation should be utilised as much as possible.

The exact validation guidelines and rules may differ across organisations, but general checklist items include:

Do the data entities align with the original intention?
If they do not, do they make sense in the current business context?
In other words, always confirm your original model is still relevant, or a reconstructed model may be your new source of truth.
Do the naming conventions align with the original intention?
Are original key characteristics of the data accurate, or have these drifted with implementation?

The outcomes of this type of validation exercise are varied and can depend on organisational dynamics. In some cases, it may be better to change the physical model and the software to align with the original intended data model. In other circumstances, it may make sense to update the logical data model to align with the changing business needs.

What else can you do?

Using a modern, active data catalog is one way to keep data models relevant and useful within an organisation and there are several other factors as well. Some of these include:

Staff skills: Building a T-shaped skills models would greatly help in the overall process. Architects and analysts who develop the conceptual or logical data models should have deep skills in modelling techniques and domain knowledge (the torso of the T) with secondary skills in database technologies and software engineering. Developers should have an inverse of these skills.

Executive advocacy: Apart from sponsorship/funding for the modelling exercise, advocacy from senior executives helps in ensuring active participation by all stakeholders which support active maintenance and enduring trust in the data models.

Right tools for job: There are plenty of tools for modelling and cataloging in the market, the key is to ensure consistency in the application of the tools in the organisation —pick tools that improve productivity.

Consistency in naming: When building physical data model, reflect on the naming and data type conventions of the logical data model as much as possible, or have a translation available.

Version control and repository: Ensure model DDL’s are checked into a repository to maintain version control just like software code.

Right people: Keep the logical data model developers (analysts and architects) as key stakeholders in physical database design as reviewers or as test users of the application/system being implemented and updated.

There is a lot of opportunity for data catalog and data modelling tool vendors to integrate better and enable a continuous improvement process of data models. Using open standards to publish models and generate models from data would allow better interoperability between the two tooling areas and allow for more automation.

Summary

Hopefully this article provided some food for thought on keeping data modelling relevant and enduring in your organisation. As technologies and learnings improve, we hope that these processes are simplified further to enable organisations move faster with their data.

Slalom is a global consulting firm that helps people and organisations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.