So, who killed data modelling?

Chris Jackson
9 min readOct 4, 2022

--

Judging by the noise on LinkedIn, it seems data modelling is in a life and death struggle. Whilst aware of the swamps left behind by lazier followers of the data lake movement, I had naively assumed that modelling was alive and well. Almost every use of data has some data model, even if only implied — though not necessarily a good model. (Personally, I’d question whether anyone who can’t describe 3NF in simple, practical terms should be working in data…)

So why are there still questions about data modelling? Various causes are proposed. Some issues are at least thirty years old, while recently the finger has been pointed at the growing use of cloud data platforms and the ELT approach to analytic data architectures.

1) Lack of interest — the business really doesn’t care

Despite CIOs and CEO declarations of intent to be ‘data-driven’, for some businesses the management and exploitation of data is not a key concern, at least at the senior level. This may be understandable — not every business is a ‘data business’; data may be critical, but used narrowly within specific separate domains. Some organisations are fundamentally good at other things, such as sourcing and selling stuff, or looking after people, or providing legal expertise. It’s not that they don’t use data but — for now — the way they do things seems to work for them, even if that’s in Excel spreadsheets.

This can happen both in a mature organisation, maybe dominant in its sector, and in a tech start-up, where good data is a secondary consideration to shipping product.

Data-driven steam engine

Solution: Until the organisation is suffering enough data-related pain, or senior management chooses to back a strategic data-enabled business approach, data modelling — and governance and other data stuff — are going to be done largely at project level, to achieve local goals. For more established organisations, the enthusiasm for, or resistance to, data literacy efforts may act as a litmus test of what is really going on.

If as a data person you find this frustrating, you may need to weight up the benefits of your current position (good benefits, sane hours, lucrative stock options), accepting its limited scope, with the possibility of making more impact elsewhere. The churn in senior data leaders in my view is at least partly down to these types of decisions.

2) Lack of a ‘big picture’ — no broad business data model

Data modelling has often been seen as a detailed activity to underpin development of operational and analytical products, removed from data strategy, and only impacting business users as part of detailed business analysis. But if there is no high-level map of the organisation’s data landscape, how can a company be ‘data-driven’, or business domains agree on data ownership and responsibility? How is a CDO supposed to rationalise data across multiple applications or silos, each with competing aims to be the true source of ‘customers’, or to understand the reasons for specific data flows?

The 1990s answer was a vast, detailed, 3NF ‘enterprise data model’ often running into 100s or 1000s of entities. Sometimes this was bought ‘off-the-peg’ for specific industries but then required local validation and adaptation. Unsurprisingly, these exercises usually ran into the sand, overtaken by more urgent business priorities and boredom.

Solution: The art of high-level ‘business data modelling’ or ‘conceptual data modelling’ has been around for over 15 years. In the hands of an experienced practitioner, for a medium sized business or division, it should be possible to produce a good first draft in 1–3 months, including proper interaction with all parts of the business. Typically, this can be done in conjunction with an exercise in data literacy for more senior managers and staff. Such a model can be refined and extended as more detailed data work in an area throws up a need for greater differentiation of concepts, or wholly new ones.

Highly useful in itself, starting data modelling ‘at the top’ also establishes the principle that it is fundamental to an organisation’s approach to data.

3) Data as application exhaust or afterthought

Even though many applications produce and rely on data, there has always been a tendency, especially in online applications, to see data as a necessary evil, not a first-class citizen in application design. This especially manifests in two areas:

a) Use of third-party applications to accelerate business capabilities.

Many applications have their own data model, which exists on a ‘take it or leave it’ basis — you bend your data needs (and even business processes) to suit the application’s world view. Other applications, on the other hand, actively encourage local customisation by business users, without any consideration for whether the implied data model really makes sense.

Questions of wider integration may be brushed aside, so long as the application can obtain or exchange data to suit the immediate needs, perhaps via APIs. Some applications even actively discourage data from being extracted outside their own narrow environments.

Solution: Only buy an application if it can provide a clear data model, and / or options for well-constructed extracts / data sharing for analytic purposes. I’d suggest making that part of any RFI for procurement, requiring more than a ‘Yes / No’ answer.

b) In-house application developers treating data modelling as an afterthought.

This is the in-house version of the problem, accentuated by the fact that it’s even more tempting to address a ‘single-customer’ system with short-term expediency. Developers are usually working under time pressure to deliver a bunch of screens to internal or external users, who have no immediate interest in how the data is stored.

Solution: Data modelers (or those skills) should be a core part of any application team. A draft data model should usually be a pre-req for starting the first true agile development sprint. And the downstream use of the data which will be produced, whether for operational or analytic purposes, should be part of an overall framework. This is best practice for data-driven development and implied strongly by the approach of data mesh.

4) The rate of change — modelling just slows us down

A model is just that — a simplification of the real world. In the case of a data model, it will typically capture some implicit rules and relationships, hopefully attuned to the way the business manages its real-world interactions.

The relational modelling of the 1990s was seen as too slow, capturing a view of entities, relationships and attributes which were usually overtaken by business changes and new sources of data, and which failed to add value when capturing and transmitting online events. As organisations moved from producing purely physical products to more digital products, with regular changes a norm, modelling was seen as hampering or conflicting with the experimentation needed to stay current.

Solution: In online applications, semi-structured ‘document model’ approaches have offered both the encapsulation of events and a level of flexibility in extendable schemas. (Best practice in the use of such structures implicitly acknowledges the disciplines of 3NF analysis.) Analytic data platforms have in turn moved to offer native support for formats such as JSON, with varying degrees of commitment.

In the analytic space, the Data Vault approach has provided agility by generalising relationships between key entities, recognising the diversity of sources and high probability of change, and building in the capture of history. It lends itself to a range of physical optimisations for large scale landing of data, and to automation of design, though arguably automation is almost essential to get it right at scale, and raw data vault does not claim to be business-facing.

Data mesh proposes that the bulk of modelling be left to local domains — though also promotes bitemporal modelling approaches, and speaks about the need for common standards, a new modelling approach and even a language to enable ‘composability’ across domains. This is an area I believe is very much ‘work in progress’ for data mesh.

Ultimately, applying the right type of modelling for the use case or purpose is the best recipe for success, whether document, 3NF, Data Vault or dimensional. And while modelling is firstly a logical activity, support in the underlying data platform for a range of data modelling approaches, with good performance, can significantly simplify the logical to physical mapping, enhancing flexibility. How well does your data platform support document, 3NF, data vault, and dimensional structures, meeting your logical needs without requiring painful physical adaptations?

5) Just land the data — the data swamp legacy

While the big data movement was partly driven by huge, internet-generated data volumes, it was also a response to the issues of complexity and the rate of change of that data. As a few organisations started to make serious money from tracking everything, there was a growing reluctance to throw anything away — but insufficient time to model it all. The more dubious lake practitioners argued that modelling was now old school. It didn’t help that lake platforms then — and even now — were best suited to working on sets of big flat files or flattened tables. When joining data across large datasets or multi-table models is painful, the temptation to create lots of denormalised datasets will be very strong — often leading to lots of duplication. Poor granular security further encourages this trend.

Lake as swamp

Burnt by this experience, there has been some push-back at two complementary trends seen in the cloud-based ‘modern data stack’: ‘cheap’ storage and the ‘land and transform’ (ELT) paradigm.

Many of the cloud data platform players have, at least to some extent, separated storage from compute. Cloud object storage is resilient and (relatively) low-cost. The charge is that this gives rise to lazy ‘land now, worry later’ thinking. Lots of data is kept for no known reason, and raw or poorly modelled data is used directly and never properly integrated. While storage is cheap, growing data volumes drive up consumption-priced compute, giving platform providers an incentive to encourage sloppiness in their customers.

This charge can’t be fully ducked — even cheaply stored data should sometimes be deleted, whether to reduce clutter or tread more lightly on the planet. However, I’d push back on the assertion that the vendors don’t care. At least in my backyard, Snowflake, I have come across data vault thought leaders and some of the best, most methodical data architect consultants I’ve worked with.

Many organisations have moved to a layered data modelling approach where the first layer takes the data ‘raw’, whether as tables directly matching those on OLTP systems, or unrefined JSON web and IoT logs. This ELT pattern isn’t that new, for example being common in data warehousing patterns and implementation on platforms such as Teradata for a decade or more. The ideal aim is that the raw layer feeds onto further layers, typically a conformed layer reflecting some canonical model (3NF or Data Vault for example) and a presentation or delivery layer targeted at the end users (typically modelled dimensionally).

There are legitimate reasons for holding data for longer — Regulatory (prove that what you did five years ago was legitimate), Cybersecurity (attack patterns can develop over months), Data science and longer-term analytics (turning raw data into new features), or simply the ability to refactor downstream new products from old data with immediate built-in history. Set against this are privacy regulations and the risk of breach, and the environmental costs of holding data with a short half-life for far too long. Ultimately this comes back to data ownership and the ‘why’.

Solution: Just because you can be sloppy, it doesn’t mean you should. An organisation with solid governance, a good high-level model of its data, and solid data architecture can benefit from the land and transform patterns enabled by cheaper storage and easy to use platforms, without creating Data Swamp V2. There can be value in not rushing into over-modelling data in detail and spending significant compute cycles and engineer time in transformations before its value has been established — and if there is no obvious value, or only a small subset of this particular data is useful, ditch the rest.

Likewise, let’s be realistic about the useful ‘half-life’ of your data, especially raw data — few regulations require retaining more than 7 years’ history, and ML models less so, unless looking at long term earth events. How good is your data platform at capturing dependencies and access history? This can help identify those datasets never or rarely being used, and avoid the retention of data due to fear of downstream consequences.

In summary…

Like so many good things in data, good modelling arises from organisational commitment, the skills to apply good practices and patterns appropriately, well designed processes, and technology which enables, rather than getting in the way or forcing designers down narrow alleys. One can model (or not model) disastrously in most data platforms. Thankfully I’m working with a platform which allows users to model well and in diverse ways, while providing good performance — the rest is down to the team and the organisation.

--

--

Chris Jackson

I’m a Senior Sales Engineer at Snowflake. Opinions expressed are solely my own and do not necessarily represent the views or opinions of my employer.