Simplifying 3NF

Chris Jackson
7 min readNov 10, 2022

--

In a recent post I was rude enough (albeit tongue in cheek) to question whether someone who couldn’t describe 3NF in simple terms should be working in data…. A couple of people I highly respect then told me I’d just sacked them…

It made me think — could I meet my own challenge? How recently had I reviewed the definition of normal forms, and my ability to explain them to a lay person? So I offer here my first attempts, in the hopes of provoking others out there to offer something much better….

Let’s imagine different situations where someone has asked the question: ‘What’s this about data modelling and 3NF?’ (and really seems to want an answer). This might be how I would respond verbally, possibly with the help of something like a whiteboard or napkin to scribble on.

Exec in the elevator:

“The data we hold is a like a model of our business. Data modelling is a set of disciplines which try to make that model as useful as possible, and help avoid confusion and ambiguities. It’s a critical component of being a data-driven company.”

Busy manager in a meeting:

“The way we choose to hold data will impact what we can do with it and how robust it is. Companies which take data seriously take data modelling seriously.

Normalised modelling, which includes 3NF, aims to ensure that: (a) you know you have sets of distinct ‘things’, such as individual customers or products, (b) the facts you hold about those things at any time are clear — such as name or address or size, and (c) you understand how each ‘thing’ relates to other ‘things’, such as customers to products.

As a simple example, when we store customer orders, we don’t store all the product details on every order because we could then end up holding conflicting versions of product details….”

Interested colleague over a coffee break:

“The way we choose to hold data will impact what we can do with it and how robust it is. The data we hold is like a model of the real world we operate in. Data modelling is the art of making that model useful, flexible, and unambiguous.

The basic set of rules data modellers use to test their models is called ‘normalisation’. It’s a pure maths term, but don’t worry, you don’t need a maths degree to understand and use the rules. Those rules are often summarised as ‘each fact in a table or dataset is dependent on the key (1NF), the whole key (2NF), and nothing but the key (3NF)’.

Let me give you the rules and some examples. Think of a table of data with several fields or columns (I’ll use the words interchangeably):

1NF says there has to be a key — a field or fields which uniquely identify a row in a table. 1NF also says that once you have a key, you know what will be found in each of the other fields (unless it’s empty — that’s a special case, and you can define whether it’s allowed). So, you should only have one ‘fact’ in each field, and you need to know what type of data you expect — for example a number or a date or a string.

A simple example: Have you used a service where they rely on email address as a key — but also try to hang your name and birth date off that? Suppose you and your spouse share that email address — now, what facts are they going to hold in the name and birth date fields?

2NF is relevant where a key is made up of more than one field or column. It says that the other columns in the table must depend on all the key columns, not just a subset.

3NF assumes any 2NF problems are fixed, but also says that the other columns must depend directly on the key, not via some other column.

There are several further forms of normalisation, but 90% of the time, getting to 3NF actually fixes all of them.

Why is this important? It makes sure you can capture everything about the ‘things’ in your system, consistently.

Example:

Let’s take an example — suppose we are running a cabaret with Events, and for each event a Customer can book one or more Tables (the sort you sit at).

What is the unique key here for the Bookings? Well, it’s not Event plus Customer — that’s clearly not unique. Maybe Event and Table? Let’s go with that for now, and assume each Event only plays on one Date.

So, what rules are we breaking here? There are fields which depend on just one of the key fields — such as the Event’s Date and Table’s Size (assuming it’s fixed) — breaking 2NF. There is also a field, Address which depends on Customer — breaking 3NF.

The problems with holding the data in this way are

a) we might record them inconsistently, such as with table 23’s Size or Ms Dietrich’s Address, and

b) there is no definite place to record facts which aren’t linked to bookings — other customers’ addresses, or sizes of unused tables, or the dates of events before any bookings take place. Fixing these problems is known as ‘normalization’ and almost always involves creating more tables. For example:

Of course, identifying and solving these problems can raise other important questions.

The business might tell you that the table’s Size represents the number booked for that particular Event, for which places must be prepared, not the maximum size. So we could have a Booked Size column in the Table Bookings table, and a Maximum Size column in the Tables table. (Just as in an online shopping app, the customer may have default billing or shipping addresses, but a specific delivery address for each order.)

You may also realise that we’re missing an Event Customer Bookings (or Event Customer Party Bookings) table, which captures data at that level, such as expected arrival time (or at a bare minimum the existence of the booking, perhaps before tables are allocated and numbers finalised).

Data modelling is at least partially an art form, the most interesting bit being teasing out of the business the details, rules and relationships they may never have thought about explicitly.”

Yes, but:

“A few questions which may occur to you:

a) Now we have split all this data up, is it less usable? A key aspect of ‘relational databases’ is that they are very good at joining all these facts back together, based on the shared values. (You may have heard people use the term foreign key.) It can get confusing for less data-savvy users, but all that joining can be hidden for them, behind ‘views’, which look a lot like tables but hide the SQL join language.

b) Doesn’t this slow everything down? In systems which capture data, 3NF can actually speed things up when adding or changing or deleting data, by only having one row in which you need to do that change. When reading data, good databases are very fast at joins.

But for high-speed dashboards, or analytic applications, we sometimes collapse the data back into tables which ‘break’ the rules, so we can query them faster — this is called ‘de-normalisation’. When you do that, you should ideally understand the rules you’re breaking. And you do sacrifice flexibility and reintroduce duplication when doing this.

c) Don’t these multi-field keys get very complicated? Yes — especially when capturing messy things like separating father and son who share both a name and an address, or handling things like orders for which customer, date, time, product, product colour could all be needed for a unique key. Joining tables with lots of key fields can be slower.

That’s why many systems have business keys like customer number or order number. And, under the covers, database implementers sometimes create single-column ‘surrogate’ keys for almost everything. This is fine and it can allow some flexibility — we just need to keep an eye on what the ‘real key’ might be, and make sure, for example, we aren’t squeezing two different things into one table — or at least know we’re doing it.

d) Don’t things change all the time — like people’s addresses or product prices? Absolutely — which means we often have to bring a time factor into our data and keys. Sometimes that’s needed in the front-end systems — for example the day’s product price. Sometimes it’s added in or implied at the analytic stage as things slowly change.

But I see we’ve finished our coffees and I’m afraid for more detail you’ll need to take me out for lunch…”

--

--

Chris Jackson

I’m a Senior Sales Engineer at Snowflake. Opinions expressed are solely my own and do not necessarily represent the views or opinions of my employer.