The Coddfather: Relational Database Fundamentals

Exploring relational databases and why they’re ubiquitous

Jackson Prince
Better Programming

--

In the mid-20th century — the IBM era — there was a man named Edgar F. Codd, aka “Ted”, who had a passion for data. He proposed the theory of relational databases, which fifty years later remains the primary data storage architecture for the vast majority of websites and applications. The culmination of his work, the pièce de résistance that gave us Facebook, Snapchat, Instagram, LinkedIn, YouTube, and all the other user-based, social-media web applications is one powerful concept: the self-join table.

This series is written for those interested in learning the fundamentals of relational databases and why they’re ubiquitous. It’s also aimed at those curious about self-joins and their slightly more advanced application: the comparative triple-join. I’ll demonstrate the latter using Ruby on Rails in the next two pieces:

Building Self-Joins and Triple Joins in Ruby on Rails.

Movie Comparison Website in Ruby on Rails

Let’s get to it!

First, the Fundamentals

Imagine you and I want to replicate Facebook. Except this is the mid-1960s, before Ted’s proposal, and all of our related information must be stored on an enormous Excel-like spreadsheet. Something like this:

Remember, you and I just set out to replicate Facebook (2.4 billion users, 20 billion visits in the last 6 months), which means that instead of 7 columns and 15 rows, it’s more like 10,000 columns and 100,000,000,000,000 rows. (That’s over 250 million miles of data to scroll through.)

Note: due to the unfathomable volume of Facebook’s data, they use non-relational database systems as well as relational systems. Non-relational systems allow them to get even more specific. For this article’s purposes, though, let’s throw it back to the early 2000s and assume they still rely solely on relational systems.

Every time you add a friend, comment on a post or add a movie to your favorites – that small bit of information (“Becky” or “Hey Mom!!” or “Bill and Ted’s Excellent Adventure”) is added to a cell in a new row of The Great Excel Sheet. That one addition will then trigger a set of automatic calculations rippling over the entire spreadsheet: “add 1 to my ‘total friends’ column”, for example.

Furthermore, every time you want to directly change some bit of information [“new profile picture” or “edit that last comment”] the computer will dig through billions, possibly trillions of rows, plus a few thousand columns, to find and then change that one, single cell.

Is it possible? Sure.

Is it data Armageddon? Yes. Yes, it is. In fact it’s a true nightmare . A bloody, unreliable data nightmare.

Enter Ted Codd with a radical solution and a well groomed mustache à la Don Corleone:

Ted Codd

What Ted theorized is that we can break this enormous spreadsheet up into many smaller spreadsheets that are “related” to one another.

To conceptualize a “relationship” between data, think in terms of family relations. I’m related to my parents and my children are related to me. Our whole line shares some common traits. Those traits are defined by a unique DNA strand. That DNA makes us family.

Consider this table of parents:

And these children:

The children table holds a single, important bit of information that links the child to the parent: the name of the parent associated with that child. I can see all of Janelle’s children (Louis and Tanner) by looking up all children in the children table with her name represented in the parent column.

This makes sense and is helpful. However, in reality, children are not linked uniquely to a parent through the parent’s name. Linking parents and children together based on arbitrary, mutable factors such as name is disastrous:

Suddenly Louis and Tanner’s mom has cloned herself into two people living a thousand miles apart. …But are Louis and Tanner even siblings? Maybe they just happen have mothers with the same name. In that case, whose mother is whose? Uh oh.

We therefore need a unique identifier for every single person. We need something a bit more like DNA. Even better, what if we could get that DNA in number form? Computers are especially fond of that.

Enter: Unique Ids!

Unique ids are the pillar upon which the storage of information in a relational database stands. Here’s how they’re used:

You’ll notice the “name” column in the children table is now completely useless. So we remove that column and keep the parent’s id (identifying number) in a revised column called the parent_id column.

As long as we make sure no parents ever have the same id, ids can effectively act as “DNA.” They’re the one, immutable factor assigned to each parent that makes each parent unique.

With the introduction of ids, we now know with absolute certainty that Louis’ mother, Janelle, lives in Seattle and Tanner’s mother, also Janelle, lives in Albuquerque:

In the same way a single strand of DNA contains attributes which dictate our physical appearance and behavior, each id number contains the data that defines a person. Therefore, we don’t need the latter two columns, parent_name and parent_city — that information is implied with the parent_id.

Best of all, our tables are exquisitely simple! We’re not scrolling left and right, up and down, to find what we need. We have two separate tables, both responsible for one aspect of our database: one for children, one for the parents.

Let’s see this in action. If we want to discover information about a parent, we search the parent table for one, relevant row. While there, we see the parent’s name, age, location, etc.

We’re nosy, so we’re also interested in seeing information about the parent’s children (names, ages, locations, studies). That’s easy enough: just grab the parent id (1), hop over to the children table, and search all results with a matching number (1) in the parent_id column.

We’re left with a curated selection of highly relevant data.

That’s the beauty and usefulness of a relational database system in a nutshell. It allows us to focus our activity by separating concerns. We have two tables, handling very different data, yet they’re easily related with an id.

The initial example, where we were required to include all of our information in a single source, would force us to constantly rewrite and re-link elements. With Ted’s relational system and the powerful addition of unique ids, an enormous amount of work is avoided.

What About Facebook?

Let’s go back to our quest to recreate Facebook. Now, armed with Ted’s tools, we can manage our nightmarish data pile by splitting it into many smaller, highly focused tables, each concerned with one aspect of the larger picture. users, posts, comments and likes are now individual tables linked together by a simple, numeric common denominator: unique ids.

Let’s visualize this with a users (parent) to posts (child) relationship:

Each user has many posts, and each post has a single user, so we say that posts is the child table and user the parent table. In order to correctly set this relationship, we just need to make sure that each Post contains the “DNA” of the user, inextricably linking the two: User_id.

Each instance of a post includes the id of the post, some data (in this case Text and Time) and a User_id, which establishes the relationship to user:

Nice! Even though the first and third posts were posted at the exact same time and contain the exact same text, they’re unique in terms of their id (first column) and who wrote them, represented by User_id. We know with certainty that Janelle wrote the first post (id equal to 1), and Ted wrote the third post (id equal to 3).

Additionally, because all post data exists on a single table, we can easily draw information from and analyze data about all posts or all of Ted’s posts without having to first remove irrelevant information.

Most importantly, we’re not rewriting anything. Cleaner. Clearer. Less work. Fewer errors.

Voila! A comment and post are now related via the DNA or Unique id of the post, i.e. the parent. To boot, there are no unnecessary blank spaces in our tables and the overall efficiency of locating data is drastically improved.

What’s ‘Like’ Got to Do With It?

‘Like’ is an interesting feature in social media websites. A single Like contains two important bits of information: Who created the Like and What received the Like.

If we think about this in terms of a table, we’ll see that every Like carries with it exactly one User. It also carries with it exactly one Post. A Like is therefore a child of User and Post. It contains DNA from each of them:

We describe this as: one like belongs to a single user; that same like also belongs to a single Post. It has two parents. Those parents know or are related to each other through that single like.

By creating this table, we imply the following:

The columns in grey are unnecessary, but they illustrate this table’s behavior:

Janelle’s first post (Post_id = 1) “My first post!”, written at 10:45pm, Aug 09, 2021, has accumulated two likes. One came from Janelle herself (User_id of 1), and the other came from Ted (User_id of 2).

This type of relationship surfaces frequently in relational databases. It is called a many-to-many relationship, because one post can have many likes, and each one of those likes has a user attached. Therefore, each post can have many users associated with it. Likewise, each user has many likes, and each one of those likes has a post attached. Therefore each user has many posts associated with it.

Armed with this relationship, we can look up all of Ted’s Likes by searching through the like table for all instances in which his id (2) appears. We can then easily analyze Ted’s preferences, which the real Facebook has taken to enjoying too.

What About Friends?

OK, so we have a one-to-many relationship between users and posts, and we have a many-to-many relationship via likes. What about friends? If Janelle and I become friends, who is the parent and who is the child? The answer is: neither. We are both just users. We’re both instances of a user related through our friendship.

Unlike the users to posts model, in which one user has many posts and one post has a single user, this new “friendship” challenge requires both uers to have many other users. Janelle has many friends (users). I have many Friends (users). Thus, we only have one table to work with: users.

How do we display this information? We could create a “my friends” table that has all of my friends and each of their unique ids, like this:

If we went this route, we would have to create a new “my friends” table for every single user. Instead, we want one table that contains all friendships. To do this, we create what’s called a self-join relationship.

A self-join table is created in order to join one table to itself, i.e. Janelle and I are both instances of users, with friendship being the table that joins us together. Therefore, friendship is a self-join, because it joins users to itself.

The beauty of a self-join is its simplicity. We still only need three columns: one for my id (representing me and all of my information), one for Janelle’s id (representing all of her information) and one column to describe the friendship itself as one which is entirely unique to us: a friendship id. That’s it. Here’s what it looks like:

If we abstract this out to all friendships and not just my friendships, we see something like this:

I can now easily see that my id (in this case 5) appears in 2 rows of the friendships table. Therefore, I have 2 friends. I know which friends they are because their id is in the adjacent column (Ted 2 and Dennis 3). In order to retrieve data about either one of them, all I have to do is grab their id and search for it back in the user table. The same is true for them about me!

Those three simple columns make a social media website like Facebook possible.

Creating Comparative Self-Joins and Triple Joins!

In the above example, a friendship is shared equally by both users. There’s no difference if my id is in the second or third column. That design is effective for sites like Facebook and LinkedIn. But what about Twitter, Instagram and YouTube where one User is a ‘follower’ and the other is ‘being followed.’ In that case, the first user_id becomes ‘follower_id’ and second user_id becomes something like ‘the_one_being_followed_id.’

I’m not sure of the exact terminology for this style of self-join relationship, but “comparative” self-join makes sense to me, because we’re simultaneously linking two elements from the same table and comparing them against one another: one element is X (follower), the other is Y (followed).

If you want to see this twitter-style comparative self-join in real-time, created using Ruby on Rails, here are two excellent articles I kept going back to while learning the concept:

If you want to follow along with what is, in my opinion, a more intuitive example case of the comparative self-join, then follow the link below: a demonstration from start to finish creating a Movie Comparison app.

We’ll be relating two movies to one another via a comparison table (the self-join), then relating that comparison to a specific user. The result is a triple-join.

Sneak peak:

From the above, we extrapolate that User 1 and User 2 have opposite preferences when it comes to Movies 1, 2 and 3.

Awesome.

If you’re leaving off here, then thanks for reading! Otherwise, click here and I’ll be with you momentarily.

Building Self-Joins and Triple-Joins in Ruby on Rails

Cheers to you, Ted.

Happy Coding

Jackson

--

--