How to Build an Optimized Relational Data Model?

Ram Teja Y
The Startup
Published in
6 min readJan 7, 2021
Source: industrywired.com

Introduction:

Just like how the above image appears, data too can turn out to be complex! Making data talk is in our hands and based on how smartly we use it. This page is for people who want to provide data for reports or dashboards in the best efficient way and also for beginners in the data industry who can get an idea of how data modeling works.

Data is available in numerous forms in this world. Starting from ancient scripts to high resolution videos to the thoughts and memories in our brain! Data is in various structured, unstructured and semi-structured forms. In this huge ocean of the types of data, I’m going to be lazy and pick up data in forms of tables in Relational Database model structure as an example for this write-up. Let’s start step by step from the time you look at the data you are given.

Understanding and Classifying your Tables:

Let us for example take that we have been given some Sales related data with below four tables:

Example tables with Field Names

First thing we will need to do is to understand the meaning of the tables and fields in them. Here we have 4 tables:

  1. Orders: Contains order requests from customer with what item(s), how much qty in which Unit of Measurement (Kgs, Liters, Cartons and likewise) and by when is the order is expected to be delivered
  2. Invoicing: This table contains the invoicing details of the ordered items. What item is invoiced to which customer and on which date, what is the value of the invoice and whether there are any other amounts like tax and discounts involved along with the customer payment terms (credit, upfront and likewise)
  3. Customer: This table has basic details of the customer including contact, shipping and billing addresses and what customer group/channel they belong to
  4. Product: This table contains information which describes more about the items which are sold along with what groups and brands they belong to along with the default unit of measurement considered for the item in the organization

You can understand the meaning of tables by conversing with relevant DBA’s or business people and more importantly by looking into and playing with the data. As this is complete, let us classify these tables into Master and Transactional tables. These in other data warehousing terms can be also called as Facts and Dimensions. Please keep in mind that there can be other types of tables like lookup tables and relationship tables which maybe indirectly used to connect between the main tables in your data model too which also can be classified and used accordingly.

In our example, Orders & Invoicing are Transactional tables and Product & Customer are master tables. Transactional or Fact tables usually are very busy and keep getting continuous data loaded into them and also usually have quantified data or metrics like sales value and qty in them. Master or dimensional tables are usually the ones which provide meaning and detailed descriptions for key attributes and get updated a little less frequently like for example when a company introduces a new product only then a new entry in the product dimension is made. Another key example for master data is the very important date/time hierarchy. We can tag numerous useful fields to a date or a timestamp like week, quarter, previous year same date, month begin & end dates, fiscal year start & end date and so on.

Find out granularity and relationship between the tables:

Granularity of a table means to identify the level at which a record or a row is unique in the table. Most of the software like ERP systems usually generates an ID column itself for popular tables but it is good to understand the level of granularity by omitting the ID column as that helps you understand the data even better.

In our case, the orders table will be unique at customer, product and order date level as there can be multiple products requested by a single customer in a single order. The order line number usually gets generated based on individual items involved in the order. However, the order ID will also be unique for a row. The invoicing table is at a product, customer and Invoice date level and the Invoice ID will also be unique. The customer table is at customer number level and Product table is at Stock Keeping Unit (SKU) level.

Now that we know the granularity both at data and ID level, we can proceed to find out the relationships i.e. joins between the tables.

Relationship between tables

In our case, as the naming is straight forward and hence is easy to identify that the Customer_ID and Item_ID are the fields to join between the Dimensions and Facts. There definitely can be cases where multiple (more than one) fields can be used for joining and the analysis we’ve done before this step will be useful in identifying joins for those cases. Also, you may wonder what is the difference between the ID and number fields in each table, you will get a better understanding on this when we discuss about Primary & Foreign Key constraints.

Though we’ve identified the joining columns, a very important thing to keep in mind is to use the correct join type. The most preferable for a Fact-Dimension model is left outer join and the Fact table being the left object. This is for the sole reason of not missing out on any transactional values when there is no supportive dimension entry made for a corresponding transaction. It is always better to show a default value rather than skipping that transaction.

Maintaining optimal table structure:

The steps done till now are actually enough to supply data to a report or a dashboard but our objective is not just to provide data but to provide it in the most efficient manner so that both the user experience and the health of the database system are not compromised. To do this, maintaining the correct datatype lengths are important. For example, if there is a column called SKU_Number which is created with a datatype of varchar(256) but we know for a fact that SKU_Number can only be 12 characters long, in this case we can alter the datatype to varchar(12). These kind of small things will definitely play a major role in speeding up the output retrieving time. In addition to this, it is also good to remove unwanted or repetitive fields in a table as they can take up unwanted space and play a role in performance.

Create Primary and Foreign Key Relationships:

This is another step which can drastically help you increase the performance of your data model. The reason for having the ID columns even though we have number columns in each table is for this step. In SQL, joins tend to happen much faster when the datatype of the joining columns are integer rather than varchar. Also, when you create a primary and foreign key relationship this makes it much easier for the join process and thus improving speed. To give you a better understanding of these terms, in the customer table, the Customer_ID is the primary key and the Customer_ID in the Orders or Invoicing table is the foreign key reference for the customer table. Customer_ID in the customer table should be unique for each row for this constraint to be applied.

Use Indexing if required:

Indexing is generally helpful when you have many records in a table. This helps to index your data based on a particular field or combination of fields based on how you create the index for faster filtering and retrieval of data as the table now knows where to look instead of searching the whole table. Indexing based on a dimension column like Quarter of Year or Product Group maybe useful. The choice of column for indexing completely depends on how the data is and how many rows are present in the table.

Conclusion:

Having mentioned all these points, the most important factor for successful data modeling is to understand how your data is behaving in different scenarios. Keeping an observation on this will help you think of custom solutions in proper ratio/combination of all the above steps.

Good luck and enjoy your data!

--

--