Image describing some part of normalization (Image Credit: Author)

Why normalization is vital for Data integrity?

Priya Yogendra Rana
Geek Culture

--

To those who don’t understand normalization; Normalization is the process necessary to maintain data integrity and eliminates duplication.

You implement RDBMS principles to ensure the quality of data is not compromised but upheld. There is a lot written on what is normalization and RDBMS principle, available on the net. But here let’s stick to WHY Normalize.

Let’s understand this with an example.

You are asked to extract data from multiple raw, unstructured data dumps of different departments’ databases into a single coherent table.

In the data dump tables, almost 80% of all the columns are the same. Some accounts in the key column are common across multiple data tables. But the data is not organized, structured or clean — there are multiple null values, duplicate entries, and incorrect data.

You have to create one database or view so that a Tableau or Power BI developer can build a dashboard interpreting the organization’s Key Performance Indicators (KPIs) from the database for the different departments as well as the management of the organization.

Scenario 1:

You proceed to write a comprehensive SQL query on this raw data, merging all data from all units — after applying all business conditions. Keep in mind that apart from business logic, you need to apply conditions that take care of null or duplicate values or incorrect data and ensure methods are in place so that users can only view the data they are entitled to.

As time passes by, over the months (and maybe years), many more business changes are implemented and you continue to make changes in the original query. So, with every new business condition implementation or change, the layers of complexity deepen. Eventually, a point is reached where insertion of one ‘new’ business logic distorts another part of the query leading to incorrect output — which often is unidentified as the output was correct before the addition of the ‘new’ business logic.

Scenario 2:

Instead of writing a SQL query, you Normalize the databases and tables, and define the master and transaction tables, specify the keys, constraints and other necessary conditions on records. For example, all account details will go in the accounts master table and sales rep details will be inserted in the sales rep master table. Transaction tables will gather the activities of all the sales reps for their respective accounts. And only then, finally, a SQL query will be written to extract data from this master and transaction table and create a single database, table or view which can then be used in Tableau or Power BI.

Now, let’s assess both the scenarios with respect to data:

1.Data Quality

Data reliability and accuracy defines data quality. Let’s consider a scenario — you have to apply one condition on one operational unit, and another in another operational unit.

In scenario 1, with a high number of joins, there is a higher probability that adding one join can contradict another. With a highly skilled SQL developer on the team, this is an attainable goal. However, it is not a very conducive and cost-effective process in the long run. But in scenario 2, this disparate condition is taken care of as different jobs that will pull the data from different operational units that are not interdependent. Hence, data quality is not compromised. And in case of any data issues, one can easily identify the root cause and thereby reduce resolution time.

2.Data Integrity

Data Quality is a subset of Data Integrity. It encompasses quality but also contextuality, completeness and consistency.

In scenario 1, there is a higher chance of compromising the data integrity. And this includes duplication. With multiple layers of complexity, it will lead to not one but many instances where one change will disturb any previous condition/logic. And the developer will find himself/ herself treating similar issues multiple times. This quality compromise can be handled efficiently in scenario 2. It will also ensure proper optimization of resources and time.

3.Data Scalability

In scenario 1, adding new business conditions increases complexity of an already complex SQL query, which, depending on data size, and demands of it may crash the whole model.

In the same content, scenario 2 makes the system scalable in terms of complexity, with the effort of inserting more business conditions remaining the same (or easier). Maybe the new changes will lead to creation of more tables or addition of more columns; but the SQL query size will remain manageable.

Yes, at times the need and preference for denormalized tables on Tableau and Power BI front is paramount but I’ll close by just stating that in ensuring and practicing data integrity and quality, it’s imperative to practice normalization and implementing all RDBMS principles. It’s better if we normalize before we denormalize. Au revoir!!

--

--

Priya Yogendra Rana
Geek Culture

Ex-Software Developer, MBA, Data Analyst Enthusiast. Bring together business focus and data skills