Data cleansing examples

From this article: you will learn what business data hygiene is and what stages it consists of, you will see concrete examples of how to clean the database, and you will understand why it is worth carrying out this process systematically.

Published in

Blog Transparent Data ENG

10 min readFeb 25, 2021

Data cleansing, also often referred to as Data cleaning, is in fact not a single activity on the database, but a whole process involving the use of several techniques. Their goal is one: to have a clean - reliable, consistent and complete - database. Clean data is nothing more than high-quality data, data that we can trust and based on which we can make the right decisions. In business, where it is estimated that 25% of all data is outdated each year, and contaminated data translates into up to 20% less revenue, this is especially important.

As a data software house that deals with the cleaning of aggregated data from trade registers and various business databases of our clients on a daily basis, we will tell you today what database hygiene is in practice: what are the various stages of B2B data cleansing and what they are all about. First, however, let's provide some good arguments why it is worth taking care of this.

Business data cleansing - invention or necessity?

Nowadays, IT systems generate and process countless amounts of data daily. From a technological point of view, this is not a big challenge for our computers, servers or cloud solutions. With such a huge amount of information, the challenge is something else: the hygiene of databases, i.e. maintaining them in excellent quality.

How great this challenge is shown by the data from the latest report by Experian 2021 Global Data Management Research:

companies estimate that approximately ⅓ of all business data about customers and potential customers is inaccurate,
55% of leaders do not trust the data their organizations own,
only 50% believe that their CRM / ERP data is clean data and can be fully used.

Moreover, as many as 95% of companies notice negative effects related to low data quality.

Benefits of data cleansing

Benefits of regular data cleansing are primarily the problems that dirty data generate in enterprises. Low quality data:

wastes resources (human and time) and generate additional costs,
lowers the credibility of analytics and the accuracy of decisions made,
causes delays in the implementation of tasks,
negatively affects the customer experience,
adversely affects the reputation and trust of customers,
hinders compliance with the rules resulting from regulatory obligations (more about RegTech).

Data cleansing in 5 steps (with examples)

Different data types require a different approach, so the techniques used to clean up data may differ slightly depending on the database you are dealing with. Nevertheless, usually the business customer databases are quite similar (they always contain company registration numbers, e-mails, addresses, etc.). Hence, in the remainder of this article, we will primarily focus on data cleansing these types of records.

B2B data cleansing is a process that usually consists of at least five steps. Those are:

Data validation
Formatting data to a common value (standardization / consistency)
Cleaning up duplicates
Filling missing data vs. erasing incomplete data
Detecting conflicts in the database

Below we describe how data cleaning looks like in each of the stage, together with simple examples of implementation.

Data cleansing Step 1: Data Validation

Any company that has business records in its database, i.e. company data, knows perfectly that many of them is data that should be (and can be) checked for its correctness. Of course, we could assume that all company identification numbers, postal codes or e-mail addresses have been entered correctly in the database or that a business register in which we have verified the contractor certainly does not contain errors, but in practice it is not. Erroneous data can happen even in the best public commercial registers and it is no different in internal databases, where records are entered manually by employees.

This is why data validation, i.e. data verification in terms of meeting certain top-down conditions and logical principles, is the first stage of database hygiene.

For example, let's take the validation of the list of tax numbers of Polish companies imported from some X system:

Data cleansing example: Data Validation of company TAX numbers (raw data)

If you do not have experience in working with company data, you may not know that the last digit of each tax identification number is not accidental in many of countries. In Poland this is called a ‘check digit’ and it is calculated on the basis of an algorithm that can be validated. Briefly, validation of the Polish check digit consists in multiplying each of the first nine digits of the tax number by weights (in sequence: 6, 5, 7, 2, 3, 4, 5, 6, 7), summing the results of this multiplication, and then dividing checksum by 11. The remainder of the division should be identical to the last digit in the tax number.

If we calculate the checksums for the tax numbers given above, it turns out that three of them are incorrect: 4980117337, 5260300292, 000000000. Therefore, they should be deleted from the database.

This step of database cleaning (validation) has passed the following tax numbers:

Data cleansing example: Data Validation of company TAX numbers (data after validation)

Data cleansing Step 2: Formatting data to a common form

The next step in improving the quality of the database is to normalize the data to a uniform form. This procedure is used primarily to facilitate the search for information about a given company in the database.

In the table we pasted above, you can see immediately that some tax numbers were written with dashes, spaces or the prefix “PL” which stands for Poland. So now you need to format all company tax numbers to a common form. How? First of all, since we know that this is a database of Polish business clients, we can safely omit the prefix with the country code. Second, the best option in this case will be to write all numbers without any special characters separating the digits.

Thus, we get the following result:

Numbers are not the only values that we can bring to a consistent form in this way. E-mail addresses or website addresses can also be brought to a common form by writing all of them in lowercase. And it is certainly worth it, because what would database hygiene be if it did not make the database more consistent and easier to use? Exactly!

Data cleansing Step 3: Cleaning up duplicates

After standardizing the data format, the next step in data cleaning is to check whether our database has some duplicates that could not be detected earlier due to a different save format.

After conducting such an analysis, we discover that in our original database it was possible to find two records with the same tax number: 7540335340 and 754 033 53 40.

Our table, after removing duplicates from it, looks as follows:

Data cleansing example: removing duplicates

The above example is limited to finding duplicates by values in one column. In practice, however, some data defines a unique record with more data arranged in different columns. For example, you can search for duplicated people by first name and last name, and in this case use two separate columns - one for the first name and the other for the last name.

Data cleansing Step 4: Filling missing data vs. erasing incomplete data

The next step in database hygiene is preventing the possession of incomplete data. Anyone who works with data at least a little knows well that the information, in addition to being reliable and up-to-date, should also be complete. Incomplete data contaminates the database, lowering its business quality.

As an example, let's take the database of B2B contractors addresses, which are saved in CRM in the following format: voivodship, commune, postal code, city and street.

Data cleansing example on addresses database: Filling missing data vs. erasing incomplete data

Let's assume that in our system we want to have only complete company addresses, i.e. complete data sets (incomplete data does not contribute anything to the business process). We can approach this topic in two ways:

delete all records that have an empty value in any field (which is not an ideal solution, because we lose a lot of information),
complete incomplete records (which is a much better choice, considering that a voivodeship or a commune can be easily completed based on the name of the city or postal code), and only what cannot be retrieved with a supplement (in this case, e.g. sets with empty street info) remove.

Of course, we decide to clean the database the second way.

In order to facilitate this task and perform it fully professionally, it is necessary to define some repetitive and exhaustive rules that will apply to this data set in turn. They take the following form:

If the voivodship field is empty, we complete it based on the city.
If the city field is empty, we check whether we can determine the city name based on the postcode field (we will not always be able to do this - there are many common postal codes for various smaller towns and villages).
If the commune/district field is empty, we complete it based on the city and postal code.
We are introducing a few rules for clearing the data in the street column, such as clearing null strings or removing values where there are no letters other than street.
In the last step, we get rid of the records that are still left with empty values in any of the fields of a single dataset.

After applying the above set of rules, our cleaned database of company addresses looks like this:

Data cleansing examples on addresses database: filling missing data vs. erasing incomplete data (table after cleansing)

Data cleansing Step 5: Detecting conflicts in the database

The last step in our data quality improvement process is the so-called conflict detection. In the terminology of working with data, conflicts are data that are contradictory or mutually exclusive. As you can easily guess, properly performed data hygiene aims to track them all down and mark them properly.

Continuing the example with the address database, we can check, for example, whether the zip code, city and commune match the voivodship entered or whether there is a conflict somewhere. Performing such a quick analysis, you will notice that one of the records is incorrect:

Data cleansing examples: Detecting conflicts in the database

In this dataset, the voivodeship does not match the rest of the address provided.

What can be done now with such a conflict? If you know who entered the data into the system, contact that person to explain the error and enter the correct values. However, if it is impossible for some reason, you should first of all properly mark this record in the database. Thanks to this, in the future it will be easier for us to decide whether to use such a record or not in further data processing. Thanks to this, if we wanted to carry out a statistical survey by provinces, for example, we would be able to simply omit such conflicting, "uncertain" records, so as not to introduce errors in the calculations.

Sometimes, the data hygiene of B2B records databases containing company numbers includes one more activity that involves the detection of conflicts, and is aimed at checking the validity of the information: namely, detecting conflicts in data with other national business registers (verifying whether a given company has an active business status in the National Court Register). Then, companies that have been deleted or suspended in the registers are appropriately marked so that we can decide later whether we want to remove them from the database.

Data hygiene, how often to do it?

The data hygiene of our clients 'and potential clients' business databases is not a topic that we can leave alone. No good manager should assume in advance that employees of various departments have never made and will not make mistakes when entering new data or that everyone will adhere to uniform recording standards. Error is a human thing, so it simply has to be time for data cleansing in the enterprise. Either it should be performed by a properly trained employee (data analyst / programmer with knowledge of the specifics of working with data), or we should outsource this task to an external company specializing in this subject, preferably one that has an ISO / IEC 27001 information security certificate.

How often should data hygiene be carried out in the company? Well, it depends on the size of the base. Medium and large enterprises with a large number of records should repeat data cleaning every 3-6 months. For smaller companies, it is enough to do data hygiene about once a year.

Authors of the article:

Mateusz Antkowiak | PHP developer Transparent Data

Professional programmer for almost a decade. He began his journeys with code from such exotic destinations as Lazarus Pascal or Delphi, and now he is drawn into the stable waters of object-oriented programming. He has extensive experience in process automation, which is useful in his current RegTech tools in Transparent Data.

Magdalena Nowaczyk | Marketing & PR Manager Transparent Data

Our brand communication person who produces words by the meter. Translates the technological complexities into the H2H language and is not afraid of difficult topics. She has been working in the marketing industry for over 9 years, of which almost 5 in our company.

You might also like:

How to deal with legacy systems? Strategies for getting out of technological debt

Building a new system from scratch, code refactoring, and code rewrite - these are three strategies for getting out of…

medium.com

The legacy systems problem — the ABC of technological debt

‘Legacy systems’ —this name perfectly reflects both their obsolete and problematic nature. Those applications and…