Cleaning Dataset In Excel

Harsh Bhise
6 min readApr 19, 2022

--

Hi everyone, in the upcoming lines we will be learning how to clean a raw dataset in excel from the very basic concepts. Before moving forward to continue with the steps you will need basic knowledge of the tools and techniques used in Excel. The dataset that I will use to demonstrate will be linked below so follow along with the steps to get a better understanding of the process.

Alright before we dive into the process we need to understand what is the need for cleaning data? why should we bother to clean our dataset and just start analysis? does cleaning data even make any difference?

The most straightforward answer to these questions is, “Garbage In Garbage Out”. We want our reports and analysis to be crystal clear so that it is easy to understand even for people who do not have much technical knowledge.

Ok Then, enough talk and let’s get our hands dirty with data!

A Quick Note — in our case, we are using an excel file that has XLSX file format, if you are using any other dataset that has CSV file format you have to perform the step that I have mentioned at the end first and then follow along.

  • Importing Our Dataset: Importing a Dataset in Excel is a straightforward process and may take up time depending on the size of your dataset and your hardware specifications.
Raw Data

Here, we need to properly space each column so that we see what type of values are there in each column.

  • Blanks Entries: Almost every raw Dataset that you will ever work on will have either blank cells or blank rows. To solve this problem we will first select all the rows and columns and use the Find and Select Function to get all the blanks selected and delete them using the Delete Sheet Rows function.

This goes well for blank rows but things get a bit complicated when we have blank cells, here we have to decide whether to delete the whole row for one blank cell or fill in values calculated using Statistics or place a dummy value, something like ‘Null’ or 0 (both the entries carries different meaning). There is no one-shot answer for this, here you need to decide what the best option is, and depending on that take your next steps.

A quick tip before moving forward if you are like me and you don’t like the visual formatting that comes with the dataset you can just clear the formatting using the Clear Formats Function.

  • Duplicate Entries: A common problem that every dataset has is duplicate values. This can hinder your reports and visualizations and even cause errors when performing Statistical Analysis on your Data. To solve this problem first we need to select our data and in the Data tab in excel use the Remove Duplicates function. Make sure that the ‘ My Data has Headers’ box is checked.
  • Cell Formats: A Very Important step of our cleaning process is verifying whether our cells are in the correct format, that is they have the correct type of data based on our column’s naming conventions. For example — checking decimals, currency, symbols, date, time, and number formats.

Before moving forward we need to make sure that every cell value is formatted otherwise we will surely encounter errors while we perform Analysis, trust me you do not want to come back to this step and redo all the formats. Alright to perform this step We will use the Formats Cells Function.

after following through all the steps of formatting cells it should look something like this.

  • Cell Spacing and Proper Case: Often we miss these small things which take our process of cleaning data to near perfection. Ensuring that all the columns which have data in the type of text are in Proper Case and have Proper Cell Spacing. We can achieve Both by using the Trim and Proper Case Function at the same time.

after applying these functions to the country column we can just copy-paste values from C Column to B Column.

Okay, till now our dataset is looking pretty clean compared to when we started our process. There are still some things we need to take care of before moving to the Analysis Part.

Sometimes some manual tweaks are needed to be done to make our work as close to perfect as possible. Here if you will observe each country’s name is in its full form but ‘USA’ is not in its proper form. There might be a chance that some entries are in the form “USA” and some in the form of “UnitedStates”, this might cause problems so let’s fix that using the Find and Replace Function.

Alright, Now our dataset is looking clean but there are still some tweaks that I like to do. As we discussed before we will change some of the visual formats of our dataset.

I have added All borders and Bolded the Column Names. It will not affect your reports in any case but I like to keep the data visually appealing (personal preference). You can also go ahead and add colors to individual columns or even the whole table I’ll leave the creative part to you.

At this point, you are Good to go but there are some extra steps you might need to check before moving forward especially if you are working with any other dataset other than the one I have linked below.

  • Combined Data: Sometimes when you import from a CSV file certain columns such as dates or names are combined. In our dataset we already have the dates separated Still, for demonstration, I will change some values so that we can perform the steps to tackle this problem.
Dates

We can do this using the Text to Columns function. The same steps are as follows for the name column. Make sure that correct Delimiters are used to separate data. When importing data from a CSV file format this will be the first step you will perform.

A Quick Note — There is a direct function for this in excel called the Split Function But When Importing Data From a CSV File You will Need To Perform The Step Shown Above.

We have finally reached the end of our data cleaning journey Good Job! Before going there are some points that I want to discuss, I do not own the rights to this data so if you are going to use this for your personal use you are on your own. I have used this only for instruction purposes.

All the steps instructed above need not be performed in any serial order or are even needed in some cases. The objective of this article is to show you how you would clean a dataset in an ideal situation, there is no straightforward way to clean every dataset, it solely depends on the data. It might happen that even after performing all these steps there might be still anomalies in your data that needs specific fixes or in the best case scenario you might need to only use 2 or 3 functions to get a clean dataset, again solely depending on data (and luck too).

Alright, Hopefully, that was informative and helpful to you in any way possible Thank you for your time.

Here is the link to the dataset

--

--