Google data analytics course 4 week 3: Data cleaning with Excel

Zainab Kareem
8 min readAug 15, 2022

--

I Know you’re wondering why I did this, I will tell you now.

I decided to do this because it would a lot of add a lot of clarity to my learning and to know if I really understood the course plus the completed dataset sent didn’t really a have work through on how it was done(merged).

The datasets

The datasets were gotten from the google data analytics course: The “Global Logistics Association” and the “International Logistics Association.

CASE SCENARIO

Two logistics company have just merged and they both have records on customers from their respective companies. By procedure these companies will have to join records as they have to share record, hence the merging of the datasets.

STEPS TAKEN

It’s always a good practice to save a copy of the dataset while working on a project. One can always make a reference back to it and also for caution in the case of data cleaning. So first I saved a copy of each dataset.

In data merging the destination data and source data should be chosen first.

I choose the global dataset has the source file and the international dataset has the destination file.

I checked for duplicates with a unique value/row in the international logistics dataset. Which is the Member ID

Using the conditional formatting. Found a duplicate value 100027. which appeared twice . I deleted one of it manually.

In the global dataset a new address column was created, by concatenating concatenating column D and E to form column F(New address). This is to ensure that the number of address in both datasets are the same.

I checked for spelling errors. In both datasets, found a spelling error in the destination dataset(international logistics). where a bracket “{“ replaces the letter “P” in professional member. Changed this the bracket “{“ to a “P”.

N.B: This step was already shown in the course video

Some of the cells share values that is, values that are spreading across multiple cells. So that needs to be adjusted as it could have an effect when extracting/merging the dataset.

N.B: The highlighted cells are the cells are cells that are merged.

The above picture shows the home tab from the global logistics dataset. where the cells are merged and centered.

After unselecting merge and center from the home button(under alignment). All merged cells are no longer merged.

EXTRACTING CELLS

Next I start merging cells from the source dataset(Global Logistics) to the destination dataset(International Logistics). I started with the First names, I used vlookup to extract the First names from the source dataset to the destination dataset.

The vlookup has 4 syntax the lookup value, table array, column index number and the range lookup. Each syntax separated by a comma.

N.B Each syntax argument has the name of the source file before it. Because this is where the values are being extracted from.

A close up image of the lookup function.

I started with the second column on the destination dataset(row 73). I used the vlookup function to extract the lastnames on the source dataset to match.

The vlookup has 4 arguments the lookup value, table array, column index number and the range lookup. Each argument separated by a comma.

N.B Each syntax has the name of the source file before it. Because this is where the values are being extracted from.

I extracted the next column next using vlookup and copied down the formula to the rest of the cell. The cell B72 in the formula. The Column B was locked down to keep the column in place.

An image showing the rest of the column being extracted. Address 2 was extracted using vlookup as well.

For the third address(Address 3) earlier I showed some columns were merged .

N.B: The highlighted cells are the cells are the ones that are merged.

After unmerging some of this columns became empty. Extracting them would cause these columns to return “0” as the values of the now empty cells. The vlookup value is nested with the “IFERROR” function

The iferror function has 2 syntax, value and value if error. I do not want zeros . i want the value returned with any empty cell to be an empty cell not the number zero.

The nested functions of iferror and vlookup
Before using the nested functions and after using them.

Moving to the next column I extracted using vlookup. Locking the table array as to keep the table value from moving around.

The Address 5 column which is the next column had some blanks cells too. The nested functions of “IFERROR” and “VLOOKUP” was used.

Extracting the next column(Dues amount), with the “VLOOKUP” function.

extracting the currecncy values below

The extracted values were not formatted like the values that they were being extracted to, this is beacause on the home tab under number its being formatted as general.

So changing from general to currency will change the formatting but adding two zeros and I do not want that.

Removing the extra zeros with a keyboard shortcut, with the extracted values still selected. I used ALT+H+9 to remove extra zeros from the selected values.

The ALT+H+9 was used twice so as to remove the two zeros.

Now the column is in the right format and the extra zeros are gone :).

The next column was extracted using the vlookup function( The function must be crying by now).

Just like the pervious column extracted, the format of the extracted column isn’t correct. The values are all right aligned which means excel recognizes it as kind of number.

I changed the format of the date on the number , from the home tab. The date is now well formatted. How do I know this? Apart from it now showing as date, it is now grouped by year when the filter dropdown menu is clicked. Which is the way date displays in excel.

As seen in the above image, one the dates “2/29/2022” isn’t formatted as a date but a text. This is because excel doesn’t recognize it as a date. It shows 29th of February 2022 but such date doesn’t exist as the year 2022 is not a leap year. To get excel to format this as a date the year either as to be changed to 2020 or 2024(Those are leap years), the date could also be changed too. I do not have enough information about this date value, so I will not be changing it.

“In Analysis there’s no guess work, If you do not know dont answer”

The last column was extracted too, using the vlookup function.

Extracting the columns from the beginning I started from the second column which means I’m yet to fill the member ID column. The reason why I did this is because I wanted to extract all rows so I would get the total number of rows.

Using CTRL+spacebar to select the column, I used a keyboard shortcut ALT+H+F+D+S+K to go to special and select all blank cells in the column.

Writing a formula with = and Pressing the up arrow key, to select the cell above plus 1. So the value of the cells can be the value, of the cell above it plus 1 till the end of the column. Pressing CTRL + enter key to fill the column down.

I want the member ID to show the actual values and not the formula.

I copied the column using CTRL+C

Double clicked so as to paste special values

Now the values are no longer showing formulas in the formula bar.

The merged dataset now have a total of a hundred and fifty three rows and eleven columns. :)

--

--