Cleaning Up Messy Data in Power Query

Harry Fry
5 min readJan 10, 2022

--

Photo by Mika Baumeister on Unsplash

I am sure we have all had it when we have been handed data in a CSV file that has made our head hit the desk upon opening. Whether this is due to poor formatting decisions, misunderstood data extraction or simply naivety, it’s important to not get too overwhelmed in the process of making tabular data readable.

In this article, I am going to take a snippet of data that I have previously come across (obviously with new data populated by myself — you will see why) and some of the formatting and data cleaning issues that I have faced.

The Data

Snapshot of the data

The data is split into each individuals weightlifting record. Each person has their progress on each movement across ten sessions. From the first name, down to the tenth session, this counts as one dataset. So, what we essentially need to do is combine each person’s weightlifting records into one dataset, where the column names are date, first name, last name, session, bench etc.

The first step we will need to do is to actually get this inside the Power Query Editor. I am most familiar with using the editor inside Power BI so this is where I will be using it. A link to the CSV can be found on my GitHub.

Columns and Headers

Same Dataset in the Power Query Editor

Once we are in the Query Editor, you will see that we actually have two blank columns on the right-hand side (Column8 and Column9) and using the ‘Column Quality’ data preview in the ‘View Ribbon’, we can see that both columns have 2 distinct values. These will likely be rogue spaces in our dataset. So what we first want to do is remove these blank columns. We can do that by holding Ctrl+clicking on both blank columns > right-clicking and selecting Remove Columns.

Since seven of our ten proposed columns are on the top row, we can also just use the first row as headers and assign these now.

Using First Row as Headers

Splitting Columns by Delimiter

Since our first name, last name and date are contained in separate columns, it would be ideal if we could extract this. Since none of our other data contains any colons, we can actually split each column by a custom delimiter. In this case, we will use a colon and then a space as our delimiter, as these are the characters that come immediately before the data we want.

To do this all we need to do is right-click on our column, in this case, it will be ‘Session’, go to ‘Split Column’ and then by Delimiter. We can then use the custom split to enter a colon and a space and then to split it at the right-most delimiter.

Example of Splitting by Delimiter

As you can see above, this has now split the first name from the column. We now need to do the same with our bench column, to get the last name and the squat column, to get the date.

The Rest of Our Columns Split by Delimiter

This has now split the data that we want into its own columns. We can now rename our columns as First Name, Last Name and Date and then drag all of them to the left-hand side. We will also want to amend the other columns that now have a .1 value behind them. We should now be left with our columns looking similar to the below.

Cleaned Up Columns

Cleaning Up the Stragglers

We should now be at a point where our columns are okay but we now have an issue with some rogue rows and null values. Since we have a completely blank first row, we can use the remove rows feature in the ‘Reduce Rows’ ribbon in the Home tab.

Using the Remove Rows Feature

Choose the remove row and select ‘remove top rows’, ensuring you only select 1 row to remove.

Specifying how Many Rows to Remove from the Top

Fill

We should now have our first name, last name, and date values in row 1. Our issue now is the null values inside those three columns. Ideally, we would want the individual's name and date repeated across each row throughout their sessions. Power Query actually has a fantastic feature in the Transform tab called ‘Fill’. Fill will essentially take the top row on your column and fill that cell of data all the way down the column until it changes, then when it registers a change, it will fill the rest of the column with that change and so on.

Using the Fill Feature

So if we hold down Ctrl and select our three first rows. We can then go to Fill > Down and you will then see the results.

Results from Fill

This has now filled our data with our values across its corresponding columns. We can then filter our Session column by the rest of the data that we want to remove from the rows (or we can delete the rows manually it’s your choice).

Filtering out the Rest

Using the Filter to get Rid of Rogue Rows

We can now change our datatypes by clicking on the left-hand side of the column name and voila! We now have our cleaned-up data ready to be loaded into Power BI.

Voila!

--

--

Harry Fry

If you can’t explain something to a five-year-old, you don’t really understand it yourself.