Power Query: Removing duplicates

Stephano Tonelli
3 min readMar 21, 2024

--

When trying to clean a dataset imported to Excel or Power BI using Power Query we can many times have duplicate values or updated values where the old register is no longer needed.

To do so using Power Query we must sort the data in a way that the relevant registers remain on the top of the dataset and the ones to be removed in the bottom.

For example purposes, I’ve downloaded a database from Kaggle called “Electric Vehicle Data 2023” (https://www.kaggle.com/datasets/utkarshx27/electric-vehicle-population-data**) and created 2 versions of the file, with a column called “Data Version” simulating an update the data could have received, and now we must keep the latest updated records with the old not updated data.**

But that’s when the tricky part comes in, if we simply sort the data the way we want and then remove duplicates, many times if not every time (in my experience with a project with over 500K rows) the data that we wanted to remove is not completely removed, it's like Power Query ignored the last command line and removed the duplicates using the original data sorting.

In order to fix that, I’ve found 2 solutions, the second one being much better in terms of memory usage.

  1. The first Solution would be to use the “Table.Buffer” Function when doing the sorting from the step before removing the duplicates. This function allows you to save the table in memory assuring that the next step will consider this table for the duplicates removal. The only problem is that depending on the size of the table it can make your query run much slower.
    https://learn.microsoft.com/en-us/powerquery-m/table-buffer
  2. On the second solution instead of using buffer that saves the whole data set into the memory, the best approach would be using “Table.StopFolding” which only forces the query to consider the last step and not the original dataset for the duplicate removal.
    https://learn.microsoft.com/en-us/powerquery-m/table-stopfolding
let
Source = Table.Combine({Electric_Vehicle_Population_Data_2023, Electric_Vehicle_Population_Data_2024}),
#"Sorted Rows" = Table.StopFolding(Table.Sort(Source,{{"Data Version", Order.Descending}})),
#"Removed Duplicates" = Table.Distinct(#"Sorted Rows", {"VIN (1-10)"})
in
#"Removed Duplicates"

--

--

Stephano Tonelli
0 Followers

A Mechanical engineer, Data analyst and Maker from Brazil, sharing my daily challenges as a Data Analyst.