Perhaps the most fundamental data cleaning steps involve working with rows — sorting and filtering, removing duplicates, and so forth. “Classic” Excel makes this process intuitive through menu-driven features, but the steps are often tedious and difficult to reproduce. For an alternative right in Excel, check out Power Query. Let’s take a look at some typical row-wise transformations:
The party planning committee has been collecting RSVPs and would like the resulting list sorted alphabetically with duplicates, blanks and misprints removed. Committee members are tired of manually re-sorting and removing unneeded rows each time more data is added.
Let’s help them out.
Loading to Power Query
The first step in the process is to convert the data to a table and load to Power Query. I will also rename the query rsvps
.
Using the Power Query Editor ribbon
From here, we can use various buttons on the Power Query Editor ribbon to clean the data. Regardless of sequence, each step will be visible on the Applied Steps menu.
It’s also worth noting that, just like in classic Excel, there are almost always multiple ways to perform the same task. If you have a preferred alternative to any of the below, go ahead and share in the comments!
Remove the missing values
Unlike basic Excel, Power Query has a dedicated field to represent missing values called null
. Our signup data has three missing values which will only cause confusion about who’s registered. To remove these blank values, head to Home on the ribbon and Remove Rows > Remove Blank Rows:
Sort the list
Next, we can sort this list A-Z so it’s easier for the Party Planning Committee to read — and for us to spot any other issues with the data.
This step should look quite familiar — click the drop-down next to the Sign-up
column and you’ll see the same old sort and filter list as in basic Excel:
Removing the duplicates
You may have noticed that Phyllis is entered multiple times in this list. To avoid any duplicates and potential confusion over headcount, let’s remove duplicates by heading back to Home > Remove Rows > Remove Duplicates:
Removing the misprint
This list is looking clean — except for one misprint. Here is where understanding the subject matter of your data is so important. Power Query can only go so far in with boilerplate steps to clean your data.
The final step is to filter out the Klevin from the data:
This is one of those things that has to be done in conjunction with your supervision. Power Query can help make the data clean but you have to specify what “clean” is for this situation.
Testing out the result
Nice work applying these steps to your query. To make the results more accessible, let’s load them back into Excel. From Home, click Close & Load:
At this point it may be helpful to rename the original sheet to something like source-data
and the new one rsvps-cleaned
.
Refreshing the query
We already dodged a couple of clunky Excel menus with Power Query, but its real benefit comes from the ability to refresh work at the click of a button. For example, let’s say two new rows get added to our RSVP list: an entry from Nate, and another blank row.
Rather than having to manually re-sort and delete your work, simply head over to your cleaned worksheet, right-click the table and click Refresh:
This workbook now contains a one-click reproducible data cleaning process that can be used for any RSVP list moving forward.
Have you used Power Query to clean up this or similar lists? What do you think of cleaning data here versus classic Excel? Let me know in the comments.
If you are looking to teach Power Query to your whole team check out my corporate training:
Leave a Reply