Adding a calculated column is one of the most typical tasks in data cleaning — profit ratios, date durations, you name it. Power Query makes this a simple part of the data cleaning process through the magic of the M language. Let’s get started:
This dataset consists of season records for every major league baseball team since 2000. We would to derive a column calculating each team’s season win record, as calculated by number of wins divided by number of wins + number of losses.
The first step, of course, is to load the data into Power Query. Then from the ribbon of the Editor, head to Add Column > Custom Column.
A Custom Column menu will appear. Let’s kick the calculated column tires a bit by writing "Hello world"
in the “Custom column formula” box:
Underneath the formula box, you will see a link that says “Learn about Power Query formulas.” Click that link and you’ll see documentation about Microsoft’s M language. This is a good reminder that everything done in Power Query is powered by M — even the steps generated by clicking through menus.
Go ahead and click OK. You should now see a column called Custom
with Hello world as an entry in each record:
Renaming the column
Between the Custom
column name and the “Hello world” contents, this calculated column is pretty useless! Go ahead and clean this column up by clicking the gearwheel next to “Added Custom” in the Applied Steps menu.
From here, we can change the column name to Wcpt
and the formula to
[W] + ([W] + [L])
You can refer to a column either by double-clicking on it in the “Available columns” box to the right of the formula area or typing it (with the help of the Intellisense auto-complete feature).
Now this is a column we can use. Let’s make one more change and load to Excel and explore.
Converting the data type
In Excel, a column can contain cells of any number of data types. In Power Query, every cell in a given column is a specific data types. Winning percentage is, of course, a percent field, so we can go ahead and change it by clicking the icon to the left of Wpct
:
Data is both easier for the end user to view, and more efficient for Excel to work with, when given the right data type.
Loading & inspecting the data
Our new column is calculated and ready to work with. On the Power Query Editor ribbon, head to Home > Close & Load > Close & Load To, then select PivotTable Report:
From here, you can crunch the data to, for example, find the average Wpct
for each team name
.
Calculated columns versus measures
While this was a satisfying and interesting analysis, it’s worth noting that what we have in the PivotTable is a straight season winning percentage average. Seasons with fewer number of games figure disproportionately. We can confirm that something’s afoot by comparing Average of Wpct
to our own derived Sum of W / (Sum of W + Sum of L)
calculation:
To avoid this, we could use a dynamic measure that is performed on the data in real-time to aggregate and calculate given specific context of the analysis. That’s what Power Pivot and DAX are for.
Given this hiccup, should you avoid calculated columns in Power Query for good? Not necessarily. They are easy to create, and efficient to compute. But if there’s any chance these columns might be aggregated in a misleading way, you’re better off with a DAX measure.
Have you derived a calculated column in Power Query before? Did you have any issues with the data types or aggregations? Let me know in the comments.
Leave a Reply