I’ve written in the past how Power Query enforces good data hygiene practices that aren’t always in Excel. One of those is the presence of a null
or dedicated missing value. This has been a real problem for people in the past:
Good news… it doesn’t have to be anymore, with Power Query. In this post we’ll look at what null
means in Power Query, and how to use it properly. Let’s practice on a a modified version of the penguins
dataset:
Go ahead and load this data into Power Query by clicking inside the table and selecting Data > Get Data > From Table/Range. (Check this post for more detailed steps.)
For this exercise, we will focus on the row ID = 4
(row 5 back in Excel). Take a look at how these values are represented in Excel versus Power Query:
The values that were blank in Excel are now null
in Power Query. (We’ll circle back to the NA
under sex
.) This is Power Query’s dedicated way to encode missing values. But what is a missing value?
It does NOT equal zero!
Possibly the most common misconception of a missing value is that it equals zero. It does not! If we knew what the value should be, we would just fill in a zero! By definition, we don’t know what the missing value stands for. To be extra careful this mistake isn’t made, we use a special value null
just for missing values. Otherwise, it’s too easy to mix up zeros and null
s.
We don’t know why it’s missing
Did you know the word data comes from the Latin for “what is given?” That means if our data is missing, it wasn’t given. It’s an unknown unknown: we don’t know why it’s missing, nor do we know what the real value is.
This seems like a trivial point, but I remember a past boss asking me why values were missing. And while it’s important to understand how the data was collected, the data itself can’t tell us that. The best we can do is represent and store that missing data the best we can. And that is by using the null
value.
We should always represent missing values as null
(which starts as a blank in Excel)
Not all coworkers or bosses are going to “get” null
. They may prefer a more “human readable” missing value like “Not available” used in the raw spreadsheet data. And, to be fair, Excel in the past was pretty ambiguous about what to use for missing values. But these days, we should always represent missing values as blanks in Excel, because those become null
in Power Query.
To understand the importance of being consistent with missing values, let’s take a look at summarizing our null
s:
We can count up null
s with column quality
With missing values, many statistical models break. They may be signs of poor data quality and collection methods. When exploring the data, we should be able to take stock of missing values easily. That’s what null
lets us do.
In the Power Query editor, head over to the View tab and check on “Column quality” under Data Preview. Here you will see a menu appear at the top of each column summarizing what percent of values are valid, contain formula errors, and are empty. And to Power Query, empty means null
and null
alone.
Recoding values as null
Those of you who have been been eating your carrots you might have seen the NA in record 4 of sex
. It appears that whoever collected this data marked some of the missing values this way, perhaps as “not available?” It makes sense as a value, but not to Power Query. If we keep these values as NA
, they won’t be identified as missing which could impact later steps of the analysis.
To fix this, we will right-click on the sex
columns and select Replace Values. This should look very similar to a Find and Replace in regular Excel, except this time we’ll replace the selected phrase with the Power Query null
:
If you look carefully, you’ll even notice that the percentage of missing values in the Column Quality menu goes from 2% to 3%. Want to take a closer look? Get the completed exercise file here:
Don’t “miss” out… use null
s!
If you use databases and SQLs, null
serves a similar function. It’s one that Excel sorely lacked and has a lot of benefits. Most importantly, it makes things unmistakably clear to data users what values are truly missing, and not just 0 or some placeholder value.
How have you used null
in the past for data analysis? How could this have helped you in Excel in the past? Let me know in the comments.
You can learn more about Power Query in my book Modern Data Analytics in Excel:
Jon Peltier
Sorry, the previous comment submitted itself too early.
I’ve tried to get Microsoft to add a NULL() or BLANK() function that a formula could return, so that other formulas would treat that result as a null value. Apparently it would require too much reengineering of the way the grid is calculated.
That little checkbox in Hidden and Empty Cells that allows you to Show #N/A as an Empty Cell? That was what they gave me (and the rest of us) when they couldn’t create a NULL() function.