In my book Advancing into Analytics I called VLOOKUP()
the “duct tape of Excel:”
Technology tactics move quickly. Excel now offers XLOOKUP()
as a modern alternative to VLOOKUP()
. But the theories and concepts behind these tools rarely change. Whether done through VLOOKUP()
or XLOOKUP()
, Excel’s “duct tape” lookup functions differ fundamentally to the welder of Power Query’s relational joins. Let’s take a look using two small sales and country tables:
Combining datasets depends on shared columns
If you’ve written any lookup function in Excel, whether VLOOKUP()
, XLOOKUP()
or something else, you know that the name of the game is returning values from one table based on corresponding matches from the other.
Take, for example, the sales
and countries
tables in the exercise file. We can “look up” the country_name
from countries
based on corresponding values of country_id
that are found in both tables:
It’s easy to repeat these lookup functions on rote when accomplishing work tasks under deadlines. But let’s take a closer look at what’s actually being done with the data here — what does it tell us about the department numbers and names in either table?
XLOOKUP()
has limited ability for missing values
You’ve likely seen when a lookup function can’t find a match in the lookup table and an #N/A
error results. Should this really be considered an error? After all, the lookup function is working as expected. There’s just no matching value in the other table.
#N/A
errors can cause real issues in workbooks. For example, trying to take an average from cells that contain an #N/A
error will just result in another error:
There are ways around this: for example, XLOOKUP()
includes a parameter to customize what value is returned when no match is found:
However, leaving a cell entirely blank raises its own questions. Is this cell meant to be blank on purpose? Is it a mistake? For all the trouble it presents, clearly marking non-matching cells with #N/A
does have a certain logic to it.
Ideally, there could be a dedicated record for missing values, similar to #N/A
, that doesn’t throw errors in subsequent analysis. Unfortunately, that’s a data type basic Excel can’t give us. But Power Query can.
XLOOKUP()
cannot return records from the lookup table
Another consideration of XLOOKUP()
— it doesn’t tell us anything about the values found in the lookup table itself. For example, country 4
, Panama, is listed in the countries
table. There are no corresponding employees in sales
. Is this a mistake? Are there just no sales in this country? Do we want to be aware it exists when reporting on this data?
Everything XLOOKUP()
does is relative to the table with lookup values — unique information found in the lookup table is not addressed.
Relational joins are like a welder
While XLOOKUP()
is useful to adding a new column to an existing table based on corresponding values in a lookup table, it lacks a couple of important features for more systematically combining data between two tables:
- It should be able to search for and return values that are found in either table.
- When a match is not found in both tables, the affected records should either be dropped or marked with a dedicated missing value symbol.
Relational joins, on the other hand, let us do precisely these things. Let’s look at the four major types using our data from above.
Left outer join: think XLOOKUP()
In this join type, all records from the first table are kept and matching values in the second are sought. If no matches are found there, a null
results. This join type is almost identical to the XLOOKUP()
except for the use of null
to represent missing values:
Inner join: only the matches
Perhaps it’s possible that if either table contains a record without a match in the other table, you’d like to throw it out entirely. This might be done to only keep complete records, to drop data with potential integrity issues.
Using these rules, the inner join will result in the fewest number of rows of all join types discussed here.
Right join: like a XLOOKUP()
in reverse
Remember how there was an extra country, Panama, in the countries
table that wasn’t found in sales
? We can use the right outer join to include it in the join results, and drop any unmatched records from sales
. This is like a “backwards XLOOKUP()
.”
Outer join: Bring it all on!
Finally, the outer join, or full outer join, will represent records found in both tables, regardless of a match. In this one we’ll see both the mystery country_id
99 and the unused country
Panama, with some null
s in the mix to represent the unmatched values.
Outer joins return the most number of records from all the join types.
From lookup joins to Power Query welders
In a future post, we’ll cover how to actually execute these welder-like relational joins right from Excel with Power Query. But for now, are the concepts making sense? Are there places in your work where this greater range of join options would be helpful? Let me know in the comments.
Leave a Reply