Fuzzy matching in Excel Power Query is a technique for comparing and linking data that may not be an exact match but instead shares similarities. This approach is particularly beneficial for addressing errors, misspellings, and variations within datasets.
Leveraging advanced algorithms, this feature intelligently discerns patterns and resemblances between data points, enhancing the accuracy of data consolidation and cleansing.
To see fuzzy matching at work, consider the following demo dataset:
Introducing the datasets
In this example there two tables: customers
and transactions
. Our goal is to merge these two datasets.
Ideally, a primary key or another unique identifier would enable an effortless merge; however, such a clear-cut connector is absent here. The closest column suitable for merging is ‘customer name,’ which unfortunately contains inconsistencies.
This scenario is quite common in the real world. Often, organizations aim to link data from multiple sources, yet they face the challenge of doing so without straightforward identifiers due to variances in data entry or recording methods. Fuzzy matching becomes a vital tool in such cases, offering a solution to synthesize information across different systems and datasets.
Using fuzzy matching in Power Query
These two datasets are already loaded into Power Query. To view them, navigate to Data > Queries & Connections from the ribbon. Then, right-click on either query and select Edit to open the Power Query Editor.
Once inside the Power Query Editor, proceed to Merge Queries, choose Merge Queries as New, link transactions
to customers
via the customer_name
column, and opt for a Left Outer Join.
Make sure to enable the option “Use fuzzy matching to perform the merge.” As implied, fuzzy matching allows the merging of tables with values that are alike but not identical.
Before we proceed, let’s review the situation and contrast typical joins with fuzzy joins. Joins are typically executed by using a shared column between tables, which is the method we’re employing here, though in a more unstructured way. Instead of relying on a tidy and controlled identifier such as a customer ID, we must enable Excel to determine probabilistically which customer names match.
Customizing the fuzzy matching options
Fuzzy matching can become quite complex, and in enterprise-level scenarios, linking records in this way can turn into an extremely demanding task. Even within Excel, there are multiple ways to tailor the fuzzy join functionality, which you can access by selecting the “Fuzzy Matching Options” dropdown in the Merge menu:
I won’t explore every option available, but it’s important to highlight the role of the similarity threshold, which governs how strict or lenient the matching criteria will be.
Additionally, I will limit the “Maximum Number of Matches” to one. Fuzzy matching uses probability to decide which rows resemble each other, which might lead to either no matches or multiple matches for a given row. By setting the limit to one match per row, we prevent the possibility of duplicate results.
For more detailed information on these and other fuzzy matching settings in Power Query, you can refer to Microsoft’s official documentation.
To learn more about these and the other fuzzy matching options in Power Query, check out Microsoft’s documentation.
Unnesting and loading the results
Similar to other joins in Power Query, to see the results integrated into your table, you will need to expand them. Navigate to the “customers” column in your new query, click the expansion icon beside the column name, and choose the columns you want to display in the results.
For thoroughness, I will include all columns in the view. However, to keep the results concise, I will opt not to prefix the column names with the original column name:
You’ll notice that Power Query attempted to match the names as closely as possible, taking certain liberties in determining what rows are considered matches. In the instance of the final record, “J Smithers,” no match was found. Modifying the similarity threshold could alter this outcome.
Optionally, you may rename this query from Merge1
to something more descriptive, such as fuzzy_match
, to clarify its purpose before loading it into your workbook.
Compare your finalized results with mine below:
Fuzzy matching and AI-Powered Excel
Is fuzzy matching a form of AI, and why is it significant?
While fuzzy matching isn’t a standalone AI application, it is intrinsically linked to artificial intelligence. This connection is due to its capabilities for human-like decision-making, learning from data, and problem-solving, as well as its integration into AI workflows. Fuzzy matching algorithms emulate human thought processes to compare and align text strings, taking into account character similarities and the ‘edit distance’ between them.
Although these algorithms can refine their performance by learning from data, they do not possess the self-learning capabilities and have a narrower decision-making range than more sophisticated AI models. Nevertheless, their incorporation into AI-driven processes and tools has led to their recognition as a component of the AI domain.
The integration of fuzzy matching into Excel significantly democratizes data handling, empowering end-users with access to algorithms that were once exclusive to top-tier business echelons. We can anticipate the continued integration of this technology into Excel, which professionals should leverage to maintain their relevance in the workforce.
Have you used fuzzy matching before, such as in Power Query or in other capacities? What are the advantages and disadvantages of this method in your view? Share your thoughts in the comments.
Leave a Reply