If you’ve ever encountered the challenge of sorting through data to find or manipulate specific text, you’ll know the potential headaches it can cause. Excel’s integration of regular expression (regex) functions has been a game changer, equipping users with powerful tools to search, extract, and replace text in ways that go beyond the capabilities of functions like FIND()
or SEARCH()
or even newer dynamic array text functions like TEXTBEFORE()
and TEXTAFTER()
.
Regex becomes crucial when you need advanced pattern matching or more nuanced text manipulation than what simple delimiters and character positions can provide.
As powerful as learning regular expressions can be, I wouldn’t advise spending too much time trying to commit the language to memory. It really is a language of its own, with special characters for matching everything from escapes and new lines to lowercase letters and so on. In fact, regex can be so daunting that it’s the basis for this well-known “joke” among computer programmers: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”
Given these considerations, I won’t delve deep into explaining how the regular expressions in the following examples function. Instead, I’ll rely on generative AI tools and other resources to construct the regex strings, which we’ll then apply directly in Excel.
If your regular expressions don’t work as expected at first, that’s okay; regex often involves some trial and error and iteration. That’s part of the unpredictable nature of regular expressions; they’re tough to make foolproof against every possible exception or exclusion that might come up!
We’ll explore a basic use for each function with simple examples, focusing mainly on finding the first string that matches the regex for each case. Keep in mind these functions come with additional parameters we won’t be using here, like options for case sensitivity and whether to return only the first match or all matches within the text.
To follow along, please download the exercise file below:
This dataset consists of a few cells that contain diverse types of information like email addresses, phone numbers, and dates. Our objective is to employ regular expressions to extract each specific piece of data. This approach is particularly useful in text mining scenarios. For instance, you might sift through medical transcriptions to pinpoint all the dosage numbers recorded or scan video transcripts to identify all mentioned websites.
Testing the expression with REGEXTEST()
Let’s begin with the REGEXTEST()
function. This function will consistently return TRUE
or FALSE
, indicating whether the text matches the specified regular expression pattern.
This function is especially useful for validating text data, for instance, verifying if an email or phone number is formatted correctly, or for spotting patterns in text strings that go beyond what simple text functions can manage.
Starting with REGEXTEST()
is a smart move to initially verify if your regular expression works before delving into more specific tasks you want Excel to perform with it.
In this example, I will check if each cell contains a valid email address, returning TRUE
if it does. At this point, I’m not overly concerned with crafting the regular expression itself. we can leverage various tools for that, including generative AI right within Excel using Copilot (more details on that later!).
=REGEXTEST(A2, "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$")
Although this function might not appear particularly thrilling, it can be very useful when combined with the FILTER()
function. This combination allows you to easily generate a list of all cells where at least one match is found (note that while there might be multiple matches. This setup won’t list them individually — the next functions we cover can be used for that purpose):
=FILTER(A2:A8, REGEXTEST(A2:A8, "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$") = TRUE)
Extracting data with REGEXEXTRACT()
The REGEXEXTRACT()
function in Excel is designed to pull out the first match of a specified regular expression pattern from a given text. Unlike REGEXTEST()
, which only checks for the existence of a match, REGEXEXTRACT()
actually returns the matching substring from the text. This makes it invaluable for extracting specific parts of text based on complex patterns, like extracting phone numbers, dates, or any identifiable sequence within larger text bodies.
In this scenario, we’ll extract valid phone numbers from the list. Note that this regex will recognize phone numbers formatted with either parentheses or periods. If we needed to be more specific, we could adjust the pattern accordingly.
We’re assuming that the entries in rows 6 and 7 are not phone numbers with international country codes but rather dates. As you can see, defining what exactly constitutes a “valid phone number” can be quite complex. Regex requires a lot of precision and really forces you to thoroughly consider the problem at hand! That’s one aspect I particularly enjoy about coding and computational thinking: the need for precision.
=REGEXEXTRACT(A2, "\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})")
One more thing to do here. I really dislike the #N/A
errors that are returned here. They’re messy, confusing, and if we attempt any downstream calculations on these cells, like counting populated cells, we’d encounter more #N/A
errors.
To address this, we can wrap the function in an IFERROR()
to clean it up. I’m hoping future versions of this function will include a parameter for specifying what to do when no match is found, similar to XLOOKUP()
.
Replacing with REGEXREPLACE()
Lastly, the REGEXREPLACE()
function is designed to replace text that matches a specified regular expression pattern with another string. It scans through the provided text, identifying matches to the pattern, and substitutes each match with the designated replacement text. In this example, we’ll search for any dates within the cells and replace them with the word “Date”.
=REGEXREPLACE(A2, "\d{2}-\d{2}-\d{4}", "Date")
This approach is particularly effective for data masking. For instance, if you have Social Security numbers or other sensitive information you wish to conceal, you could employ regular expressions to mask them with a placeholder like XXX-YY-ZZZZ
.
Regular expressions and Python + Copilot
I mentioned earlier in this post that regular expressions are far from new in computer programming. Most programming languages include built-in regex capabilities, and Python is no exception.
Thanks to Python’s integration with Copilot in Excel, in my experience, it’s often the preferred method over native Excel regex functions when you seek assistance from Copilot to generate regular expressions. This approach works well, but it does require some knowledge of Python and prompt engineering. For example, in this case, I’m going to request all valid email addresses from our original data source:
This approach will filter out only the valid records, similar to what we did with FILTER() and REGEXTEST(). But what if that’s not what we wanted, and instead, we aimed to keep all records while creating a new column to extract the first valid email address from each? Simply use those exact words in your request to achieve what you need!
So, does this mean regular expression functions in Excel just aren’t needed or are out of date? Not really. There’s definitely still a mental barrier or stigma for many users when it comes to opting for Copilot/Python. Perhaps it’s not readable enough for some, or they might not have the necessary tools in their version of Excel. Therefore, the native regex functions in Excel are there for you to use, and they perform very well in the Excel environment.
What questions do you have about regular expressions? Can you think of any immediate use cases? What do you perceive as the trade-offs between using regex in Excel versus regex in Copilot with Python? Share your thoughts in the comments.
Resources
For additional resources and insights into regular expressions and regex functions in Excel, explore the following resources from Microsoft:
- New Regular Expression functions in Excel
- Regular Expression Language – Quick Reference
REGEXTEST()
functionREGEXEXTRACT()
functionREGEXREPLACE()
function
Leave a Reply