Converting scanned paper documents into editable computer text files is nothing new — it’s called optical character recognition (OCR) and it’s been around since the 1970s. This technology, of course, has improved tremendously since then and is available in a variety of programs — including Excel.
Importing the image into Excel
For this demo, we’ve got a handful of customer reviews that exist only as printed copies. We’d like to import them into Excel with the ultimate hope to conduct sentiment analysis on them. Let’s give it a try:
Open a new Excel workbook and select Data > Get & Transform Data > From Picture > Picture from File:
From here, you can navigate to and select the scanned_reviews.png
exercise file. Import the file and you should see a Data from Picture menu to the right of your workbook:
Reviewing possible errors
Just like magic, Excel was able to convert this into text… right? Well, OCR is not magic. It does make mistakes. And, thanks to other artificial intelligence (AI) features, Excel can even detect where these mistakes are likely located.
In this case, Excel has flagged all but one record as likely containing an error. You can click Review to scan and double-check each of them, then make any adjustments to the data. For example, the first entry starts with the letter 1
when it should be the pronoun I
:
You can continue to search through and review any possible erroneous entries. Click “Insert Data” when you’re done to move the results to Excel.
False positives and negatives
Excel’s AI does a decent job predicting when text is likely to contain an error, but it’s not perfect. For example, it could detect an error in one entry when none exist — known as a false positive in statistics. On the flipside, it could approve an entry that actually does have errors — a false negative.
The balance between flagging potential false positives and false negatives is one of the biggest topics in statistics and machine learning. For now, we’ll trust Excel to get it right, but as you continue in your analytics journey there may be places where it’s best for you to decide yourself.
Inserting and reviewing the data
Excel isn’t the most natural place to store unstructured data like text, so you may encounter some issues with inserting the text.
For the sake of organization, it’s best to keep each review in its own discrete cell, so go ahead and manually make any adjustments. For example, it appears my text on rows 6 and 7 should really just be one review:
I’ve consolidated them into one in the solution file, available in this post’s exercise files.
Images in Excel: What next?
OCR traditionally has been used to read text data — not exactly Excel’s bread-and-butter, although that may be changing as tools like sentiment analysis become available. However, a variety of number-driven applications do exist — perhaps, for example, you’ve got a financial statement you want to analyze but it only exist as a printed record, smartphone photo, or so forth. And while not a novel technology to Excel, having OCR right in Excel is as convenient as it gets.
I hope you can use this feature to save hours of busywork — but remember! AI isn’t perfect, and it’s very likely the OCR will produce false positives and false negatives. How to respond depends on the circumstances of your project. Think through the AI first, then use it and you’ll be in great shape to shine.
Have you used Excel’s OCR features, and for what purpose? How has the distinction between false positives and false negatives mattered in your work? Let me know in the comments.
Jon Peltier
Maybe I’m just grumpy, but I’m not as enthusiastic as you about Excel’s OCR capabilities. I’ve done experiments with screenshots of Excel worksheet data, and while most of the characters are successfully recognized* (which is good), the layout is usually wrong. The algorithms do not seem to use non-character visual cues (gridlines, row and column shading, text color, etc.) to help parse content into rows and columns. I would have expected better.
*Excel’s character recognition has improved. When this feature first was released, I found many of the free online OCR sites were much better than Excel, but Excel has caught up in that regard. The free sites have no row/column parsing capabilities, meaning there is even more cleanup required than for Excel’s imports.