Don’t listen to the haters. Excel is a fantastic way to learn data analytics and a valued slice of the data analytics stack.
That said, you may need to make a couple of tweaks from Excel’s “default settings” to get the most out of it. Here are two particular features that I find essential for analyzing data in Excel.
But is it tidy?
First off — most anything we say about the merits of Excel for data analysis is made moot by whether the data is “tidy.” The basic idea here is that each value in the data table is a unique combination of variable and observation. If the data’s not tidy, and we try to add these two features to it, then it’s the proverbial lipstick on a pig.
If you’re not familiar with the concept of tidy data, check out this video:
The mpg
dataset
With that in mind, let’s demonstrate on a nicely-tidy dataset of vehicle mileage. You can get the dataset and accompanying solutions at the following link.
Download the exercise file here
1. It should have an index column
When you’re analyzing data, you’re hardly ever keeping it in one place. It’s constant drilling down, rolling up, and all around manipulating.
One Excel tool that is great for this rapid data reshaping is the PivotTable. If you’re not super comfortable with PivotTables, I suggest this YouTube series from Excel Campus:
Like I said, PivotTables are great. But they are all about aggregating and rolling up the data. Sometimes you don’t necessarily want that; let’s look at an example.
Split the housing prices by airco
In many analytics questions we would like to compare observations in one group to another. For example, we might want to analyze the sales prices of homes with an air conditioner and those without. While we could try to sort and filter on the original data to do this, a PivotTable provides a quicker and more flexible way to do this.
Go ahead and insert a PivotTable inside the same worksheet to do this. The result is close but no cigar… the PivotTable aggregated all the records of each category, when we would have wanted to keep them separate.
To prevent this, we will add an index column assigning each observation in the dataset a unique observation.
Creating an index column with Flash Fill
The easiest way to do this is with Excel’s Flash Fill feature. Insert a new column at column A named id, fill in the first couple of rows with the numbers 1 and 2, then double-click to do the rest:
Just like with most things in Excel, there are quite a few other ways to create an index number (also known as a serial number). Check out this post from Excel Champs for more:
2. It should be contained as a table
OK, you’ll see that this will fix the earlier problem… but before you see that, delete the current PivotTable; we are going to start from scratch… a tabula rasa, if you will. And how fitting, because our next step is to convert this dataset into a Table. That can be done by clicking on any cell in our range and selecting Ctrl + T
.
Once you’ve done that, head to the Table Name menu toward the left of your screen. (Doesn’t the Table look so nicely formatted, by the way? What a gem.). Give the Table a more meaningful name than Table1
, such as mpg
.
Excel Tables have a lot going for them; we can’t cover all the benefits here, but you can learn more about them here:
In our case, having the data in an Excel table makes it a lot easier to build a PivotTable; all the data in our table is automatically included so we don’t have to worry about errant source ranges.
With that in mind, we can create the PivotTable layout we were initially hoping for:
I rely on these two features
They may seem a little arbitrary right now, but as you put your data through the Excel wringer, shifting it every which way for your desired analysis, you’ll see just how helpful these two features are. In fact, I insist that every Excel dataset in my book Advancing into Analytics follow this convention.
What features do you like to include to your Excel datasets? Are you finding index columns or Tables useful in other contexts? Tell us about it in the comments.
Leave a Reply