A common practice in machine learning and predictive analytics is to split a dataset into training and testing subsets. The training subset is used to build the actual model, while the testing is used to check its overall performance on new data.
Generally R or Python provide a richer environment than Excel for conducting machine learning, but as I write in Advancing into Analytics, there can be real gains to using spreadsheets as a prototyping and learning tool.
In that spirit, let’s practice splitting a housing prices dataset in Excel.
Creating a split column
Many methods exist for splitting a dataset, but the most basic is an entirely random shuffle. We can do that with the help of Excel’s RAND()
function. Go ahead and create this column in your housing
table. Important: be sure to copy-paste values once you’ve run this function to avoid later issues with splitting the datasets.
Creating the split
Typical train-test splits include 70/30 and 75/25. To avoid hard-coding any of these numbers in the workbook, create a split
worksheet where the user can key in any ratio. The training ratio will be used to drive later calculations, so it’s a good idea to name this cell split
:
Next, we’ll use dynamic arrays to create the training and testing subsets.
Creating the training and testing sets
Create a new worksheet training
in your workbook. We’ll put the 70% of training observations here. Ideally we should create a copy of these values while leaving the original intact. The FILTER()
function lets us do exactly that.
We are going to filter the housing
dataset to include only the records where RAND()
is less than or equal to .7. Because RAND()
is a uniform or evenly distributed number generator, we should expect about 70% of observations to fall in this range.
Here’s the formula:
=FILTER(housing, housing[rand]<=split)
The FILTER()
function does not carry through the column headers from the refernce dataset, so go ahead and add those with the formula:
=housing[#Headers]
Looking good! We’ll double-count our observations soon… for now, do the same with a testing
worksheet except use the inverse formula:
=FILTER(housing, housing[rand]>split)
Checking our work
Great work splitting your data into training and testing subsets using Excel! One downside to this approach is that because the numbers generated by RAND()
will be different than yours, there will be variations in what’s seen on your workbook. This can be fixed in programs like R and Python by setting the random seed.
We’ll overlook it for now and simply double-check that the number of rows in the training and testing datasets are approximately 70/30. To do that, I used the COUNT()
function to tally up how many ID numbers appear in column A of each worksheet. (Keep in mind that COUNT()
does not include text cells, so the column headers are excluded.)
It looks indeed as if the data has been approximately been split 70/30. It’s not perfect for the simple reason that 546 isn’t perfectly divisible by these numbers!
Not perfect, but a start
The FILTER()
function makes splitting an Excel dataset relatively straightforward. It’s not the most elegant solution, but it beats what would have come before dynamic arrays. From here, you can build and evaluate models using tools like the XLMiner add-in, right from Excel.
Have you needed to randomly split your dataset in Excel for machine learning or other reasons? How did you do it? What questions do you have about machine learning in Excel? Let me know in the comments.
Leave a Reply