How to create training and testing data samples in Excel

A common practice in machine learning and predictive analytics is to split a dataset into training and testing subsets. The training subset is used to build the actual model, while the testing is used to check its overall performance on new data.

Generally R or Python provide a richer environment than Excel for conducting machine learning, but as I write in Advancing into Analytics, there can be real gains to using spreadsheets as a prototyping and learning tool.

In that spirit, let’s practice splitting a housing prices dataset in Excel.

Download the exercise file here

Creating a split column

Many methods exist for splitting a dataset, but the most basic is an entirely random shuffle. We can do that with the help of Excel’s RAND() function. Go ahead and create this column in your housing table. Important: be sure to copy-paste values once you’ve run this function to avoid later issues with splitting the datasets.

Creating the split

Typical train-test splits include 70/30 and 75/25. To avoid hard-coding any of these numbers in the workbook, create a split worksheet where the user can key in any ratio. The training ratio will be used to drive later calculations, so it’s a good idea to name this cell split:

Next, we’ll use dynamic arrays to create the training and testing subsets.

Creating the training and testing sets

Create a new worksheet training in your workbook. We’ll put the 70% of training observations here. Ideally we should create a copy of these values while leaving the original intact. The FILTER() function lets us do exactly that.

We are going to filter the housing dataset to include only the records where RAND() is less than or equal to .7. Because RAND() is a uniform or evenly distributed number generator, we should expect about 70% of observations to fall in this range.

Here’s the formula:

=FILTER(housing, housing[rand]<=split)

The FILTER() function does not carry through the column headers from the refernce dataset, so go ahead and add those with the formula:

=housing[#Headers]

Looking good! We’ll double-count our observations soon… for now, do the same with a testing worksheet except use the inverse formula:

=FILTER(housing, housing[rand]>split)

Checking our work

Great work splitting your data into training and testing subsets using Excel! One downside to this approach is that because the numbers generated by RAND() will be different than yours, there will be variations in what’s seen on your workbook. This can be fixed in programs like R and Python by setting the random seed.

We’ll overlook it for now and simply double-check that the number of rows in the training and testing datasets are approximately 70/30. To do that, I used the COUNT() function to tally up how many ID numbers appear in column A of each worksheet. (Keep in mind that COUNT() does not include text cells, so the column headers are excluded.)

It looks indeed as if the data has been approximately been split 70/30. It’s not perfect for the simple reason that 546 isn’t perfectly divisible by these numbers!

Check work in train and test datasets in Excel

Not perfect, but a start

The FILTER() function makes splitting an Excel dataset relatively straightforward. It’s not the most elegant solution, but it beats what would have come before dynamic arrays. From here, you can build and evaluate models using tools like the XLMiner add-in, right from Excel.

Have you needed to randomly split your dataset in Excel for machine learning or other reasons? How did you do it? What questions do you have about machine learning in Excel? Let me know in the comments.

How to create training and testing data samples in Excel

Creating a split column

Creating the split

Creating the training and testing sets

Checking our work

Not perfect, but a start

Like this:

Related

Newsletter signup

Thank you for signing up. I look forward to becoming savvier about data with you.

Leave a ReplyCancel reply

Creating a split column

Creating the split

Creating the training and testing sets

Checking our work

Not perfect, but a start

Share this:

Like this:

Related

Newsletter signup

Thank you for signing up. I look forward to becoming savvier about data with you.

Reader Interactions

Leave a ReplyCancel reply