Python in Excel: How to build random forest models with Copilot

In a previous post we looked at building decision trees in Excel with the help of Copilot and Python, using IBM’s employee attrition dataset:

Python in Excel: How to build decision trees with Copilot

Decision trees are great for visualizing patterns and making sense of complex datasets. But single decision trees have limitations. They can easily become too rigid, capturing noise instead of real patterns. What if there was a way to maintain the intuitive appeal of decision trees while enhancing accuracy and reliability? That’s where random forests come in.

Random forests use a collection (or “forest”) of decision trees to improve prediction accuracy. Each tree votes on the predicted outcome, and the result with the most votes becomes the final prediction. Let’s revisit our attrition dataset and see how random forests can offer deeper insights, all within the comfort of Excel and Python, powered by Copilot. You can follow along with the download file below.

Download the exercise file here

Model explanation

We’ll start our analysis with this prompt:

“Briefly explain what a random forest is and why it might improve our attrition prediction compared to a single decision tree.”

Beginning here helps establish a clear understanding of the tool we’re about to use and why it could offer a step up from the single decision tree model we built previously. Copilot neatly summarizes this benefit, highlighting how random forests reduce errors and guard against overfitting, making them especially useful for complex predictions like employee attrition.

Building the random forest model

We continue our analysis with this prompt:

“Build a random forest model. Identify which factors the model finds most predictive of attrition and explain why these features matter.”

Copilot’s output clearly identifies important features representing aspects of an employee’s workplace experience: compensation, workload and overtime demands, and more. Each factor significantly affects employee satisfaction and their ultimate decision to remain at or exit the company.

You’ll notice these findings align closely with what a single decision tree previously highlighted, as both approaches capture many of the same critical predictors. However, the random forest approach differs importantly by drawing on multiple decision trees rather than just one. Because each tree within the random forest model looks at slightly different subsets of data and features, the identified predictors reflect broader, more robust patterns rather than quirks or biases potentially found in one individual tree.

Train/test split

The next step is to ensure our random forest model doesn’t just describe historical data. It must also predict future outcomes accurately. By splitting our dataset into two groups (80% training data and 20% test data), we mimic real-world scenarios where the model faces completely new, unseen employee data.

“Split the data into training and testing sets (80/20). Train your random forest on the training set and evaluate its predictive accuracy on new data. Briefly explain why this validation step matters for business decisions.”

Copilot’s output indicates that our random forest model correctly predicted employee attrition about 88% of the time on the test set. This strong performance suggests our model has effectively identified generalizable patterns, giving HR greater confidence in applying these predictions to future decision-making.

Comparatively, a single decision tree model might perform slightly differently at this step. Decision trees, due to their simpler structure, often risk “overfitting,” meaning they may be exceptionally accurate with training data but less reliable when encountering new data. A single decision tree might show similar accuracy (perhaps slightly lower), but its predictions could be less stable and more sensitive to minor fluctuations in the dataset.

The random forest’s advantage comes from combining many trees, each contributing slightly different insights, to produce stable, robust predictions. Therefore, while validation is essential for both methods, random forests typically yield more trustworthy results, making them especially valuable for critical HR decisions around retention and resource allocation.

Visualizing model results

After evaluating the accuracy of our random forest model, we want a clear, intuitive way to see which specific factors most strongly influence employee attrition across the entire dataset. To accomplish this, let’s use the following prompt:

“Create a bar chart visualizing the feature importance scores generated by our random forest model. Explain how HR can interpret this visualization to prioritize retention efforts.”

The resulting bar chart clearly ranks features like MonthlyIncome, OverTime, and Age as most influential. For HR, this makes it easy to prioritize efforts: focus on competitive salaries, manage overtime demands, and support career growth for younger staff.

But why not visualize this as a tree diagram like we did earlier? Unlike a single decision tree, a random forest is made up of many trees, often hundreds, which makes visualizing each impractical and overwhelming. Instead, random forests summarize results as feature importance scores. While we lose some visual simplicity, we gain accuracy and stability, making random forests ideal for confidently guiding HR retention strategies.

Deriving business recommendations

Finally, we’ll prompt Copilot to translate the insights we gained from the random forest analysis into clear, actionable recommendations:

“Based on this random forest analysis, suggest three clear actions HR could take to reduce employee attrition.”

These recommendations directly connect predictive insights from the random forest analysis to practical actions HR can implement to improve employee retention.

Conclusion

Random forests offer substantial strengths over single decision trees by providing more accurate, reliable, and stable predictions for complex analytical challenges. By combining insights from multiple trees, random forests effectively guard against overfitting, making them particularly useful for data analysts working with intricate, real-world datasets.

However, they do have limitations. Random forests sacrifice the intuitive, easy-to-follow visual representation of single decision trees. This loss in interpretability means analysts must rely more heavily on summarized outputs, such as feature importance scores, rather than clear visual narratives.

Looking ahead, Excel data analysts have opportunities to further refine these analyses. One potential next step is to explore additional analytical techniques, such as logistic regression or neural networks, to compare and validate findings. Continuous updating of models with fresh datasets will ensure ongoing relevance and accuracy.

In summary, while no single analytical method is perfect, random forests strike a powerful balance between predictive performance and ease of use, providing Excel data analysts with actionable insights for informed decision-making.

Python in Excel: How to build random forest models with Copilot

Model explanation

Building the random forest model

Train/test split

Visualizing model results

Deriving business recommendations

Conclusion

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.

Model explanation

Building the random forest model

Train/test split

Visualizing model results

Deriving business recommendations

Conclusion

Share this:

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.