How to evaluate forecast accuracy in Excel, Part 2: Using Copilot

In a previous post we built and evaluated forecasts in Excel using native Excel functions and features:

How to evaluate forecast accuracy in Excel, Part 1: Using Excel formulas and Forecast Sheet

We split the data into training and testing sets and used MAE, RMSE, and MAPE to measure accuracy. Doing this by hand helps you see exactly how forecasting works and builds intuition, but Excel’s built-in tools can feel clunky and limited when it comes to consistency and deeper analysis.

In this post we’ll see how Python in Excel and Copilot compare. They’re both more powerful and easier to use, but that also means you need to be more careful about how you interpret what’s happening. We’ll keep using the same synthetic retail sales dataset from the previous post to see how these newer tools perform.

Download the exercise file here

If you’ve never used the Advanced Analysis feature before, check out this post:

How to understand Advanced Analysis with Python for Copilot in Excel

Building a Holt-Winters forecast with Copilot

To get started, I asked Copilot to fit a Holt-Winters exponential smoothing model, the same type used by Excel’s FORECAST.ETS function. This method captures both the overall trend in retail sales and the repeating seasonal patterns in the data. I also had Copilot hold out the last 12 months as a test set and calculate the same accuracy metrics as before (MAE, RMSE, and MAPE) so we can make a fair, side-by-side comparison of the models:

“Forecast Retail_Sales. Hold out the last 12 months as a test set, and fit a Holt-Winters exponential smoothing model with additive trend and 12-month seasonality. Forecast those 12 months and calculate MAE, RMSE, and MAPE. Plot the full training dataset together with the 12-month forecasted values for the holdout period, so that the plot shows the entire time series and how well the forecast aligns with the actuals. Explain in plain language how well the model captured trend and seasonality.”

Copilot generated a forecast that follows the overall sales pattern closely. The blue line represents actual sales, while the red dashed line shows the 12-month holdout forecast. The model captures both the upward trend and the repeating seasonal highs and lows, though it slightly overshoots a few peaks near the end.

The three accuracy metrics summarize how closely the forecast matched the real data. A lower value means a more accurate prediction. In this case, the errors are fairly small, suggesting the model performed well. Still, these numbers only become meaningful when compared to another model’s results since they do not have an absolute scale.

When you compare these metrics to those from Excel’s FORECAST.ETS function, you will see they are similar but not identical. That is expected because both methods use exponential smoothing, but they rely on different optimization algorithms and parameter settings behind the scenes.

Understanding seasonality through decomposition

At this point, we’ve seen that Copilot can generate a solid forecast and measure how well it matches reality. But accuracy metrics alone don’t tell us why the model performed the way it did. To understand that, we need to look under the hood and separate the time series into its core ingredients: the trend, the seasonality, and the random noise.

To that end, this will be our next prompt:

“Take the same series and run a seasonal decomposition. Show me the trend, the seasonality, and the remainder (random noise). Explain in everyday terms what each piece means for my data.”

Copilot’s decomposition splits the retail sales data into four parts: observed, trend, seasonal, and remainder.

The top panel shows the raw data, where we can see both steady growth and repeating yearly cycles. The trend line smooths out those ups and downs to show long-term growth in sales. The seasonal component isolates the recurring peaks and dips that happen around the same time each year. The remainder captures what’s left: the short-term noise that doesn’t follow any pattern.

Together, these pieces confirm that the forecast earlier was capturing real structure in the data rather than random chance. If you want a deeper look at how seasonal decomposition works and what the Python code behind it is doing, check out this post:

Python in Excel: How to do time series decomposition

Testing an ARIMA/SARIMA model

To dig a little deeper, I asked Copilot to try another common forecasting approach:

“Using the same dataset, difference the series if needed to make it stationary, then fit an ARIMA or SARIMA model on the training set. Forecast the last 12 months and compute MAE, RMSE, and MAPE. Plot forecast vs actuals, and explain in simple language what differencing does and why it matters.”

ARIMA (AutoRegressive Integrated Moving Average) models time series by relating each value to its past values and recent errors. SARIMA adds a seasonal component, making it useful for data that repeats in cycles, like monthly or quarterly sales. These models work from the bottom up, looking at how today’s value depends on yesterday’s, rather than smoothing trends from the top down as Holt-Winters does.

They’re a good next step when you want more flexibility or when seasonality shifts over time, since they explicitly model both short-term memory and longer seasonal patterns.

This prompt results in a SARIMA model that focuses on patterns and cycles rather than long-term growth. Differencing removes the steady upward trend in sales so the model can focus on seasonal changes. In simple terms, it tells the model to ignore the overall climb and pay attention to how sales rise and fall around that trend.

The forecast looks solid, following the same repeating pattern seen throughout this dataset. But the accuracy metrics (MAE ≈ 45, RMSE ≈ 50, MAPE ≈ 2.3%) are slightly worse than the earlier Holt-Winters results. That’s because by differencing the data, we may have removed useful information about long-term growth that helped the exponential smoothing model perform better.

This example shows that more advanced models are not always more accurate. Differencing can make forecasts more stable, but it can also flatten genuine upward momentum. It’s a good reminder that every model involves tradeoffs.

Comparing multiple models

To wrap up the comparison, I asked Copilot to bring all the models together:

“Compare linear, ETS, Holt-Winters, and ARIMA forecasts on the same 12-month holdout. Show all error metrics in a table and plot the forecasts vs actuals in one chart. Then summarize which method tracked best and explain in plain English why.”

The resulting table lists each model’s accuracy metrics side by side. As before, lower numbers mean smaller forecast errors. The Holt-Winters model came out on top, showing the lowest MAE, RMSE, and MAPE across the board.

The chart below helps visualize why. The blue Holt-Winters line follows the black actuals most closely, keeping pace with both the upward trend and the seasonal swings. The ETS forecast does reasonably well but flattens out the seasonality, while the linear model misses both the ups and downs entirely. The ARIMA forecast tracks the rhythm of the data but slightly underplays the growth trend.

In plain terms, Holt-Winters wins here because it balances both trend and seasonality, which are strong features of this dataset. Simpler models miss these repeating patterns, and more complex ones can overcorrect by stripping out trend information. The best results often come from the model that fits the structure of the data, not necessarily the most advanced one.

Conclusion

Working with Copilot and Python in Excel shows how far forecasting in Excel has come. Instead of manually building training sets and formula chains, you can now prompt Copilot to fit models like Holt-Winters or ARIMA, generate forecasts, and calculate accuracy metrics such as MAE, RMSE, and MAPE automatically. That speed makes it easier to experiment, compare models, and focus on interpretation rather than setup.

At the same time, automation does not remove the need for judgment. The results here showed that the Holt-Winters model performed best for this dataset, while the more advanced SARIMA model actually did worse after differencing removed useful trend information. This kind of nuance is easy to miss if you rely on Copilot’s summaries without checking the math or understanding what the model is doing.

Python in Excel adds more transparency and control since you can inspect parameters, tweak assumptions, and visualize decompositions directly. It does, however, require a bit more technical understanding. The most effective workflow for now is a hybrid one: use Copilot to build quickly, then verify and refine with Python when you need more precision and clarity.

How to evaluate forecast accuracy in Excel, Part 2: Using Copilot

Building a Holt-Winters forecast with Copilot

Understanding seasonality through decomposition

Testing an ARIMA/SARIMA model

Comparing multiple models

Conclusion

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.

Building a Holt-Winters forecast with Copilot

Understanding seasonality through decomposition

Testing an ARIMA/SARIMA model

Comparing multiple models

Conclusion

Share this:

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.