Python in Excel: How to do principal component analysis

In a previous post, I showed you how to use Copilot in Excel to perform principal component analysis, or PCA. This time, we’re going to strip things down, dial the AI back a bit, and work through the analysis using our own Python in Excel code.

Copilot in Excel: How to do principal component analysis with Python

PCA is useful when you have several related variables that may be telling you overlapping versions of the same story. Rather than analyzing every variable separately, PCA looks for broader patterns in the data and condenses them into a smaller number of components.

For this example, we’ll work with five years of simulated monthly returns for ten familiar technology stocks, including Apple, Microsoft, Nvidia, Amazon, Meta, and Google. Although each stock has its own return series, technology stocks often move together because they are influenced by many of the same market and industry forces. Download the exercise file below to follow along:

Download the exercise file here

Our goal is to see whether we can reduce these ten columns of returns into a smaller number of underlying factors while still preserving most of the variation in the original dataset. Along the way, we’ll look at how much information each principal component captures and begin interpreting what those components may represent.

The values are simulated, so this is not intended as an investment analysis. The familiar ticker symbols simply make the PCA results easier to understand and discuss.

Preparing the data

First, import the packages we’ll need and bring the Excel table into Python:

from sklearn.decomposition import PCA
import statsmodels.api as sm

returns_df = xl("MonthlyReturns[#All]", headers=True)

Next, we’ll make a couple of transformations to prepare the data for PCA. First, we convert the Date column to a true datetime data type, which will make it easier to sort, filter, chart, or reuse later.

returns_df["Date"] = pd.to_datetime(returns_df["Date"])

stock_returns_df = returns_df.drop(columns="Date")

Then, we create a new DataFrame containing only the stock return columns. PCA looks for patterns among numeric variables, so the date should not be included in the analysis. It tells us when each observation occurred, but it is not itself a financial measurement.

Creating stock_returns_df lets us preserve the original dataset while giving PCA a clean, numeric input.

Fitting the model

Next, it’s time to actually create the PCA model. For now, we’re simply fitting it to the stock returns; we’ll look at the diagnostics and interpret the results next.

pca = PCA()

pca.fit(stock_returns_df)

Now it’s time to view some diagnostics from the fitted PCA model. To make the results easier to read in Python in Excel, we’ll organize them into a DataFrame:

explained_variance_df = pd.DataFrame({
    "Component": [f"PC{i}" for i in range(1, len(pca.explained_variance_ratio_) + 1)],
    "Explained Variance": pca.explained_variance_ratio_,
    "Cumulative Variance": pca.explained_variance_ratio_.cumsum()
})

explained_variance_df

Evaluating the model

The results show that PC1 alone explains about 66% of the variation in the ten stock returns. That suggests a large portion of their movement is driven by one broad, shared pattern.

Adding PC2 raises the cumulative variance to about 74%, while the first three components capture roughly 81%. By the time we reach five components, we have retained about 90% of the variation in the original dataset.

In other words, PCA suggests that we may be able to summarize ten stock return columns with around three to five components without losing too much information. The later components each add relatively little, with PC10 accounting for only about 1%.

Next, we’ll plot the cumulative explained variance to make the pattern easier to interpret.

plt.plot(
    explained_variance_df["Component"],
    explained_variance_df["Cumulative Variance"],
    marker="o"
)

plt.title("Cumulative Explained Variance by Principal Component")
plt.xlabel("Principal Component")
plt.ylabel("Cumulative Explained Variance")
plt.xticks(rotation=45)

The curve rises quickly across the first few components and then begins to flatten. PC1 captures about 66% of the total variance, while the first three components capture just over 80%.

By PC5, the cumulative explained variance exceeds 90%. Beyond that point, each additional component contributes relatively little, suggesting that three to five components may provide a useful summary of the ten original stock-return variables.

So far, PCA has identified the main patterns in how the stock returns move together. The next step is to express each row of the dataset in terms of those patterns.

factor_array = pca.transform(stock_returns_df)

factor_df = pd.DataFrame(
    factor_array[:, :3],
    columns=["PC1", "PC2", "PC3"]
)

factor_df.head()

Because each row in the original dataset represents one month, each row in factor_df also represents one month. The difference is that the month is now described by its scores on PC1, PC2, and PC3 rather than by ten separate stock returns.

These scores show how strongly each month aligns with the patterns captured by the first three principal components. This gives us a more compact way to represent the original data while retaining just over 80% of its variation.

Using the principal components in a regression model

As a final step, we’ll look at one possible application of the principal components we created.

factor_df.insert(0, "Date", returns_df["Date"])

factor_df.head()

y = stock_returns_df["AAPL"]

X = factor_df[["PC1", "PC2", "PC3"]]
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
summary_df = pd.DataFrame(
    model.summary().tables[1].data[1:],
    columns=model.summary().tables[1].data[0]
)

summary_df

First, we add the dates back to factor_df so each set of component scores remains connected to its original month.

We then use the first three principal components as predictors in a regression model, with Apple’s monthly return as the outcome. In other words, we are asking how much of Apple’s movement can be explained by the three broader patterns PCA identified across all ten stocks.

The output shows a coefficient for each component. A larger coefficient means Apple’s returns are more sensitive to that component, while the p-value helps us evaluate whether the relationship is statistically meaningful. The constant represents the portion of Apple’s average return not associated with the three components.

This is one practical use of PCA: reducing a large collection of correlated variables into a few factors and then using those factors in a simpler regression model. In a financial setting, this can help identify which broader patterns are most closely associated with the return or risk of an individual stock.

Conclusion

n this example, we used Python in Excel to reduce ten related stock-return columns into a smaller number of principal components. The first three components retained just over 80% of the variation in the original dataset, while five components retained more than 90%.

That is the main value of PCA: it provides a more compact representation of a dataset whose variables contain overlapping information. Instead of carrying all ten stock-return columns into every chart or model, we can work with a smaller set of broader patterns.

We also used the first three components in a regression model for Apple’s monthly returns. This shows one possible downstream application, but the example should be treated as illustrative rather than predictive. Because Apple was included in the data used to create the components, those components already contain some information from Apple’s returns. A more rigorous approach would be to exclude Apple when fitting PCA and then test how well the resulting components explain it.

Standardization is another important modeling choice. All of the variables here are monthly returns measured in the same units, so using the raw values is defensible. However, more volatile stocks may still have greater influence because their returns vary more. Standardizing the columns before fitting PCA would give each stock equal variance and place more emphasis on their correlation patterns.

Other useful next steps would include examining the component loadings to see which stocks contribute most strongly to each component, comparing standardized and unstandardized results, and using a train-and-test split to evaluate the model on unseen data.

PCA does not automatically tell us what each component means, and reducing the number of variables always sacrifices some information. Still, it can be a useful exploratory tool when a dataset contains several correlated measurements and you need a simpler view of the patterns underneath them.

Python in Excel: How to do principal component analysis

Preparing the data

Fitting the model

Evaluating the model

Using the principal components in a regression model

Conclusion

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.

Preparing the data

Fitting the model

Evaluating the model

Using the principal components in a regression model

Conclusion

Share this:

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.