Python in Excel: How to generate descriptive statistics

Many data professionals overlook descriptive statistics, thinking their datasets aren’t suited to it or that it doesn’t add significant value. But descriptive statistics are foundational: they allow you to understand data distributions, spot anomalies, and identify patterns that directly inform business decisions.

Consider these real-world examples:

Inventory management: Imagine a retail company using descriptive statistics to analyze inventory data. They quickly spot products with unusually high variability in sales, enabling them to adjust inventory levels proactively. This approach significantly reduces stockouts and boosts sales revenue by nearly 10%.
Customer service: Consider a call center examining descriptive statistics from call duration data. They notice a subset of calls consistently lasting longer than average. Further analysis uncovers training gaps among certain staff members. By addressing these gaps through targeted training, the call center successfully reduces average call durations and increases overall customer satisfaction scores.
Marketing: Picture an e-commerce business analyzing customer demographics and purchasing behaviors through descriptive statistics. They identify specific segments showing higher purchase frequency and average order values. Armed with this insight, they optimize targeted advertising campaigns, driving conversion rates up by 15%.

For these use cases and many more, I urge my learners and clients not to overlook descriptive statistics. No matter what the ultimate purpose of your dataset is, descriptive statistics should become a regular habit when sizing it up.

Unfortunately, performing descriptive statistics in Excel historically meant dealing with cumbersome manual formulas or the limited functionality of the Analysis ToolPak. But now, Python in Excel, particularly the Pandas package, changes the game completely. Descriptive analysis is easier, faster, and far more powerful.

The goal of this post is to show you how. You can follow along with exercise file below:

Download the exercise file here

By default, Python gives us summary statistics for all numeric columns automatically—no manual pointing and clicking like with Excel’s Analysis ToolPak. Notice, though, it also included the transaction ID, which doesn’t make much sense to analyze statistically.

sales_df = xl("sales[#All]", headers=True)
sales_df.describe()

One option here is to cast the transaction_id column to a string, preventing it from being analyzed quantitatively. Alternatively, we could explicitly specify the column we want to analyze: in this case, sales. Notice how much more flexible and interactive this approach is compared to using Excel’s ToolPak?

sales_df['sales'].describe()

Next we’ll use Pandas to get quick summary statistics for categorical rather than numeric data:

sales_df[['category', 'region']].describe()

We see the total number of entries (count), number of distinct values (unique), the most common entry (top), and how often it appears (freq). For example, the most frequent category is “Electronics,” occurring 24 times out of 100, while “West” is the top region, appearing 32 times.

Next, let’s generate summary statistics grouped by specific categories, a capability that’s not easily available in Excel’s Analysis ToolPak. Here, we can quickly compare average transaction amounts across regions and categories.

sales_df.groupby(['region', 'category'])['sales'].describe()

We might even extend this analysis by visually highlighting top and bottom performers using native Excel features like conditional formatting. Remember, it’s often not a choice between Python or Excel, but rather a powerful combination of both!

Next we’ll use Pandas to calculate custom percentile breakdowns for our sales data, specifically at the 10th, 25th, 50th (median), 75th, and 90th percentiles.

sales_df['sales'].quantile([0.1, 0.25, 0.5, 0.75, 0.9])

With Pandas, we have the flexibility to instantly specify any set of percentiles we want, providing highly tailored insights without the tedious manual setup typically needed in base Excel.

Finally, we’ll identify the top N categories by average sales using a technique called method chaining, which allows us to apply multiple data transformations in a clean, step-by-step sequence.

sales_df.groupby('category')['sales'].mean().nlargest(4)

Method chaining often makes our analysis more readable compared to deeply nested functions traditionally used in Excel formulas (though, to be fair, recent Excel additions like LET and LAMBDA, as well as PivotTables and dynamic arrays, have made this type of analysis easier.).

In conclusion, using Python in Excel, particularly through the Pandas package, offers unmatched flexibility for descriptive statistics. What we’ve seen here is just the beginning. Python in Excel not only simplifies exploratory data analysis but seamlessly bridges into confirmatory analyses and predictive analytics, tasks that Excel alone struggles to handle efficiently or at scale.

If you’re eager for more practical, actionable examples like these, check out my Gumroad course, Quick Wins for Python in Excel, designed specifically to help Excel users quickly level up their analytics capabilities.

What questions do you have about descriptive statistics with Pandas or using Python in Excel more generally? Let me know in the comments below.

Python in Excel: How to generate descriptive statistics

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.

Leave a ReplyCancel reply

Share this:

Like this:

Related

Want more Excel + AI insights? Join my newsletter.

Thank you for signing up. I look forward to becoming savvier about data with you.

Reader Interactions

Leave a ReplyCancel reply