Recently, I encountered a task in Excel where I needed to identify the top N values within each group. Generally, I’ve found that working with groups in Excel can be less than user-friendly for many tasks, and I was hopeful that some of the newer dynamic array functions might have improved this aspect.
Unfortunately, I came up short. While there are certainly some solutions out there that aren’t too complex, in cases like this—where I need to combine two or three functions to achieve the desired result—I often find Python to be more readable and concise.
Let’s tackle this using some basic employee sales data by region. I’d like to demonstrate how to calculate total sales by region. You can follow along by downloading the exercise file provided below:
The first step is to load the sales data into Python as sales_df
.
As a precaution, I will next group the data by region and employee, and then sum the resulting columns. This ensures that no employee is double-counted, allowing us to accurately analyze the top five values by region. This results in sales_df_grouped
.
The next line of code is crucial. Here, we group the Sales
column again, but this time only by Region
—indicated by level=0
since Python is zero-based. The group_keys=False
parameter in the groupby
method controls whether the keys used to group the DataFrame are included in the output. Setting group_keys=False
prevents the group keys from being added to the index of the resulting object.
Finally, I will append nlargest(5)
to the code to extract the five largest values per region in this dataset.
Below is a screenshot along with code snippet for your convenience and ease of copy-pasting. In the screenshot I am displaying the Excel values rather than Python object.
That was pretty cool, but can we take it a step further? I don’t see any harm in making this workbook more dynamic. First, I will enable the user to explore either the top or bottom employees by region. Next, I will allow them to define ‘N’. So, whether they want the top 3 employees by region or the bottom 7, it won’t be a problem.
To achieve this, I will set up data validation in Excel, allowing the user to choose between ‘Top’ or ‘Bottom’ and select an integer between 1 and 10. I will then establish a simple lookup table to determine whether nlargest()
or nsmallest()
should be used.
Normally, I would organize this information in an Excel table. However, since I don’t anticipate any changes to the names or options, I’m comfortable handling it without.
Now, onto the Python code! To make this setup dynamic, I will input the user’s choice of N along with their preference for top or bottom.
Next, I will define a function to determine whether to find the largest or smallest values, establish what N is, and identify the relevant dataset. This is where things get complex. I will provide the relevant Python code snippet below, and we’ll continue exploring it.
The dynamic_filter_sales()
function in the provided Python code is designed to filter and return the top or bottom n
sales records based on the choice made by the user. It takes three parameters: a DataFrame df
, an integer n
, and a string choice
.
Initially, the function groups the DataFrame by ‘Region’ and ‘Employee’, summing up the sales for these combinations to create a consolidated view of sales performance across different regions and employees. This grouped data is stored in df_grouped
.
The function then checks if the choice
parameter matches either ‘nlargest’ or ‘nsmallest’, which are used to fetch the top or bottom ‘n’ entries respectively. If a valid choice is provided, the function dynamically accesses the appropriate method (nlargest
or nsmallest
) from the Pandas library using getattr
.
This method is applied on the ‘Sales’ column of the grouped DataFrame, but only within the same region (indicated by group_keys=False
in the groupby
method). This ensures the operation is performed separately for each region. If an invalid choice is given, the function raises a ValueError
to alert the user. The result of this operation—either the top ‘n’ or bottom ‘n’ sales records per region, depending on the user’s choice—is then returned by the function.
If you execute this code and position your input cells as specified, you should see something like the below. This setup will enable you to define your parameters in Excel, allowing the Python code to run automatically and update accordingly.
Keep in mind that while this solution is relatively dynamic, it could be enhanced further. Currently, the function assumes that you are grouping by columns named ‘Region’ and ‘Employee,’ and that there will be one remaining column to sum. However, this may not always be the case. Therefore, a more robust function would allow users to select the columns for grouping and summing. The below function allows just that:
You could even introduce additional user inputs in your workbook to specify how these groupings and aggregations should be configured based on the dataset.
But for now, I’ll pause here and take any questions. What questions do you have about working with Python in Excel? Are there other Excel tasks, particularly with group-wise data, that you are struggling with and looking for a new approach? Do you see the benefits of using Python here, or do you think using regular Excel functions is sufficient? Let me know in the comments.
Additionally, if your corporate finance team is looking to advance their skills with Python, please check out my corporate workshop. Details are provided below.
Leave a Reply