As analysts, the ability to summarize and aggregate data on the fly is an essential skill. Traditionally, PivotTables have been a popular method for this purpose. However, a new, function-driven approach has emerged. In this post, we’ll explore the basics of the GROUPBY()
function for creating dynamic data summaries. You can follow along using the exercise file provided below:
What is a PivotTable, anyway?
You might not have considered this, but essentially every significant PivotTable in Excel includes at least two elements. First, there’s a category or dimension that groups the data. Although it’s possible to filter by this category, we’ll simplify the discussion by excluding it here. Consider the category
column in the example dataset. If you were to create a new PivotTable from this data and select it in the field list, it would automatically be added to the rows. While you could further manipulate this, we’ll stick with the default behavior for now:
Next, we need a measure or quantitative value, something that allows us to perform mathematical operations. We’ll focus on the sales column in the provided dataset. Go ahead and add this to your PivotTable; by default, it will be placed in the Values area of the PivotTable. While further manipulation is possible, we’ll stick to the basics for now.
Now, when you add sales to the PivotTable section, you face anotherdecision (apart from formatting the numbers!): deciding how to aggregate this data. Typically, the default aggregation is Sum, but if you double-click on Sum of sales
, you’ll access the Value Field Settings menu. From there, you can change the aggregation to average, minimum, maximum, and so forth.
So, to create these basic PivotTables, we did three things:
- Group the data by a category along the rows.
- Summarize by a value.
- Decide how to summarize that value.
With these steps in mind, you’re well-prepared to understand the GROUPBY() function.
GROUPBY()
function syntax
The syntax of the GROUPBY()
function includes seven parameters in total, with four being optional and three mandatory:
GROUPBY(row_fields, values, function, [field_headers], [total_depth], [sort_order], [filter_array])
While the presence of seven parameters might seem daunting, and we aren’t going to delve into each one here (you can refer to the Microsoft documentation for detailed explanations), the good news is that the three required parameters are exactly what you’ve just utilized in your last PivotTable.
You need row_fields
to group by, values
to aggregate, and you must specify a function
. Reflecting on our previous PivotTable, here are the corresponding components:
OK, let’s see if we can recreate this in Excel. When selecting columns for the GROUPBY()
function, I suggest using the #All
structured reference. This approach ensures that both the values and the labels are included in the data, making these labels available to the function in case you want to label your GROUPBY()
results later:
=GROUPBY(sales[[#All],[category]], sales[[#All],[sales]], SUM)
OK, I’d actually like to include the headers here! Let’s adjust the relevant fourth parameter. I will set it to 3, indicating that my source data has headers and I want to display them in the results:
OK, I think we’re getting the hang of this! From here, it’s a bit of plug and play. Experiment with it yourself and get creative with some interesting business questions you might have. For instance, if you’re interested in understanding whether there’s any age variation by country, we can easily find that out:
=GROUPBY(sales[[#All],[country]], sales[[#All],[customer_age]], STDEV.S, 3)
Adding multiple categories and values
Now that you’re getting comfortable with the basic syntax, let’s delve into something more complex by adding multiple categories and values to the results. This task is simplified if the categories and values you want to include are adjacent to each other in the source table. If they are not, you can select individual columns using the CHOOSECOLS()
function, but for now, we’ll stick to a basic example. In this case, I want to group the data by both category
and product_name
, and also obtain the sum of both quantity
and sales
:
You are welcome to validate these results by creating an equivalent PivotTable. This can be a really useful debugging tool if you’re unsure about how to handle an example.
Adding multiple aggregations
Last but not least, what if you want to analyze the same metric but with different aggregations? This requires a bit of ingenuity. For instance, if you want to calculate both the sum and average of sales in a PivotTable, you would drag and drop that value twice into the PivotTable. We’ll do something similar here. Using structured references, I’ll indicate that we want the field twice.
Now that we have two values to aggregate, I’ll use the HSTACK()
function to horizontally stack them across the columns of the results:
=GROUPBY(sales[[#All],[category]:[product_name]], sales[[#All],[sales]:[sales]], HSTACK(SUM, AVERAGE), 3)
Conclusion
In summary, the GROUPBY()
function is an exciting addition, leveraging the power of dynamic arrays in Excel. What makes it particularly impressive is that these formulas update automatically—there’s no need to manually refresh like with a regular PivotTable. This capability allows you to create highly interactive reports.
However, if you want to display a category across the columns as well as the rows, you will need to use the PIVOTBY()
function. I plan to explore this function in a future article, but in the meantime, you can refer to the Microsoft documentation for more information. Additionally, if you’re still familiarizing yourself with how dynamic arrays function, consider checking out my book, Modern Data Analytics in Excel.
Leave a Reply