← Previously: Part 1: Reading and Filtering Your First Real Dataset (14 May)
→ Next: Python for Excel Users, Part 3 (28 May)
What this post covers: replacing SUMIFS, COUNTIFS, and AVERAGEIFS with pandas groupby, running multiple aggregations in one pass, and grouping by more than one column to get something that looks like a pivot table without leaving Python. Three operations, in pandas, on a dataset Excel can't open.
The scenario, because SUMIFS doesn't scale to 2 million rows
Same dataset as last time. sales_2025.csv, 2.3 million rows, columns for order_date, category, region, customer_id, and amount. You filtered it in Part 1. Filtering is the easy half. What you actually want is the answer to "how much did we sell per category, per region, in Q4."
In Excel, this is =SUMIFS(amount, category, "Hardware", region, "South"), repeated for every combination, then assembled into a table by hand. Or a pivot table, which works fine until the source has more than a million rows, at which point it doesn't load. Or Power Query, which works but takes a separate post.
In pandas, the entire thing is one line.
The mental shift is the part worth slowing down for. SUMIFS asks "what's the sum where X equals this specific value?" You write it once per combination. Groupby asks "what's the sum for every value of X, all at once?" You write it once, total.
Groupby is SUMIFS turned inside out
Here's the move. Instead of telling pandas which filter to apply, you tell it which column to split on. Pandas then runs your aggregation against every group it finds.
import pandas as pd
df = pd.read_csv("sales_2025.csv", parse_dates=["order_date"])
df.groupby("category")["amount"].sum()Read that as: split df into groups, one per unique value in the category column, then sum the amount column within each group.
The output is a Series indexed by category, with one row per unique category. You don't write a SUMIFS per category. You don't even need to know the categories ahead of time. Pandas finds them.
category
Hardware 4152330.55
Office 1028741.20
Services 9887412.80
Software 3318990.65
Name: amount, dtype: float64
That's the result of forty-something SUMIFS, written once.
Multiple aggregations in one pass
.sum() is fine for one number, but the analytical question is rarely just one number. You want sum, count, average, and maybe the maximum, in the same table.
The method for this is .agg, and it takes a list.
df.groupby("category")["amount"].agg(["sum", "count", "mean"])
The output is a DataFrame with one row per category and one column per aggregation.
sum count mean category
Hardware 4152330.55 18402 225.59
Office 1028741.20 14911 68.99
Services 9887412.80 22087 447.65
Software 3318990.65 16554 200.50That's a SUMIFS, a COUNTIFS, and an AVERAGEIFS, run in one operation, with the output already in the shape you'd want to paste into a report.
.agg will take any function pandas knows about ("min", "max", "median", "std", "nunique"). It will also take your own function if you write one. The Excel equivalent of "your own function" is "write a LAMBDA, copy it down the column, hope for the best." Pandas treats it as a normal argument.
Naming the output columns so the result is readable
The default column names from .agg are the function names: sum, count, mean. Fine for inspection, terrible for a report someone else is going to read. The cleaner pattern uses the named-aggregation form of .agg.
df.groupby("category").agg(
total_revenue=("amount", "sum"),
order_count=("amount", "count"),
avg_order_value=("amount", "mean"),
)
Read each line as: name the output column on the left, point at the source column and aggregation on the right. The result is a DataFrame whose columns are already labelled the way you want them in the deliverable.
total_revenue order_count avg_order_value
category
Hardware 4152330.55 18402 225.59
Office 1028741.20 14911 68.99
Services 9887412.80 22087 447.65
Software 3318990.65 16554 200.50This is the form I reach for nine times out of ten. The body of the code reads like a description of the output, which is what the next person to look at the script (often you, in three months) will appreciate.
Grouping by more than one column is the pivot table
Pass a list of columns to groupby and pandas splits by every combination it finds.
df.groupby(["category", "region"]).agg(
total_revenue=("amount", "sum"),
order_count=("amount", "count"),
)The result has a hierarchical index: one row per (category, region) pair, columns for revenue and count.
total_revenue order_count
category region
Hardware East 1018220.10 4521
North 992110.40 4380
South 1018403.75 4612
West 1123596.30 4889
Office East 248120.50 3501
North 255842.10 3640
...If you want it shaped more like a traditional pivot table, regions across the top, categories down the side, use .unstack().
df.groupby(["category", "region"])["amount"].sum().unstack("region")
region East North South West
category
Hardware 1018220.10 992110.40 1018403.75 1123596.30
Office 248120.50 255842.10 261201.90 263576.70
Services 2412889.40 2455130.20 2498712.60 2520680.60
Software 820100.10 829711.50 833111.20 836067.85That's a pivot table, generated in two lines, against a dataset Excel can't load. The pivot is also a regular DataFrame, so you can sort it, filter it, sum the columns, or write it to CSV. The Excel pivot table is its own thing that needs a special menu to do any of that.
Where this breaks the first time
Three issues hit everyone the first few times, and the error messages are not always helpful.
The grouping column has nulls and they vanish.
By default, pandas drops rows where the groupby column is null. If your region column has 200 rows of missing data, the totals will be 200 rows short of the unfiltered sum, with no warning. The fix is df.groupby("region", dropna=False), which keeps a row for the nulls. Find out which way you want it before the report goes out, not after.
You grouped, you aggregated, and the result is still a DataFrame you can't merge back in.
The output of groupby().agg() has the grouping columns as the index, not as columns. If you want them back as columns, chain .reset_index() on the end. Most join and merge operations expect the key as a column, so this is the line that lets the result behave like a regular table again.
The aggregation runs without errors but the numbers look wrong.
Almost always because the amount column is being read as a string. Pandas will happily "sum" strings by concatenating them, and the answer looks plausible until you spot that "100" + "50" is "10050". Check df.dtypes if a total looks suspicious. If amount says object instead of float64, fix it on read: pd.read_csv(..., dtype={"amount": "float64"}).
Where the series goes from here
Two of three down. Grouping is the engine. The third part of the series turns it into something that travels.
Part 3 (28 May): joining datasets. VLOOKUP and XLOOKUP, but for files that don't fit in memory and joins that don't break when the source range changes.
For the broader question of when Power BI's performance problems live in the model rather than the DAX, and why that's the same lesson as this one, see the 20 May Hot Take on data model performance.
Frequently Asked Questions
Is groupby faster than a pivot table in Excel?
For datasets Excel can open, often comparable. For datasets Excel can't open, the comparison stops being interesting. On a 2 million row CSV, the groupby above runs in under a second on a normal laptop. The pivot table doesn't run at all.
What if I need a percentage of total, not just a sum?
Run the groupby for the sum, then divide by the grand total: result["pct"] = result["total_revenue"] / result["total_revenue"].sum(). There's also a transform method on groupby that handles per-group ratios, which is the right answer when you want "this row's share of its own group's total" rather than the overall.
Can I use groupby with a date column to get monthly totals?
Yes, but it's cleaner with .resample for date frequencies: df.set_index("order_date").resample("M")["amount"].sum() gives monthly totals. Groupby works too if you build a month column first (df["month"] = df["order_date"].dt.to_period("M")), but resample is the idiom built for date arithmetic.
What's the equivalent of GETPIVOTDATA?
You don't need one. The pivot result is a regular DataFrame, so you reach into it with result.loc["Hardware", "South"] and get the value directly. No special function, no broken references when the pivot rebuilds.
Does this work on Excel files, or only CSVs?
Both. Swap pd.read_csv for pd.read_excel and the rest of the script is identical. Excel files are slower to parse than CSVs, so for the 2 million row case, you'll want the CSV regardless. For a 50,000 row workbook, either works fine.
What if I want the unaggregated rows back, not the summary?
That's a filter, not a group. Use the Part 1 syntax. Groupby is for "describe each group with one number." If you need every row, you never left filtering.