Python Groupby for Excel SUMIFS Replacement

Q: Can I use groupby with a date column to get monthly totals?

Yes, but it's cleaner with .resample for date frequencies: df.set_index("order_date").resample("M")["amount"].sum() gives monthly totals. Groupby works too if you build a month column first (df["month"] = df["order_date"].dt.to_period("M")), but resample is the idiom built for date arithmetic.

Q: Does this work on Excel files, or only CSVs?

Both. Swap pd.read_csv for pd.read_excel and the rest of the script is identical. Excel files are slower to parse than CSVs, so for the 2 million row case, you'll want the CSV regardless. For a 50,000 row workbook, either works fine.

← Previously: Part 1: Reading and Filtering Your First Real Dataset (14 May)

→ Next: Python for Excel Users, Part 3 (28 May)

What this post covers: replacing SUMIFS, COUNTIFS, and AVERAGEIFS with pandas groupby, running multiple aggregations in one pass, and grouping by more than one column to get something that looks like a pivot table without leaving Python. Three operations, in pandas, on a dataset Excel can't open.

The scenario, because SUMIFS doesn't scale to 2 million rows

Same dataset as last time. sales_2025.csv, 2.3 million rows, columns for order_date, category, region, customer_id, and amount. You filtered it in Part 1. Filtering is the easy half. What you actually want is the answer to "how much did we sell per category, per region, in Q4."

In Excel, this is =SUMIFS(amount, category, "Hardware", region, "South"), repeated for every combination, then assembled into a table by hand. Or a pivot table, which works fine until the source has more than a million rows, at which point it doesn't load. Or Power Query, which works but takes a separate post.

In pandas, the entire thing is one line.

The mental shift is the part worth slowing down for. SUMIFS asks "what's the sum where X equals this specific value?" You write it once per combination. Groupby asks "what's the sum for every value of X, all at once?" You write it once, total.

Groupby is SUMIFS turned inside out

Here's the move. Instead of telling pandas which filter to apply, you tell it which column to split on. Pandas then runs your aggregation against every group it finds.

python

import pandas as pd

df = pd.read_csv("sales_2025.csv", parse_dates=["order_date"])

df.groupby("category")["amount"].sum()

Read that as: split df into groups, one per unique value in the category column, then sum the amount column within each group.

The output is a Series indexed by category, with one row per unique category. You don't write a SUMIFS per category. You don't even need to know the categories ahead of time. Pandas finds them.

plaintext

category
Hardware     4152330.55
Office       1028741.20
Services     9887412.80
Software     3318990.65
Name: amount, dtype: float64

That's the result of forty-something SUMIFS, written once.

Multiple aggregations in one pass

.sum() is fine for one number, but the analytical question is rarely just one number. You want sum, count, average, and maybe the maximum, in the same table.

The method for this is .agg, and it takes a list.

python

df.groupby("category")["amount"].agg(["sum", "count", "mean"])

The output is a DataFrame with one row per category and one column per aggregation.

plaintext

sum count mean category
Hardware    4152330.55   18402      225.59
Office      1028741.20   14911       68.99
Services    9887412.80   22087      447.65
Software    3318990.65   16554      200.50

That's a SUMIFS, a COUNTIFS, and an AVERAGEIFS, run in one operation, with the output already in the shape you'd want to paste into a report.

.agg will take any function pandas knows about ("min", "max", "median", "std", "nunique"). It will also take your own function if you write one. The Excel equivalent of "your own function" is "write a LAMBDA, copy it down the column, hope for the best." Pandas treats it as a normal argument.

Naming the output columns so the result is readable

The default column names from .agg are the function names: sum, count, mean. Fine for inspection, terrible for a report someone else is going to read. The cleaner pattern uses the named-aggregation form of .agg.

python

df.groupby("category").agg(
    total_revenue=("amount", "sum"),
    order_count=("amount", "count"),
    avg_order_value=("amount", "mean"),
)

Read each line as: name the output column on the left, point at the source column and aggregation on the right. The result is a DataFrame whose columns are already labelled the way you want them in the deliverable.

plaintext

total_revenue  order_count  avg_order_value
category
Hardware     4152330.55       18402         225.59
Office       1028741.20       14911          68.99
Services     9887412.80       22087         447.65
Software     3318990.65       16554         200.50

This is the form I reach for nine times out of ten. The body of the code reads like a description of the output, which is what the next person to look at the script (often you, in three months) will appreciate.

Grouping by more than one column is the pivot table

Pass a list of columns to groupby and pandas splits by every combination it finds.

python

df.groupby(["category", "region"]).agg(
    total_revenue=("amount", "sum"),
    order_count=("amount", "count"),
)

The result has a hierarchical index: one row per (category, region) pair, columns for revenue and count.

plaintext

total_revenue  order_count
category   region
Hardware   East             1018220.10        4521
           North             992110.40        4380
           South            1018403.75        4612
           West             1123596.30        4889
Office     East              248120.50        3501
           North             255842.10        3640
           ...

If you want it shaped more like a traditional pivot table, regions across the top, categories down the side, use .unstack().

python

df.groupby(["category", "region"])["amount"].sum().unstack("region")

plaintext

region          East        North        South        West
category
Hardware    1018220.10   992110.40   1018403.75   1123596.30
Office       248120.50   255842.10    261201.90    263576.70
Services    2412889.40  2455130.20   2498712.60   2520680.60
Software     820100.10   829711.50    833111.20    836067.85

That's a pivot table, generated in two lines, against a dataset Excel can't load. The pivot is also a regular DataFrame, so you can sort it, filter it, sum the columns, or write it to CSV. The Excel pivot table is its own thing that needs a special menu to do any of that.

Where this breaks the first time

Three issues hit everyone the first few times, and the error messages are not always helpful.

The grouping column has nulls and they vanish.

By default, pandas drops rows where the groupby column is null. If your region column has 200 rows of missing data, the totals will be 200 rows short of the unfiltered sum, with no warning. The fix is df.groupby("region", dropna=False), which keeps a row for the nulls. Find out which way you want it before the report goes out, not after.

You grouped, you aggregated, and the result is still a DataFrame you can't merge back in.

The output of groupby().agg() has the grouping columns as the index, not as columns. If you want them back as columns, chain .reset_index() on the end. Most join and merge operations expect the key as a column, so this is the line that lets the result behave like a regular table again.

The aggregation runs without errors but the numbers look wrong.

Almost always because the amount column is being read as a string. Pandas will happily "sum" strings by concatenating them, and the answer looks plausible until you spot that "100" + "50" is "10050". Check df.dtypes if a total looks suspicious. If amount says object instead of float64, fix it on read: pd.read_csv(..., dtype={"amount": "float64"}).

Where the series goes from here

Two of three down. Grouping is the engine. The third part of the series turns it into something that travels.

Part 3 (28 May): joining datasets. VLOOKUP and XLOOKUP, but for files that don't fit in memory and joins that don't break when the source range changes.

For the broader question of when Power BI's performance problems live in the model rather than the DAX, and why that's the same lesson as this one, see the 20 May Hot Take on data model performance.

Frequently Asked Questions

Is groupby faster than a pivot table in Excel?

For datasets Excel can open, often comparable. For datasets Excel can't open, the comparison stops being interesting. On a 2 million row CSV, the groupby above runs in under a second on a normal laptop. The pivot table doesn't run at all.

What if I need a percentage of total, not just a sum?

Run the groupby for the sum, then divide by the grand total: result["pct"] = result["total_revenue"] / result["total_revenue"].sum(). There's also a transform method on groupby that handles per-group ratios, which is the right answer when you want "this row's share of its own group's total" rather than the overall.

Can I use groupby with a date column to get monthly totals?

Yes, but it's cleaner with .resample for date frequencies: df.set_index("order_date").resample("M")["amount"].sum() gives monthly totals. Groupby works too if you build a month column first (df["month"] = df["order_date"].dt.to_period("M")), but resample is the idiom built for date arithmetic.

What's the equivalent of GETPIVOTDATA?

You don't need one. The pivot result is a regular DataFrame, so you reach into it with result.loc["Hardware", "South"] and get the value directly. No special function, no broken references when the pivot rebuilds.

Does this work on Excel files, or only CSVs?

Both. Swap pd.read_csv for pd.read_excel and the rest of the script is identical. Excel files are slower to parse than CSVs, so for the 2 million row case, you'll want the CSV regardless. For a 50,000 row workbook, either works fine.

What if I want the unaggregated rows back, not the summary?

That's a filter, not a group. Use the Part 1 syntax. Groupby is for "describe each group with one number." If you need every row, you never left filtering.

The scenario, because SUMIFS doesn't scale to 2 million rows

Groupby is SUMIFS turned inside out

Multiple aggregations in one pass

Naming the output columns so the result is readable

Grouping by more than one column is the pivot table

Where this breaks the first time

The grouping column has nulls and they vanish.

You grouped, you aggregated, and the result is still a DataFrame you can't merge back in.

The aggregation runs without errors but the numbers look wrong.

Where the series goes from here

Frequently Asked Questions

Is groupby faster than a pivot table in Excel?

What if I need a percentage of total, not just a sum?

Can I use groupby with a date column to get monthly totals?

What's the equivalent of GETPIVOTDATA?

Does this work on Excel files, or only CSVs?

What if I want the unaggregated rows back, not the summary?

More like this

get the goodstuff first.

get the good
stuff first.