Python for Excel: Reading & Filtering CSVs

Q: My CSV uses semicolons instead of commas. What changes?

pd.read_csv("sales_2025.csv", sep=";") handles it. European exports and some German-locale Excel saves default to semicolons. The same sep argument takes a tab character (sep="\t") for TSV files.

← Previously: When to actually move beyond Excel (3 May) → Next: Python for Excel Users, Part 2 (21 May)

What this post covers: opening a CSV that's too big for Excel, filtering it by date and category, and writing the result back out as a clean file. Three operations, in Python, assuming you already know what AutoFilter and Save As do.

The scenario, because toy examples teach you nothing

You have sales_2025.csv. It's 2.3 million rows. Excel opens it and silently chops everything after row 1,048,576, or it doesn't open it at all and sits there spinning. Either way, you can't filter what you can't see.

What you actually need: every row from Q4 2025, in the Hardware category, written back out as a smaller CSV that Excel can handle from there.

In Excel, this would be: open file → AutoFilter → date column → set range → category column → tick "Hardware" → copy visible rows → paste into new file → Save As. Six steps, none hard, except the first one which is the part that doesn't work.

In Python with pandas, it's four lines.

Opening the file is one line

Pandas is the library that turns CSVs into something you can work with. You import it once at the top of the script and never think about it again.

python

import pandas as pd

df = pd.read_csv("sales_2025.csv", parse_dates=["order_date"])

Two things going on here.

The import. import pandas as pd brings in pandas and gives it the nickname pd. Every pandas script in the world does this. You'll see pd.something for the rest of your career. Don't fight it.
The read. pd.read_csv opens the file and returns a DataFrame, which is the pandas equivalent of an Excel sheet. It has rows, it has columns with headers, and the variable df is what you'll work with from here. The parse_dates=["order_date"] argument tells pandas to treat the order_date column as actual dates, not strings. This matters more than it looks like it matters, and we'll get to why in a moment.

The file is open. Excel never had to know.

Boolean filtering is AutoFilter without the dropdown

Here's where the mental model shift happens. In Excel, you click a dropdown and tick boxes. In pandas, you describe the condition each row has to meet, and pandas keeps the rows where the condition is true.

python

filtered = df[(df["order_date"] >= "2025-10-01") & (df["category"] == "Hardware")]

Read that as: give me the rows of df where order_date is October 1 2025 or later, AND category equals Hardware.

Three details trip Excel people up the first few times.

The square brackets reference a column. df["category"] is how you pull out a column. You can also write df.category, but stick with brackets, they handle column names with spaces, which Excel exports have plenty of.
& is not and. Python has both, and they do different things. and is for single true/false values. & is the column-level operator pandas needs for row filtering. If you write and, pandas throws an error that looks scary but means "use the symbol instead".
The parentheses around each condition are not optional. Operator precedence in pandas filtering means & binds tighter than >=, so without the parentheses Python tries to evaluate the wrong thing and the error message is genuinely useless. Wrap each condition. Always.

Run that line and filtered is a new DataFrame with only the rows you wanted. The original df is untouched.

Writing the output is `to_csv`, with one footnote

python

filtered.to_csv("hardware_q4_2025.csv", index=False)

The method is .to_csv. The filename goes in quotes. The index=False is the part nobody mentions until you've already shipped a report with a weird extra column on the left.

What's the extra column? Pandas assigns every row a numeric index starting at 0. By default, to_csv writes that index out as the first column of your output file. You don't want it. It looks like data, it isn't data, it confuses every downstream consumer. Write index=False unless you have a specific reason not to. I have shipped reports with the extra column, more than once, before this became muscle memory.

The whole script

Four lines that replace the entire Excel workflow you couldn't run.

python

import pandas as pd

df = pd.read_csv("sales_2025.csv", parse_dates=["order_date"])
filtered = df[(df["order_date"] >= "2025-10-01") & (df["category"] == "Hardware")]
filtered.to_csv("hardware_q4_2025.csv", index=False)

Save it as filter_sales.py. Run it with python filter_sales.py from a terminal in the folder where your CSV lives. The output file appears next to it.

That's the whole thing. File filtered, file written, small enough to open in Excel.

Where this breaks, so you can fix it when it does

Three things go wrong the first few times. I have done all three.

The category name doesn't match exactly. "Hardware" and "hardware" are different strings. Pandas does an exact match. If your filter returns zero rows and you're sure the data is there, check the case. df["category"].unique() prints every distinct value in the column and the typo will be obvious. I've spent more time on this than I want to admit, after an upstream system silently switched "Hardware" to "hardware" between exports.
The date filter looks like it works, but doesn't. If you forget parse_dates, the order_date column stays as strings. The comparison >= "2025-10-01" still runs, because Python can compare strings alphabetically, and ISO-format dates happen to sort correctly that way. It looks fine until a row comes in with a different date format, at which point it silently filters the wrong rows. Parse the dates explicitly, every time.
The file isn't where you think it is. pd.read_csv("sales_2025.csv") looks in the current working directory, which is wherever you launched Python from, not where the script lives. If you get FileNotFoundError, the path is wrong, not the file. Either move into the right folder before running, or pass the full path.

Where the series goes from here

This is the first of three. Reading and filtering are the foundation, they are not the interesting part. The interesting part is what you do once the data is in a DataFrame.

Part 2 (21 May): grouping and aggregating. The pandas equivalents of pivot tables and SUMIFS, where Python starts being faster than Excel rather than just bigger.
Part 3 (28 May): joining datasets. VLOOKUP and XLOOKUP, but for files that don't fit in memory and joins that don't break when the source range changes.

If you're not yet convinced Python is worth the switching cost, the 3 May Deep End is the post that argues the case. Read that first if you need persuading. Read this one if you're already past that and want the mechanics.

Frequently Asked Questions

Do I need to install Python before any of this works?

Yes. The shortest path is to install Anaconda, which bundles Python, pandas, and a notebook environment in one installer. Skip the "should I use pip or conda" debate for now, either works, and the decision doesn't matter until you're maintaining a much larger project.

Why pandas and not OpenPyXL or the built-in csv module?

Pandas is built for this exact shape of problem: rectangular data, columns with types, filter and transform operations. OpenPyXL is for writing to .xlsx files specifically. The built-in csv module reads CSVs row by row, which is fine for tiny files but painful once you need to filter or aggregate.

Does this work on Mac?

Yes, the code is identical. The only Windows-specific gotcha is file paths: if you copy a path from File Explorer, the backslashes need to be doubled, or the string needs an r prefix, like r"C:\Users\bianca\sales_2025.csv".

My CSV uses semicolons instead of commas. What changes?

pd.read_csv("sales_2025.csv", sep=";") handles it. European exports and some German-locale Excel saves default to semicolons. The same sep argument takes a tab character (sep="\t") for TSV files.

What's the actual size limit before Excel breaks?

Excel's hard ceiling is 1,048,576 rows. In practice, performance degrades well before that, workbooks over 200,000 rows with formulas become unusably slow. CSV files over about 100MB are the point where Python is the right answer regardless of row count.

Should I use Jupyter or just run a script?

Jupyter notebooks are better for exploring data interactively. Scripts are better for the version of the job you'll run repeatedly. For a workflow like this one, where you know what you want and you need it done, a script is fine. Switch to a notebook when you're not sure what you're looking for yet.

Part 2 lands 21 May.

The scenario, because toy examples teach you nothing

Opening the file is one line

Boolean filtering is AutoFilter without the dropdown

Writing the output is to_csv, with one footnote

The whole script

Where this breaks, so you can fix it when it does

Where the series goes from here

Frequently Asked Questions

Do I need to install Python before any of this works?

Why pandas and not OpenPyXL or the built-in csv module?

Does this work on Mac?

My CSV uses semicolons instead of commas. What changes?

What's the actual size limit before Excel breaks?

Should I use Jupyter or just run a script?

More like this

get the goodstuff first.

Writing the output is `to_csv`, with one footnote

get the good
stuff first.