← Previously: When to actually move beyond Excel (3 May) → Next: Python for Excel Users, Part 2 (21 May)
What this post covers: opening a CSV that's too big for Excel, filtering it by date and category, and writing the result back out as a clean file. Three operations, in Python, assuming you already know what AutoFilter and Save As do.
The scenario, because toy examples teach you nothing
You have sales_2025.csv. It's 2.3 million rows. Excel opens it and silently chops everything after row 1,048,576, or it doesn't open it at all and sits there spinning. Either way, you can't filter what you can't see.
What you actually need: every row from Q4 2025, in the Hardware category, written back out as a smaller CSV that Excel can handle from there.
In Excel, this would be: open file → AutoFilter → date column → set range → category column → tick "Hardware" → copy visible rows → paste into new file → Save As. Six steps, none hard, except the first one which is the part that doesn't work.
In Python with pandas, it's four lines.
Opening the file is one line
Pandas is the library that turns CSVs into something you can work with. You import it once at the top of the script and never think about it again.
import pandas as pd
df = pd.read_csv("sales_2025.csv", parse_dates=["order_date"])Two things going on here.
- The import.
import pandas as pdbrings in pandas and gives it the nicknamepd. Every pandas script in the world does this. You'll seepd.somethingfor the rest of your career. Don't fight it. - The read.
pd.read_csvopens the file and returns a DataFrame, which is the pandas equivalent of an Excel sheet. It has rows, it has columns with headers, and the variabledfis what you'll work with from here. Theparse_dates=["order_date"]argument tells pandas to treat the order_date column as actual dates, not strings. This matters more than it looks like it matters, and we'll get to why in a moment.
The file is open. Excel never had to know.
Boolean filtering is AutoFilter without the dropdown
Here's where the mental model shift happens. In Excel, you click a dropdown and tick boxes. In pandas, you describe the condition each row has to meet, and pandas keeps the rows where the condition is true.
filtered = df[(df["order_date"] >= "2025-10-01") & (df["category"] == "Hardware")]
Read that as: give me the rows of df where order_date is October 1 2025 or later, AND category equals Hardware.
Three details trip Excel people up the first few times.
- The square brackets reference a column.
df["category"]is how you pull out a column. You can also writedf.category, but stick with brackets, they handle column names with spaces, which Excel exports have plenty of. &is notand. Python has both, and they do different things.andis for single true/false values.&is the column-level operator pandas needs for row filtering. If you writeand, pandas throws an error that looks scary but means "use the symbol instead".- The parentheses around each condition are not optional. Operator precedence in pandas filtering means
&binds tighter than>=, so without the parentheses Python tries to evaluate the wrong thing and the error message is genuinely useless. Wrap each condition. Always.
Run that line and filtered is a new DataFrame with only the rows you wanted. The original df is untouched.
Writing the output is to_csv, with one footnote
filtered.to_csv("hardware_q4_2025.csv", index=False)The method is .to_csv. The filename goes in quotes. The index=False is the part nobody mentions until you've already shipped a report with a weird extra column on the left.
What's the extra column? Pandas assigns every row a numeric index starting at 0. By default, to_csv writes that index out as the first column of your output file. You don't want it. It looks like data, it isn't data, it confuses every downstream consumer. Write index=False unless you have a specific reason not to. I have shipped reports with the extra column, more than once, before this became muscle memory.
The whole script
Four lines that replace the entire Excel workflow you couldn't run.
import pandas as pd
df = pd.read_csv("sales_2025.csv", parse_dates=["order_date"])
filtered = df[(df["order_date"] >= "2025-10-01") & (df["category"] == "Hardware")]
filtered.to_csv("hardware_q4_2025.csv", index=False)Save it as filter_sales.py. Run it with python filter_sales.py from a terminal in the folder where your CSV lives. The output file appears next to it.
That's the whole thing. File filtered, file written, small enough to open in Excel.
Where this breaks, so you can fix it when it does
Three things go wrong the first few times. I have done all three.
- The category name doesn't match exactly.
"Hardware"and"hardware"are different strings. Pandas does an exact match. If your filter returns zero rows and you're sure the data is there, check the case.df["category"].unique()prints every distinct value in the column and the typo will be obvious. I've spent more time on this than I want to admit, after an upstream system silently switched "Hardware" to "hardware" between exports. - The date filter looks like it works, but doesn't. If you forget
parse_dates, the order_date column stays as strings. The comparison>= "2025-10-01"still runs, because Python can compare strings alphabetically, and ISO-format dates happen to sort correctly that way. It looks fine until a row comes in with a different date format, at which point it silently filters the wrong rows. Parse the dates explicitly, every time. - The file isn't where you think it is.
pd.read_csv("sales_2025.csv")looks in the current working directory, which is wherever you launched Python from, not where the script lives. If you getFileNotFoundError, the path is wrong, not the file. Either move into the right folder before running, or pass the full path.
Where the series goes from here
This is the first of three. Reading and filtering are the foundation, they are not the interesting part. The interesting part is what you do once the data is in a DataFrame.
- Part 2 (21 May): grouping and aggregating. The pandas equivalents of pivot tables and SUMIFS, where Python starts being faster than Excel rather than just bigger.
- Part 3 (28 May): joining datasets. VLOOKUP and XLOOKUP, but for files that don't fit in memory and joins that don't break when the source range changes.
If you're not yet convinced Python is worth the switching cost, the 3 May Deep End is the post that argues the case. Read that first if you need persuading. Read this one if you're already past that and want the mechanics.
Frequently Asked Questions
Do I need to install Python before any of this works?
Yes. The shortest path is to install Anaconda, which bundles Python, pandas, and a notebook environment in one installer. Skip the "should I use pip or conda" debate for now, either works, and the decision doesn't matter until you're maintaining a much larger project.
Why pandas and not OpenPyXL or the built-in csv module?
Pandas is built for this exact shape of problem: rectangular data, columns with types, filter and transform operations. OpenPyXL is for writing to .xlsx files specifically. The built-in csv module reads CSVs row by row, which is fine for tiny files but painful once you need to filter or aggregate.
Does this work on Mac?
Yes, the code is identical. The only Windows-specific gotcha is file paths: if you copy a path from File Explorer, the backslashes need to be doubled, or the string needs an r prefix, like r"C:\Users\bianca\sales_2025.csv".
My CSV uses semicolons instead of commas. What changes?
pd.read_csv("sales_2025.csv", sep=";") handles it. European exports and some German-locale Excel saves default to semicolons. The same sep argument takes a tab character (sep="\t") for TSV files.
What's the actual size limit before Excel breaks?
Excel's hard ceiling is 1,048,576 rows. In practice, performance degrades well before that, workbooks over 200,000 rows with formulas become unusably slow. CSV files over about 100MB are the point where Python is the right answer regardless of row count.
Should I use Jupyter or just run a script?
Jupyter notebooks are better for exploring data interactively. Scripts are better for the version of the job you'll run repeatedly. For a workflow like this one, where you know what you want and you need it done, a script is fine. Switch to a notebook when you're not sure what you're looking for yet.
Part 2 lands 21 May.