← Previously: Part 2: Grouping, Aggregating, and Replacing Your SUMIFS (21 May)

What this post covers: replacing VLOOKUP and XLOOKUP with pandas merge, the four join types and when each one matters, joining on multiple key columns at once, and handling the many-to-many case that VLOOKUP can't represent at all. Built on two real CSV files: a customer table and an order table.

The scenario, because VLOOKUP breaks at scale

Two files. customers.csv has 180,000 rows: customer_id, name, region, signup_date. orders.csv has 2.3 million rows: order_id, customer_id, order_date, amount. The analytical question is the same one VLOOKUP exists to answer: take the orders table, and put each customer's region next to every order so we can group by region.

In Excel, the move is =VLOOKUP(customer_id, customers, 3, FALSE), dragged down 2.3 million rows. Which doesn't fit. Even if it did, the formula would recalculate every time anything in the workbook changes, and the file would take five minutes to save.

In pandas, the entire join is a single line, runs in under a second, and produces a DataFrame you can immediately group, filter, or write back to CSV.

The mental shift, though, is the part worth slowing down for. VLOOKUP is a per-row function: for this row, find the matching value in the lookup table. Merge is a set operation: combine these two tables on a shared key, and give me the result as a new table. The first thinks one row at a time. The second doesn't.

Merge is a join, not a lookup

Here's the move. Load both files, then call merge.

python
import pandas as pd

customers = pd.read_csv("customers.csv", parse_dates=["signup_date"])
orders = pd.read_csv("orders.csv", parse_dates=["order_date"])

result = orders.merge(customers, on="customer_id", how="left")

Read that as: take the orders table, attach matching rows from customers based on customer_id, and keep every row in orders whether a match is found or not.

The result is a new DataFrame that has every column from orders, plus every column from customers except the join key. 2.3 million rows in, 2.3 million rows out, region attached, no formulas, no drag.

plaintext
   order_id  customer_id order_date     amount     name  region signup_date
0    100001         4012 2025-01-03      55.20    Cobb   North  2022-04-11
1    100002         8821 2025-01-03    1102.55     Lin   South  2021-09-30
2    100003         4012 2025-01-04     220.00    Cobb   North  2022-04-11
3    100004         2247 2025-01-04      18.99     Vea    East  2024-02-15
...

That's a VLOOKUP across the whole column, run once, with the output already in a shape you can use.

The four kinds of merge

how= is the argument that does the work. It defines what happens to rows that don't have a match in the other table.

how="left" keeps every row from the left table. Unmatched rows get NaN in the columns from the right table. This is the closest analogue to VLOOKUP - "I want every order, with the customer info filled in where I can find it."

how="right" is the mirror. Every row from the right table is kept. Rarely the right answer; usually a sign you should swap the arguments and use left.

how="inner" keeps only rows where the key exists in both tables. Useful when "no match" means "this row shouldn't be here in the first place." Risky as a default - silent data loss is exactly the kind of bug that ships to a stakeholder.

how="outer" keeps every row from both tables. Unmatched rows get NaN on whichever side they're missing from. This is the right answer when you actively want to find the rows that don't match. Audit work, reconciliation, "which customers have no orders" - all outer joins.

Default is inner. The default is wrong for the most common case (VLOOKUP-style enrichment), so it's worth setting how= explicitly every time. The line that reads merge(..., how="left") is the line that won't bite you later.

Joining on multiple key columns at once

VLOOKUP and XLOOKUP take exactly one key. If your match needs two columns - say, customer ID and product category - you concatenate them in Excel and hope the helper column doesn't break.

merge takes a list.

python
result = orders.merge(
    promotions,
    on=["customer_id", "category"],
    how="left",
)

Read it as: a row in orders matches a row in promotions only if both customer_id and category are equal. No helper columns, no concatenation tricks, no quiet bugs when one of the keys has a stray trailing space.

If the columns are named differently in each table, use left_on and right_on:

python
result = orders.merge(
    customers,
    left_on="cust_id",     # name in orders.csv
    right_on="customer_id", # name in customers.csv
    how="left",
)

The result has both columns. Drop the redundant one with .drop(columns="customer_id") if you don't want it hanging around.

Many-to-many is where VLOOKUP gives up entirely

VLOOKUP returns the first match it finds. If the lookup table has more than one row per key, the rows after the first are silently invisible. The formula gives you an answer and never tells you it ignored half the data.

Merge handles this honestly. If both tables have multiple rows per key, the result is the Cartesian product of the matches: every left row paired with every right row that matches.

Quick example. customers has one row per customer. orders has many rows per customer. The orders-side join produces one result row per order, which is what you want. Two million orders go in, two million annotated orders come out.

Now suppose customer_promotions has multiple rows per customer (a customer can have several promotions). Joining orders to customer_promotions produces one row per (order, promotion) pair. An order with three eligible promotions becomes three rows in the result. The row count blows up.

Pandas can warn you about this on the way in:

python
result = orders.merge(
    customer_promotions,
    on="customer_id",
    how="left",
    validate="many_to_one",  # this is the assertion
)

validate="many_to_one" means "many on the left, one on the right." If that's false - if there are duplicate keys in customer_promotions - pandas raises a MergeError instead of silently exploding the row count. Set validate= on any merge where the row count matters and you'll catch the failure before the report goes out.

Where this breaks the first time

Three things will trip you the first few merges. The error messages are not always helpful.

The result has a _x and _y suffix on a column and you don't know why.

If both tables have a column with the same name, and that column isn't the join key, pandas keeps both, suffixed with _x (left) and _y (right). Usually unwanted. Either drop one with .drop(columns="region_y") after the merge, or rename one before: customers = customers.rename(columns={"region": "customer_region"}).

The row count doubled and you don't know why.

Duplicate keys in the right-hand table. A single row in the left ends up matching multiple rows on the right, and the output is the Cartesian product. Run customers["customer_id"].duplicated().sum() to confirm. The fix is upstream: deduplicate the lookup table, or use the validate= argument to fail loudly.

The merge runs but lots of rows have NaN in the joined columns.

The keys aren't matching. Almost always one of two things: leading or trailing whitespace, or a type mismatch. Numeric IDs read from one file as int64 and from the other as object (string) will not merge. Check df["customer_id"].dtype on both sides. If they differ, cast both: df["customer_id"] = df["customer_id"].astype("int64").

Frequently Asked Questions

Is merge the same as join?

Almost. df.merge(other) is the general-purpose function and is what to reach for by default. df.join(other) is a shortcut that joins on the index rather than on a column, which is occasionally convenient but more often confusing. If you're not sure which to use, use merge.

How do I do an XLOOKUP-style closest-match (approximate) lookup?

That's pd.merge_asof, not merge. It joins on a sorted numeric or date key and finds the nearest preceding match, which is the right shape for time-series joins ("what was the FX rate as of each transaction date"). Both tables must be sorted by the key.

What if I want to merge on a date range, not an exact match?

Same answer: merge_asof with direction="backward" for "the most recent record at or before this date." For true range joins ("any record between these two dates"), there's no built-in - the idiomatic move is a self-join with a boolean mask, or a small loop, or SQL upstream if the data is in a database.

Can I merge more than two tables?

Yes, but chain them: orders.merge(customers, on="customer_id").merge(products, on="product_id"). Each .merge returns a new DataFrame, so you can stack them as long as the join keys exist on each step.

What's the right way to handle missing matches?

Use how="left" so you keep every row, then check result["region"].isna().sum() to count unmatched rows. The fix depends on context: fill with a default (result["region"] = result["region"].fillna("Unknown")), filter them out, or flag them in a separate report. The point is that the missing matches are visible and counted, not silently dropped.

Why does merge sometimes change the order of my rows?

Because some merge strategies sort by the join key internally. If row order matters, pass sort=False (the default in modern pandas) and verify with a check column. For most analytical work, the order doesn't matter; for output to a stakeholder who expects a particular sequence, sort explicitly at the end.