I just lately and instantly closed it.
Not as a result of it was unsuitable. The code labored. The numbers checked out.
However I had no concept what was happening.
There have been variables all over the place. df1, df2, final_df, final_final. Every step made sense in isolation, however as an entire it felt like I used to be tracing a maze. I needed to learn line by line simply to grasp what I had already finished.
And the humorous factor is, that is how most of us begin with Pandas.
You study just a few operations. You filter right here, create a column there, group and combination. It will get the job finished. However over time, your code begins to really feel tougher to belief, tougher to revisit, and undoubtedly tougher to share.
That was the purpose I spotted one thing.
The hole between newbie and intermediate Pandas customers shouldn’t be about realizing extra capabilities. It’s about the way you construction your transformations.
There’s a sample that quietly modifications every thing when you see it. Your code turns into simpler to learn. Simpler to debug. Simpler to construct on.
It’s known as methodology chaining.
On this article, I’ll stroll by means of how I began utilizing methodology chaining correctly, together with assign() and pipe(), and the way it modified the way in which I write Pandas code. If in case you have ever felt like your notebooks are getting messy as they develop, this can in all probability click on for you.
The Shift: What Intermediate Pandas Customers Do In another way
At first, I assumed getting higher at Pandas meant studying extra capabilities.
Extra methods. Extra syntax. Extra methods to control knowledge.
However the extra I constructed, the extra I observed one thing. The individuals who have been really good at Pandas weren’t essentially utilizing extra capabilities than I used to be. Their code simply appeared… totally different.
Cleaner. Extra intentional. Simpler to observe.
As an alternative of writing step-by-step code with plenty of intermediate variables, they wrote transformations that flowed into one another. You might learn their code from high to backside and perceive precisely what was taking place to the information at every stage.
It nearly felt like studying a narrative.
That’s when it clicked for me. The true improve shouldn’t be about what you employ. It’s about how you construction it.
As an alternative of considering:
“What do I do subsequent to this DataFrame?”
You begin considering:
“What transformation comes subsequent?”
That small shift modifications every thing.
And that is the place methodology chaining is available in.
Technique chaining isn’t just a cleaner method to write Pandas. It’s a totally different approach to consider working with knowledge. Every step takes your DataFrame, transforms it, and passes it alongside. No pointless variables. No leaping round.
Only a clear, readable stream from uncooked knowledge to remaining end result.
Within the subsequent part, I’ll present you precisely what this seems like utilizing an actual instance.
The “Earlier than”: How Most of Us Write Pandas
To make this concrete, let’s say we need to reply a easy query:
Which product classes are producing probably the most income every month?
I pulled a small gross sales dataset with order particulars, product classes, costs, and dates. Nothing fancy.
import pandas as pd
df = pd.read_csv("gross sales.csv")
print(df.head())
Output
order_id customer_id product class amount value order_date
0 1001 C001 Laptop computer Electronics 1 1200 2023-01-05
1 1002 C002 Headphones Electronics 2 150 2023-01-07
2 1003 C003 Sneakers Vogue 1 80 2023-01-10
3 1004 C001 T-Shirt Vogue 3 25 2023-01-12
4 1005 C004 Blender Dwelling 1 60 2023-01-15
Now, right here is how I might have written this not too way back:
# Create a brand new column for income
df["revenue"] = df["quantity"] * df["price"]
# Filter for orders from 2023 onwards
df_filtered = df[df["order_date"] >= "2023-01-01"]
# Convert order_date to datetime and extract month
df_filtered["month"] = pd.to_datetime(df_filtered["order_date"]).dt.to_period("M")
# Group by class and month, then sum income
grouped = df_filtered.groupby(["category", "month"])["revenue"].sum()
# Convert Collection again to DataFrame
end result = grouped.reset_index()
# Type by income descending
end result = end result.sort_values(by="income", ascending=False)
print(end result)
This works. You get your reply.
class month income
1 Electronics 2023-02 2050
2 Electronics 2023-03 1590
0 Electronics 2023-01 1500
8 Dwelling 2023-03 225
6 Dwelling 2023-01 210
5 Vogue 2023-03 205
7 Dwelling 2023-02 180
4 Vogue 2023-02 165
3 Vogue 2023-01 155
However there are just a few issues that begin to present up as your evaluation grows.
First, the stream is difficult to observe. It’s important to preserve monitor of df, df_filtered, grouped, and end result. Every variable represents a barely totally different state of the information.
Second, the logic is scattered. The transformation is occurring step-by-step, however not in a approach that feels linked. You’re mentally stitching issues collectively as you learn.
Third, it’s tougher to reuse or check. If you wish to tweak one a part of the logic, you now need to hint the place every thing is being modified.
That is the type of code that works high quality right now… however turns into painful once you come again to it every week later.
Now examine that to how the identical logic seems once you begin considering in transformations as a substitute of steps.
The “After”: When Every part Clicks
Now let’s clear up the very same drawback once more.
Similar dataset. Similar purpose.
Which product classes are producing probably the most income every month?
Right here’s what it seems like once you begin considering in transformations:
end result = (
pd.read_csv("gross sales.csv") # Begin with uncooked knowledge
.assign(
# Create income column
income=lambda df: df["quantity"] * df["price"],
# Convert order_date to datetime
order_date=lambda df: pd.to_datetime(df["order_date"]),
# Extract month from order_date
month=lambda df: df["order_date"].dt.to_period("M")
)
# Filter for orders from 2023 onwards
.loc[lambda df: df["order_date"] >= "2023-01-01"]
# Group by class and month, then sum income
.groupby(["category", "month"], as_index=False)["revenue"]
.sum()
# Type by income descending
.sort_values(by="income", ascending=False)
)
print(end result)
Similar output. Utterly totally different really feel.
class month income
1 Electronics 2023-02 2050
2 Electronics 2023-03 1590
0 Electronics 2023-01 1500
8 Dwelling 2023-03 225
6 Dwelling 2023-01 210
5 Vogue 2023-03 205
7 Dwelling 2023-02 180
4 Vogue 2023-02 165
3 Vogue 2023-01 155
The very first thing you discover is that every thing flows. There is no such thing as a leaping between variables or making an attempt to recollect what df_filtered or grouped meant.
Every step builds on the final one.
You begin with the uncooked knowledge, then:
- create income
- convert dates
- extract the month
- filter
- group
- combination
- type
Multi functional steady pipeline.
You possibly can learn it high to backside and perceive precisely what is occurring to the information at every stage.
That’s the half that stunned me probably the most.
It isn’t simply shorter code. It’s clearer code.
And when you get used to this, going again to the previous approach feels… uncomfortable.
There are a few issues taking place right here that make this work so nicely.
We’re not simply chaining strategies. We’re utilizing just a few particular instruments that make chaining really sensible.
Within the subsequent part, let’s break these down.
Breaking Down the Sample
After I first noticed this fashion of Pandas code, it appeared a bit intimidating.
Every part was chained collectively. No intermediate variables. Loads taking place in a small house.
However as soon as I slowed down and broke it into items, it began to make sense.
There are actually simply three concepts carrying every thing right here:
- methodology chaining
assign()pipe()
Let’s undergo them one after the other.
Technique Chaining (The Basis)
At its core, methodology chaining is easy. Every step takes a DataFrame, applies a change, and returns a brand new DataFrame. That new DataFrame is instantly handed into the following step.
So as a substitute of this:
df = step1(df)
df = step2(df)
df = step3(df)
You do that:
df = step1(df).step2().step3()
That’s actually it.
However the influence is larger than it seems.
It forces you to suppose when it comes to stream. Every line turns into one transformation. You’re not leaping round or storing short-term states. You’re simply shifting ahead.
That’s the reason the code begins to really feel extra readable. You possibly can observe the transformation from begin to end with out holding a number of variations of the information in your head.
assign() — Retaining Every part within the Circulate
That is the one that basically unlocked chaining for me.
Earlier than this, anytime I wished to create a brand new column, I might break the stream:
df["revenue"] = df["quantity"] * df["price"]
That works, nevertheless it interrupts the pipeline.
assign() helps you to do the identical factor with out breaking the chain:
.assign(income=lambda df: df["quantity"] * df["price"])
At first, the lambda df: half felt bizarre.
However the concept is easy. You’re saying:
“Take the present DataFrame, and use it to outline this new column.”
The important thing profit is that every thing stays in a single place. You possibly can see the place the column is created and the way it’s used, all throughout the identical stream.
It additionally encourages a cleaner fashion the place transformations are grouped logically as a substitute of scattered throughout the pocket book.
pipe() — The place Issues Begin to Really feel Highly effective
pipe() is the one I ignored at first.
I assumed, “I can already chain strategies, why do I want this?”
Then I bumped into an issue.
Some transformations are simply too complicated to suit neatly into a sequence.
You both:
write messy inline logic
or break the chain fully
That’s the place pipe() is available in.
It means that you can move your DataFrame right into a customized perform with out breaking the stream.
For instance:
def filter_high_value_orders(df):
return df[df["revenue"] > 500]
df = (
pd.read_csv("gross sales.csv")
.assign(income=lambda df: df["quantity"] * df["price"])
.pipe(filter_high_value_orders)
)
Now your logic is cleaner, reusable and simpler to check
That is the purpose the place issues began to really feel totally different for me.
As an alternative of writing lengthy scripts, I used to be beginning to construct small, reusable transformation steps.
And that’s when it clicked.
This isn’t nearly writing cleaner Pandas code. It’s about writing code that scales as your evaluation will get extra complicated.
Within the subsequent part, I need to present how this modifications the way in which you consider working with knowledge fully.
Pondering in Pipelines (The Actual Improve)
Up till this level, it’d really feel like we simply made the code look nicer.
However one thing deeper is occurring right here.
If you begin utilizing methodology chaining persistently, the way in which you consider working with knowledge begins to vary.
Earlier than, my method was very step-by-step.
I might take a look at a DataFrame and suppose:
“What do I do subsequent?”
- Filter it.
- Modify it.
- Retailer it.
- Transfer on.
Every step felt a bit disconnected from the final.
However with methodology chaining, that query modifications.
Now it turns into:
“What transformation comes subsequent?”
That shift is small, nevertheless it modifications the way you construction every thing.
You cease considering when it comes to remoted steps and begin considering when it comes to a stream. A pipeline. Information is available in, will get remodeled stage by stage, and produces an output.
And the code displays that.
Every line isn’t just doing one thing. It’s a part of a sequence. A transparent development from uncooked knowledge to perception.
This additionally makes your code simpler to cause about.
If one thing breaks, you should not have to scan all the pocket book. You possibly can take a look at the pipeline and ask:
- which transformation may be unsuitable?
- the place did the information change in an surprising approach?
It turns into simpler to debug as a result of the logic is linear and visual.
One other factor I observed is that it naturally pushes you towards higher habits.
- You begin writing smaller transformations.
- You begin naming issues extra clearly.
- You begin enthusiastic about reuse with out even making an attempt.
And that’s the place it begins to really feel much less like “simply Pandas” and extra like constructing precise knowledge workflows.
At this level, you aren’t simply analyzing knowledge.
You’re designing how knowledge flows.
Actual-World Refactor: From Messy to Clear
Let me present you ways this really performs out.
As an alternative of leaping straight from messy code to an ideal chain, I need to stroll by means of how I might refactor this step-by-step. That is often the way it occurs in actual life anyway.
Step 1: The Beginning Level (Messy however Works)
df = pd.read_csv("gross sales.csv") # Load dataset
# Create income column
df["revenue"] = df["quantity"] * df["price"]
# Filter orders from 2023 onwards
df_filtered = df[df["order_date"] >= "2023-01-01"]
# Convert order_date and extract month
df_filtered["month"] = pd.to_datetime(df_filtered["order_date"]).dt.to_period("M")
# Group by class and month, then sum income
grouped = df_filtered.groupby(["category", "month"])["revenue"].sum()
# Convert to DataFrame
end result = grouped.reset_index()
# Type outcomes
end result = end result.sort_values(by="income", ascending=False)
Nothing unsuitable right here. That is how most of us begin.
However we are able to already see:
- too many intermediate variables
- transformations are scattered
- tougher to observe because it grows
Step 2: Cut back Pointless Variables
First, take away variables that aren’t actually wanted.
df = pd.read_csv("gross sales.csv") # Load dataset
# Create new columns upfront
df["revenue"] = df["quantity"] * df["price"]
df["month"] = pd.to_datetime(df["order_date"]).dt.to_period("M")
end result = (
# Filter related rows
df[df["order_date"] >= "2023-01-01"]
# Mixture income by class and month
.groupby(["category", "month"])["revenue"]
.sum()
# Convert to DataFrame
.reset_index()
# Type outcomes
.sort_values(by="income", ascending=False)
)
Already higher. There are fewer shifting elements, and a few stream is beginning to seem
Step 3: Introduce Primary Chaining
Now we begin chaining extra intentionally.
end result = (
pd.read_csv("gross sales.csv") # Begin with uncooked knowledge
.assign(
# Create income column
income=lambda df: df["quantity"] * df["price"],
# Extract month from order_date
month=lambda df: pd.to_datetime(df["order_date"]).dt.to_period("M")
)
# Filter for current orders
.loc[lambda df: df["order_date"] >= "2023-01-01"]
# Group and combination
.groupby(["category", "month"])["revenue"]
.sum()
# Convert to DataFrame
.reset_index()
# Type outcomes
.sort_values(by="income", ascending=False)
)
At this level, the stream is evident, transformations are grouped logically, and we’re not leaping between variables.
Step 4: Clear It Up Additional
Small tweaks make a giant distinction.
end result = (
pd.read_csv("gross sales.csv") # Load knowledge
.assign(
# Create income
income=lambda df: df["quantity"] * df["price"],
# Guarantee order_date is datetime
order_date=lambda df: pd.to_datetime(df["order_date"]),
# Extract month from order_date
month=lambda df: df["order_date"].dt.to_period("M")
)
# Filter related time vary
.loc[lambda df: df["order_date"] >= "2023-01-01"]
# Mixture income
.groupby(["category", "month"], as_index=False)["revenue"]
.sum()
# Type outcomes
.sort_values(by="income", ascending=False)
)
Now there are not any redundant conversions, there’s cleaner grouping and extra constant construction.
Step 5: When pipe() Turns into Helpful
Let’s say the logic grows. Possibly we solely care about high-revenue rows.
As an alternative of stuffing that logic into the chain, we extract it:
def filter_high_revenue(df):
# Maintain solely rows the place income is above threshold
return df[df["revenue"] > 500]
Now we plug it into the pipeline:
end result = (
pd.read_csv("gross sales.csv") # Load knowledge
.assign(
# Create income
income=lambda df: df["quantity"] * df["price"],
# Convert and extract time options
order_date=lambda df: pd.to_datetime(df["order_date"]),
month=lambda df: df["order_date"].dt.to_period("M")
)
# Apply customized transformation
.pipe(filter_high_revenue)
# Filter by date
.loc[lambda df: df["order_date"] >= "2023-01-01"]
# Mixture outcomes
.groupby(["category", "month"], as_index=False)["revenue"]
.sum()
# Type output
.sort_values(by="income", ascending=False)
)
That is the place it begins to really feel totally different. Your code is not only a script. Now, it’s a sequence of reusable transformations.
What I like about this course of is that you don’t want to leap straight to the ultimate model.
You possibly can evolve your code steadily.
- Begin messy.
- Cut back variables.
- Introduce chaining.
- Extract logic when wanted.
That’s how this sample really sticks.
Subsequent, let’s speak about just a few errors I made whereas studying this so you don’t run into the identical points.
Widespread Errors (I Made Most of These)
After I began utilizing methodology chaining, I undoubtedly overdid it.
Every part felt cleaner, so I attempted to pressure every thing into a sequence. That led to some… questionable code.
Listed here are just a few errors I bumped into so that you should not have to.
1. Over-Chaining Every part
Sooner or later, I assumed longer chains = higher code.
Not true.
# This will get exhausting to learn in a short time
df = (
df
.assign(...)
.loc[...]
.groupby(...)
.agg(...)
.reset_index()
.rename(...)
.sort_values(...)
.question(...)
)
Sure, it’s technically clear. However now it’s doing an excessive amount of in a single place.
Repair:
- Break your chain when it begins to really feel dense.
- Group associated transformations collectively
- Break up logically totally different steps
- Assume readability first, not cleverness.
2. Forcing Logic Into One Line
I used to cram complicated logic into assign() or loc() simply to maintain the chain going.
That often makes issues worse.
.assign(
revenue_flag=lambda df: np.the place(
(df["quantity"] * df["price"] > 500) & (df["category"] == "Electronics"),
"Excessive",
"Low" ) )
This works, however it isn’t very readable.
Repair:
If the logic is complicated, extract it.
def add_revenue_flag(df):
df["revenue_flag"] = np.the place(
(df["quantity"] * df["price"] > 500) & (df["category"] == "Electronics"),
"Excessive",
"Low"
)
return df
df = df.pipe(add_revenue_flag)
Cleaner. Simpler to check. Simpler to reuse.
3. Ignoring pipe() for Too Lengthy
I averted pipe() at first as a result of it felt pointless. However with out it, you hit a ceiling.
You both:
break your chain
or write messy inline logic
Repair:
- Use
pipe()as quickly as your logic stops being easy. - It’s what turns your code from a script into one thing modular.
4. Shedding Readability With Poor Naming
If you begin utilizing customized capabilities with pipe(), naming issues loads.
Dangerous:def rework(df): ...
Higher:def filter_high_revenue(df): ...
Now your pipeline reads like a narrative:.pipe(filter_high_revenue)
That small change makes a giant distinction.
5. Pondering This Is About Shorter Code
This one took me some time to comprehend. Technique chaining shouldn’t be about writing fewer strains. It’s about writing code that’s simpler to learn, cause about and are available again to later
Typically the chained model is longer. That’s high quality. Whether it is clearer, it’s higher.
Let’s wrap this up and tie it again to the “intermediate” concept.
Conclusion: Leveling Up Your Pandas Sport
When you’ve adopted alongside, you’ve seen a small shift with a huge impact.
By considering in transformations as a substitute of steps, utilizing methodology chaining, assign(), and pipe(), your code stops being only a assortment of strains and turns into a transparent, readable stream.
Right here’s what modifications once you internalize this sample:
- You possibly can learn your code high to backside with out getting misplaced.
- You possibly can reuse transformations simply, making your notebooks extra modular.
- You possibly can debug and check with out tracing dozens of intermediate variables.
- You begin considering in pipelines, not simply steps.
That is precisely what separates a newbie from an intermediate Pandas person.
You’re not simply “making it work.” You’re designing your evaluation in a approach that scales, is maintainable, and appears good to anybody who reads it—even future you.
Strive It Your self
Choose a messy pocket book you’ve been engaged on and refactor only one half utilizing methodology chaining.
- Begin with
assign()for brand new columns - Use
loc[]to filter - Introduce
pipe()for any customized logic
You’ll be stunned how a lot clearer your pocket book turns into, nearly instantly.
That’s it. You’ve simply unlocked intermediate Pandas.
The next step? Maintain practising, construct your individual pipelines, and see how your enthusiastic about knowledge transforms alongside along with your code.















