I Diminished My Pandas Runtime by 95% — Right here’s What I Was Doing Mistaken

I Constructed an AI Pipeline for Kindle Highlights

Causal Inference Is Completely different in Enterprise

for a while now. Nothing too loopy although. Simply primary knowledge cleansing, exploratory knowledge evaluation, and a few important features. I’ve additionally explored issues like technique chaining for cleaner, extra organized code, and operations that silently break your Pandas workflow, each of which I’ve written about earlier than.

I by no means actually thought of runtime. Truthfully, if my code ran with out errors and gave me the output I wanted, I used to be glad. Even when it took a couple of minutes for all my pocket book cells to complete, I didn’t care. No errors meant no issues, proper?

Then I got here throughout the idea of vectorization. And one thing clicked.

I went down the rabbit gap, as I often do. The extra I learn, the extra I noticed that “no errors” and “environment friendly code” are two very various things. Your Pandas code could be fully right and nonetheless be quietly horrible at scale.

So this text is me documenting what I discovered. The errors that gradual Pandas code down, why they occur, the right way to repair them, and when Pandas itself may be the bottleneck. If you happen to’ve ever run a pocket book and simply assumed the wait time was regular, this one’s for you.

Why “Working Code” Isn’t Good Sufficient

There’s a purpose this took me some time to consider. Pandas is designed to be forgiving. You may write code in a dozen alternative ways and most of them will work. You get your output, your dataframe seems to be proper, and you progress on.

However that flexibility comes with a hidden value.

Not like SQL or production-grade knowledge programs, Pandas doesn’t power you to consider effectivity. It doesn’t warn you while you’re doing one thing costly. It simply… does it. Slowly, typically. But it surely does it.

Give it some thought this manner. SQL has a question optimizer. It seems to be at what you’re asking for and figures out essentially the most environment friendly method to get it. Pandas doesn’t have that. It trusts you to put in writing environment friendly code. And if you happen to don’t know what environment friendly seems to be like, you’ll by no means know you’re lacking it.

The result’s that numerous Pandas code within the wild is what I’d name politely inefficient. It really works on small datasets. It really works on medium datasets with slightly endurance. However the second you throw real-world knowledge at it, one thing that’s a couple of hundred thousand rows or extra, the cracks begin to present. What used to take seconds now takes minutes. What took minutes turns into unusable.

And the irritating half is nothing seems to be mistaken. No errors. No warnings. Only a gradual pocket book and a spinning cursor.

That’s the lure. Pandas optimizes for comfort, not pace. And comfort is nice, till it isn’t.

So the primary shift is a mindset one: working code and environment friendly code usually are not the identical factor. As soon as that clicks, every thing else follows.

Profiling: Cease Guessing, Begin Measuring

Right here’s one thing I observed whereas happening this rabbit gap. Most individuals, after they really feel like their code is gradual, do one in all two issues. They both rewrite the entire thing from scratch hoping one thing improves, or they simply settle for it and wait.

Neither of these is the appropriate transfer.

The appropriate transfer is to measure first. You may’t optimize what you haven’t recognized. And as a rule, the a part of your code you assume is gradual isn’t really the issue.

Pandas offers you a couple of easy instruments to begin with.

`%timeit` — Know How Lengthy Issues Really Take

%timeit is a Jupyter magic command that runs a line of code a number of instances and offers you the typical execution time. It’s the only method to evaluate two approaches and know, concretely, which one is quicker.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'gross sales': np.random.randint(100, 10000, measurement=100_000),
    'low cost': np.random.uniform(0.0, 0.5, measurement=100_000)
})

# Method A
%timeit df.apply(lambda row: row['sales'] * row['discount'], axis=1)

# Method B
%timeit df['sales'] * df['discount']

On a dataset of 100,000 rows, the distinction just isn’t refined:

1.91 s ± 228 ms per loop (imply ± std. dev. of seven runs, 1 loop every)
316 μs ± 14 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)

Identical output. Utterly totally different value. That’s the sort of factor you’d by no means discover by simply operating the cell as soon as and shifting on.

`df.data()` and `df.memory_usage()` — Know What You’re Carrying

Velocity isn’t nearly computation. Reminiscence performs an enormous position too. A dataframe that’s bloated with the mistaken knowledge sorts will gradual every thing down earlier than you’ve even written a single transformation.

df.data()

Output:


RangeIndex: 100000 entries, 0 to 99999
Information columns (complete 2 columns):
 #   Column    Non-Null Depend   Dtype  
---  ------    --------------   -----  
 0   gross sales     100000 non-null  int64  
 1   low cost  100000 non-null  float64
dtypes: float64(1), int64(1)
reminiscence utilization: 1.5 MB

To test the reminiscence utilization

df.memory_usage(deep=True)

Output:

Index          132
gross sales       400000
low cost    800000
dtype: int64

Right here, we will see that low cost is taking over twice the area. It’s because low cost is saved as a “heavier” quantity kind (float64) whereas gross sales is saved in a “lighter” kind (int32).

This turns into particularly vital while you’re working with string columns or object sorts which are secretly consuming reminiscence. We’ll come again to this within the subsequent part.

The Profiling Mindset

The instruments themselves are easy. The shift is in the way you strategy your code. Earlier than you optimize something, ask: the place is the time really going? Measure the gradual components. Examine options. Let the numbers let you know what to repair.

As a result of what feels gradual and what’s gradual are sometimes two various things fully.

Mistake #1: Row-wise Operations (The Silent Killer)

If there’s one factor I saved seeing come up repeatedly whereas researching this subject, it was this: folks looping by Pandas dataframes row by row. And I get it. It feels pure. You concentrate on your knowledge one row at a time, so that you write code that processes it one row at a time.

The issue is, that’s not how Pandas thinks.

How Pandas Really Works

Pandas is constructed on prime of NumPy, which shops knowledge in contiguous blocks of reminiscence, column by column. This implies Pandas is closely optimized to function on total columns directly. While you do this, it runs quick, low-level, vectorized operations beneath the hood.

While you loop by rows as an alternative, you’re basically bypassing all of that. You’re dropping down into pure Python, one row at a time, with all of the overhead that comes with it. On a small dataset you’ll by no means discover. On a big one, you’ll be ready a very long time.

There are two patterns that present up always.

`.iterrows()`

# Calculating a reduced worth row by row
discounted_prices = []

for index, row in df.iterrows():
    discounted_prices.append(row['sales'] * (1 - row['discount']))

df['discounted_price'] = discounted_prices

This works. It gives you the appropriate reply. However on a dataframe with 100,000 rows, it’s painfully gradual.

%timeit [row['sales'] * (1 - row['discount']) for index, row in df.iterrows()]

Output:

10.2 s ± 1.73 s per loop (imply ± std. dev. of seven runs, 1 loop every)

`.apply(axis=1)`

This one is sneakier as a result of it seems to be extra “Pandas-like.” However making use of a perform throughout axis=1 means making use of it row by row, which is basically the identical drawback.

%timeit df.apply(lambda row: row['sales'] * (1 - row['discount']), axis=1)

Output:

1.5 s ± 88.1 ms per loop (imply ± std. dev. of seven runs, 1 loop every)

Sooner than .iterrows(), however nonetheless working row by row. Nonetheless gradual.

The Repair: Vectorized Operations

Right here’s the identical calculation, executed the best way Pandas really desires you to do it:

df['discounted_price'] = df['sales'] * (1 - df['discount'])

Let’s time it

%timeit df['sales'] * (1 - df['discount'])

Output:

688 μs ± 236 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)

That’s it. One line. No loop. No lambda. And it’s roughly 14,800x sooner than .iterrows() and 2,180x sooner than .apply(axis=1).

What’s occurring right here is that Pandas passes the whole column to NumPy, which executes the operation on the C stage throughout the entire array directly. No Python overhead. No row-by-row iteration. Simply quick, low-level computation.

When `.apply()` Is Really Wonderful

To be honest, .apply() isn’t all the time the villain. While you’re making use of a perform column-wise (axis=0, which is the default), it’s usually completely cheap. The difficulty is particularly axis=1, which forces row-by-row execution.

And typically your logic is genuinely complicated sufficient {that a} clear vectorized expression isn’t apparent. In these circumstances, np.vectorize() or np.the place() may give you one thing nearer to vectorized efficiency whereas nonetheless letting you specific conditional logic clearly.

# As a substitute of this
df['category'] = df.apply(
    lambda row: 'excessive' if row['sales'] > 5000 else 'low', axis=1
)

# Do that
df['category'] = np.the place(df['sales'] > 5000, 'excessive', 'low')

%timeit df.apply(lambda row: 'excessive' if row['sales'] > 5000 else 'low', axis=1)
%timeit np.the place(df['sales'] > 5000, 'excessive', 'low')

Output:

1.31 s ± 189 ms per loop (imply ± std. dev. of seven runs, 1 loop every)
1.3 ms ± 180 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)

Identical outcome. About 1,000x sooner.

The Rule of Thumb

If you happen to’re writing a loop over rows in Pandas, cease and ask your self: can this be expressed as a column operation? 9 instances out of ten, the reply is sure. And when it’s, the efficiency distinction is transformative.

If you happen to’re looping by rows, you’re not utilizing Pandas. You’re utilizing Python with further steps.

Mistake #2: Pointless Copies and Reminiscence Bloat

Row-wise operations get numerous consideration when folks discuss Pandas efficiency. Reminiscence will get loads much less. Which is a disgrace, as a result of in my expertise studying about this, bloated reminiscence is simply as liable for gradual notebooks as unhealthy computation.

Right here’s the factor. Pandas operations don’t all the time modify your dataframe in place. Numerous them quietly create a model new copy of your knowledge behind the scenes. Do this sufficient instances, and also you’re not simply holding one dataframe in reminiscence. You’re holding a number of, unexpectedly, with out realizing it.

The Hidden Price of Chained Operations

Chained operations are a typical perpetrator. They appear clear and readable, however every step can generate an intermediate copy that sits in reminiscence till rubbish assortment cleans it up.

# Every step right here doubtlessly creates a brand new copy
df2 = df[df['sales'] > 1000]
df3 = df2.dropna()
df4 = df3.reset_index(drop=True)
df5 = df4[['sales', 'discount']]

By the point you get to df5, you doubtlessly have 5 variations of your knowledge floating round in reminiscence concurrently. On a small dataset that is invisible. On a big one, that is the way you run out of RAM.

Non permanent Columns That Stick Round

One other sample that quietly eats reminiscence is creating columns you solely wanted quickly.

df['gross_revenue'] = df['sales'] * df['quantity']
df['tax'] = df['gross_revenue'] * 0.075
df['net_revenue'] = df['gross_revenue'] - df['tax']

# However you solely really wanted net_revenue

gross_revenue and tax at the moment are everlasting columns in your dataframe, taking over reminiscence for the remainder of your pocket book though they have been simply stepping stones.

The repair is straightforward. Both compute straight:

df['net_revenue'] = (df['sales'] * df['quantity']) * (1 - 0.075)

Or drop them as quickly as you’re executed:

df.drop(columns=['gross_revenue', 'tax'], inplace=True)

Mistaken Information Sorts Are Quietly Costly

This one stunned me after I got here throughout it. By default, Pandas is sort of beneficiant with how a lot reminiscence it assigns to every column. Integer columns get int64. Float columns get float64. String columns change into object kind, which is among the most memory-hungry sorts in Pandas.

Let’s see what that really seems to be like:

df = pd.DataFrame({
    'order_id': np.random.randint(1000, 9999, measurement=100_000),
    'gross sales': np.random.randint(100, 10000, measurement=100_000),
    'low cost': np.random.uniform(0.0, 0.5, measurement=100_000),
    'area': np.random.alternative(['north', 'south', 'east', 'west'], measurement=100_000)
})

df.memory_usage(deep=True)

Output

Index           132
order_id     400000
gross sales        400000
low cost     800000
area      5350066
dtype: int64

That area column, which solely has 4 attainable values, is consuming 5.3MB as an object kind. Convert it to a categorical and watch what occurs:

df['region'] = df['region'].astype('class')
df.memory_usage(deep=True)

Output:

Index          132
order_id    400000
gross sales       400000
low cost    800000
area      100386
dtype: int64

From 5.3MB all the way down to about 100KB. For one column. The identical logic applies to integer columns the place you don’t want the complete int64 vary. In case your values match comfortably in int32 and even int16, downcasting saves actual reminiscence.

df['sales'] = df['sales'].astype('int32')
df['order_id'] = df['order_id'].astype('int32')

df.memory_usage(deep=True)

Output:

Index       128
order_id    400000
gross sales       400000
low cost    800000
area      100563
dtype: int64

Just a few small kind modifications and your dataframe is already considerably lighter. And a lighter dataframe means sooner operations throughout the board, as a result of there’s merely much less knowledge to maneuver round.

The Fast Reminiscence Examine Behavior

Earlier than you run any heavy transformation, it’s value realizing what you’re working with:

print(f"Reminiscence utilization: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

It takes one second and it tells you precisely how a lot reminiscence your dataframe is consuming at that time. Make it a behavior earlier than and after main transformations and also you’ll rapidly develop an instinct for when one thing is heavier than it ought to be.

The Perception

Sluggish code isn’t all the time about computation. Generally your pocket book is gradual as a result of it’s carrying way more knowledge than it must, in codecs which are far costlier than vital. Trimming reminiscence isn’t glamorous work, but it surely compounds. A dataframe that’s lighter to retailer is quicker to filter, sooner to merge, sooner to remodel.

Reminiscence and pace usually are not separate issues. They’re the identical drawback.

Mistake #3: Overusing Pandas for All the pieces

This one is slightly totally different from the earlier two. It’s not a few particular perform or a nasty behavior. It’s about realizing the bounds of your instrument.

Pandas is genuinely nice. For many knowledge duties, particularly on the scale most individuals are working at, it’s greater than sufficient. However there’s a model of Pandas utilization that I saved seeing described whereas researching this: folks reaching for Pandas by default, for every thing, no matter whether or not it’s the appropriate match.

And at a sure scale, that turns into an issue.

The Dataset

To make this actual, I generated an artificial e-commerce dataset with 1 million rows. Nothing unique, simply the sort of knowledge you’d realistically encounter: orders, dates, areas, classes, gross sales figures, reductions, portions and statuses.

import pandas as pd
import numpy as np

np.random.seed(42)

n = 1_000_000

areas = ['north', 'south', 'east', 'west']
classes = ['electronics', 'clothing', 'furniture', 'food', 'sports']
statuses = ['completed', 'returned', 'pending', 'cancelled']

df = pd.DataFrame({
    'order_id': np.arange(1000, 1000 + n),
    'order_date': pd.date_range(begin='2022-01-01', durations=n, freq='1min'),
    'area': np.random.alternative(areas, measurement=n),
    'class': np.random.alternative(classes, measurement=n),
    'gross sales': np.random.randint(100, 10000, measurement=n),
    'amount': np.random.randint(1, 20, measurement=n),
    'low cost': np.spherical(np.random.uniform(0.0, 0.5, measurement=n), 2),
    'standing': np.random.alternative(statuses, measurement=n),
})

df.to_csv('large_sales_data.csv', index=False)

A million rows. Saved to a CSV. That is the dataset we’ll be working with for the remainder of the article.

The place Pandas Begins to Battle

Pandas hundreds your total dataset into reminiscence. That’s tremendous when your knowledge is a couple of hundred thousand rows. It begins to get uncomfortable at a couple of million. And past that, you’re preventing the instrument.

The opposite situation is complicated, nested transformations the place you’re stacking a number of operations, creating intermediate outcomes, and customarily asking Pandas to do numerous heavy lifting in sequence. Every step provides overhead. The prices stack up.

Right here’s a practical instance utilizing our dataset. Say that you must calculate a rolling common of gross sales per area, flag orders above a threshold, then combination by month:

# Step 1: Type
df = df.sort_values(['region', 'order_date'])

# Step 2: Rolling common per area
df['rolling_avg'] = (
    df.groupby('area')['sales']
    .remodel(lambda x: x.rolling(window=7).imply())
)

# Step 3: Flag high-value orders
df['high_value'] = df['sales'] > df['rolling_avg'] * 1.5

# Step 4: Month-to-month aggregation
df['month'] = pd.to_datetime(df['order_date']).dt.to_period('M')
monthly_summary = df.groupby(['region', 'month'])['sales'].sum()

This works. However discover that Step 2 makes use of .remodel(lambda x: ...), which carries the identical row-adjacent value we talked about earlier. On 1 million rows, this pipeline will drag. Go forward and time it in your machine and also you’ll see precisely what I imply.

What to Attain For As a substitute

The excellent news is you don’t must abandon Pandas fully. There are a couple of choices relying on the state of affairs.

Chunking
In case your dataset is just too giant to load unexpectedly, Pandas enables you to course of it in chunks. As a substitute of loading all 1 million rows into reminiscence directly, you load and course of a portion at a time:

chunk_size = 100_000
outcomes = []

for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
    chunk['discounted_price'] = chunk['sales'] * (1 - chunk['discount'])
    outcomes.append(chunk.groupby('area')['discounted_price'].sum())

final_result = pd.concat(outcomes).groupby(stage=0).sum()
print(final_result)

As a substitute of asking Pandas to carry 1 million rows in reminiscence concurrently, you’re feeding it 100,000 rows at a time, processing every chunk, and assembling the outcomes on the finish. It’s not essentially the most elegant sample, but it surely enables you to work with knowledge that may in any other case crash your kernel.

When to Take into account Different Instruments
Generally the sincere reply is that Pandas isn’t the appropriate instrument for the job. This isn’t a criticism, it’s simply scope. Just a few value realizing about:

Polars: A contemporary dataframe library inbuilt Rust, designed for pace. It makes use of lazy analysis, that means it optimizes your total question earlier than executing it. For giant datasets it may be dramatically sooner than Pandas.
Dask: Extends Pandas to work in parallel throughout a number of cores and even a number of machines. If you happen to’re comfy with Pandas syntax, Dask feels acquainted.
DuckDB: Enables you to run SQL queries straight in your dataframes or CSV information with surprisingly quick efficiency. Nice for aggregations and analytical queries on giant knowledge.

The purpose isn’t to desert Pandas. For many on a regular basis knowledge work, it’s the appropriate alternative. The purpose is to acknowledge while you’ve hit its ceiling, and know that there are good choices on the opposite aspect of it.

The Actual-World Refactor: From 61 Seconds to 0.33 Seconds

That is the place every thing we’ve coated stops being theoretical.
I took our 1 million row e-commerce dataset and wrote the sort of Pandas code that feels fully regular. The sort of factor you’d write on a Tuesday afternoon with out considering twice.

Then I timed it.

The Sluggish Model

import time

df = pd.read_csv('large_sales_data.csv')

begin = time.time()

# Row-wise income calculation
df['gross_revenue'] = df.apply(
    lambda row: row['sales'] * row['quantity'], axis=1
)
df['tax'] = df.apply(
    lambda row: row['gross_revenue'] * 0.075, axis=1
)
df['net_revenue'] = df.apply(
    lambda row: row['gross_revenue'] - row['tax'], axis=1
)

# Row-wise flagging
df['order_flag'] = df.apply(
    lambda row: 'excessive' if row['net_revenue'] > 50000 else 'low', axis=1
)

# Ultimate aggregation
outcome = df.groupby('area')['net_revenue'].sum()

finish = time.time()
print(f"Complete runtime: {finish - begin:.2f} seconds")

Output:

Complete runtime: 61.78 seconds

Over a minute. For a four-step pipeline. And nothing seems to be mistaken. Let’s break down precisely what’s making it gradual.

Three errors, multi functional pipeline:

First, the info sorts are by no means addressed. The area, class and standing columns load as generic object sorts, that are memory-hungry and gradual to work with. We’re carrying that useless weight by each single operation.
Second, there are three separate .apply(axis=1) calls simply to calculate income. Each loops by all 1 million rows in Python, one by one. We already noticed in Part 4 how costly that’s. Right here we’re doing it 3 times in a row.
Third, gross_revenue and tax are created as everlasting columns though they’re simply intermediate steps. They serve no goal past being stepping stones to net_revenue, however they sit in reminiscence for the remainder of the pipeline anyway.

Right here’s how I’d repair this step-by-step

Step 1: Repair knowledge sorts upfront
Earlier than the rest, convert the apparent categorical columns:

df['region'] = df['region'].astype('class')
df['category'] = df['category'].astype('class')
df['status'] = df['status'].astype('class')

This alone reduces reminiscence utilization considerably and makes subsequent operations cheaper throughout the board.

Step 2: Change .apply() with vectorized operations
As a substitute of three separate row-wise calls, one vectorized expression does the identical work:

# Earlier than: three .apply() calls, three passes by 1 million rows
df['gross_revenue'] = df.apply(lambda row: row['sales'] * row['quantity'], axis=1)
df['tax'] = df.apply(lambda row: row['gross_revenue'] * 0.075, axis=1)
df['net_revenue'] = df.apply(lambda row: row['gross_revenue'] - row['tax'], axis=1)

# After: one vectorized expression, no momentary columns
df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)

Step 3: Change row-wise flagging with np.the place()

# Earlier than
df['order_flag'] = df.apply(
    lambda row: 'excessive' if row['net_revenue'] > 50000 else 'low', axis=1
)

# After
df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')

Identical logic. Vectorized. Performed.

The Quick Model

Put all of it collectively and the pipeline seems to be like this:

import time

df = pd.read_csv('large_sales_data.csv')

begin = time.time()

# Repair 1: Right knowledge sorts upfront
df['region'] = df['region'].astype('class')
df['category'] = df['category'].astype('class')
df['status'] = df['status'].astype('class')

# Repair 2: Vectorized income calculation, no momentary columns
df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)

# Repair 3: Vectorized flagging with np.the place
df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')

# Ultimate aggregation
outcome = df.groupby('area')['net_revenue'].sum()

finish = time.time()
print(f"Complete runtime: {finish - begin:.2f} seconds")

Output:

Complete runtime: 0.33 seconds

61.78 seconds all the way down to 0.33 seconds. A 99.5% discount in runtime. That’s like 187x sooner.

It’s not a trick. That’s simply how Pandas is meant for use.

Earlier than You Run Your Subsequent Pocket book

All the pieces we coated comes down to a couple core habits. Not guidelines. Not tips. Only a totally different mind-set about your code earlier than you write it.

Suppose in columns, not rows. If you happen to’re looping by a dataframe row by row, cease and ask whether or not the identical factor could be expressed as a column operation. 9 instances out of ten, it could actually.
Measure earlier than you optimize. Don’t guess the place the slowness is coming from. Use %timeit and df.memory_usage() to let the numbers let you know what to repair.
Watch your reminiscence, not simply your pace. Mistaken knowledge sorts, pointless copies and momentary columns all add up. A lighter dataframe is a sooner dataframe.
Know when to modify instruments. Pandas is the appropriate alternative more often than not. However at a sure scale, the appropriate optimization is recognizing that you simply’ve outgrown it.

I began this rabbit gap as a result of I saved seeing the identical dialog come up in knowledge communities. Folks pissed off with gradual notebooks, code that labored tremendous on small knowledge and fell aside on actual knowledge. I wished to know why.

What I discovered was that the code wasn’t damaged. It simply wasn’t constructed to scale. And the hole between code that works and code that works effectively isn’t about being a complicated Pandas consumer. It’s a few handful of habits utilized persistently.

If you happen to’ve ever waited too lengthy for a pocket book to complete and simply assumed that was regular, now it doesn’t must be.

If this modified how you concentrate on your Pandas code, I’d love to listen to what bottlenecks you’ve been coping with. Be happy to say hello on any of those platforms

Medium

Twitter

YouTube

I Diminished My Pandas Runtime by 95% — Right here’s What I Was Doing Mistaken

I Constructed an AI Pipeline for Kindle Highlights

Causal Inference Is Completely different in Enterprise

Related Posts

I Constructed an AI Pipeline for Kindle Highlights

Causal Inference Is Completely different in Enterprise

Introduction to Approximate Answer Strategies for Reinforcement Studying

I Simulated an Worldwide Provide Chain and Let OpenClaw Monitor It

Utilizing a Native LLM as a Zero-Shot Classifier

Utilizing Causal Inference to Estimate the Impression of Tube Strikes on Biking Utilization in London

Leave a Reply Cancel reply

POPULAR NEWS

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

Easy methods to Use LLMs for Highly effective Computerized Evaluations

XMN is accessible for buying and selling!

College endowments be a part of crypto rush, boosting meme cash like Meme Index

EDITOR'S PICK

AI insiders search to poison the info that feeds them • The Register

The best way to Create a RAG Analysis Dataset From Paperwork | by Dr. Leon Eversberg | Nov, 2024

The Psychology of Dangerous Knowledge Storytelling: Why Individuals Misinterpret Your Knowledge

Is the AI and Knowledge Job Market Lifeless?

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

I Diminished My Pandas Runtime by 95% — Right here’s What I Was Doing Mistaken

READ ALSO

Why “Working Code” Isn’t Good Sufficient

Profiling: Cease Guessing, Begin Measuring

%timeit — Know How Lengthy Issues Really Take

df.data() and df.memory_usage() — Know What You’re Carrying

The Profiling Mindset

Mistake #1: Row-wise Operations (The Silent Killer)

How Pandas Really Works

.iterrows()

.apply(axis=1)

The Repair: Vectorized Operations

When .apply() Is Really Wonderful

The Rule of Thumb

Mistake #2: Pointless Copies and Reminiscence Bloat

The Hidden Price of Chained Operations

Non permanent Columns That Stick Round

Mistaken Information Sorts Are Quietly Costly

The Fast Reminiscence Examine Behavior

The Perception

Mistake #3: Overusing Pandas for All the pieces

The Dataset

The place Pandas Begins to Battle

What to Attain For As a substitute

The Actual-World Refactor: From 61 Seconds to 0.33 Seconds

The Sluggish Model

The Quick Model

Earlier than You Run Your Subsequent Pocket book

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR'S PICK

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

`%timeit` — Know How Lengthy Issues Really Take

`df.data()` and `df.memory_usage()` — Know What You’re Carrying

`.iterrows()`

`.apply(axis=1)`

When `.apply()` Is Really Wonderful