• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Wednesday, June 24, 2026
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Machine Learning

I Spent an Hour on a Information Preprocessing Process Earlier than Asking Gemini

Admin by Admin
June 24, 2026
in Machine Learning
0
National institute of allergy and infectious diseases oc12eproeoi unsplash scaled 1.jpg
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

Use Claude Code in Your Browser

Software Calling, Defined: How AI Brokers Determine What to Do Subsequent


a big period of time on information preparation for downstream duties. Whether or not it entails information cleansing, dealing with lacking values, characteristic engineering, information preprocessing, or submit processing, this section requires a variety of time.

So, I used to be engaged on this post-processing process the place I wanted to create a brand new column in a Pandas DataFrame by extracting values from an current column, primarily based on the info from two different columns.

I might have straight requested an LLM to put in writing the code (which I often do) however this time I wished to do it myself. It was early within the morning and I had a contemporary thoughts so I used to be within the temper to deal with some complicated information operations.

Here’s what I needed to do. I had a DataFrame with predicted_categories, pred_category_id, and text_predicted_probs columns.

The values within the predicted_categories column are lists of 5 classes in “category_id” – “category_description” format.

['80814001 - Freze Uçları', 
 '13003106 - Freze', 
 '80805004 - Sanayi Makineleri', 
 '13003144 - Torna Makinesi', 
 '13003195 - Kumpas']

The text_predicted_probs column has the anticipated possibilities of those 5 classes so as.

[0.943, 0.018, 0.008, 0.006, 0.004]

Therefore, the primary worth within the text_predicted_probs is the chance of the primary class within the predicted_categories, and so forth.

The pred_category_id column exhibits the anticipated class id from one other mannequin . What I would like is the anticipated chance of the class within the pred_category_id column.

I must get the order of the pred_category_id within the predicted_categories column after which take its worth from the test_predicted_probs column.

The drawing beneath demonstrates what I wish to obtain:

On this drawing, I wish to get the chance of class 13003106, which is the second merchandise within the checklist and its corresponding chance worth is 0.018.

If we requested Gemini, or one other superior mannequin, we’ll in all probability get the reply in seconds. However, I wished to do it alone first after which ask Gemini.

Let’s begin with studying the dataset right into a Pandas DataFrame.

import pandas as pd

outcomes = pd.read_csv("prediction_results.csv")

The values within the predicted_categories column are lists of strings with class ids and class names:

outcomes.loc[0, "predicted_categories"]
# output: "['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']"

It’s an inventory however saved as a string so we first convert it to an inventory object utilizing the literal_eval operate within the built-in ast module of Python:

ast.literal_eval(outcomes.loc[0, "predicted_categories"])
# output:
['80814001 - Freze Uçları',
 '13003106 - Freze',
 '80805004 - Sanayi Makineleri',
 '13003144 - Torna Makinesi',
 '13003195 - Kumpas']

To extract the class ids, we are able to cut up every string on this checklist on the “-” character after which choose the primary half after splitting. Since we’ve an inventory with 5 classes, we should always do that operation in an inventory comprehension as follows:

[category.split("-")[0].strip() 
for class in ast.literal_eval(outcomes.loc[0, "predicted_categories"])]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']

We’ve performed it for a single worth (i.e. one row). In an effort to do the identical operation to your entire predicted_categories column, we are able to use an inventory comprehension. Will probably be an inventory comprehension inside one other checklist comprehension (i.e. nested checklist comprehension):

outcomes.loc[:, "predicted_category_ids"] = [
    [category.split("-")[0].strip() for class in ast.literal_eval(predicted_categories)] 
    for predicted_categories in outcomes["predicted_categories"]
]

We now have class ids extracted from the predicted_categories column:

The following step is to test the order of the classes within the predicted class id lists. We are going to then use this order to extract the anticipated chance of the class.

Python checklist object has an index methodology, which returns the index (i.e. order) of the merchandise within the checklist.

outcomes.loc[0, "predicted_category_ids"]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']


outcomes.loc[0, "predicted_category_ids"].index("13003106")
# output:
2

As soon as I discover the index of a predicted class id, I can use it to get the chance of this class id from the text_predicted_probs column:

What we have to do:

  1. Get the index of pred_category_id within the predicted_category_ids
  2. Use this index to extract the related worth from text_predicted_probs

These steps will be performed in a single operation by zipping these three columns. Let’s check it on the primary row:

for i, j, ok in zip(outcomes["pred_category_id"][:1], outcomes["predicted_category_ids"][:1], outcomes["text_predicted_probs"][:1]):
    print(j.index(str(i))) # get the index of pred_category_id in predicted_category_ids
    print(ast.literal_eval(ok)[j.index(str(i))]) # get the worth at this index in text_predicted_probs

# output:
0
0.943

We are able to affirm the output within the screenshot above. The pred_category_id is 80814001, which is the primary merchandise (i.e. index = 0) within the predicted_category_ids and the primary chance worth is 0.943.

The lists within the text_predicted_probs column are saved as string as properly so we apply the literal_eval operate to transform them to an inventory object.

We are able to do that operation on your entire DataFrame utilizing an inventory comprehension as comply with:

outcomes.loc[:, "pred_category_prob"] = [
    float(ast.literal_eval(k)[j.index(str(i))]) if str(i) in j else 0
    for i, j, ok in zip(outcomes["pred_category_id"], outcomes["predicted_category_ids"], outcomes["text_predicted_probs"])
]

We principally transformed the for loop within the earlier code block to an inventory comprehension. I’ve solely added a test “if str(i) in j else 0” as a result of there are some rows with a pred_category_id that doesn’t exist in predicted_category_ids. In such circumstances, we set the chance worth to 0.

The duty is now accomplished as we are able to see beneath:

How did Gemini do that?

After working about an hour (together with breaks) I accomplished the duty. Then, I wished to ask Gemini to see how it could deal with the identical process.

I wrote a immediate explaining my purpose and included a small pattern of the info.

Here’s what Gemini produced on its first try:

import pandas as pd

# 1. Create the pattern DataFrame
information = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008, 0.006, 0.004]
    ]
}

df = pd.DataFrame(information)

# 2. Outline a operate to extract the chance matching the class ID
def get_matching_prob(row):
    # Convert ID to string for matching
    target_id_str = str(row['pred_category_id'])
    
    # Iterate by means of the classes to search out the matching index
    for index, class in enumerate(row['predicted_categories']):
        if class.startswith(target_id_str):
            # Return the corresponding chance from the identical index
            return row['text_predicted_probs'][index]
            
    # Return None (or 0) if no match is discovered to stop the code from crashing
    return None

# 3. Create the brand new column
df['pred_category_prob'] = df.apply(get_matching_prob, axis=1)

# Show the consequence
print(df[['pred_category_id', 'pred_category_prob']])

It first created a pattern DataFrame utilizing the info I supplied within the immediate. Then, it wrote a operate that does precisely what I defined.

It took me about an hour to put in writing the code myself. Gemini accomplished the very same process in seconds primarily based on a immediate that took me only some minutes to put in writing.

Nonetheless, there’s a minor subject in Gemini’s code. It makes use of the apply operate, which isn’t vectorized. In case of working with giant datasets, utilizing the apply operate (principally a for loop) can improve the run time.

However the resolution is easy. I requested Gemini to put in writing a vectorized model of it and here’s what I acquired:

import pandas as pd

# Pattern DataFrame
information = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri'],
        ['13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001, 13003195],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008],
        [0.6, 0.4]
    ]
}
df = pd.DataFrame(information)

# 1. Explode each checklist columns concurrently to keep up alignment between class and chance
df_exploded = df.explode(['predicted_categories', 'text_predicted_probs'])

# 2. Extract the numeric ID from the class string utilizing vectorized regex
df_exploded['extracted_id'] = df_exploded['predicted_categories'].str.extract(r'^(d+)').astype(float)

# 3. Filter for rows the place the extracted ID matches the goal 'pred_category_id'
matched = df_exploded[df_exploded['extracted_id'] == df_exploded['pred_category_id']]

# 4. Dedup the index (security web in case an ID seems twice throughout the similar checklist)
matched = matched[~matched.index.duplicated(keep='first')]

# 5. Map the extracted chance column again to the unique DataFrame utilizing the index
df['pred_category_prob'] = matched['text_predicted_probs']

df

The second resolution was completely superb and regarded easier than the code I wrote.

So, I spent about an hour on a process that an LLM might have accomplished in lower than 5 minutes. Nonetheless, if I didn’t know the way Pandas labored, I’d have accepted the primary resolution, which was not the optimum one. It’s a good instance of how LLMs can improve productiveness, however provided that you truly know what you’re doing.

Tags: DataGeminiHourPreprocessingSpenttask

Related Posts

Coding agents browser cover.jpg
Machine Learning

Use Claude Code in Your Browser

June 23, 2026
Capture.jpg
Machine Learning

Software Calling, Defined: How AI Brokers Determine What to Do Subsequent

June 21, 2026
Utah.jpg
Machine Learning

7 Essential Boundaries Between Information Groups and Self-Therapeutic Information Structure

June 20, 2026
Etl scheduling.jpg
Machine Learning

I Tried to Schedule My ETL Pipeline. Right here’s What I Didn’t Anticipate.

June 19, 2026
Gemini generated image f3s6k6f3s6k6f3s6.jpg
Machine Learning

The Secret to Reproducible and Transportable Optimization: ORPilot’s Intermediate Illustration (IR)

June 18, 2026
93c5e532 5182 40a1 b6a5 d11734f86e68.jpg
Machine Learning

Run a Native LLM with OpenClaw on Your Mac Mini

June 17, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
Chainlink Link And Cardano Ada Dominate The Crypto Coin Development Chart.jpg

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

May 17, 2025
Image 100 1024x683.png

Easy methods to Use LLMs for Highly effective Computerized Evaluations

August 13, 2025
Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025

EDITOR'S PICK

0jrms19hqscaze5dy.png

Selecting Between LLM Agent Frameworks | by Aparna Dhinakaran | Sep, 2024

September 21, 2024
Data center 2 1 0125 shutterstock 2502153963.jpg

EdgeConneX and Lambda to Construct AI Manufacturing unit Infrastructure in Chicago and Atlanta

August 24, 2025
1zgo Lvx0j92q7cd Svytaq.png

The way to Make Proximity Maps with Python | by Lee Vaughan | Oct, 2024

October 30, 2024
Mlm chugani decision trees fail fix feature v2 1024x683.png

Why Resolution Timber Fail (and The way to Repair Them)

November 28, 2025

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • I Spent an Hour on a Information Preprocessing Process Earlier than Asking Gemini
  • How you can Create Highly effective Loops in Claude Code
  • The Math Abilities Each Aspiring Information Scientist Must Grasp Earlier than Writing a Single Line of Code
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?