I Spent an Hour on a Information Preprocessing Process Earlier than Asking Gemini

Use Claude Code in Your Browser

Software Calling, Defined: How AI Brokers Determine What to Do Subsequent

a big period of time on information preparation for downstream duties. Whether or not it entails information cleansing, dealing with lacking values, characteristic engineering, information preprocessing, or submit processing, this section requires a variety of time.

So, I used to be engaged on this post-processing process the place I wanted to create a brand new column in a Pandas DataFrame by extracting values from an current column, primarily based on the info from two different columns.

I might have straight requested an LLM to put in writing the code (which I often do) however this time I wished to do it myself. It was early within the morning and I had a contemporary thoughts so I used to be within the temper to deal with some complicated information operations.

Here’s what I needed to do. I had a DataFrame with predicted_categories, pred_category_id, and text_predicted_probs columns.

The values within the predicted_categories column are lists of 5 classes in “category_id” – “category_description” format.

['80814001 - Freze Uçları', 
 '13003106 - Freze', 
 '80805004 - Sanayi Makineleri', 
 '13003144 - Torna Makinesi', 
 '13003195 - Kumpas']

The text_predicted_probs column has the anticipated possibilities of those 5 classes so as.

[0.943, 0.018, 0.008, 0.006, 0.004]

Therefore, the primary worth within the text_predicted_probs is the chance of the primary class within the predicted_categories, and so forth.

The pred_category_id column exhibits the anticipated class id from one other mannequin . What I would like is the anticipated chance of the class within the pred_category_id column.

I must get the order of the pred_category_id within the predicted_categories column after which take its worth from the test_predicted_probs column.

The drawing beneath demonstrates what I wish to obtain:

On this drawing, I wish to get the chance of class 13003106, which is the second merchandise within the checklist and its corresponding chance worth is 0.018.

If we requested Gemini, or one other superior mannequin, we’ll in all probability get the reply in seconds. However, I wished to do it alone first after which ask Gemini.

Let’s begin with studying the dataset right into a Pandas DataFrame.

import pandas as pd

outcomes = pd.read_csv("prediction_results.csv")

The values within the predicted_categories column are lists of strings with class ids and class names:

outcomes.loc[0, "predicted_categories"]
# output: "['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']"

It’s an inventory however saved as a string so we first convert it to an inventory object utilizing the literal_eval operate within the built-in ast module of Python:

ast.literal_eval(outcomes.loc[0, "predicted_categories"])
# output:
['80814001 - Freze Uçları',
 '13003106 - Freze',
 '80805004 - Sanayi Makineleri',
 '13003144 - Torna Makinesi',
 '13003195 - Kumpas']

To extract the class ids, we are able to cut up every string on this checklist on the “-” character after which choose the primary half after splitting. Since we’ve an inventory with 5 classes, we should always do that operation in an inventory comprehension as follows:

[category.split("-")[0].strip() 
for class in ast.literal_eval(outcomes.loc[0, "predicted_categories"])]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']

We’ve performed it for a single worth (i.e. one row). In an effort to do the identical operation to your entire predicted_categories column, we are able to use an inventory comprehension. Will probably be an inventory comprehension inside one other checklist comprehension (i.e. nested checklist comprehension):

outcomes.loc[:, "predicted_category_ids"] = [
    [category.split("-")[0].strip() for class in ast.literal_eval(predicted_categories)] 
    for predicted_categories in outcomes["predicted_categories"]
]

We now have class ids extracted from the predicted_categories column:

The following step is to test the order of the classes within the predicted class id lists. We are going to then use this order to extract the anticipated chance of the class.

Python checklist object has an index methodology, which returns the index (i.e. order) of the merchandise within the checklist.

outcomes.loc[0, "predicted_category_ids"]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']


outcomes.loc[0, "predicted_category_ids"].index("13003106")
# output:
2

As soon as I discover the index of a predicted class id, I can use it to get the chance of this class id from the text_predicted_probs column:

What we have to do:

Get the index of pred_category_id within the predicted_category_ids
Use this index to extract the related worth from text_predicted_probs

These steps will be performed in a single operation by zipping these three columns. Let’s check it on the primary row:

for i, j, ok in zip(outcomes["pred_category_id"][:1], outcomes["predicted_category_ids"][:1], outcomes["text_predicted_probs"][:1]):
    print(j.index(str(i))) # get the index of pred_category_id in predicted_category_ids
    print(ast.literal_eval(ok)[j.index(str(i))]) # get the worth at this index in text_predicted_probs

# output:
0
0.943

We are able to affirm the output within the screenshot above. The pred_category_id is 80814001, which is the primary merchandise (i.e. index = 0) within the predicted_category_ids and the primary chance worth is 0.943.

The lists within the text_predicted_probs column are saved as string as properly so we apply the literal_eval operate to transform them to an inventory object.

We are able to do that operation on your entire DataFrame utilizing an inventory comprehension as comply with:

outcomes.loc[:, "pred_category_prob"] = [
    float(ast.literal_eval(k)[j.index(str(i))]) if str(i) in j else 0
    for i, j, ok in zip(outcomes["pred_category_id"], outcomes["predicted_category_ids"], outcomes["text_predicted_probs"])
]

We principally transformed the for loop within the earlier code block to an inventory comprehension. I’ve solely added a test “if str(i) in j else 0” as a result of there are some rows with a pred_category_id that doesn’t exist in predicted_category_ids. In such circumstances, we set the chance worth to 0.

The duty is now accomplished as we are able to see beneath:

How did Gemini do that?

After working about an hour (together with breaks) I accomplished the duty. Then, I wished to ask Gemini to see how it could deal with the identical process.

I wrote a immediate explaining my purpose and included a small pattern of the info.

Here’s what Gemini produced on its first try:

import pandas as pd

# 1. Create the pattern DataFrame
information = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008, 0.006, 0.004]
    ]
}

df = pd.DataFrame(information)

# 2. Outline a operate to extract the chance matching the class ID
def get_matching_prob(row):
    # Convert ID to string for matching
    target_id_str = str(row['pred_category_id'])
    
    # Iterate by means of the classes to search out the matching index
    for index, class in enumerate(row['predicted_categories']):
        if class.startswith(target_id_str):
            # Return the corresponding chance from the identical index
            return row['text_predicted_probs'][index]
            
    # Return None (or 0) if no match is discovered to stop the code from crashing
    return None

# 3. Create the brand new column
df['pred_category_prob'] = df.apply(get_matching_prob, axis=1)

# Show the consequence
print(df[['pred_category_id', 'pred_category_prob']])

It first created a pattern DataFrame utilizing the info I supplied within the immediate. Then, it wrote a operate that does precisely what I defined.

It took me about an hour to put in writing the code myself. Gemini accomplished the very same process in seconds primarily based on a immediate that took me only some minutes to put in writing.

Nonetheless, there’s a minor subject in Gemini’s code. It makes use of the apply operate, which isn’t vectorized. In case of working with giant datasets, utilizing the apply operate (principally a for loop) can improve the run time.

However the resolution is easy. I requested Gemini to put in writing a vectorized model of it and here’s what I acquired:

import pandas as pd

# Pattern DataFrame
information = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri'],
        ['13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001, 13003195],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008],
        [0.6, 0.4]
    ]
}
df = pd.DataFrame(information)

# 1. Explode each checklist columns concurrently to keep up alignment between class and chance
df_exploded = df.explode(['predicted_categories', 'text_predicted_probs'])

# 2. Extract the numeric ID from the class string utilizing vectorized regex
df_exploded['extracted_id'] = df_exploded['predicted_categories'].str.extract(r'^(d+)').astype(float)

# 3. Filter for rows the place the extracted ID matches the goal 'pred_category_id'
matched = df_exploded[df_exploded['extracted_id'] == df_exploded['pred_category_id']]

# 4. Dedup the index (security web in case an ID seems twice throughout the similar checklist)
matched = matched[~matched.index.duplicated(keep='first')]

# 5. Map the extracted chance column again to the unique DataFrame utilizing the index
df['pred_category_prob'] = matched['text_predicted_probs']

df

The second resolution was completely superb and regarded easier than the code I wrote.

So, I spent about an hour on a process that an LLM might have accomplished in lower than 5 minutes. Nonetheless, if I didn’t know the way Pandas labored, I’d have accepted the primary resolution, which was not the optimum one. It’s a good instance of how LLMs can improve productiveness, however provided that you truly know what you’re doing.