• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Tuesday, July 15, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Machine Learning

Constructing a Regression Mannequin to Predict Supply Durations: A Sensible Information | by Jimin Kang | Dec, 2024

Admin by Admin
January 28, 2025
in Machine Learning
0
1iz085yovdol3dlqera28aw.png
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

Accuracy Is Lifeless: Calibration, Discrimination, and Different Metrics You Really Want

AI Brokers Are Shaping the Way forward for Work Job by Job, Not Job by Job


Information Preparation & Exploratory Evaluation

Now that we’ve outlined our method, let’s check out our information and what sort of options we’re working with.

From the above, we see our information accommodates ~197,000 deliveries, with a wide range of numeric & non-numeric options. Not one of the options are lacking a big proportion of values (lowest non-null rely ~181,000), so we possible gained’t have to fret about dropping any options solely.

Let’s verify if our information accommodates any duplicated deliveries, and if there are any observations that we can’t compute supply time for.

print(f"Variety of duplicates: {df.duplicated().sum()} n")

print(pd.DataFrame({'Lacking Rely': df[['created_at', 'actual_delivery_time']].isna().sum()}))

We see that each one the deliveries are distinctive. Nonetheless, there are 7 deliveries which are lacking a price for actual_delivery_time, which suggests we gained’t be capable of compute the supply period for these orders. Since there’s solely a handful of those, we’ll take away these observations from our information.

Now, let’s create our prediction goal. We need to predict the supply period (in seconds), which is the elapsed time between when the client positioned the order (‘created_at’) and after they recieved the order (‘actual_delivery_time’).

# convert columns to datetime 
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)

# create prediction goal
df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()

The very last thing we’ll do earlier than splitting our information into prepare/check is verify for lacking values. We already seen the non-null counts for every characteristic above, however let’s view the proportions to get a greater image.

We see that the market options (‘onshift_dashers’, ‘busy_dashers’, ‘outstanding_orders’) have the best proportion of lacking values (~8% lacking). The characteristic with the second-highest lacking information charge is ‘store_primary_category’ (~2%). All different options have < 1% lacking.

Since not one of the options have a excessive lacking rely, we gained’t take away any of them. Afterward, we’ll have a look at the characteristic distributions to assist us resolve methods to appropriately take care of lacking observations for every characteristic.

However first, let’s cut up our information into prepare/check. We’ll proceed with an 80/20 cut up, and we’ll write this check information to a separate file which we gained’t contact till evaluating our closing mannequin.

from sklearn.model_selection import train_test_split
import os

# shuffle
df = df.pattern(frac=1, random_state=42)
df = df.reset_index(drop=True)

# cut up
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# write check information to separate file
listing = 'datasets'
file_name = 'test_data.csv'
file_path = os.path.be part of(listing, file_name)
os.makedirs(listing, exist_ok=True)
test_df.to_csv(file_path, index=False)

Now, let’s dive into the specifics of our prepare information. We’ll set up our numeric & categorical options, to make it clear which columns are being referenced in later exploratory steps.

categorical_feats = [
'market_id',
'store_id',
'store_primary_category',
'order_protocol'
]

numeric_feats = [
'total_items',
'subtotal',
'num_distinct_items',
'min_item_price',
'max_item_price',
'total_onshift_dashers',
'total_busy_dashers',
'total_outstanding_orders',
'estimated_order_place_duration',
'estimated_store_to_consumer_driving_duration'
]

Let’s revisit the explicit options with lacking values (‘market_id’, ‘store_primary_category’, ‘order_protocol’). Since there was little lacking information amongst these options (< 3%), we’ll merely impute these lacking values with an “unknown” class.

  • This fashion, we gained’t must take away information from different options.
  • Maybe the absence of characteristic values holds some predictive energy for supply period i.e. these options usually are not lacking at random.
  • Moreover, we’ll add this imputation step to our preprocessing pipeline throughout modeling, in order that we gained’t must manually duplicate this work on our check set.
missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']

train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna("unknown")

Let’s have a look at our categorical options.

pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Characteristic')

Since ‘market_id’ & ‘order_protocol’ have low cardinality, we are able to visualize their distributions simply. However, ‘store_id’ & ‘store_primary_category’ are excessive cardinality options. We’ll take a deeper have a look at these later.

import seaborn as sns
import matplotlib.pyplot as plt

categorical_feats_subset = [
'market_id',
'order_protocol'
]

# Arrange the grid
fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)

# Create barplots for every variable
for i, col in enumerate(categorical_feats_subset):
sns.countplot(x=col, information=train_df, ax=axes[i])
axes[i].set_title(f"Frequencies: {col}")

# Alter structure
plt.tight_layout()
plt.present()

Some key issues to notice:

  • ~70% of orders positioned have ‘market_id’ of 1, 2, 4
  • < 1% of orders have ‘order_protocol’ of 6 or 7

Sadly, we don’t have any extra details about these variables, corresponding to which ‘market_id’ values are related to which cities/places, and what every ‘order_protocol’ quantity represents. At this level, asking for extra information regarding this info could also be a good suggestion, as it could assist for investigating developments in supply period throughout broader area/location categorizations.

Let’s have a look at our increased cardinality categorical options. Maybe every ‘store_primary_category’ has an related ‘store_id’ vary? In that case, we might not want ‘store_id’, as ‘store_primary_category’ would already encapsulate a whole lot of the details about the shop being ordered from.

store_info = train_df[['store_id', 'store_primary_category']]

store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])

Clearly not the case: we see that ‘store_id’ ranges overlap throughout ranges of ‘store_primary_category’.

A fast have a look at the distinct values and related frequencies for ‘store_id’ & ‘store_primary_category’ reveals that these options have excessive cardinality and are sparsely distributed. On the whole, excessive cardinality categorical options could also be problematic in regression duties, significantly for regression algorithms that require solely numeric information. When these excessive cardinality options are encoded, they could enlarge the characteristic house drastically, making the accessible information sparse and lowering the mannequin’s capability to generalize to new observations in that characteristic house. For a greater & extra skilled clarification of the phenomena, you’ll be able to learn extra about it right here.

Let’s get a way of how sparsely distributed these options are.

store_id_values = train_df['store_id'].value_counts()

# Plot the histogram
plt.determine(figsize=(8, 5))
plt.bar(store_id_values.index, store_id_values.values, colour='skyblue')

# Add titles and labels
plt.title('Worth Counts: store_id', fontsize=14)
plt.xlabel('store_id', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels for higher readability
plt.tight_layout()
plt.present()

We see that there are a handful of shops which have a whole lot of orders, however the majority of them have a lot lower than 100.

To deal with the excessive cardinality of ‘store_id’, we’ll create one other characteristic, ‘store_id_freq’, that teams the ‘store_id’ values by frequency.

  • We’ll group the ‘store_id’ values into 5 totally different percentile bins proven under.
  • ‘store_id_freq’ could have a lot decrease cardinality than ‘store_id’, however will retain related info relating to the recognition of the shop the supply was ordered from.
  • For extra inspiration behind this logic, try this thread.
def encode_frequency(freq, percentiles) -> str:
if freq < percentiles[0]:
return '[0-50)'
elif freq < percentiles[1]:
return '[50-75)'
elif freq < percentiles[2]:
return '[75-90)'
elif freq < percentiles[3]:
return '[90-99)'
else:
return '99+'

value_counts = train_df['store_id'].value_counts()
percentiles = np.percentile(value_counts, [50, 75, 90, 99])

# apply encode_frequency to every store_id primarily based on their variety of orders
train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))

pd.DataFrame({'Rely':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')

Our encoding reveals us that ~60,000 deliveries have been ordered from shops catgorized within the 90–99th percentile when it comes to reputation, whereas ~12,000 deliveries have been ordered from shops that have been within the 0–fiftieth percentile in reputation.

Now that we’ve (tried) to seize related ‘store_id’ info in a decrease dimension, let’s attempt to do one thing comparable with ‘store_primary_category’.

Let’s have a look at the preferred ‘store_primary_category’ ranges.

A fast look reveals us that many of those ‘store_primary_category’ ranges usually are not unique to one another (ex: ‘american’ & ‘burger’). Additional investigation reveals many extra examples of this type of overlap.

So, let’s attempt to map these distinct retailer classes into a number of primary, all-encompassing teams.

store_category_map = {
'american': ['american', 'burger', 'sandwich', 'barbeque'],
'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean',
'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],
'mexican': ['mexican'],
'italian': ['italian', 'pizza'],
}

def map_to_category_type(class: str) -> str:
for category_type, classes in store_category_map.gadgets():
if class in classes:
return category_type
return "different"

train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))

value_counts = train_df['store_category_type'].value_counts()

# Plot pie chart
plt.determine(figsize=(6, 6))
value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
plt.title('Class Distribution')
plt.ylabel('') # Cover y-axis label for aesthetics
plt.present()

This grouping might be brutally easy, and there might very nicely be a greater technique to group these retailer classes. Let’s proceed with it for now for simplicity.

We’ve achieved a great deal of investigation into our categorical options. Let’s have a look at the distributions for our numeric options.

# Create grid for boxplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15)) # Alter determine measurement
axes = axes.flatten() # Flatten the 5x2 axes right into a 1D array for simpler iteration

# Generate boxplots for every numeric characteristic
for i, column in enumerate(numeric_feats):
sns.boxplot(y=train_df[column], ax=axes[i])
axes[i].set_title(f"Boxplot for {column}")
axes[i].set_ylabel(column)

# Take away any unused subplots (if any)
for i in vary(len(numeric_feats), len(axes)):
fig.delaxes(axes[i])

# Alter structure for higher spacing
plt.tight_layout()
plt.present()

Boxplots for a subset of our numeric options

Lots of the distributions seem like extra proper skewed then they’re as a result of presence of outliers.

Particularly, there appears to be an order with 400+ gadgets. This appears unusual as the following largest order is lower than 100 gadgets.

Let’s look extra into that 400+ merchandise order.

train_df[train_df['total_items']==train_df['total_items'].max()]
Tags: BuildingDecDeliveryDurationsGuideJiminKangmodelPracticalPredictRegression

Related Posts

Afif ramdhasuma rjqck9mqhng unsplash 1.jpg
Machine Learning

Accuracy Is Lifeless: Calibration, Discrimination, and Different Metrics You Really Want

July 15, 2025
Chatgpt image jul 6 2025 10 09 01 pm 1024x683.png
Machine Learning

AI Brokers Are Shaping the Way forward for Work Job by Job, Not Job by Job

July 14, 2025
Pexels sofia falco 1148410914 32439212.jpg
Machine Learning

Fearful About AI? Use It to Your Benefit

July 13, 2025
0 ov1ab 5q7gvwkdm .webp.webp
Machine Learning

Are You Being Unfair to LLMs?

July 12, 2025
Screenshot 2025 07 05 at 21.33.46 scaled 1 1024x582.png
Machine Learning

Constructing a Сustom MCP Chatbot | In the direction of Knowledge Science

July 10, 2025
Ryan moreno lurw1nciklc unsplash scaled 1.jpg
Machine Learning

What I Discovered in my First 18 Months as a Freelance Information Scientist

July 9, 2025
Next Post
1xn81bzwbusx8ket0xwu6ua.png

Past Causal Language Modeling. A deep dive into “Not All Tokens Are… | by Masatake Hirono | Jan, 2025

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025
How To Maintain Data Quality In The Supply Chain Feature.jpg

Find out how to Preserve Knowledge High quality within the Provide Chain

September 8, 2024
0khns0 Djocjfzxyr.jpeg

Constructing Data Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024

November 5, 2024

EDITOR'S PICK

998101b1 463d 41fd 8592 e4166c26ab78 800x420.jpg

Russia’s industrial powerhouse Rostec to launch ruble-backed token on Tron

July 4, 2025
Image.jpg

PyScript vs. JavaScript: A Battle of Net Titans

April 6, 2025
0m0u7eoll8omsolfc.jpeg

AI Ethics for the On a regular basis Person — Why Ought to You Care? | by Murtaza Ali | Jan, 2025

January 30, 2025
Big Data Security.jpg

Why Belief is the Basis of AI Content material Manufacturing

December 7, 2024

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • Report: 87% of Corporations Use AI Instruments in App Growth Processes
  • Accuracy Is Lifeless: Calibration, Discrimination, and Different Metrics You Really Want
  • James Wynn Returns with $19M Bitcoin, $100k PEPE Guess
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?