• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Friday, October 17, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Data Science

How I Constructed a Knowledge Cleansing Pipeline Utilizing One Messy DoorDash Dataset

Admin by Admin
October 17, 2025
in Data Science
0
Rosidi how i built a data cleaning pipeline 17.png
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


How I Built a Data Cleaning Pipeline Using One Messy DoorDash DatasetHow I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
Picture by Editor

 

# Introduction

 
In line with CrowdFlower’s survey, knowledge scientists spend 60% of their time organizing and cleansing the information.

On this article, we’ll stroll via constructing a knowledge cleansing pipeline utilizing a real-life dataset from DoorDash. It incorporates almost 200,000 meals supply information, every of which incorporates dozens of options comparable to supply time, whole gadgets, and retailer class (e.g., Mexican, Thai, or American delicacies).

 

# Predicting Meals Supply Occasions with DoorDash Knowledge

 
Predicting Food Delivery Times with DoorDash DataPredicting Food Delivery Times with DoorDash Data
 
DoorDash goals to estimate the time it takes to ship meals precisely, from the second a buyer locations an order to the time it arrives at their door. In this knowledge challenge, we’re tasked with growing a mannequin that predicts the full supply length primarily based on historic supply knowledge.

Nevertheless, we gained’t do the entire challenge—i.e., we gained’t construct a predictive mannequin. As an alternative, we’ll use the dataset supplied within the challenge and create a knowledge cleansing pipeline.

Our workflow consists of two main steps.

 
Data Cleaning PipelineData Cleaning Pipeline
 

 

# Knowledge Exploration

 
Data Cleaning PipelineData Cleaning Pipeline
 

Let’s begin by loading and viewing the primary few rows of the dataset.

 

// Load and Preview the Dataset

import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()

 

Right here is the output.

 
Data Cleaning PipelineData Cleaning Pipeline
 

This dataset consists of datetime columns that seize the order creation time and precise supply time, which can be utilized to calculate supply length. It additionally incorporates different options comparable to retailer class, whole merchandise depend, subtotal, and minimal merchandise value, making it appropriate for varied forms of knowledge evaluation. We are able to already see that there are some NaN values, which we’ll discover extra intently within the following step.

 

// Discover The Columns With data()

Let’s examine all column names with the data() methodology. We are going to use this methodology all through the article to see the modifications in column worth counts; it’s a very good indicator of lacking knowledge and total knowledge well being.

 

Right here is the output.

 
Data Cleaning PipelineData Cleaning Pipeline
 

As you possibly can see, we now have 15 columns, however the variety of non-null values differs throughout them. This implies some columns comprise lacking values, which might have an effect on our evaluation if not dealt with correctly. One last item: the created_at and actual_delivery_time knowledge varieties are objects; these needs to be datetime.

 

# Constructing Knowledge Cleansing Pipeline

 
On this step, we construct a structured knowledge cleansing pipeline to organize the dataset for modeling. Every stage addresses frequent points comparable to timestamp formatting, lacking values, and irrelevant options.
 
Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
 

// Fixing the Date and Time Columns Knowledge Sorts

Earlier than doing knowledge evaluation, we have to repair the columns that present the time. In any other case, the calculation that we talked about (actual_delivery_time – created_at) will go flawed.

What we’re fixing:

  • created_at: when the order was positioned
  • actual_delivery_time: when the meals arrived

These two columns are saved as objects, so to have the ability to do calculations appropriately, we now have to transform them to the datetime format. To try this, we will use datetime capabilities in pandas. Right here is the code.

import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df.data()

 

Right here is the output.

 
Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
 

As you possibly can see from the screenshot above, the created_at and actual_delivery_time are datetime objects now.

 
Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
 

Among the many key columns, store_primary_category has the fewest non-null values (192,668), which suggests it has essentially the most lacking knowledge. That’s why we’ll deal with cleansing it first.

 

// Knowledge Imputation With mode()

One of many messiest columns within the dataset, evident from its excessive variety of lacking values, is store_primary_category. It tells us what sort of meals shops can be found, like Mexican, American, and Thai. Nevertheless, many rows are lacking this data, which is an issue. For example, it may possibly restrict how we will group or analyze the information. So how can we repair it?

We are going to fill these rows as an alternative of dropping them. To try this, we are going to use smarter imputation.

We write a dictionary that maps every store_id to its most frequent class, after which use that mapping to fill in lacking values. Let’s see the dataset earlier than doing that.

 
Data Imputation With modeData Imputation With mode
 

Right here is the code.

import numpy as np

# International most-frequent class as a fallback
global_mode = df["store_primary_category"].mode().iloc[0]

# Construct store-level mapping to essentially the most frequent class (quick and sturdy)
store_mode = (
    df.groupby("store_id")["store_primary_category"]
      .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
)

# Fill lacking classes utilizing the store-level mode, then fall again to world mode
df["store_primary_category"] = (
    df["store_primary_category"]
      .fillna(df["store_id"].map(store_mode))
      .fillna(global_mode)
)

df.data()

 

Right here is the output.

 
Data Imputation With modeData Imputation With mode
 

As you possibly can see from the screenshot above, the store_primary_category column now has the next non-null depend. However let’s double-check with this code.

df["store_primary_category"].isna().sum()

 

Right here is the output exhibiting the variety of NaN values. It’s zero; we removed all of them.

 
Data Imputation With modeData Imputation With mode
 

And let’s see the dataset after the imputation.

 
Data Imputation With modeData Imputation With mode

 

// Dropping Remaining NaNs

Within the earlier step, we corrected the store_primary_category, however did you discover one thing? The non-null counts throughout the columns nonetheless don’t match!

This can be a clear signal that we’re nonetheless coping with lacking values in some a part of the dataset. Now, relating to knowledge cleansing, we now have two choices:

  • Fill these lacking values
  • Drop them

On condition that this dataset incorporates almost 200,000 rows, we will afford to lose some. With smaller datasets, you’d have to be extra cautious. In that case, it’s advisable to research every column, set up requirements (resolve how lacking values can be stuffed—utilizing the imply, median, most frequent worth, or domain-specific defaults), after which fill them.

To take away the NaNs, we are going to use the dropna() methodology from the pandas library. We’re setting inplace=True to use the modifications on to the DataFrame while not having to assign it once more. Let’s see the dataset at this level.

 
Dropping NaNsDropping NaNs
 

Right here is the code.

df.dropna(inplace=True)
df.data()

 

Right here is the output.

 
Dropping NaNsDropping NaNs
 

As you possibly can see from the screenshot above, every column now has the identical variety of non-null values.

Let’s see the dataset after all of the modifications.

 
Dropping NaNsDropping NaNs
 

 

// What Can You Do Subsequent?

Now that we now have a clear dataset, right here are some things you are able to do subsequent:

  • Carry out EDA to grasp supply patterns.
  • Engineer new options like supply hours or busy dashers ratio so as to add extra which means to your evaluation.
  • Analyze correlations between variables to extend your mannequin’s efficiency.
  • Construct completely different regression fashions and discover the best-performing mannequin.
  • Predict the supply length with the best-performing mannequin.

 

# Remaining Ideas

 
On this article, we now have cleaned the real-life dataset from DoorDash by addressing frequent knowledge high quality points, comparable to fixing incorrect knowledge varieties and dealing with lacking values. We constructed a easy knowledge cleansing pipeline tailor-made to this knowledge challenge and explored potential subsequent steps.

Actual-world datasets may be messier than you assume, however there are additionally many strategies and methods to unravel these points. Thanks for studying!
 
 

Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from high firms. Nate writes on the newest traits within the profession market, offers interview recommendation, shares knowledge science initiatives, and covers every thing SQL.



READ ALSO

How Healthcare Careers Are Increasing on the Intersection of Knowledge and Affected person Care

Information Bytes 20251013: AMD’s Massive OpenAI Deal, Intel’s New 2nm Server CPU from Fab 52

Tags: BuiltCleaningDataDatasetDoorDashMessyPipeline

Related Posts

Intersection of data and patient care.jpg
Data Science

How Healthcare Careers Are Increasing on the Intersection of Knowledge and Affected person Care

October 17, 2025
Intel fab 52 me 2 1 102025 jpg.png
Data Science

Information Bytes 20251013: AMD’s Massive OpenAI Deal, Intel’s New 2nm Server CPU from Fab 52

October 16, 2025
Dynamics 365 for customer engagement.jpg
Data Science

Reinvent Buyer Engagement with Dynamics 365: Flip Insights into Motion

October 16, 2025
Clouds.jpg
Data Science

Tessell Launches Exadata Integration for AI Multi-Cloud Oracle Workloads

October 15, 2025
Kdn data analytics automation scripts with sql sps.png
Data Science

Knowledge Analytics Automation Scripts with SQL Saved Procedures

October 15, 2025
1760465318 keren bergman 2 1 102025.png
Data Science

@HPCpodcast: Silicon Photonics – An Replace from Prof. Keren Bergman on a Doubtlessly Transformational Expertise for Knowledge Middle Chips

October 14, 2025
Next Post
Image 152.png

Statistical Methodology mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Information Evaluation

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025
0khns0 Djocjfzxyr.jpeg

Constructing Data Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024

November 5, 2024

EDITOR'S PICK

Crypto mask.jpg

Excessive worth masks an uncomfortable reality: Crypto is not sovereign

October 1, 2025
Ai In Business Analytics Transforming Data Into Insights.png

AI in Enterprise Analytics: Reworking Knowledge into Insights

February 6, 2025
Ai Shutterstock 2350706053 Special.jpg

WEKA Introduces New WEKApod Home equipment to Speed up Enterprise AI Deployments

November 3, 2024
Market Update Cover.jpg

Bitcoin Bulls Assault $66K as Shiba Inu Skyrockets 40%: This Week’s Crypto Recap

September 27, 2024

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • Statistical Methodology mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Information Evaluation
  • How I Constructed a Knowledge Cleansing Pipeline Utilizing One Messy DoorDash Dataset
  • Zcash Value Correction Deepens as Bull Flag Sample Takes Form 
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?