Predicting Food Delivery Time -Hackathon by IMS Proschool

Snehanshu Sengupta
Code To Express
Published in
6 min readDec 3, 2019

This Article is generally on ‘Predicting Food Delivery Time’ a hackathon hosted on machinehack.com takes you through each and every step in detail and helps you understand the whole ML model building process. So, let’s get started.

Problem Statement

The entire world is transforming digitally and our relationship with technology has grown exponentially over the last few years. We have grown closer to technology, and it has made our life a lot easier by saving time and effort. Today everything is online starting from our shopping to ordering food. As data scientists, we are gonna prove that given the right data anything can be predicted. Here we are providing you with data from thousands of restaurants in India regarding the time they take to deliver food for online order. As data scientists, your goal is to predict the online order delivery time based on the given factors.

Datasets

We will be using two datasets — Train data and Test data

Screenshot of the Training data (11094 rows): Training data refers to that portion of data used to fit a model.

Training data is a combination of both categorical and numerical also we can see some special character also being used because of which we have to do data Transformation on it before applying it to our model

Test data

The test data is similar to the training data set, minus the ‘ Delivery_Time’ column (To be predicted using the model).

FEATURES:

  • Restaurant: A unique ID that represents a restaurant.
  • Location: The location of the restaurant.
  • Cuisines: The cuisines offered by the restaurant.
  • Average_Cost: The average cost for one person/order.
  • Minimum_Order: The minimum order amount.
  • Rating: Customer rating for the restaurant.
  • Votes: The total number of customer votes for the restaurant.
  • Reviews: The number of customer reviews for the restaurant.
  • Delivery_Time: The order delivery time of the restaurant. (Target Classes)

Approach

We will create a pipeline and brake down the solution into four simple stages.

  • Exploring the data and its features
  • Data Cleaning
  • Data Preprocessing
  • Modeling and Predicting

Python Coding

Step 1: Import the relevant libraries in Python.

Step 2: Import Train and Test data sets

Step 3: Exploring Data and Features

While exploring the dataset thoroughly we will try to find answers to the following questions

  • Does the table contain any missing or null values?
  • What type of data does each column have?
  • Can new features be deduced from the existing columns?
  • What are the categorical variables that need to be encoded?
  • Does any column contain values that are irrelevant or have no significance to the context?

Key Observations :

  • The Location and Cuisine column contains multiple values separated by commas.
  • The Average_Cost and Minimum Order column consist of symbols and are strings.
  • The Rating, Votes and Reviews column consists of invalid values such as ‘-’, “NEW’ etc.
  • Restaurant, Location, and Cuisines are categorical variables

Step 4: Handling Categorical Variables

Location

  1. Here we will using some NLP Technique which we use for Handling Categorical Variables .You Can study the below article to learn more about NLP
Removal of comma and other special character
Bag of words details

2. TF IDF | TFIDF

Term Frequency (TF)

The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.

Inverse Data Frequency (IDF)

The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

Lastly, the TF-IDF is simply the TF multiplied by IDF.

Then implementing the TFIDF vectorizer in our location field

Finally, we can compute the TF-IDF scores for all the words in the corpus

2. Cuisines

Like the above ‘location’ we will repeat the same process .you will get the code from the below GitHub link

Step 5: Data Cleaning

Data Cleaning is a very important stage that can directly account for the efficiency of a machine learning model. Here we can see some of the columns have special character like [‘-’, ‘NEW’, ‘Opening Soon’, ‘Temporarily Closed’] but if we see the same column have integer values so we are filling the values with the mean value another if we see the column ‘Average_Cost’ we have ‘$’ sign so we are changing the value with space

1. Ratings, Votes and ‘Reviews

Function to get non numerical Value
Replacing the values with mean values

2. Average_Cost & Minimum_Order

We are replacing the ‘$’ with blank value and converting the column into integer

Replacing the value

Still some value have comma value so to remove those we are using the below function

Once cleaned we can finally see the data

Step 6: Modeling and Predicting

Finally, we are on to building a simple classifier that can predict and evaluate on our sample data. We will use a simple XGBoost classifier without any parameter tuning

Splitting into Train and Test
Test data
Xgboost Parameter

Final Word

In this type of problem Feature Engineering and NLP is the most crucial thing . You can see how we have handled the categorical and numerical data and also how we build ML model on the same dataset .
At last, You can also further improve the Model by Tunning different parameters that are being used in the model.
Please let me know your thoughts about this article and do comment if you face any issues.

As always, I welcome feedback and constructive criticism. I can be reached on snehanshu.sengupta1991@gmail.com

Sources :

  1. https://analyticsindiamag.com/predict-the-food-delivery-time-hackathon-solution/
  2. https://medium.com/code-to-express/flight-price-prediction-7c83616a13bb
  3. https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1

--

--