Predicting Food Delivery Time -Hackathon by IMS Proschool
This Article is generally on ‘Predicting Food Delivery Time’ a hackathon hosted on machinehack.com takes you through each and every step in detail and helps you understand the whole ML model building process. So, let’s get started.
Problem Statement
The entire world is transforming digitally and our relationship with technology has grown exponentially over the last few years. We have grown closer to technology, and it has made our life a lot easier by saving time and effort. Today everything is online starting from our shopping to ordering food. As data scientists, we are gonna prove that given the right data anything can be predicted. Here we are providing you with data from thousands of restaurants in India regarding the time they take to deliver food for online order. As data scientists, your goal is to predict the online order delivery time based on the given factors.
Datasets
We will be using two datasets — Train data and Test data
Training data is a combination of both categorical and numerical also we can see some special character also being used because of which we have to do data Transformation on it before applying it to our model
The test data is similar to the training data set, minus the ‘ Delivery_Time’ column (To be predicted using the model).
FEATURES:
- Restaurant: A unique ID that represents a restaurant.
- Location: The location of the restaurant.
- Cuisines: The cuisines offered by the restaurant.
- Average_Cost: The average cost for one person/order.
- Minimum_Order: The minimum order amount.
- Rating: Customer rating for the restaurant.
- Votes: The total number of customer votes for the restaurant.
- Reviews: The number of customer reviews for the restaurant.
- Delivery_Time: The order delivery time of the restaurant. (Target Classes)
Approach
We will create a pipeline and brake down the solution into four simple stages.
- Exploring the data and its features
- Data Cleaning
- Data Preprocessing
- Modeling and Predicting
Python Coding
Step 1: Import the relevant libraries in Python.
Step 2: Import Train and Test data sets
Step 3: Exploring Data and Features
While exploring the dataset thoroughly we will try to find answers to the following questions
- Does the table contain any missing or null values?
- What type of data does each column have?
- Can new features be deduced from the existing columns?
- What are the categorical variables that need to be encoded?
- Does any column contain values that are irrelevant or have no significance to the context?
Key Observations :
- The Location and Cuisine column contains multiple values separated by commas.
- The Average_Cost and Minimum Order column consist of symbols and are strings.
- The Rating, Votes and Reviews column consists of invalid values such as ‘-’, “NEW’ etc.
- Restaurant, Location, and Cuisines are categorical variables
Step 4: Handling Categorical Variables
Location
- Here we will using some NLP Technique which we use for Handling Categorical Variables .You Can study the below article to learn more about NLP
2. TF IDF | TFIDF
Term Frequency (TF)
The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.
Inverse Data Frequency (IDF)
The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.
Lastly, the TF-IDF is simply the TF multiplied by IDF.
Then implementing the TFIDF vectorizer in our location field
Finally, we can compute the TF-IDF scores for all the words in the corpus
2. Cuisines
Like the above ‘location’ we will repeat the same process .you will get the code from the below GitHub link
Step 5: Data Cleaning
Data Cleaning is a very important stage that can directly account for the efficiency of a machine learning model. Here we can see some of the columns have special character like [‘-’, ‘NEW’, ‘Opening Soon’, ‘Temporarily Closed’] but if we see the same column have integer values so we are filling the values with the mean value another if we see the column ‘Average_Cost’ we have ‘$’ sign so we are changing the value with space
1. Ratings, Votes and ‘Reviews
2. Average_Cost & Minimum_Order
We are replacing the ‘$’ with blank value and converting the column into integer
Still some value have comma value so to remove those we are using the below function
Once cleaned we can finally see the data
Step 6: Modeling and Predicting
Finally, we are on to building a simple classifier that can predict and evaluate on our sample data. We will use a simple XGBoost classifier without any parameter tuning
Final Word
In this type of problem Feature Engineering and NLP is the most crucial thing . You can see how we have handled the categorical and numerical data and also how we build ML model on the same dataset .
At last, You can also further improve the Model by Tunning different parameters that are being used in the model.
Please let me know your thoughts about this article and do comment if you face any issues.
As always, I welcome feedback and constructive criticism. I can be reached on snehanshu.sengupta1991@gmail.com
Sources :