Uber Case Study: EDA
An analysis of the Uber request dataset with relevant illustrations using seaborn and matplotlib visualization libraries.
Uber is a cab service provider for people wanting to travel from one place to another. Here, I have taken an Uber request dataset from Kaggle to try and perform analysis using the visualization libraries such as seaborn and matplotlib. At the end of this article, I have given a link to my Kaggle notebook where I have performed a detailed analysis of this Uber dataset.
The steps taken to perform this analysis are:
- Understanding the dataset
- Cleaning the data
- Draw meaningful conclusions using the visualization libraries.
Let us jump right into the analysis and see what can be understood to make relevant conclusions.
1.Understanding the dataset
Before moving on to understanding the fields/observations in the data, let us import the required python libraries required for this analysis.
We will import the dataset and store it as a data frame for our future analysis. To see what features the data contains, we use the head () function which by default prints the first 5 rows of the dataset.
Let us see the shape of the data for the number of columns and rows it has. Here, we use the attribute shape.
This means our dataset has 6745 rows which is the number of times users have requested for cab service and 6 columns which provide details on the different aspects of the dataset.
We will see for the numbers of NaNs or missing values in each column and also see in what percentage do they have the missing data.
Columns “Driver id” and “Drop timestamp” are the only columns with missing values with the percentages 39 and 58 respectively.
Let us see how we can get the information about the structure of the dataset into consideration. We use the function info().
Number of Rows: 6745
Number of Columns: 6
The “Dtype” of each column based on the type of data it holds: There are 2 numerical(int and float) and 4 object columns
To find the statistical summary for different parameters for both object and numerical data, we use “describe(include= “all”)” function where numerical parameters such as mean, standard deviation, minimum & maximum value and object parameters such as count, unique, top and frequency of occurrence are shown.
2. Cleaning/Handling the data
We see that for the column “Request timestamp”, the format of DateTime is different for different sections of data. Some have it separated by “-” and some have it separated by “/”. Let us first replace the “/” with “-” to have uniformity and then convert the entire column to standard DateTime format using the “pd.to_datetime” function. Refer to lines 10,11 and 12.
After doing the DateTime conversion, if we pull the info again, we see that “Request timestamp” is converted to “DateTime”.(Highlighted in yellow).
Similarly, convert the “Drop timestamp” column from “object” to “DateTime”
Pulling up the info again, now we can see both “Request timestamp” and “Drop timestamp” are converted to DateTime standard. (Highlighted in yellow).
We now add 2 columns “req_hour”(which is the Hour of the request during the day) and “req_day”(which is the day of the month) to determine and categorize a load of cab service requests.
As we had seen earlier during understanding the data step of the process, there were two columns “Driver id” and “Drop timestamp” with missing values. We have to understand whether the missing values are genuine or they are present due to something going wrong during data collection.
The NaNs/missing values in the column “Driver_Id” can be ignored.
This is because we see that since there were NO CARS AVAILABLE at point of the day after the user tried to book a cab, no driver was allotted the trip, and hence the driver_id is missing. Similarly, we can ignore the NaNs/missing values in the column “Drop timestamp” as for all of them, the trip is either CANCELLED or NO CARS AVAILABLE. In both the columns, the data is missing due to a genuine reason and not that it got lost during the data collection. Hence, I have not substituted them with any other values based on any logic
3. Visualize and Analyze
Let us see the plot of the “Status” of the trip at different hours of the day and also pick up locations. We will use the seaborn viz library for this activity.
The plot of Status of the trip at different hours of the day and also pick up locations shows that,
1) Between hours 5 AM-9 AM, the load on cabs is high with an almost equal amount of trips getting completed and canceled.
2) Between hours 5 PM-9 PM, the load on cabs is significantly high. Hence, there is a mismatch between cab demand and availability. Hence, we see more of “No cars Available Status”.
3) Between hours 5 AM-9 AM, the users from the city is significantly high.
4) Between hours 5 PM-9 PM, the users from the Airport is significantly high.
We will now add a new column “Time_Slot” to create categories of hours from the “req_hour” column.
After running the above code, we now see a new column “Time_Slot” added with the relevant time categories.
Let us see the count of each category that we have created in the above step using the value_counts( ) function.
You see from the above value counts, the “Morning_Rush” and “Evening_Rush” are the hours with maximum load.
From the above plot, we deduce that users booking for cab services in the morning are significantly high from “City” as compared to from “Airport”.
Let us take up the “Morning_Rush” hour and pickup point as “City” and see the status of the trips.
From the above pie chart, we see that nearly 49% of the users canceled their trips. Only 28% of the trips were completed and for 22% of the trips, there were no cars available.
Let us take up the “Evening_Rush” hour and pickup point as “Airport” and see the status of the trips.
From the above pie chart, we see that nearly 6% of the users canceled their trips. Only 21% of the trips were completed and for 73% of the trips, there were no cars available.
- We understood the dataset with the number of user requests that were done and the number of columns(6745,6) along with other facts such as number/percentage of NaNs in each column and format of DateTime in the request and drop timestamp columns.
- We standardized the format of DateTime in the request and drop timestamp columns.
- Figured a logical reason as to why the NaNs in Driver_id and drop timestamp columns should be ignored.
- Extracted the day number and hour from the request timestamp column to perform a deeper analysis.
- Between hours 5 AM-9 AM, the load on cabs are high with an almost equal amount of trips getting completed and canceled.
- Between hours 5 PM-9 PM, the load on cabs is significantly high. Hence, there is a mismatch between cab demand and availability. Hence, we see more of “No cars Available Status”.
- Between hours 5 AM-9 AM, the users from the city are significantly high.
- Between hours 5 PM-9 PM, the users from the Airport are significantly high.
- The “Morning_Rush” and “Evening_Rush” are the hours with maximum load(i.e more number of users requesting cab services).
- We also saw the load during “Morning_rush” and “Evening_Rush” from both pick-up points “City” and “Airport”.
- During the evening rush hour, we saw a significant number of No CARS AVAILABLE status for the trip bookings from “Airport”.
I have tried to include as many steps as possible to maintain a flow for the steps taken in the analysis. Here is the link to my Kaggle notebook where I have worked on this analysis with step by step explanation.