Data Analysis

Uber Case Study: EDA

An analysis of the Uber request dataset with relevant illustrations using seaborn and matplotlib visualization libraries.

Suhas V S
Suhas V S
Oct 23 · 7 min read
Image for post
Image for post
Photo by Austin Distel on Unsplash

Uber is a cab service provider for people wanting to travel from one place to another. Here, I have taken an Uber request dataset from Kaggle to try and perform analysis using the visualization libraries such as seaborn and matplotlib. At the end of this article, I have given a link to my Kaggle notebook where I have performed a detailed analysis of this Uber dataset.

The steps taken to perform this analysis are:

  1. Understanding the dataset

Let us jump right into the analysis and see what can be understood to make relevant conclusions.

1.Understanding the dataset

Before moving on to understanding the fields/observations in the data, let us import the required python libraries required for this analysis.

Image for post
Image for post
Importing the required Python libraries

We will import the dataset and store it as a data frame for our future analysis. To see what features the data contains, we use the head () function which by default prints the first 5 rows of the dataset.

Image for post
Image for post
Read the dataset and print the first 5 rows

Let us see the shape of the data for the number of columns and rows it has. Here, we use the attribute shape.

Image for post
Image for post
df.shape

This means our dataset has 6745 rows which is the number of times users have requested for cab service and 6 columns which provide details on the different aspects of the dataset.

We will see for the numbers of NaNs or missing values in each column and also see in what percentage do they have the missing data.

Image for post
Image for post
NaNs and their %s per column

Columns “Driver id” and “Drop timestamp” are the only columns with missing values with the percentages 39 and 58 respectively.

Let us see how we can get the information about the structure of the dataset into consideration. We use the function info().

Image for post
Image for post
df.info( )

Extracted Info:

Number of Rows: 6745

Number of Columns: 6

The “Dtype” of each column based on the type of data it holds: There are 2 numerical(int and float) and 4 object columns

To find the statistical summary for different parameters for both object and numerical data, we use “describe(include= “all”)” function where numerical parameters such as mean, standard deviation, minimum & maximum value and object parameters such as count, unique, top and frequency of occurrence are shown.

Image for post
Image for post
df.describe(inclide= “all”)

2. Cleaning/Handling the data

We see that for the column “Request timestamp”, the format of DateTime is different for different sections of data. Some have it separated by “-” and some have it separated by “/”. Let us first replace the “/” with “-” to have uniformity and then convert the entire column to standard DateTime format using the “pd.to_datetime” function. Refer to lines 10,11 and 12.

Image for post
Image for post
Request timestamp column DateTime discrepancy
Image for post
Image for post
Request timestamp column DateTime standardization

After doing the DateTime conversion, if we pull the info again, we see that “Request timestamp” is converted to “DateTime”.(Highlighted in yellow).

Image for post
Image for post
Converted Dtype for Request timestamp column

Similarly, convert the “Drop timestamp” column from “object” to “DateTime”

Image for post
Image for post
Drop timestamp column DateTime standardization

Pulling up the info again, now we can see both “Request timestamp” and “Drop timestamp” are converted to DateTime standard. (Highlighted in yellow).

Image for post
Image for post
“Request timestamp” and “Drop timestamp columns are converted to DateTime standard

We now add 2 columns “req_hour”(which is the Hour of the request during the day) and “req_day”(which is the day of the month) to determine and categorize a load of cab service requests.

Image for post
Image for post
Adding columns “req_hour” and “req_day”

As we had seen earlier during understanding the data step of the process, there were two columns “Driver id” and “Drop timestamp” with missing values. We have to understand whether the missing values are genuine or they are present due to something going wrong during data collection.

The NaNs/missing values in the column “Driver_Id” can be ignored.
This is because we see that since there were NO CARS AVAILABLE at point of the day after the user tried to book a cab, no driver was allotted the trip, and hence the driver_id is missing. Similarly, we can ignore the NaNs/missing values in the column “Drop timestamp” as for all of them, the trip is either CANCELLED or NO CARS AVAILABLE. In both the columns, the data is missing due to a genuine reason and not that it got lost during the data collection. Hence, I have not substituted them with any other values based on any logic

3. Visualize and Analyze

Let us see the plot of the “Status” of the trip at different hours of the day and also pick up locations. We will use the seaborn viz library for this activity.

Image for post
Image for post
seaborn count plot for req_hour with “Status” column at hue
Image for post
Image for post
seaborn factor plot for req_hour with “Pick up” column ”at hue

The plot of Status of the trip at different hours of the day and also pick up locations shows that,
1) Between hours 5 AM-9 AM, the load on cabs is high with an almost equal amount of trips getting completed and canceled.
2) Between hours 5 PM-9 PM, the load on cabs is significantly high. Hence, there is a mismatch between cab demand and availability. Hence, we see more of “No cars Available Status”.
3) Between hours 5 AM-9 AM, the users from the city is significantly high.
4) Between hours 5 PM-9 PM, the users from the Airport is significantly high.

We will now add a new column “Time_Slot” to create categories of hours from the “req_hour” column.

Image for post
Image for post
Hour categorization
Image for post
Image for post
Python code for “Hour Categorization”

After running the above code, we now see a new column “Time_Slot” added with the relevant time categories.

Image for post
Image for post
“Time_Slot” column added with the relevant time categories.

Let us see the count of each category that we have created in the above step using the value_counts( ) function.

Image for post
Image for post
value_counts( ) function

You see from the above value counts, the “Morning_Rush” and “Evening_Rush” are the hours with maximum load.

Image for post
Image for post
Count of categories with “Status” column at hue
Image for post
Image for post
Morning_rush hour cab requests w.r.t “Pickup point” column

From the above plot, we deduce that users booking for cab services in the morning are significantly high from “City” as compared to from “Airport”.

Let us take up the “Morning_Rush” hour and pickup point as “City” and see the status of the trips.

Image for post
Image for post
Morning_rush hour and pickup from the city

From the above pie chart, we see that nearly 49% of the users canceled their trips. Only 28% of the trips were completed and for 22% of the trips, there were no cars available.

Let us take up the “Evening_Rush” hour and pickup point as “Airport” and see the status of the trips.

Image for post
Image for post
Evening_rush hour and pickup from Airport

From the above pie chart, we see that nearly 6% of the users canceled their trips. Only 21% of the trips were completed and for 73% of the trips, there were no cars available.

Key Takeaways:

  1. We understood the dataset with the number of user requests that were done and the number of columns(6745,6) along with other facts such as number/percentage of NaNs in each column and format of DateTime in the request and drop timestamp columns.

I have tried to include as many steps as possible to maintain a flow for the steps taken in the analysis. Here is the link to my Kaggle notebook where I have worked on this analysis with step by step explanation.

Towards AI

The Best of Tech, Science, and Engineering.

Sign up for Towards AI Newsletter

By Towards AI

Towards AI publishes the best of tech, science, and engineering. Subscribe with us to receive our newsletter right on your inbox. For sponsorship opportunities, please email us at pub@towardsai.net Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Suhas V S

Written by

Suhas V S

MTech Student in Data Science and Machine Learning|Ex-Ericsson|https://www.linkedin.com/in/suhasvs95/

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Suhas V S

Written by

Suhas V S

MTech Student in Data Science and Machine Learning|Ex-Ericsson|https://www.linkedin.com/in/suhasvs95/

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store