Exploratory Data Analysis for Beginners!

Darshan Gandhi
The Quest
Published in
5 min readApr 29, 2020
Different Types of Graphs

What is Exploratory Data Analysis?

EDA refers to a process by which one can try to carry out in-depth analysis and identify the various patterns, variance and check out whether the assumptions made are true or not.

EDA on a Dataset

We would now try to understand how can one try to perform EDA on a data set to gain insights and create a better sense of understanding about the dataset and later carry out the desired process by asking out a few questions.

Basic Steps

  1. Define the problem statement
  2. Define the datatypes and create data definitions
  3. Import the libraries
  4. Set options
  5. Read the data
  6. Understand and Prepare the Data
  7. Understand the variables
  8. Check for Missing Values
  9. Study Correlation
  10. Detect Outliers
  11. Ask questions to obtain more insights

So, let us now dive right into the exploratory part and start asking questions.

Here, we shall be working on the Zomato India Dataset and try to gain a few insights from the dataset. The complete source code is available here.

How is the distribution of all restaurants spread across India?

Distribution of Restaurants all over India

Here, we have tried to make the use of the parameters : Latitude & Longitude of each and every restaurant all over India.

Observing this graph one try to understand that the number of hotels is not even in all the states, the states in the east zone have a low count of the number of restaurants as compared to the west, south, north zone.

Further, the density of restaurants is more in the south zone and north zone at the extreme ends of the country as compared to the central zone.

A code snippet for the same

Explanation of the code:
1) Import the libraries : matplotlib & geopandas

2) Create a GeoDataFrame with the help of the latitude and longitude columns

3) Create a new DataFrame having world data available

4) Get data only relevant to country India and set the color of the graph to Black and edge color to White

5) Plot the graph, and indicate the restaurants with yellow color

What is the distribution of price range for the restaurants?

Pie Chart for Visualisation

From the above graph we can try to understand that most of the restaurants lie in the price range category of 1, followed by 2 and 3 and at the end 4.

This, graph helps us to understand that most of the restaurants in India have a low cost and medium cost price which occupies almost 75% of the total distribution.

A code snippet for the same

Explanation of the code:

  1. First we read the data
  2. Next we go to the specific column, here: “price_range”
  3. Now, we find out the value count of each element in the column
  4. Finally, we plot the pie-chart for the value counts using the plot.pie() function

How does the number of photos vary with respect to a change in the type of establishment (type of restaurant) and showcase the outliers?

First let us understand what are outliers :

In this question we shall learn the concepts of comparing values in two columns and also importance of outliers.

Outlier is a data point which vary considerably from other observations. An outlier may be due to measurement variability, or it may imply experimental error; the latter is often omitted from the collection of data.

Thus, with the help of that we can try to understand how the values are varying and which values are not of great interest to us or have a great variance from the normal for example as seen in case when the establishment type is Lounge, the values are as high as 17,600, which is way high as compared to it’s mean values which are below 1500 photos.

Next, we also understand that if the type of establishment is either of Microbrewery, Pub, Bar the amount of photos uploaded is high as compared to others, this can be due to several factors : the ambience, the food, the services provided and so on.

On the other hand if you have a Bhojnalaya or Kiosk the number of photos uploaded are negligible.

This can help us gain an insight that establishments of pubs and bars have a better audience reach as compared to if it’s a bhojnalya or kiosk.

A code snippet for the same

Explanation of the code:

  1. Import the library seaborn as sns
  2. Plotting a boxplot, with the x axis as “Photo Count” and the y axis as “Type of Establishment”
  3. Resizing the image for better visualisation

Conclusion :

We are done! Hope you had a great read and understanding of the concepts explained and the questions asked. For any further clarification or insights you can reach out to me. Thank you.

--

--

Darshan Gandhi
The Quest

100K+ Web3 Projects @ darshang.xyz/web3 | VC Scout @ SHL, Republic | Product Manager @ Loco | Founder