Exploratory Data Analysis for Beginners 🤩

Ananya Agrawal
6 min readSep 27, 2022

--

Introduction

We are living in an ever-evolving world, and technology has accelerated this process to boundless new possibilities. Digital revolution is making the world increasingly data-driven and to keep up with this transformation, businesses have to incorporate data analytics into their strategic vision and use it to get some mind-blowing insights for wise decision making.

Image credits: https://www.gapingvoidart.com

Turns out, the recipe📃 to these insights, lies hidden in one acronym known as EDA (Exploratory Data Analysis). As the name suggests, it is a process used to probe various datasets and extract useful information from them which is often combined with data visualizations.

Why Exploratory Data Analysis?🤔

It can assist in finding glaring errors, better understanding data patterns, spotting outliers or unusual occurrences, and discovering intriguing relations between the variables or new variables altogether. Features obtained after EDA is done, can then be used for more complex data analysis or modelling, including machine learning.

To better illustrate the idea behind EDA, let’s take a sample dataset and embark on this quest for insights and what better than books 📚 to get started with. The dataset that is going to be used here is of Amazon’s Bestselling books from 2009–2019 which can be found here.

First Steps 👣

Importing necessary libraries

For our analysis here, we’d be requiring pandas, numpy, matplot and seaborn library.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Reading and understanding the dataset

Using pandas read_csv function to read the loaded dataset and head() and tail() functions to look at the first and last five rows of data and the variables with their respective values. The column values are basically features that will be used to analyse the data.

books = pd.read_csv('/content/bestsellers with categories.csv')books.head()
books.tail()

The .info() function is used to get additional details about these features as shown. It can be inferred from this that there are 4 numerical features and 3 categorical features by referring to their data types.

books.info()

Getting insights and visualizations

Now let’s get some statistics for the given features using .describe(),

Numerical Features Statistics:

books.describe()

Categorical Feature Statistics:

books.describe(include='object')

Using value_count() to get the count of fiction and non-fiction in genre column and visualizing it using seaborn’s countplot function:

genre_counts=books.Genre.value_counts()
genre_counts

Let’s analyze two features together and see the results, here number of reviews are plotted against genre, and it can be clearly seen fiction books have a greater number of reviews than non-fiction books.

plt.title('Fiction vs Non Fiction')
sns.countplot(x='Genre', data=books)
plt.figure(figsize=(7,7))
plt.title("Genre v/s Reviews")
sns.barplot(x = 'Genre',
y = 'Reviews',
data = books);

We can also implement boxplots for visualizing the distribution of books across various features, as an example here, the price distribution of books is shown across the genre and price

plt.figure(figsize = (9,4))
sns.boxplot(data =books,x = 'Price',y ="Genre")
plt.title("Genre-wise Distribution of Price", fontsize = 18)
plt.ylabel("Genre", fontsize = 15)
plt.xlabel("Price",fontsize = 15)
plt.show()

Next up, a double bar plot is shown, they can come in handy in cases like these when we have two sets of data such as fictional and non-fictional books here. So, the data both of these sets can be visualized in one plot.

books_price = books.sort_values("Price", ascending= False)[['Author', 'Price', 'Genre', 'Reviews']].head(20)
plt.xticks(rotation = 90)
sns.barplot(x = 'Author',
y = 'Reviews',
data = books_price, hue = "Genre");

Here .sort_values function is used to first sort the data by price of the books in descending order and then the top 20 of those are included in the newly created dataframe ‘books_price’

books_price

It is also seen that there are recurrent values in this particular dataframe, which can be removed by drop_duplicates() function to have a clear picture of what one is dealing with.

books_price.drop_duplicates()

We can also explore the statistics of each feature exclusively, such as of Price shown below:

print(books.Price.describe())
print()
books.Price.plot(bins=50, kind='hist')

It is observed from the statistics of price that the minimum value of price in the dataframe is ‘0’. This is not desired as prices of bestseller books cannot be ‘0’. To fix this problem, we first look at the count of ‘0’ values in the price column:

books.Price.value_counts()[0]

Output:

12

Next, we use the map function to map all ‘0’ values with the median of the price values which is basically replacing the values:

books['Price'] = books['Price'].map( lambda x : books.Price.median() if x == 0 else x)books.Price.min()

Output:

1.0

If we check the minimum now, it is no longer 0.

Sometimes, we exclusively want to analyze a particular category, and hence to ease this purpose, the dataset can be divided into sub-datasets. There are many ways to do this but here it is split into fiction and non-fiction using nesting amongst the dataframe:

df_fiction=books[books["Genre"]=="Fiction"]
df_nonfiction=books[books["Genre"]=="Non Fiction"]

Now we can keep other features common and use the same functions to get individual insights for each category, for example the top authors in each genre:

df_fiction.Author.value_counts()

It is also possible to split data into ranges to get meaningful insights, here the Price values are divided into three ranges as shown using cut function from pandas and a new feature called PriceRange is made. This is used to get a distribution of reviews depending on the price, and clearly low-priced books have the highest number of reviews.

books['PriceRange'] = pd.cut(books['Price'], 3,labels=['low range <= $35','mid range<=$70','high range <=$150'])
books[['PriceRange', 'Reviews']].groupby(['PriceRange'], as_index=False).mean().sort_values(by='PriceRange', ascending=True)

This is visualized using Seaborn’s boxplot, again to get an idea about how the reviews are distributed across ranges.

plt.figure(figsize=(10,10))
sns.boxplot(x=books['PriceRange'],y=books['Reviews'])
plt.ylim(0, 50000)

Just the beginning😎

In this way, we can go on and on about data exploration, applying functions, creating new features or modifying the existing ones and deriving insights. The possibilities are endless. The foremost important thing to keep in mind is the purpose, when one has a well-defined end goal, it becomes easy to carve the way to accomplish the task at hand.

No business can draw any assumptions with raw data, an exhaustive approach like EDA thus becomes an integral part before moving on to the modelling part of the work.

Thankyou if you read all the way down here! You can find the github repository of this code here on my github. Do check it out! You can also respond to this story, I’d be glad to get some feedback as this is my first blog on Medium.

--

--

Ananya Agrawal

Student | Aspiring data scientist | Writes about data analysis, Machine Learning, Deep Learning , NLP | Check out my github https://github.com/ananyasgit