EDA ~Top 100 Richest People in the World
Let’s make our hands dirty by playing with the dataset named Top 100 Richest People in the World. This dataset contains information about the top 100 richest people in the world based on their net worth. The dataset includes their rank, name, net worth, birthday, age, and nationality. This dataset is taken from kaggle
Note: For quick Pandas revision you can refer to this blog : Tutorial: Pandas
1. Import important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
2. Reading the CSV file
df = pd.read_csv("/content/top_100_richest.csv")
df.head()
Observation:* The dataset has 6 columns namely rank, age, name,net_worth, bday, and nationality
3. Data Summary
- The info() method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).
df.info()
Observation:We can notice that there are non null values in each column except for bday and age.
- The describe() method returns a description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains the information for each column:
df.describe()
Observations:* As seen before that there are NULL values in age therefore it's count is not 100
* Maximum age of the Billionaire is 97 years
* Mean/average age of the Billionaire is 68 years
- For proper understanding let’s split the birth date in the day-month and year format
df[["day", "month", "year"]] = df["bday"].str.split("-", expand = Truedf.head()
4. Finding and Removing the missing values
The isna() function is used to detect missing values.
input:
df.isna().sum()output:
rank 0
name 0
net_worth 0
bday 6
age 5
nationality 0
dtype: int64Observation:We can notice that there are non null values in each column except for bday and age.
The dropna() method removes the rows that contains NULL values. The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.
input:
df = df.dropna()
df.isna().sum()output:
rank 0
name 0
net_worth 0
bday 0
age 0
nationality 0
dtype: int64Observation:Null values have been removed
5. Exploring Distribution
- Nationality Distribution
value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
count = df.nationality.value_counts().tolist()
labels = []
none = [labels.append(i[0]) for i in df.nationality.value_counts().items()]plt.figure(figsize=(6,10))
sns.barplot(x=count,y=labels)
plt.xlabel('Count')
plt.ylabel('National')
plt.show()
Observations:* Most billionaire belongs to the USA
* Top 5 countries which have the maximum number of Billionaire are USA, Russia, China, France and India
* France,UAE, Spain ,Mexico, etc have the almost same number of Billionaire
- Pie chart
def pie_chart(sizes,labels,level):
explode = [0.1]for i in range(level):
explode.append(0)
explode = tuple(explode)fig1, ax1 = plt.subplots(figsize=(15, 8))
ax1.pie(sizes,
explode=explode,
labels=labels,
autopct='%1.1f%%',
shadow=True,
startangle=90)
ax1.axis('equal')plt.show()
pie_chart(count[:12],labels[:12],level=11)
Observations:* About 44.3% billionaire belongs to the USA
* France, UAE , Spain,Mexico etc have the about same percentage of Billionaire that 2.5%
- Birth distribution
count= df.month.value_counts(sort=False).tolist()
month = []
none = [month.append(i[0]) for i in df.month.value_counts(sort=False).items()]def bar_plot(sizes,labels,title): fig, ax = plt.subplots(figsize=(14, 4))
ax.bar(labels, sizes)
ax.set_ylabel('Count')
ax.set_title(title)
plt.show()bar_plot(count,month,'month')
Observations:* Most billionaires are born in the month of October followed by April, August, and March
* Very few billionaires are born in the month of December
# Number of Indians in Top 100
df[df["nationality"]=="India"]
# List of Billionaires below the age 40
df[df["age"]<40]
# List the Billionaires between the age group of 50 — 80 and are from either India or Russia
df[(df["age"]>50) & (df["age"]<80) & ( df["nationality"] == "India") |( df["nationality"] == "Russia" ) ]
# List of Billionaires who are above the age of 80 and either born in November or are in the top 20
df[(df["age"]>80) & (( df["month"] == "Nov") | ( df["rank"] <20 ))]
# List all the Billionaires who belongs to Gen X generation
df[(df["year"] >'65') & (df["year"]<'80')]
# List all the Billionaire who belongs to the Millennials generation
df[(df["year"] >'81') & (df["year"]<'96')]
More EDA blogs are available down below, check them out for gaining new concepts.
References:
2. w3 Schools
3. https://www.beresfordresearch.com/age-range-by-generation/
4. Dataset: https://www.kaggle.com/datasets/ayessa/top-100-richest-people-in-the-world