EDA ~Top 100 Richest People in the World

Abhishek Selokar
5 min readSep 29, 2022

Let’s make our hands dirty by playing with the dataset named Top 100 Richest People in the World. This dataset contains information about the top 100 richest people in the world based on their net worth. The dataset includes their rank, name, net worth, birthday, age, and nationality. This dataset is taken from kaggle

Note: For quick Pandas revision you can refer to this blog : Tutorial: Pandas

Source

1. Import important libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Reading the CSV file

df = pd.read_csv("/content/top_100_richest.csv")
df.head()
Observation:* The dataset has 6 columns namely rank, age, name,net_worth, bday, and nationality

3. Data Summary

  • The info() method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).
df.info()
Observation:We can notice that there are non null values in each column except for bday and age. 
  • The describe() method returns a description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains the information for each column:
df.describe()
Observations:* As seen before that there are NULL values in age therefore it's count is not 100
* Maximum age of the Billionaire is 97 years
* Mean/average age of the Billionaire is 68 years
  • For proper understanding let’s split the birth date in the day-month and year format
df[["day", "month", "year"]] = df["bday"].str.split("-", expand = Truedf.head()

4. Finding and Removing the missing values

The isna() function is used to detect missing values.

input:
df.isna().sum()
output:
rank 0
name 0
net_worth 0
bday 6
age 5
nationality 0
dtype: int64
Observation:We can notice that there are non null values in each column except for bday and age.

The dropna() method removes the rows that contains NULL values. The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.

input:
df = df.dropna()
df.isna().sum()
output:
rank 0
name 0
net_worth 0
bday 0
age 0
nationality 0
dtype: int64
Observation:Null values have been removed

5. Exploring Distribution

  • Nationality Distribution

value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

count = df.nationality.value_counts().tolist()
labels = []
none = [labels.append(i[0]) for i in df.nationality.value_counts().items()]
plt.figure(figsize=(6,10))
sns.barplot(x=count,y=labels)
plt.xlabel('Count')
plt.ylabel('National')
plt.show()
Observations:* Most billionaire belongs to the USA
* Top 5 countries which have the maximum number of Billionaire are USA, Russia, China, France and India
* France,UAE, Spain ,Mexico, etc have the almost same number of Billionaire
  • Pie chart
def pie_chart(sizes,labels,level):
explode = [0.1]
for i in range(level):
explode.append(0)
explode = tuple(explode)
fig1, ax1 = plt.subplots(figsize=(15, 8))
ax1.pie(sizes,
explode=explode,
labels=labels,
autopct='%1.1f%%',
shadow=True,
startangle=90)
ax1.axis('equal')
plt.show()
pie_chart(count[:12],labels[:12],level=11)
Observations:* About 44.3%  billionaire belongs to the USA
* France, UAE , Spain,Mexico etc have the about same percentage of Billionaire that 2.5%
  • Birth distribution
count= df.month.value_counts(sort=False).tolist() 
month = []
none = [month.append(i[0]) for i in df.month.value_counts(sort=False).items()]
def bar_plot(sizes,labels,title): fig, ax = plt.subplots(figsize=(14, 4))
ax.bar(labels, sizes)
ax.set_ylabel('Count')
ax.set_title(title)
plt.show()
bar_plot(count,month,'month')
Observations:* Most billionaires are born in the month of October followed by April, August, and March
* Very few billionaires are born in the month of December

# Number of Indians in Top 100

df[df["nationality"]=="India"]

# List of Billionaires below the age 40

df[df["age"]<40]

# List the Billionaires between the age group of 50 — 80 and are from either India or Russia

df[(df["age"]>50) & (df["age"]<80) & ( df["nationality"] == "India") |( df["nationality"] == "Russia" ) ]

# List of Billionaires who are above the age of 80 and either born in November or are in the top 20

df[(df["age"]>80) & (( df["month"] == "Nov") | ( df["rank"] <20 ))]

# List all the Billionaires who belongs to Gen X generation


df[(df["year"] >'65') & (df["year"]<'80')]

# List all the Billionaire who belongs to the Millennials generation

df[(df["year"] >'81') & (df["year"]<'96')]

More EDA blogs are available down below, check them out for gaining new concepts.

References:

  1. Top 100 Richest — EDA | iamdatamonkey

2. w3 Schools

3. https://www.beresfordresearch.com/age-range-by-generation/

4. Dataset: https://www.kaggle.com/datasets/ayessa/top-100-richest-people-in-the-world

--

--

Abhishek Selokar

Masters Student @ Indian Institute Of Technology, Kharagpur || Thirsty to learn more about AI