EDA with python using Forbes 2022 Dataset

Qudirah
9 min readMay 22, 2022

--

Hi, I totally wanted to share my excel dashboard this weekend but I didn’t save the file and it crashed so, I decided to do a tutorial on how I cleaned and visualized the Forbes 2022 dataset.

The dataset was obtained from Kaggle and I found it interesting the moment I saw it. I’m not so updated with the news but the few people I know should be there include Bill Gates, Elon Musk..uh, Aliko Dangote, and the Kardashians. I always thought the Kardashians were so rich till I went through this dataset. Only Kim got featured and she was just a billion-dollar-plus some hundred million rich. I said that like that's a measly amount. lol.

All the cleaning was done in the Jupyter notebook and the visualization was done with matplotlib and a bit of seaborn. The first step I took was to import all necessary libraries and then read/load the CSV file. The libraries I used include pandas majorly for analysis and manipulation. Matplotlib and seaborn are used for visualization and I just imported NumPy in case there is any mathematical computation I’ll be doing. The read_csv function allows you to read a file in CSV format.

The file is loaded into the Forbes_df variable and the head()/tail() is a function that simply reads the first and last n rows respectively. Also when reading the file, I passed the index_col argument to 0 which makes the first column, the index column. From the dataset, I noticed there are just five columns which include rank, name, net worth, age, country, source, and industry. If it were a larger dataset you might have to use the columns method to see all columns. The picture below also shows there are 2600 rows in total(Note: It is 2599 because the index starts from 0). This is a really small dataset but it gets you familiar with simple functions for cleaning and visualizing data.

The iloc function is used to access all information about a row through the index number of the row. The describe method shows the description of all numerical columns in the dataset. But according to the picture below, the net worth column isn’t included. This means it’s in string format and needs to be converted to float(not int, in case there are decimals).

This is done by using one of the most important and useful pandas functions, The apply function. The apply function allows you to apply a function to either a column or row in a data frame. I combine the apply function with the lambda function, another useful function. The lambda function is like a normal function but instead does not require a name when defining and it is contained in one line of code.

So, to change the net worth column to a float type, I replaced the ‘$’, the letter ‘B’ behind each number, and the extra spaces with nothing. Then changed the type to float. This is the code below. Notice how the apply function and lambda are used.

Forbes_df[‘networth’]=Forbes_df[‘networth’].apply(lambda x:x.replace(‘$’,’’)).apply(lambda x:x.replace(‘B’,’’)).apply(lambda x:x.replace(‘ ‘,’’)).astype(np.float64)

Also because, I don't particularly appreciate how my column headings are all in lowercase, I renamed the columns and then called the describe method again.

Now it looks better. The describe function now describes the net worth column. Notice how I renamed it Networth($ Billion). This is because the list includes only the billionaires and I don’t want to have lots of zeroes behind each digit. The describe function shows the count of all entries in the data frame and that makes 2600 rows which is correct. The mean of each numerical column, the standard deviation, the minimum (which shows the youngest age is 19), the first quartile, second quartile, and third quartile. Then, the max which shows the oldest billionaire is 100 years.

The next thing is to check if missing values exist and I did this by using the IsNull method and then calling the sum function on it. This sums up the total number of missing cells in each column. This gives 0 which lessens the work as I don’t have to deal with missing values. Now we dig in, there are so many things to check for.

1. Who are the Top 10 richest in the world?

I try as much as possible to include only useful information on the chart which is why I removed the frame and unnecessary axis labels. I also avoided using playing with colors unless they have meanings.

I used a bar chart because I will be plotting categorical data with a numeric one. If it were something that has to do with time or ordinal data, I’ll use the line chart. The chart shows Elon Musk is the richest man and he is far ahead of Jeff Bezos who came second. Bill Gates is fourth and Warren Buffett is fifth. Let’s move on to more interesting things

2. Which country has the highest number of billionaires?

First, we have to check the countries involved, and by doing that I used the unique method. Then, I call the len function on the unique method to know the total number of unique countries we have. This gives 75. 75 countries is a lot. The next step is to use the groupby function. The groupby function groups the whole data frame by the specified column. So, I grouped the dataset by the country column and call the size method on it. This returns a series object with the count of each unique country on the list.

I save this series into another variable called Country_counts and then used the to_frame method to convert it to a data frame. I changed the name of the count column which was automatically named 0 as it became a data frame then I sorted the values in descending form.

This shows the US has the highest number of Billionaires and is about 200 far ahead of China which came second. Then I plotted the first 10 countries by count. I could have plotted everything but the last five countries all have the same numbers and the chart wouldn’t be that useful being big and showing repeated values. So, the first 10 should do.

Again, bar chart because it is categorical vs numerical. India came third and Germany, fourth. There are lots of Asian countries on the list which made me wonder about African Countries. Hence the next question.

3. Who is the richest in Africa?

First, the dataset does not have a continent column so I checked Wikipedia for the list of countries in Africa and copied it into a list called ‘Africa’. Then I created a variable called country_in_list which is also a list, used the for loop to loop through the Africa list, and then, appended African countries in Forbes _df to the Country_in_list variable. The list shows there are seven African countries in the list.

Using the loc attribute to access specific rows by index names in the Country count data frame, outputs the list of countries with their counts in the Forbes dataset. This shows Egypt(6) has the highest number of billionaires in Africa followed by South Africa(5), then Nigeria(3), and then Morocco(2). Now for the 10 richest in Africa.

I created another variable called Forbes_dfa (a for Africa) and then set the index of the original dataset (Forbes_df) to Country. After, I used the loc attribute (again) to access the rows by name. Then, I sorted them in descending order by their net worth. I used seaborn for the plot here because it is easier to set the color parameter to a column. See below.

The bar graph shows the top 10 billionaires in Africa and the color is selected by the countries. Aliko Dangote from Nigeria tops the chart with 14 billion dollars and leaves the second, Johan Rupert from South Africa, behind with a huge gap. The third is also a South African while the 5th and 6th are Nigerians. Which makes the three Nigerians the only three Nigerians on the original billionaire Forbes list.

4. Which Industry has the most billionaires in it?

I’m a little curious about this because I want to know what they do and what they are into. So the first step towards this is using the unique function to check the unique industries in the dataset. Then I used the apply and lambda combo to replace blank spaces with nothing. Just to tidy up how it looks. Then, the size function returns a series object with each industry and the counts. I rename it as I’ve done earlier and change the series object to a data frame then I sort in descending order by count.

This is all saved into the Industry_count variable and it shows that 386 of them are into Finance and investments, while 329 are into Technolgy. Let’s visualize to see all.

This is quite obvious from the graph. The only surprising part for me was the Casino and gambling. Interesting.

5. Is there a relationship between money and age?

I was curious about this just to find out if older people were richer or something. This is visualized using a scatterplot with often shows the relationship between two numerical data. I made the color green because what does a dollar look like?

The graph shows there is no relationship between age and money. They are evenly distributed along the age axis and there are outliers of course such as Elon Musk whose money is far ahead of others.

6. The distribution of age

The distribution of a numeric value can be visualized using the histogram. The graph below shows it’s a unimodal distribution and it even looks like a normal distribution. The age group 50–70 years seems to have the highest frequency in the dataset. Also seems like there are more older people in the dataset.

7. Who is the youngest billionaire or who are the youngest billionaires?

Just in the way I have been doing, I sorted by age, this time in ascending order. Then, plot.

The youngest billionaire is Kevin David Lehmann. Age 19 from Germany and he got rich by inheriting his father’s stakes. The youngest female is Alexandra Andresen, age 25. A Norwegian heiress. Not too far from her is her Sister Katharina Andresen. Pedro Franceschi (25) and Herique Douglass(26) are the youngest self-made billionaires. They are both co-founders of Brex and are from Brazil. Brex offers business credit cards and cash management accounts to technology companies.

It was a long ride but I hope you learned one or two and can maybe try out your EDA too. If you think there is a better approach to any of the questions, I’ll like you to share it. This is the link to my Twitter page. Until next time and thanks for reading.

--

--

Qudirah

To be a Data Scientist is hard, to be a Nigerian data scientist is harder. Taking you through my journey because success is a must.