Nobel Prize Analysis with R
This descriptive analysis will be demonstrated into three subjects: gender, age and continents. And within each subject, there are distribution analysis and trend analysis.
Here is the outline:
1. Gender
a. Gender distribution -> Distribution
b. Gender by different award categories -> Distribution
c. Gender over time -> Trend
d. Gender over time by categories -> Distribution & Trend
2. Age
a. Age distribution by different award categories -> Distribution
b. Age distribution by year awarded -> Distribution & Trend
3. Continent
a. Continent distribution by categories -> Distribution
b. Continent distribution by gender -> Distribution
c. Continent over time -> Trend
Data, Reference & Code:
The analysis is performed based on the Nobel Prize data with R, and thanks for the code reference from Prabhat Kuma.
GitHub: https://github.com/qyaan/nobel-prize
1. Gender
Facts:
- As of 2020, 57 women won Nobel Prize.
- 58 Nobel Prize rewarded to women, since Marie Curie won twice.
- There are 57 women, 873 men and 25 organization won Nobel Prize.
1a. Gender distribution
a <- ggplot(data = nobel, aes(x=Gender))a + stat_count(width = 0.5,fill = "white", color = "DarkBlue") + labs(x = "Award", y = "Sum", title = "Gender Gap")
1b. Gender by different award categories
b <- ggplot(data = nobel, aes(x=Award, fill=Gender))b + stat_count(width = 0.5) +
facet_grid(Gender~.,scales = "free") +
labs(x = "Award", y = "Sum", title = "Gender by Category")
1c. Gender over time
p1 <- as.data.frame(table(nobel$award.year,nobel$Gender))colnames(p1) <- c("year", "gender", "Freq")p2 <- mutate(group_by(p1, gender), cumsum = cumsum(Freq))ggplot(subset(p2, gender != "Org")) + geom_point(aes(year,log(cumsum), color = gender)) + scale_x_discrete(breaks = seq(1901, 2020, 10)) + scale_color_manual(values = c("darkorange", "skyblue","Green")) + labs(x = "Year", y = "log(cumulative sum) of laureates", title = "Cumulative Sum of Nobel Laureates by Gender over Time")
1d. Gender over time by categories
p3 <- as.data.frame(table(nobel$award.year, nobel$Award, nobel$Gender))colnames(p3) <- c("year", "category", "gender", "Freq")p4<- mutate(group_by(p3, category, gender), cumsum = cumsum(Freq))ggplot(subset(p4, gender != "Org")) +
geom_point(aes(year, log(cumsum), color = gender)) + facet_grid(category ~ .) +
theme_bw() +
scale_x_discrete(breaks = seq(1901, 2020, 10)) + scale_color_manual(values = c("darkorange", "skyblue","Green")) + labs(x = "Year", y = "log(cumulative sum) of laureates", title = "Cumulative Sum of Nobel Laureates by Gender & Category over Time")
2. Age
Facts:
- Median age is over 50 for all categories
- Physics has youngest median, and economics has oldest median
- Peace is skewed to a extramely younger outlier, because of a 17-years old girl Malala Yousafzai.
- Peace is rewared to younger person over the years.
- All other rewards show an upward trend in age, espacially chemistry and physics.
2a. Age distribution by different award categories
ggplot(subset(nobel, Gender != "Org")) +
geom_violin(aes(x = Award, y = award.age), fill = "LightBlue")+ stat_summary(aes(x = Award, y = award.age),
fun.y = "median", geom = "point") +
labs(x = "Category", y = "Age (years)",
title = "Age Distribution of Nobel Laureates by Category")
2b. Age distribution by year awarded
ggplot(subset(nobel, Gender != "Org"),
aes(x = award.year, y = award.age)) +
facet_wrap(.~Award) +
geom_point() +
geom_smooth() +
labs(x = "Year", y = "Age(years) at end of year",
title = "Age of Nobel Laureates Over Time by Category")
3. Continent
It’s reasonable to think weathier continent would have more laurates, but the reality is more complex. It is hard to draw a conclution without bias, since there are some continents has less countries, and different countries has different populations.
So here just present the code and result of analysis.
3a. Continent distribution by category
ggplot(data = nobel, aes(x = award.continent)) +
stat_count(width = 0.5,aes(fill = Award)) +
labs(x = "Continent", y = "Count", title = "All Nobel Prizes by Continent and Category") +
scale_fill_manual(values = c("#ffffcc", "#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494"), name = "Category")
3b. Continent distribution by gender
ggplot(data = nobel, aes(x = award.continent)) +
stat_count(width = 0.5, aes(fill = Gender)) +
labs(x = "Continent", y = "Count",
title = "All Nobel Prizes by Continent and Gender") +
scale_fill_manual(values = c("#41b6c4", "#ffd800", "#253494"),
name = "Gender")
3c. Continent over time
p5 <- as.data.frame(table(nobel$award.year, nobel$award.continent))colnames(p5) <- c("year", "Continent", "Freq")p6<- mutate(group_by(p5, Continent), cumsum2 = cumsum(Freq))ggplot(data = p6) + geom_point(aes(year, log(cumsum2),color = Continent)) +
scale_x_discrete(breaks = seq(1901, 2020, 10)) +
labs(x = "Year", y = "log(cumulative sum) of Continent",
title = "Nobel Laureates by Continent over Time")