Does open data make you happy? An introduction to Kaggle Kernels

How much you like open and accessible data probably depends on the kind of person you are — I happen to like it a lot! But, it turns out that at a national scale, more open and accessible government data is positively correlated with happiness.

In this post, I want to share with you how I used Kaggle Kernels — our in-browser code execution environment — to explore two very interesting open datasets on Kaggle’s Datasets platform to come to this conclusion.

In my R analysis reproduced here, I demonstrate a few things:

  • How to write an Rmarkdown report using Kaggle Kernels featuring ggplot, dplyr, and formattable
  • How to combine multiple data sources from a catalogue of over 1,000 datasets into a single “kernel” (analysis) thanks to a super cool new feature
  • And, finally, the answer to the question: Are “open data friendly” countries “happy” countries?

Let’s go!

Introduction

In the kernel I wrote, executed, and published on Kaggle, I examine the question of whether countries whose governments adopt open policies with respect to data sharing are the same countries that score highly on the world happiness index. Let’s hypothesize that the two are positively correlated!

Thanks to the new multiple data sources feature, I can easily combine datasets from a catalogue of over 1k sources over on Kaggle’s public data platform (plus several hundreds from competitions, too, if I want!).

Here are the two datasets shared on Kaggle that I’ve chosen to work with:

Open Knowledge International’s 2015 Global Open Data Index

The Global Open Data Index is an annual effort to measure the state of open government data around the world. The crowdsourced survey is designed to assess the openness of specific government datasets according to the Open Definition.

Sustainable Development Solutions Network’s World Happiness Report from 2016

The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2016 Update, which ranks 156 countries by their happiness levels, was released in Rome in advance of UN World Happiness Day, March 20th. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. They reflect a new worldwide demand for more attention to happiness as a criteria for government policy.

Now that I’ve got my question plus the data to help me answer it, I’m ready to start a new kernel on Kaggle. The next section shares how to do this including code reading in, joining, and manipulating the two datasets.

Reading in Multiple Data Sources

It’s quite straightforward to read in multiple data sources in a kernel on Kaggle:

  • Click on “New Kernel” from any page including https://www.kaggle.com/kernels
  • Select “Script” for Rmarkdown (“Notebook” starts a new Jupyter notebook)
  • Search for and add data sources (you can go back and add more at any time)
A demo showing how to start a new kernel with multiple data sources. In this case it’s a notebook which seamlessly combines markdown and either Python or R code.

On any kernel, you can see its data sources by clicking on the “Input” tab. This is what you see on my kernel once it’s published:

The “Input” tab on my Happiness and Open Data kernel on Kaggle.

Now that I’ve selected my data sources, I can get coding. In this post, I’m sharing the code portions of my Rmarkdown file which you can see in its entirety here.

The code below reads in the data sources and joins them together by country name. There are some country names that don’t exactly match, so I’ll leave it to you to fork this and tweak the code (click the blue “Fork” button on the kernel).

library(dplyr)

# Read in data files from `open-data` and `world-happiness` datasets
open_data <- read.csv("../input/open-data/countries.csv", stringsAsFactors=F)
happiness <- read.csv("../input/world-happiness/2015.csv", stringsAsFactors=F)

# Rename from "Country Name" to just "Country" so it's easier to join
colnames(open_data)[2] <- "Country"

# Join the two dataset files on "Country"
open_data_happiness <- open_data %>%
left_join(happiness, by = "Country") %>%
mutate(Country = factor(Country)) %>%
# Keep only columns I plan to use
select(Country, Region, X2015.Score, Happiness.Score, Economy..GDP.per.Capita., Family, Health..Life.Expectancy., Freedom, Trust..Government.Corruption., Generosity, Dystopia.Residual)

# Give the columns nicer names now that our data is in one dataframe
colnames(open_data_happiness) <- c("Country", "Region", "Openness", "Happiness", "GDP", "Family", "Health", "Freedom", "Trust", "Generosity", "DystopiaResidual")

Now that I have the data roughly how I want it, let’s have a quick peek. I really like this package called formattable for presenting information in dataframes. I’ll use it to look at the characteristics of the top 10 countries rated highest for their open data sharing policies:

library(formattable)

open_data_happiness %>%
# Which countries are the most open?
arrange(desc(Openness)) %>%
# Round our numeric variables to two decimal places
mutate_each(funs(round(., 2)), -c(Country, Region, Openness)) %>%
head(10) %>%
formattable(list(
Openness = color_bar("yellow"),
Happiness = color_bar("lightgreen"),
GDP = color_bar("deepskyblue"),
Family = color_bar("deepskyblue"),
Health = color_bar("deepskyblue"),
Freedom = color_bar("deepskyblue"),
Trust = color_bar("deepskyblue"),
Generosity = color_bar("deepskyblue"),
DystopiaResidual = color_bar("deepskyblue")
), align = "l")
A little bit of funkiness, but formattable is a much richer way to glance at your data. I highly recommend you check it out!

The top ten most “open” countries aren’t localized to one or two regions but instead span the globe which is interesting. Do any of these countries surprise you?

I was interested to see Colombia in fourth position; you can see the country scores lower relative to other top countries on some measures including “Trust” (Government Corruption). I investigated a bit and found this report from the Open Data Barometer which does call out Colombia as a regional leader in its 2016 analysis of open data readiness, implementation, and impact among Latin American countries.

I also suspect that the United States will get kicked down a few notches in coming years, sadly. Anyway, onto answering our main question…

Are open countries happy countries?

Now we’re ready to answer our question of whether countries that index highly for data openness are also home to happy people. Let’s find out.

I used the code below to generate a plot showing the openness score and happiness score for each country as measured in 2015.

library(ggplot2)
library(ggthemes)
library(viridis)

ggplot(open_data_happiness,
aes(x = Openness,
y = Happiness)) +
geom_point(aes(colour = Region),
size = 2) +
geom_smooth(method="lm") +
labs(x = "Openness Score",
y = "Happiness Score",
title = "Are open data friendly countries happy countries?",
subtitle = "Data openness and happiness by country in 2015") +
scale_color_viridis(discrete = T) +
theme_minimal() +
theme(text = element_text(size=16))

Overall, it looks like the answer is … yes! One thing that stands out from this plot is that Western European countries cluster above the linear fit line on “Happiness” while Sub-Saharan African countries are grouped together at the bottom left, falling lowest on the “Happiness” scale.

Conclusion

I hope you’ve found this introduction to Kaggle Kernels and multiple data sources helpful. Maybe you’re even inspired to make the world a happier place through data!

Check out our public Datasets platform here (and even contribute if you’ve got a data project you want to share!). Plus, browse our Kernels page for codespo (code inspiration) from our community of data scientists.


If you’re passionate about making more open and accessible data available to the world, I’d love to hear about it. We’re looking for data spelunkers and storytellers with coding skills in Python or R to help make Kaggle Datasets the best possible place to find, analyze, and collaborate on data. You can learn more here.