Violence against women and top women chess players across the developing world

An exploratory data analysis exercise using R.

Published in

MCD-UNISON

5 min readDec 6, 2020

In this project we are going to analyze two different sets of data, and see what they tell us about the likelihood of a top woman chess player appearing in a country and how it relates to violence against women in said country, this will imply a huge amount of data
cleaning and interpretation before arriving to a relevant correlation. We expect in advance and acknowledge that other variables such as economic variables per country may correlate more strongly to both violence and women top players.

The next block is the setup script to load data, and setup this notebook utilities.

chooseCRANmirror(ind = 52)
# EDA & Kaggle auth packages
install.packages(c("summarytools", "explore", "dataMaid", "devtools", "configr", "rsconnect", "dplyr"))
devtools::install_github("ldurazo/kaggler")

library(dplyr)
library(summarytools)
library(explore)
library(dataMaid)
library(configr)
library(readr)
library(rsconnect)
library(kaggler)

# files downloading
kgl_auth(creds_file = 'kaggle.json')

response_violence <- kgl_datasets_download_all(owner_dataset = "andrewmvd/violence-against-women-and-girls")
download.file(response_violence[["url"]], "data/violence_temp.zip", mode = "wb")
unzipResult <- unzip("data/violence_temp.zip", exdir = "data/", overwrite = TRUE)
violence_data <- read_csv("data/makeovermonday-2020w10/violence_data.csv")

response_chessplayers <- kgl_datasets_download_all(owner_dataset = "vikasojha98/top-women-chess-players")
download.file(response_chessplayers[["url"]], "data/chess_temp.zip", mode = "wb")
unzipResult <- unzip("data/chess_temp.zip", exdir = "data/", overwrite = TRUE)
chess_data <- read_csv("data/top_women_chess_players_aug_2020.csv")

With these two files we can now see a summary of the data. Note that these two are html generated files available if you run this notebook.

Important note: the two data sets downloaded are not a one to one match, many countries that exist in the violence data, are not in the data about top women chess players and viceversa, this exercise is an attempt to match into at least a mildly interesting, if not meaningful, peek into how these two sets relate to each other.

Sources linked below:
https://www.kaggle.com/andrewmvd/violence-against-women-and-girls

https://www.kaggle.com/vikasojha98/top-women-chess-players

Alternatively, the explore package returns interesting results in a shiny app, turn the following statements on if you want to see the data.

dfSummary(violence_data, file = "data/violence_data_summary.html")
dfSummary(chess_data, file = "data/violence_data_summary.html")

explore(chess_data)
explore(violence_data)

We will need a file that maps the ISO-3166 country alpha 3 on the chess data, to the country name in violence data.

download.file("https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv", "data/iso-3166")
countries_mapping <- read_csv("data/iso-3166")
countries_mapping <- setNames(select(countries_mapping, "name", "alpha-3"), c("name", "code"))

Let’s clean up our data by removing the NA values and transforming the percentage to a number.

violence_data <- na.omit(violence_data, "Value")
violence_data$Value <- as.numeric(sub("%", "", violence_data$Value))

Now we want to create an aggregate of the data of our first data set of violence against women, and generate a weighted mean out of the results between men and women answering, so that we can effectively create a “violence score” per country, in a very subjective way. There are a number of better techniques to do such a process, but only for the sake of the excercise we will use the following score.

violence_data$WeightedValue <- ifelse(violence_data$Gender == "F", violence_data$Value * 0.7, violence_data$Value * 0.3)
violence_data_slim <- select(violence_data, "Country", "WeightedValue")
violence_data_slim_grouped <- setNames(aggregate(violence_data_slim$WeightedValue, by = list(violence_data_slim$Country), FUN = mean), c("Country", "Score"))

With the score per country done, we need to do similar work with the chess players data frame.

chess_data <- na.omit(chess_data, "Standard_Rating", "Rapid_rating", "Blitz_rating")
chess_data_slim <- select(chess_data, "Federation", "Standard_Rating", "Rapid_rating", "Blitz_rating")
chess_data_slim_grouped <- chess_data_slim %>%
  group_by(chess_data_slim$Federation) %>%
  summarise(across(ends_with("rating"), list(mean = mean, n = length, max = max, min = min)))

Now, we need to join the tables with the countries table in order to finally obtain a single dataset.

violence_df <- left_join(violence_data_slim_grouped, countries_mapping, by = c("Country" = "name"))
violence_df %>% arrange(!is.na(violence_df$code))

Notice that we may have a few exemptions where the mapping did not occur correctly, in this instance we will fix them by hand.
— Bolivia
— Congo Democratic Republic
— Cote d’Ivoire
— Kyrgyz Republic
— Moldova
— Tanzania

violence_df <- within(violence_df, code[Country == "Bolivia"] <- "BOL")
violence_df <- within(violence_df, code[Country == "Congo Democratic Republic"] <- "COD")
violence_df <- within(violence_df, code[Country == "Cote d'Ivoire"] <- "CIV")
violence_df <- within(violence_df, code[Country == "Kyrgyz Republic"] <- "KGZ")
violence_df <- within(violence_df, code[Country == "Moldova"] <- "MDA")
violence_df <- within(violence_df, code[Country == "Tanzania"] <- "TZA")
violence_df %>% arrange(!is.na(violence_df$code))

Now, assuming the FIDE and ISO-3166 codes are the same, let’s see how the joined data looks like. Because the countries that have women chess players may not intersect with the countries visited for questionnaire in the violence dataset, I expect plenty of this missed intersections to have NA values. For this analysis we will pay closer attention to the violence score aggregation, and see which countries have top chess players rather than joining all countries in the FIDE and ignore violence score for countries that do not have chess players.

merged_df <- left_join(violence_df, chess_data_slim_grouped, by = c("code" = "chess_data_slim$Federation"))
merged_df %>% arrange(desc(merged_df$Score))

The generated output is very small due to a very small intersection between interviewed countries and top women chess players, but we will still attempt to see correlation between the violence index and the rating, and number, of players.

But first, a little data visualization.

Now, let’s see the correlation values.

merged_df_no_na <- na.omit(merged_df)
print(cor(merged_df_no_na$Score, merged_df_no_na$Standard_Rating_n))
print(cor(merged_df_no_na$Score, merged_df_no_na$Standard_Rating_mean))
print(cor(merged_df_no_na$Score, merged_df_no_na$Rapid_rating_mean))
print(cor(merged_df_no_na$Score, merged_df_no_na$Blitz_rating_mean))Output:[1] 0.09331977
[1] -0.3560483
[1] -0.507026
[1] -0.6013585

From the previous result we see two interesting observations after peeking into the aggregated data:

- 1) Because the tiny size of the sample, the correlation between the violence score and the number of players is meaningless.
- 2) there is a moderate negative correlation between the violence score and the rating of players on all three categories, that means that as the violence index increases, there is an apparent negative impact into how well the players of that country perform.

Now we will create a dataset out of our aggregated data.

write.csv(merged_df,"data/violence_chess_ds.csv", row.names = TRUE)
write.csv(chess_data_slim_grouped,"data/chess_aggregate_ds.csv", row.names = TRUE)

Before wrapping up, let’s create a data report of our dataset

makeDataReport(merged_df,
               render = FALSE,
               file = "codebook.Rmd",
               codebook = TRUE,
               replace = TRUE,
               reportTitle = "Violence index and women chess players across the world")

External resources:

You can find the project repository with helpful outputs and other instructions here:

ldurazo/mcd-EDA

Please refer to main.rmd file to see all instructions to play and see the exploratory data analysis. A pdf version has…

github.com

And also a dashboard containing tools to visualize some of the data and aggregates, here:

https://ldurazo.shinyapps.io/mcd-eda/

Violence against women and top women chess players across the developing world

An exploratory data analysis exercise using R.

External resources:

ldurazo/mcd-EDA

Please refer to main.rmd file to see all instructions to play and see the exploratory data analysis. A pdf version has…

Written by Luis Durazo