Lets tidy the tuesday: A look at the GDPR-Violations

9 min readApr 30, 2020

Yay, first post! As a beginning, lets start with more recent Tidytuesday-Data. Its about GDPR-violations. As someone who is, theoretically, a licensed GDPR-Expert that might be interesting — though I have to admit that this is not the most interesting topic for me. But whatever, its more about playing around with data!

In principle I could use the tidytuesday-package, but as someone leaning a little bit in the suckless-ideology I prefer to download the data directly.

library(tidyverse)
library(lubridate)
library(DT)
# Get the Datagdpr_violations <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv')
gdpr_text <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_text.tsv')

Lets look at the data frames. Peeking at the data I saw that in one column only the flags in svg-format. We have the names of the countries so that information is redundant — at least for the analysis:

gdpr_violations %>% select(-picture)
## # A tibble: 250 x 10
##       id name   price authority date  controller article_violated type  source
##    <dbl> <chr>  <dbl> <chr>     <chr> <chr>      <chr>            <chr> <chr> 
##  1     1 Pola…   9380 Polish N… 10/1… Polish Ma… Art. 28 GDPR     Non-… https…
##  2     2 Roma…   2500 Romanian… 10/1… UTTIS IND… Art. 12 GDPR|Ar… Info… https…
##  3     3 Spain  60000 Spanish … 10/1… Xfera Mov… Art. 5 GDPR|Art… Non-… https…
##  4     4 Spain   8000 Spanish … 10/1… Iberdrola… Art. 31 GDPR     Fail… https…
##  5     5 Roma… 150000 Romanian… 10/0… Raiffeise… Art. 32 GDPR     Fail… https…
##  6     6 Roma…  20000 Romanian… 10/0… Vreau Cre… Art. 32 GDPR|Ar… Fail… https…
##  7     7 Gree… 200000 Hellenic… 10/0… Telecommu… Art. 5 (1) c) G… Fail… https…
##  8     8 Gree… 200000 Hellenic… 10/0… Telecommu… Art. 21 (3) GDP… Fail… https…
##  9     9 Spain  30000 Spanish … 10/0… Vueling A… Art. 5 GDPR|Art… Non-… https…
## 10    10 Roma…   9000 Romanian… 09/2… Inteligo … Art. 5 (1) a) G… Non-… https…
## # … with 240 more rows, and 1 more variable: summary <chr>

Ok, we see that for several nations the violations agianst GDPR are listed. One first quite simple thing to do is to visualize how much violations the different countries have:

number_countries <- gdpr_violations %>% 
  mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% 
  group_by(name) %>% summarize(n=n()) %>% arrange(desc(n)) %>% 
  mutate(name=factor(name,levels=name)) number_countries %>% ggplot(aes(name,n, fill=name)) + 
  geom_col(show.legend = F) + 
  labs(x="",y="Number of GDPR-Violations") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45,hjust = 1.1,vjust = 1.2))

Ok, we already learn: Most of the cases are in Spain. Other things we might look upon:

When did these violations happen? Is there a peak time?
What kind of violations were most often comitted? And where?

When did the violations happen

When we try to plot the number of violations over time however, we see that there is something strange:

gdpr_violations %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% 
  group_by(date) %>% summarize(n=n()) %>% arrange(date) %>% 
  ggplot(aes(date,n)) + geom_point() + labs(x = "",y = "GDPR Incidents") + theme_minimal()

A look inside the data reveals that in 15 cases the date was set to 1970–01–01:

gdpr_violations %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% 
  group_by(date) %>% summarize(n=n()) %>% arrange(date)
## # A tibble: 140 x 2
##    date           n
##    <date>     <int>
##  1 1970-01-01    15
##  2 2018-05-12     1
##  3 2018-07-17     1
##  4 2018-09-27     1
##  5 2018-10-25     1
##  6 2018-11-01     1
##  7 2018-11-21     1
##  8 2018-12-01     1
##  9 2018-12-12     1
## 10 2018-12-17     1
## # … with 130 more rows

Since 1970–01–01 is the start of the Unix time it can be assumed that here it was simply forgotten to include the dates. To confirm that, lets examine these 15 cases a little bit further.

What is quite lucky for is is that the data set contains sources to articles discussing the GDPR-incident:

gdpr_violations %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% filter(date == as_date("1970-01-01")) %>% select(date,source)
## # A tibble: 15 x 2
##    date       source                                                            
##    <date>     <chr>                                                             
##  1 1970-01-01 https://www.aepd.es/resoluciones/PS-00331-2018_ORI.pdf            
##  2 1970-01-01 https://www.aepd.es/resoluciones/PS-00121-2019_ORI.pdf            
##  3 1970-01-01 https://www.aepd.es/resoluciones/PS-00411-2018_ORI.pdf            
##  4 1970-01-01 https://www.aepd.es/resoluciones/PS-00074-2019_ORI.pdf            
##  5 1970-01-01 https://theword.iuslaboris.com/hrlaw/insights/spain-video-surveil…
##  6 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava…
##  7 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava…
##  8 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava…
##  9 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava…
## 10 1970-01-01 https://www.etrend.sk/ekonomika/gdpr-zacina-hryzt-telekomunikacny…
## 11 1970-01-01 https://www.pingdigital.de/blog/2019/03/29/implodierende-aufsicht…
## 12 1970-01-01 https://indd.adobe.com/view/d639298c-3165-4e30-85d8-0730de2a3598  
## 13 1970-01-01 https://www.uoou.cz/kontrola-zpracovani-osobnich-udaju-bankou-uni…
## 14 1970-01-01 https://www.uoou.cz/kontrola-zpracovani-osobnich-udaju-po-odvolan…
## 15 1970-01-01 https://www.uoou.cz/kontrola-zabezpeceni-osobnich-udaju-pri-provo

And what is even better: some of the sources have explicit information about time:

gdpr_violations %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% filter(date == as_date("1970-01-01")) %>% select(date,source) %>% filter(grepl("201",source))
## # A tibble: 9 x 2
##   date       source                                                             
##   <date>     <chr>                                                              
## 1 1970-01-01 https://www.aepd.es/resoluciones/PS-00331-2018_ORI.pdf             
## 2 1970-01-01 https://www.aepd.es/resoluciones/PS-00121-2019_ORI.pdf             
## 3 1970-01-01 https://www.aepd.es/resoluciones/PS-00411-2018_ORI.pdf             
## 4 1970-01-01 https://www.aepd.es/resoluciones/PS-00074-2019_ORI.pdf             
## 5 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava_…
## 6 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava_…
## 7 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava_…
## 8 1970-01-01 https://www.dataprotection.gov.sk/uoou/sites/default/files/sprava_…
## 9 1970-01-01 https://www.pingdigital.de/blog/2019/03/29/implodierende-aufsichts…

For these cases we are lucky: It might be that — for some of them, we might derive the correct date. I could use sam regex-magic but… to be honest, I fuckin hate regular expressions and am too lazy (not too busy ;)) to improve. Seriously: For these few particular tasks it would be a wasted effort.

However, what about the other ones? My first idea was to apply a little bit of webscraping (rvest to the rescue!) but some of these links are invalid. For the other ones the less exciting task of getting the time manually has to be done (if somebody knows of an alternative way though, please let me know!). And even there we get some problems: Even the sources on’t always reveal a specific date.

So, my conclusion was to exclude these datapoints. We could reduce the number of cases with an incorrect date to 8.

gdpr_violations_cleanded <- gdpr_violations %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% 
  mutate(wrong_date = date == as_date("1970-01-01")) %>% 
  mutate(date = if_else(grepl("25.maj_2018",source) & wrong_date,as_date("2018-05-25"),date)) %>% 
  mutate(date = if_else(grepl("2019/03/29",source) & wrong_date,as_date("2019-03-29"),date)) %>% 
  mutate(date = if_else(grepl("https://theword.iuslaboris.com/hrlaw/insights/spain-video-surveillance-and-data-protection-in-the-workplace",source) & wrong_date,as_date("2019-09-20"),date)) %>% 
  mutate(date = if_else(grepl("https://www.etrend.sk/ekonomika/gdpr-zacina-hryzt-telekomunikacny-operator-dostal-pokutu-40-tisic-eur.html",source) & wrong_date,as_date("2019-09-27"),date)) %>% 
  filter(date > as_date("1970-01-01"))gdpr_violations_cleanded %>% mutate(monthyear = floor_date(date, unit = "months")) %>% group_by(monthyear) %>% summarize(n=n()) %>% ggplot(aes(monthyear,n)) + geom_point() + labs(x = "", y = "GDPR-Incidents", title = "Development of GDPR-Violations over time") + theme_minimal()

Ok, we see that beginning from 2018 the monthly number of GDPR incidents has significantly risen.

One last thing to the outliers: the countries with the most wrong dates are Spain and Slovakia, but Germany and the Czech Republic are close behind that:

gdpr_violations %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>% filter(date == as_date("1970-01-01")) %>% group_by(name) %>% summarise(n=n()) %>% ggplot(aes(name,n,fill=name)) + geom_col(show.legend = F) + labs(x="",y="Outliers",title = "Occurrences of urealistic dates") + theme_minimal()

What kind of violations happened?

So far to the question about when and the who. What about the „what“? Lets see what kind of incidents happened:

gdpr_violations_cleanded %>% select(article_violated)
## # A tibble: 242 x 1
##    article_violated                                        
##    <chr>                                                   
##  1 Art. 28 GDPR                                            
##  2 Art. 12 GDPR|Art. 13 GDPR|Art. 5 (1) c) GDPR|Art. 6 GDPR
##  3 Art. 5 GDPR|Art. 6 GDPR                                 
##  4 Art. 31 GDPR                                            
##  5 Art. 32 GDPR                                            
##  6 Art. 32 GDPR|Art. 33 GDPR                               
##  7 Art. 5 (1) c) GDPR|Art. 25 GDPR                         
##  8 Art. 21 (3) GDPR|Art. 25 GDPR                           
##  9 Art. 5 GDPR|Art. 6 GDPR                                 
## 10 Art. 5 (1) a) GDPR|Art. 6 (1) a) GDPR                   
## # … with 232 more rows

Ok, that now gets a little bit messy. We see that a lot of incidences had violated several artices. These are separated by a Pipe. My coal now is to tidy this dataset up. Each article should have its own line. Furthermore, we are only interested in the Number in itself.

Again, Regular expressions are my nemesis (kinda great for a vim /vis-user, lol), so please excuse the horrible way of setting this tidying up. But in the end, its the result that counts:

gdpr_articles <- gdpr_violations_cleanded %>% mutate(date=as_date(date, format="%m/%d/%Y",tz="UTC")) %>%
  select(date,name,article_violated) %>% 
  separate(article_violated,"Art",into = c("A1","A2","A3","A4","A5","A6","A7")) %>% 
  pivot_longer(values_to = "GDPR_article",names_to="names",-c(name,date),values_drop_na = T) %>% 
  select(date,name,GDPR_article) %>% filter(GDPR_article != "") %>% 
  mutate(GDPR_article = gsub(" GDPR*","",GDPR_article)) %>% 
  mutate(GDPR_article = gsub("\\. ","",GDPR_article)) %>%
  mutate(GDPR_article = gsub(" \\(*.*","",GDPR_article)) %>% mutate(GDPR_article = gsub("\\|","",GDPR_article)) %>%
  mutate(GDPR_article = gsub("\\.","",GDPR_article)) %>% mutate(GDPR_article = gsub("\\(.*.","",GDPR_article))  %>%
  mutate(GDPR_article = ifelse(GDPR_article %in% c("Data","Failure","Unknown"),"",GDPR_article))  %>% 
  mutate(GDPR_article = ifelse(str_length(GDPR_article) == 0,"Unknown",GDPR_article)) %>%
  mutate(GDPR_article = ifelse(str_length(GDPR_article) == 1, paste0("0",GDPR_article),GDPR_article))
## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 242 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].gdpr_articles
## # A tibble: 385 x 3
##    date       name    GDPR_article
##    <date>     <chr>   <chr>       
##  1 2019-10-18 Poland  28          
##  2 2019-10-17 Romania 12          
##  3 2019-10-17 Romania 13          
##  4 2019-10-17 Romania 05          
##  5 2019-10-17 Romania 06          
##  6 2019-10-16 Spain   05          
##  7 2019-10-16 Spain   06          
##  8 2019-10-16 Spain   31          
##  9 2019-10-09 Romania 32          
## 10 2019-10-09 Romania 32          
## # … with 375 more row

Fantastic, now that we have all the Articles for each land and heach date. Note that the variable GDPR_article is, despite the numbers character. The articles are not a continuous but a discrete variable and we would like to work with the variable in that sense.

gdpr_articles %>% group_by(GDPR_article) %>% summarise(n=n()) %>% arrange(desc(n)) %>% slice(1:10) %>% mutate(GDPR_article = factor(GDPR_article,levels = GDPR_article)) %>% ggplot(aes(GDPR_article,n, fill = GDPR_article)) + geom_col(show.legend = F) + labs(x = "Art. in GDPR", y = "Violations") + theme_minimal()

We see, that the articles 5, 6 and 32 were most of the cases. What exactly are these articles? Here, the variable gdpr_text comes in handy:

gdpr_text %>% filter(article %in% c(5,6,32), sub_article==1) %>% select(article,article_title)
## # A tibble: 3 x 2
##   article article_title                                     
##     <dbl> <chr>                                             
## 1       5 Principles relating to processing of personal data
## 2       6 Lawfulness of processing                          
## 3      32 Security of processing

Of course we could dive deeper and look upon the sub-articles etc.But this article is already quite long, so lets look at the where and when and then wrap it up.

Favourite crimes for different countries

Lets look what crimes are the favourites for different countries:

gdpr_articles %>% group_by(name,GDPR_article) %>% summarise(n=n()) %>% arrange(name,desc(n)) %>% slice(1) %>% arrange(desc(n)) %>% ungroup() %>% slice(1:7) %>% mutate(name=factor(name,levels = name)) %>% ggplot(aes(name,n,fill=GDPR_article)) + geom_col() + labs(x="",y="Incidents",fill="Art. GDPR",title = "Top 5 most often violations for distict countries") + theme_minimal()

As expected the articles 5,6 and 32 of the GDPR are the ones who take the cake. Whats quite interesting is that most of the violations against Art. 32 GDPR stem from Romania.

Temporal development of GDPR-cases

And now, finally, lets look how the number of differnt GDPR-violations developed over time.

gdpr_articles %>% mutate(monthyear = floor_date(date, unit = "months")) %>% group_by(monthyear,GDPR_article) %>% summarise(n=n()) %>% ggplot(aes(monthyear,n,fill=GDPR_article)) + geom_col(position = "fill") + labs(x = "",y = "relative share", fill = "GDPR-article",title = "Share of GDPR-violations over time") + theme_minimal()

What we learned

In thins article we learned the following:

Spain is the land with the most reported GDPR violations.
The findings peaked at 2019, but the general upwards trend is not yet broken.
Most violations are against Art. 5 GDPR.

The latter one is explainable as it is the most general article. That GDPR violations are still on the rise is clear: GDPR is a new thing and it takes a little while for companies to adapt. Why Spain leads the pack is unknown for me; it could be that the officials are stricter than in other countries.

So far a small excursion through the tidy tuesday data regarding the GDPR-violations. Hope it was as exciting for you as for me!