Where and what do you eat wrong ?

Mohit Singh
datascape
Published in
5 min readFeb 16, 2017

A poster to visualize data in R of the food borne Outbreaks around the world from 1927 to 2016

The main questions to Answer from this data analysis

Which countries have the most disease borne outbreaks since 1927 ?

Where did people eat mostly that caused it ? Like school, restaurant, airplane etc.

What food caused it ?

Which bacteria was responsible ?

Where do people eat in major US Cities that causes food diseases?

We all eat outside these days. Not only at Restaurants, but also at Hotels, Airplanes, Hospitals, Schools etc.

Well, I am a big time foodie myself. Sometimes we all just cant avoid to eat outside and even if we “think” we are eating healthy we get sick and wonder why.

Also i wanted to visualize data in the way that i narrow down the focus to get deeper insights of the reason for the food borne disease outbreaks around the WORLD. So i start from the worldwide level, what country the case was reported, where did people eat ? what did people eat ? what bacteria was responsible for it ?

Then i wanted to also see whats the trend like in major US cities. Where do people eat that they get the most diseases from ? And some results were surprising to me.

First using the worldwide data of the food borne diseases, visualized the relationship of the what countries encounter the most cases and where do they usually occur there.

Plot to represent what country worldwide has most reported cases and where they occurred

I love Alluvial Plot here, since it shows the flow to what countries had the most reported cases of food borne and at what location. The thickness of the line suggests the weight of the number of cases.

As we can see Spain has most cases reported on the airplane.

So, Avoid eating in the flight to Spain (haha)

NEXT STEP: To extend this relationship to the food that caused it.

Plot to represent what food caused the disease at a particular location

I dived down deep further to identify which bacteria in that food caused the disease to represent by Bubble plot.

NEXT — To determine where the most cases were reported in major cities of US from 1998 to 2015. I did my analysis for around 12 states of US and the analyzed the top 5 places where people ate and most cases reported in Michigan, California, New York and Texas.

Michigan had the most cases reported by eating in schools while Texas at Workplace, New York at home and California at Restaurants.

Also i was surprised that California and Michigan also had a lot of cases reported in Nursing home or religious facilities.

I used Radar plot (also called spider plots) to represent the weight of cases reported, the number of deaths( inner dotted line) at the top 6 locations of that city where the cases were reported.

Refer to the code below for reference.

R Code for the Visualizations of the poster:

library(alluvial)library(sqldf)
library(RColorBrewer)
library(ggplot2)
library(fmsb)
library(reshape2)
fd <-read.csv(“FoodData.csv”)
str(fd)
fdo <-read.csv(“fd_outbreak.csv”)
#grid bubble
colnames(fdo)[10] <- “cases1”
colnames(fdo)[2] <- “Vehicle”
fdo_bpp <- sqldf(“SELECT *, SUM(cases1) as cases FROM fdo GROUP BY Microorganism”)
unique(fdo_bpp$Vehicle)
fdo_bpp <-fdo_bpp[!fdo_bpp$Vehicle == “Unknown”, ]
fdo_bpp <-fdo_bpp[!fdo_bpp$Vehicle == “Multiple foods”, ]
fdo_bpp <-fdo_bpp[!fdo_bpp$Microorganism == “Unknown”, ]
ll <-c(“Salad,chicken”, “Custard”,”Fish”,”Meat”,”Pastry”,”Pizza”,”Rasberries”,”Rice”,”Salad”,”Sandwich”,”Shrimp”,”Turkey”)
# ss<- subset(fdo_bpp, Vehicle==ll)
#fdo_bpp<-subset(fdo_bpp, Vehicle %in% ll)
#View(ss)
fdo_bpp <-head(fdo_bpp[order(fdo_bpp$cases, decreasing=TRUE), ], 20)fdo_bpp$radius <- sqrt(sqrt( fdo_bpp$cases/ pi ))
ggplot(fdo_bpp,aes(fdo_bpp$Microorganism,fdo_bpp$Vehicle))+
geom_point(aes(size=radius*7.5),shape=21,fill=”#e3a75d”)+
scale_size_identity()+
theme(panel.grid.major=element_line(linetype=5,color=”#787794"),
panel.background = element_rect(fill = “#19153a”),axis.text.x=element_text(angle=90,hjust=1,vjust=0))
#which country it occurred? where it occured? what did they eat there?
colnames(fdo)[10] <- “cases”
colnames(fdo)[2] <- “Vehicle”
fdo1 <- sqldf(“SELECT *, SUM(cases) AS cases FROM fdo GROUP BY Country, Vehicle”)
fdo1 <-fdo1[!fdo1$Vehicle == “Multiple foods”, ]
fdo1 <-fdo1[!fdo1$Vehicle == “Unknown”, ]
fdo1 <-fdo1[!fdo1$Setting == “Unknown”, ]
#fdo[‘cases’] <-sqrt(fdoo[‘cases’])
#fdo <-fdo[which(fdo$Country==’USA’
# | fdo$Country==’Canada’ |fdo$Country==’Brazil’), ]

cutoff <-2
fdo2 <-subset(fdo1, ave(ID., Country, FUN = length) > cutoff)
cutoff2 <-3
fdo2 <-subset(fdo2, ave(ID., Setting, FUN = length) > cutoff2)
cutoff1 <-1
fdo2 <-subset(fdo2, ave(ID., Vehicle, FUN = length) > cutoff1)
fdo21 <-fdo2[,’Vehicle’]
unique(fdo21)
vl <-unique(fdo2$Vehicle)
vl
k = ifelse(fdo2$Setting==”Airplane”, “#ab5796”,
ifelse(fdo2$Setting==”Bakery”, “#513082”,
ifelse(fdo2$Setting==”Camp”, “#de826b”,
ifelse(fdo2$Setting==”Catered function”, “#30ab7c”,
ifelse(fdo2$Setting==”Home”, “#b8cc64”,
ifelse(fdo2$Setting==”Hotel”, “#5ea381”,
ifelse(fdo2$Setting==”Restaurant”, “#db5072”,
ifelse(fdo2$Setting==”School”, “#bfd7b5”,”#f9f8f7")

)))))))
unique(fdo2$Vehicle)alluvial(fdo2[,c(‘Country’,’Setting’)], freq=fdo2$cases, col = k, hide=fdo2$cases < 20, alpha=.6, border=FALSE)alluvial(fdo2[,c(‘Setting’,’Vehicle’)], freq=fdo2$cases, col = k, hide=fdo2$cases < 20, alpha=.6)#Radar Plot# Library
library(fmsb)
library(reshape2)
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each topic to show on the plot!# The default radar chart proposed by the library:#California
colnames(fd)[7] <- “loc”
fd1 <-subset(fd, State==’California’)
#fd1 <- sqldf(“SELECT * sum(Illnesses) as Illnesses FROM fd1 GROUP BY State”)
fd1 <- sqldf(“SELECT * FROM fd1 GROUP BY loc”)fd1 <- fd1[order(-fd1$Illnesses), ]
#fd1 <-fd1[row.names(unique(fd1[,c(“loc”)])),]

fd1 <-fd1[1:5, ]
rownames(fd1) <- fd1$loc
fd2 <-t(fd1)
View(fd2)
fd2 <-as.data.frame(fd2)
include_list <-c(“Illnesses”,”Hospitalizations”, “Deaths”)
df3 <-subset(fd2, rownames(fd2) %in% include_list)
df3 <-as.data.frame(df3)
colnames(df3) <-c(“a”,”b”,”c”,”d”,”e”)
#par(mar=c(1, 2, 2, 1)) #decrease default margin
#layout(matrix(1:4, ncol=2)) #draw 4 plots to device
#max(df3)
#df3<- rbind(rep(20,6), rep(0,6),df3)lapply(df3, as.numeric)
View(df3)
df3$a <- as.numeric(as.character(df3$a))
df3$b <- as.numeric(as.character(df3$b))
df3$c <- as.numeric(as.character(df3$c))
df3$d <- as.numeric(as.character(df3$d))
df3$e <- as.numeric(as.character(df3$e))
radarchart(df3, maxmin = F)#California
colnames(fd)[7] <- “loc”
fd1 <-subset(fd, State==’Texas’)
#fd1 <- sqldf(“SELECT * sum(Illnesses) as Illnesses FROM fd1 GROUP BY State”)
fd1 <- sqldf(“SELECT * FROM fd1 GROUP BY loc”)fd1 <- fd1[order(-fd1$Illnesses), ]
#fd1 <-fd1[row.names(unique(fd1[,c(“loc”)])),]

fd1 <-fd1[1:6, ]
rownames(fd1) <- fd1$loc
fd2 <-t(fd1)
fd2 <-as.data.frame(fd2)
include_list <-c(“Illnesses”,”Hospitalizations”, “Deaths”)
df3 <-subset(fd2, rownames(fd2) %in% include_list)
df3 <-as.data.frame(df3)
colnames(df3) <-c(“a”,”b”,”c”,”d”,”e”,”f”)
#par(mar=c(1, 2, 2, 1)) #decrease default margin
#layout(matrix(1:4, ncol=2)) #draw 4 plots to device
#max(df3)
#df3<- rbind(rep(20,6), rep(0,6),df3)lapply(df3, as.numeric)
df3[is.na(df3)] <- 0

df3$a <- as.numeric(as.character(df3$a))
df3$b <- as.numeric(as.character(df3$b))
df3$c <- as.numeric(as.character(df3$c))
df3$d <- as.numeric(as.character(df3$d))
df3$e <- as.numeric(as.character(df3$e))
df3$f <- as.numeric(as.character(df3$f))
radarchart(df3, maxmin = F)dd <-dcast(fd, Illnesses ~ loc, value.var=”Illnesses”)# Custom the radarChart
colors_in=c( rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4) , rgb(0.7,0.5,0.1,0.4) )
df3=rbind(rep(200,20) , rep(0,5) , df3)
radarchart(df3 , axistype=1 ,
#custom the grid
cglcol=”#787794", cglty=1, axislabcol=”#787794", caxislabels=seq(0,40,10), cglwd=0.4,

#custom polygon
pcol=colors_in,
pfcol=colors_in,
plwd=3,

#custom labels
vlcex=0.8
)

Full code at my Github.

Data Source: www.foodsafety.gov, www.cdc.gov/foodsafety

--

--