Amateur Investigations of The Kaggle Chicago West Nile Virus Dataset
West Nile Virus, far from its discovery in Uganda, has recently become a recurring problem for the unfortunate people of Chicago. The Chicago West Nile dataset on kaggle, compiled by the Chicago Department of Public Health, allows data scientists of all levels to try their hand at predicting the presence of WNV in collected mosquitos at traps in the city at different dates throughout mosquito season. To accomplish this, modelers fit classifiers on training data — entries chronologically preceding testing data— that includes the locations of traps, date of collection, species of mosquito in the trap, and whether the collected mosquitos were viral hosts. A good classifier identifies salient predictors in the training data and assigns accurate probabilities to entries in the testing set, where the incidence of WNV is unknown. The specific scoring metric for the kaggle competition is called a ROC Curve, which plots your predictions based on their true and false positive rates. A baseline prediction in a ROC curve has as many false positive predictions as true positive predictions and has an area of .5.
Being in my fifth week of General Assembly Boston’s Data Science Immersive, and time limited to less than a day, I was intent on building a classifier that beat the baseline number. The kaggle datasets also include Chicago’s weather and spray data for the dates in question, but I initially (and, it turns out, ultimately) ignored those in favor of getting a result first. Sparing you the fits and starts, I built a basic logistic regression that clustered traps, mosquito species, and blocks into two groups. The idea with these groups was to signal the likelihood of WNV recurring and correspondingly tamp down the likelihood of first time incidence. Regarding model selection, logistic regression offered both better performance than k-nearest neighbors, another classifier, as well as interpretable results.
With respect to the model features, in the case of the traps and blocks, I simply looked for any past incidence of WNV, grouping those exposed and those not. The majority, but not all, traps in Chicago had seen exposure. As for mosquitos, the majority in the testing data were either Culex Pipiens, Culex Pipiens/Restuan, or Culex Restuan. Mosquito species including Culex Pipiens yielded 3–4x higher rates of WNV presence than Culex Restuan, and other mosquitos in the sample demonstrated no WNV presence at all. While a quick search reveals that the other Culex types were also carriers, the virus is most famously associated, at least in the Chicago area, with Culex Pipiens and Culex Restuan. For the regression, I made the assumption that the other types of mosquitos were unlikely carriers, and likewise that Culex Restuan was a markedly less likely carrier of WNV. I grouped the Pipiens mosquitos together and put the other mosquitos species in a second group. Lastly, and rounding out the list, I retained the raw latitude and longitude integer values as predictors.
Indeed, past incidence of WNV in a trap was far and away the highest weighted coefficient in the regression, with mosquito species a clear second. Grouping blocks, traps, and species can get you to a classifier that substantively beats the baseline, and you can see roughly how far I got in the ROC Curve above. (That score is on the training data— the testing data includes new traps, blocks, and mosquitos species types, so a model won’t retain it’s efficacy unless it classifies new cases as well as cases in the training set, which have the benefit of indicating whether WNV was present.) But a better classifier would take into account the aforementioned weather and spray data, location data about WNV hotspots, mosquito gestation cycles, and information about the likelihood of carrying WNV by mosquito type, among a seemingly infinite number of other factors. My classmates included some of this information in their submissions, and scored higher than I did, but at a razor thin margin and enabled by an extension of our deadline. The lesson being that more and better data can yield a better model and yet the added features can easily amount to so much complexity and noise.