Using Data Visualization To Monitor Disease Outbreak

Jason Pham
5 min readFeb 8, 2019

It is crucial to get the right information to make the right choice at the right time if you want to sustain business growth. For example, A marketing director can produce better-targeted ads campaign if he has better insight into the enormous mass of data coming customer responses on which type of ads perform best. A manager of a retail store would be able to improve customer satisfaction and profits effectively if she knows which performance metrics to focus on. A small start-up tech company can make better products and generate more sales if it has evidence-based data on what the customers really want.

In any of those cases, having the right information will help them make the right decisions. This is why data science is a vital and growing field in business since it can find answers to these problems by analyzing big data and present it in a clearer data visualization. For this project, I aim to demonstrate the usefulness of data visualization in monitoring disease outbreak. The goal is to make a concise and actionable visual from complex issues that may not be obvious from looking at the original data.

Infectious and parasitic diseases world map — DALY — WHO2004

The data I chose below came from the California Health and Human Services Open Data Portal. The dataset contains 138237 rows of counts and rates for Centers for Infectious Diseases-related disease cases among California residents by county, disease, sex, and year.

First, I utilize Seaborn Heatmap to get an overview of rates of all infectious diseases occurrences per 100,000 population in California according to year. From the chart, we can see that Infant Botulism, Campylobacteriosis, Giardiasis, HIV, Chronic Hepatitis B, Chronic Hepatitis C, Salmonellosis, Shigellosis, and Tuberculosis are the top prevalent diseases. Furthermore, the rate of occurrences for fatal diseases such as HIV and Hepatitis C seems to decline in most counties as time progress.

Next, I continue to explore the data further with Seaborn Heatmap to sort infectious diseases occurrences according to location. The chart indicates that foodborne illnesses rate of infection are widespread and increasing in most counties in California.

To confirm my preliminary finding, I used Seaborn linear regression plot to determine the trendlines of all previously mentioned prevalent diseases. The graphs show that the rate of infection of Chronic Hepatitis B, Chronic Hepatitis C, Shigellosis, Tuberculosis, and HIV are all decreasing. This is actually great news since most of them are serious diseases with fatal consequences if left untreated. However, the graphs also show that there is a rising trend of foodborne illnesses outbreak and the common diseases are Infant Botulism, Salmonellosis, Campylobacteriosis, E. coli Other STEC (non-O157), Vibrio Infection (non-Cholera) in California from 2001 to 2015.

To dig deeper, I utilized Matplotlib scatterplot to visualize the relationships all of the common foodborne illnesses in California and how the rates of infection are trending from 2001 to 2015. Our graph clearly shows that Campylobacteriosis, Salmonellosis, and Infant Botulism are the main prevalent trio.

Furthermore, I used Matplotlib Basemap to graph the rate of occurrences for top prevalent foodborne illness, Campylobacteriosis, to identify which counties are affected the most in 2015. The rate of disease infection per 100,000 population is depicted as a scale of color from white being the lowest and dark red as being the highest.

To be more concise and further simplify my finding, Matplotlib Animation was used to visualize the dynamic increase in Campylobacteriosis rate of occurrences through various counties in California from 2001 to 2015.

Conclusion

If you live in California and have been careless about foodborne illnesses, these graphs suggest that you should be at least more cautious about your food sources since the rate of occurrences has been trending up in most counties from 2001 to 2015. Overall, data visualization is a powerful technique to help convey the significance of data by placing it in a visual context. One can quickly understand complex patterns in the data such as the infectious disease cases report from California if the communicating visual is clear and concise. As data output increase and becoming an increasingly important vehicle for greater transformative endeavors, data visualization will continue to be an essential communication tool in data science to convey information to the masses.

The Python notebook used for this blog can be found here: https://extrajp2014.github.io

--

--