Data for Good — Zika Hackathon

Ari Kahn, genomics subject matter expert at TACC

On September 9th more than 70 volunteers united in a Data for Good hackathon to explore new ways to use data in the fight and prevention of Zika virus. The event took place at the Texas Advanced Computing Center (TAAC) at the University of Texas in Austin, Texas. Cloudera volunteers helped organize the event through its Cloudera Cares program for social causes. Cloudera partners Qlik and Bardess also participated in the event who are already helping organizations use big data in the life sciences, genome and pharma world to discover new ways to improve quality of life.

Through these hackathons we build awareness around the Zika virus and promote data sharing and collaboration. At a prior Zika hackathon on May 15th we analyzed data on Aedes mosquito habitat, breeding conditions, weather data and travel to the US from countries with Zika infected mosquitos. The data was used to plot potential hotspots in the US and found Miami and Houston at critical risk. This discovery was made before the recent outbreak of Zika in Miami.

Zika cases in the US

At the latest Hackathon we saw more ambitious data for good projects with research in the clinical and epidemiological areas of Zika. With projects going from the identification of Zika in water samples using metagenomic data to exploring the Zika protein and docking to identify potential drugs to fight Zika Virus infections.

Volunteers with various skills and knowledge grouped and collaborated on these Zika related projects.

Zika Metagenomic Portal Frontend — Goal: to create a website portal for people who are collecting metagenomic data to submit data to a service that would search that data for traces of Zika. Using agave api to connect the portal to TACC Wrangler system and other computational resources

Zika Metagenomics Portal Backend — Goal: Check water samples against Zika serum using sample training sets to train the model. The team was blown away when the actually found Zika in public available data samples, a breakthrough achievement for the Hackathon

Project Hydro — Goal: Cross comparison in Harris and Hidalgo Texas counties looking at various data sources (floodplains data, women of childbearing age, vegetation density) to assess Zika risk posed to pregnant women based on location.

Zika protein and docking research for drug discovery

Medicines to Zika Protein — Goals: Use High performance computing to facilitate docking process that is involved in Zika virus drug discovery. Deviated from that to use ML to identification of most efficient drug.

Zika Demo part 2 — Goal: add new datasets to demo created from the first Zika Hackathon and provide a platform that can be used to promote Zika awareness and need for open data sets with help from our partners Qlik, Bardess and Data.world.

The identification of Zika in publicly available water sample data was a huge discovery and proof that these projects have the potential of making a significant scientific impact. These projects are hopefully the seed to a future discoveries or insights.

Yet one of the major challenges observed at the hackathons is the lack of Zika public data sets. For example at the first hackathon we had to write scripts to scrape the CDC website for data. We reached out the CDC to request access to the raw data in any file format, but CDC does not share publicly these data files.

New organizations like data.world are making access to data better with easier ways to share and discover datasets. Data.world, who also participated in the hackathon, made Zika datasets available on their platform, but this is just the start, we need more organizations like the CDC, WHO, ECDC to post their datasets in downloadable file formats to promote research and discovery. Data.gov is a great resource for public data sets, with over 186,467 datasets and growing, but there is not much on Zika, if you search “zika” today you will only find one result and it is not a Zika specific dataset.

Texas Advanced Computing Center (TAAC) at the University of Texas

TACC is also making access to large petabytes of data storage easier and promoting collaboration. TACC’s systems, while mostly used by Academia today, are also available to private enterprises. Home of some of the top supercomputers of the world, TACC’s systems with support for Apache Hadoop are hungry for data science projects and data for good research.

The Zika hackathon was a huge success and it was great to see all the volunteers collaborate, knowledge share and unite in a data for good cause. If a small group of people can gather for a few hours and accomplish these results, just imagine what can be done by the health and life sciences industry at large. Cloudera is big supporter of President Obama’s Precision Medicine Initiative and with hackathons like these we promote the use of new Big Data technologies for this type of research as we saw at this hackathon using metagenomic data for Zika identification.

Cloudera Cares volunteers

The entire world can benefits from open source data platforms like Apache Hadoop with self service analytical and machine learning tools like Apache Spark MLlib. Many times it is the underdeveloped countries with lack of resources where these data for good projects can make the most impact, and hackathons like these help promote awareness and examples on how to tackle tough social problems with data for good and open source data. Get involved in a data for good project near you and be part of the change.