10 things we learned from organizing a hackathon on healthcare data
While data scientists likely lack the skill to build orphanages, they can use their knowledge to contribute to social good in many other ways. This is because a vast number of important questions — from healthcare and agriculture to energy and education — are best addressed by using insights distilled from data.
Making sense of data is the core activity of any data scientist and so we thought it would be a good idea to host, on a cloudy weekend in February, around 40 data scientists to apply their skills to tackle problems such as predicting the diffusion of mosquitoes, which might carry the Dengue virus, in the Philippines or assessing factors that influence the quality of pregnancy care in Kenya. (For an overview of all projects and the schedule, see this.)
What seemed like a straightforward idea in the beginning led to quite a few learnings for us during the organization that might help others doing similar things in the future. Below are our Top 10!
1. Meetup is the place to be
We never organized a hackathon before, so we tried ALL the channels: We created a Facebook event and shared it in relevant groups, we shared information on LinkedIn (and got over 100 likes), we advertised the event on local data science mailing lists and we created a meetup event in a well-established Meetup group that we partnered with. In the end, over ⅔ of all participants joined through Meetup. Next time we organize an event like this again, we will focus even more on Meetup as a channel, but still advertise the Meetup event through Facebook, LinkedIn, Email.
2. Make data security first priority
Doing data science on health data can have great positive impact for individuals receiving care but it can also have great negative impact if sensitive data gets into the wrong hands. We took a three-layered approach to ensure security in this setting: First, we thoroughly anonymized all data sets. Second, we kept all data and analyses in the cloud to discourage participants from keeping data on their personal machines. Third, we asked every participant to sign a NDA to legally ensure that data is not downloaded. Of course, this does not guarantee 100% security (data cannot be 100% anonymized, we cannot prevent people from taking pictures of their analyses, etc.) but we are confident that the measures taken were enough to avoid breaches in this hackathon.
3. Planning food is difficult
The hackathon covered two days: Saturday and Sunday, morning to evening. To plan the amount of food required, we asked people to register in two stages. First a general registration, and then a week before a confirmation without a clear deadline. That left us with 80 registered people out of which 40 confirmed. Because we didn’t set a clear deadline on the confirmation, we ended up ordering food for 80. In the end, a few more than 40 people joined (some of them registered but did not confirm their attendance, some confirmed their attendance but did not join). Next time we organize a hackathon, we will communicate a clear confirmation deadline and adjust our food planning accordingly.
4. Self-organization works
In total, we had 7 data sets for this hackathon from a variety of organizations. This alone was already useful because it brought together operational business people interested in data. For the matching of participants to data sets, we organized 5-minute pitches at the beginning of the hackathon (that turned out to be 10-minute pitches) and then asked participants to just join the data set that they were most excited about. To our surprise, self-organization worked and every data set ended up with at least four participants. Some datasets with more than 8 people split into two groups. From our experience, the best pitches focused on the problem to be solved and a quick overview of the data but leaving out most of the details for individual meetings afterwards.
5. Don’t forget to socialize
When planning the hackathon, we focused a lot on the technicalities: Interesting data, secure infrastructure, enough food and beer, but didn’t actually focus on making time to drink that beer. We assumed that participants would at some point just stop working but everybody was so engaged that there wasn’t much time left to socialize. For the next time, we will plan more time for semi-formal socializing in order to bring all these amazing people closer together and build a community.
6. Cloud-native hackathons are easy
To provide the best data security, we wanted to avoid any data to be stored locally on participant’s laptops. To achieve this, we set up individual AWS SageMaker instances for each team, connected to anonymized data stored on S3. In the end, we paid less than 200 Euros for 8 instances running for a weekend. Some tips: Ask the participants to close old notebooks to keep memory free, have an engineer on the ground to work on short-term requests (we needed to spin up a PostgreSQL database for one group) and check out our infrastructure setup here.
7. Keep the data simple
For this hackathon, we had 7 different datasets, ranging from a few tables on mosquito measurements to combinations of complex surveys and hospital visits to satellite images. Our learnings here are threefold: First, simpler data is better — the more varied the datasets were, the more time it took the teams to understand the data before starting with analyses — valuable time if you only have two days. Second, less data is better — the large satellite image dataset took very long for each model iteration to train, limiting the team in exploring parameters. Third, fewer data sets are better: In the two cases where we had more than one team working on the same data sets, the teams worked together and exchanged knowledge. We are sure that this would have helped many other teams as well.
8. Know your data (or know somebody who knows your data)
Many teams struggled during the hackathon with making sense of the data. We think we could have done a lot better here by providing more subject matter expertise. For some teams this was possible with in-person meetings with data owners, which allowed teams to quickly resolve questions and move much faster. Having subject matter experts on the ground should always be first priority. In cases where this is not possible, we believe that documentation of the data can help as well but it may be difficult to find a balance between only high-level descriptions and too detailed codebooks that are difficult to digest in the limited time. The perfect combination is having subject matter experts on the ground that can walk the team through detailed documentation of the data. Thus, prepare data beforehand and have somebody on the ground to explain it.
9. Presentations in notebooks are difficult
To save time and allow the teams to focus the little time they had on analyses, we decided to do final 5-minute presentations based on the notebooks that the analyses were done in. This resulted in a wide range of results: Some teams were able to abstract from the analyses and present background and results with clear headings, expressive figures, and helpful voice-overs. Others, however, got lost in the technical details, went overtime, and were unable to make the connection to the underlying business needs. Our takeaway is to, at the start of the hackathon, give an example of what we think a good notebook presentation looks like and more heavily stress the importance of not going overtime.
10. Focus on explainability
The CareNCode (picture above) winning team went through an interesting journey: While they were the last to be set-up technically with access to data and running notebooks, they were the ones to finish first in the final presentations. What were the winning factors? We think there are three things, which were also similar for the runner-up MosquitoPie: Their data set had limited complexity and they had access to a subject matter expert; they focused on simple models and explainable analyses, adding some more complex analyses with the remaining time; and they managed to keep their presentation focused on the problem they were solving, not on the methods they solved it with. Congratulations!
11. You rock!
Finally, we want to stress again that we were blown away by the dedication and skills that the participants brought to bear on very difficult and important problems. Thank you!
If you, dear reader, are interested in joining a network of data scientists that do pro bono work for organisations that advance the social good, check out Correlaid Nederland or send an email to firstname.lastname@example.org to join Correlaid’s Slack channel.