NYU researchers invent new real-time data analysis system for humanitarian agencies
“Too much time is spent collecting data,” explain NYU doctoral student Kien Pham and CDS’s Juliana Freire in their co-authored paper, “and not enough time is spent making sense of it.”
How can we efficiently help those in need? A difficult reality facing humanitarian agencies is that they cannot immediately address all of the world’s crises, particularly when constrained by limited financial and human resources. This is why they prioritize which emergencies they respond to. But the prioritization process, which involves collecting, organizing, and analyzing vast volumes of secondary data produced by public institutions, NGOs, and news media about each crisis around the globe, is highly time consuming and relies solely on manual human labor.
A new targeted information retrieval system, however, aims to automate these tasks for humanitarian workers so that they can make decisions about aid delivery and disaster response more quickly. Invented by NYU doctoral student Kien Pham and a research team comprising of Juliana Freire, the Executive Director of the Moore-Sloan Data Science Environment at the NYU Center for Data Science, and experts from IBM’s Thomas J. Watson Research Center, the new system contains four main components: a focused web crawler, a metadata extractor, a content classifier, and a feedback mechanism.
Crawlers generally strive to cover as many pages as possible, but “a focused crawler,” as the researchers explain, “is a web crawler that is optimized to seek web pages that are relevant to predefined topics.” And, because emergency situations tend to change rapidly, the researchers also designed a real-time re-crawling strategy in the system. Using a binary classifier, the crawler then categorizes whether particular webpages are relevant or irrelevant to the user’s search topic, and then passes the webpages onto the metadata extractor.
The extractor concentrates on mining the textual data of those webpages. After singling out the title, content, publication date, and mentioned countries from the relevant webpages that the crawler passed on, a content classifier analyzes and labels the webpages according to what type of crisis they are describing.
Because the system’s efficiency hinges on the accuracy of the content classifier, the researchers built a vital feedback cycle into the system, which collects user feedback so that the classifier can improve over time. “This especially increases the robustness of the page classifier,” the researchers explain, “as well as the adaptivity of the crawler.”
The researchers recently implemented a fully operational prototype of their system for humanitarian experts at the Assessments Capacities Project (ACAPS), an organization that supports crisis responders by providing needs assessments and analysis.
While more work still needs to be done to tailor the system for domain-specific needs, the researchers hope that it will not only be widely implemented at humanitarian agencies in the future, but also incorporate social media data into its processes as well.
by Cherrie Kwok