How does a public disaster management unit begin to identify and structure useful information from online sources, and more importantly begin to understand the true scope of an event? For data scientists, one of the obvious answers is natural language processing.
But developing a taxonomy of keywords to subset and structure text-based data sets, such as social media, is challenging, and is a taller order for languages not as tech-resource-rich as English. This can be a tedious task if developed and refined based on human efforts alone.
Preempting such a challenge in a real-life crisis, the Korea Advanced Institute of Science and Technology (KAIST) and Pulse Lab Jakarta developed a human-assisted machine learning framework that produces a set of candidate keywords based on text modeling techniques, and relies on human language expertise to filter out irrelevant keywords. A blend of human and machine intelligence, the framework is fittingly dubbed as SILK (which stands for statistical iterative learning of keywords).
SILK essentially runs multiple repetitions of its model in order to find new keywords, which feed into the next iteration, thus eventually expanding the taxonomy. To highlight SILK’s practicality in a local crisis and explain how it works, the hypothetical — yet often real — case of haze events in parts of Indonesia is explored.
Unlike English, Bahasa Indonesia is not often considered a tech-resource-rich language. Therefore, to build a list of keywords related to a haze crisis in Central Kalimantan, SILK would first rely on a set of seed keywords from speakers of the local language, which can then be used to locate a set of related posts. Next, it uses a topic model, for instance Hierarchical Dirichlet Scaling Process (HDSP), to identify and present related keywords back to the topic expert.
In the hypothetical haze crisis, the Bahasa Indonesia language expert may supply the translation of the English portmanteau “smog” which in colloquial terms is a combination of kabut (fog) and asap (smoke). After the HDSP topic model finds topics near the given keywords and identifies possible words linked to the haze event, the human expert would assess whether or not those machine-proposed words are in fact relevant to the crisis. If the identified words are deemed relevant, they would be accepted and added to the initial list of haze-related keywords supplied by the language experts. From there, the process is looped into another iteration until there is a satisfactory number of keywords associated with the haze event to continue with further analysis.
There are other approaches to developing crisis wordlists from social media like Twitter, particularly common is crowdsourcing and pseudo-relevance feedback. The latter method, though, typically depends on annotated tweets as training data and thus presents further limitations. In contrast, SILK bypasses the need for training data, simply requiring a couple of seed keywords related to the topics. The research leads behind the SILK framework, namely JinYeong Bak and Alice Oh (KAIST university) and Imaduddin Amin and Jonggun Lee (Pulse Lab Jakarta), are still exploring ways to improve methods for expanding crisis-related keywords, as well as, using the taxonomies to provide insights for disaster response authorities and the general public.
The team’s paper detailing the mechanics of the SILK framework was submitted and accepted to the Interactive Machine Learning and Semantic Information Retrieval workshop, an International Conference on Machine Learning, which is scheduled to take place on 11 August 2017 in Sydney, Australia.
Pulse Lab Jakarta is grateful for the generous support of the Government of Australia.