OpenGov at KDD-2016

Gabor Melli
OpenGov Developers

--

This August 13th through 17th all of OpenGov’s data scientists attended the 22nd international ACM conference on Knowledge Discovery and Data Mining (KDD-2016). Our hope from the conference was to discover theoretical and technical advances that could help us advance our data-driven solution.

The conference has grown to include several hundreds of individual presentations (from tutorials, to invited talks, to presented long-form papers) covering a large variety of topics (tasks, algorithms and systems). This post attempts to summarizes and/or simply lists out some of relevant presentations from the perspective of facilitating the Cloud-based self-service analysis of government outcomes and financials.

Beyond this post, you can find many of the presentation recordings on the KDD’s videolectures.net’s page or on YouTube, and the published papers either in ACM’s digital library or on KDD’s website.

Natural Language Processing

One of the more important topics for OpenGov is the automated analysis of text, such as in accounting taxonomies and in budget documents.

  • A good place to start then is with the hands-on tutorial on “Big Natural Language Processing” given by myself and Matthew Seal (also with OpenGov). However, you will have to stay tuned to our Engineering Blog for Matt’s upcoming summary blog post on the tutorial. If you are keen to peek at the presentation slides you can find them here: goo.gl/m4MVsT.
  • Next, the paper “Multi-layer Representation Learning for Medical Concepts” (pdf) reported on a system that learns embedded (vector) representations for both medical codes and medical visits from a large electronic health record database. One of its innovations is the ability to interpret the learned representations (subsequently validated by domain experts).
  • Finally, the paper “Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding” (pdf) dealt with fine-grained classification of entity mentions to hundreds of categories. Rather than only using distant supervision it also learns ‘shallow’ vector-space embeddings from text corpora and knowledge bases.

Applications in Government

Interestingly, there were several presentations on the applications of data science to government (largely attributable to the University of Chicago’s “Data Science for Social Good” program). Two approaches were typically followed. The first focused on replacing manual tasks were people are used to tediously scavange information with trained models. The second approach focused on analyzing large public datasets.

  • The paper “Designing Policy Recommendations to Reduce Home Abandonment in Mexico” investigates the factors that reduced forfeitures of homes in Mexico (to aid in policy recommendations). It found that access to particular services within walking distance were critical for sustaining communities and recommended steps that could be taken to reduce the growing trends.
  • The paper “Identifying Earmarks in Congressional Bills” used text mining techniques to locate hidden funding requests to a particular organization or project in the author’s home state.

Other relevant papers included:

Maturation of Data Science

There was continued evidence that the field of data science continues to mature from an artisanal trade (of pet tools and reinvention), to an engineering practice that systematically applies best-practice processes.

  • Joe Hellerstein of UCal Berkeley addressed the theme of a holistic end-to-end data analysis pipelines and its application to the development of several data entry/scraping and ETL products, such as Trifacta (video).
  • Ingo Mierswa of RapidMiner spoke on an analysis of thousands of data science pipelines. This analysis for example, validated the best first-practice of applying decision trees and k-means (video).
  • Greg Papadopoulos from NEA on how for-profit organizations can attain value-from-data along with a sense for problems which are no-longer-hard (video).

Large-Scale Distributed Data Science System

There was continued work on bridging large-scale distributed systems and statistics research to help data scientists to analyze data distributed across hundreds to millions of computers worldwide. Recent research focuses on how to preserve privacy/confidentiality while simultaneously allowing the mining of globally distributed data to achieve societal benefit, and how to achieve this in environments with constraints on communication bandwidth, power consumption, computational limits on each node, etc…

  • The SPARK 2.0 hands-on tutorial covered both an introduction to growingly popular data processing platform, and how to embed machine learning algorithms into it. (video)
  • Jennifer Chayes of Microsoft Research spoke on non-parametric modeling of massive sparse networks and its application to the analysis of large graphs, such as the Web or social networks (video)

Deep Learning Advances

More tangentially relevant to OpenGov is the ‘hot’ field of deep neural networks which continued its incursion into state-of-the-art solutions to supervised predictive modeling. Here are some of the talks and papers that delved into the its applications:

  • Nando de Freitas, Professor at Oxford University gave a keynote presentation that summarized many of the recent advances based on the incorporation of memory, attention, and adversarial learning. He particularly emphasized thinking of the deep learning as “modular” with modules that use “automatic differentiation”. (video).

Other relevant papers and tutorials included:

Interpretable Models

Finally, one of the more unexpected and intriguing topics was the ability to interpret predictive models. Two papers on this topic where:

Conclusion

As you can see, KDD-2016 was impressively rich in content for the data science practitioner in the space of data-driven government analytics.

Looking ahead, next year the conference moves to Halifax, Nova Scotia (and to London, England in 2018). We look forward to them!

--

--