Google Summer of Code — The lift off

Niloy Purkait
6 min readJun 2, 2020

--

https://summerofcode.withgoogle.com/

What’s this about?

Greetings all! This here article goes out to documenting my experience starting off in the Google summer of code (GSoC) program. GSoC is a global initiative, providing an exciting opportunity for any aspiring developer/software engineer to brush up on their coding skills over the summer. Participants are accorded a period of 3 months, where they work along with an enthusiastic open source community, contributing to a specific project. Naturally, I am very excited to be part of this journey, specially since this will be the first concrete open source project in which I’ll have the opportunity to participate in! I truly look forward to internalizing new skills, learn from a curated community of brilliant minds, and working with people residing in all corners of the globe.

How does GSoC work?

https://summerofcode.withgoogle.com/how-it-works/

As an aspiring researcher in the field of Artificial Intelligence, I could not think of a better opportunity to familiarize myself with more real world problems, and tackle them in a structured and collaborative manner. One may look at the GSoC program as an open source apprenticeship, where several organizations present various use cases, seeking software based solutions from the online community. As an applicant, one is expected to pick a specific use case, from a specific organization, and submit a proposal comprising of a detailed plan of attack. Once this plan is approved by Google mentors, the battle may begin! You establish contact with your mentors, and start brainstorming ways to approach the given project. In my case, I was introduced to Dr. Diego Moussallem and Dr. Thiago Castro Ferreira, both established researchers and friendly, helpful people!

What is the DBpedia project?

https://wiki.dbpedia.org/

There exist many fields (e.g. NLP), sub-fields (e.g. interactive dialogue systems) and meta-fields (e.g. transfer-learning) in the realm of artificial intelligence, each equally exciting in their own regards. Personally, I have always been fascinated by the ability of humans to extract, distill, categorize, and represent knowledge, from raw data. Hence for me, a natural choice for an organization to work with was DBpedia. These folks have made a name for their ability to systematically extract structured content from various online resources (e.g. databases like Wikidata, New York Times, CIA World Factbook), representing facts in knowledge graphs, made available to any and all. Such an effort opens up the doors for researchers and industry practitioners alike to work on bridging the gap between how humans communicate with machines, providing for intuitive and interactive interfaces. Some of you might wonder at this point, what is so special about knowledge graphs? How can such a data structure be leveraged to improve our day to day interactions with machines, be it googling an exotic recipe, or asking Siri directions to nearest shops that are not out of hand sanitizer?

The LOD cloud diagram, a Knowledge Graph that manifests as a Semantic Web of Linked Data. This image shows datasets that have been published in Linked Data format, which is a format used by DBpedia.

What’s so special about DBpedia’s Knowledge Graphs?

A knowledge graph (KG) is essentially a special kind of database, which allows the storing of knowledge in a machine-readable format. As it turns out, such a representation is quite useful for searching, organizing, sharing and consuming any kind of information from the web. In fact, Google uses a similar approach to generate those knowledge cards during a search.

A little snippet of text from DBpedia’s description on the GSoC organizations page will give you an idea of what this is all about :

“ DBpedia currently describes 38.3 million “things” of 685 different “types” in 125 languages, with over 3 billion “facts”. It is interlinked to many other databases (e.g., Wikidata, New York Times, CIA World Factbook). DBpedia provides tools that allow you to create, maintain, improve, integrate and use KGs to build applications. For example, BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Data provided by DBpedia was greatly involved in creating this knowledge graph. More recently, IBM’s Watson used DBpedia data to win the Jeopardy challenge. DBpedia data is served as Linked Data, which is revolutionizing the way applications interact with the Web. One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (such as SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That is the kind of query we are talking about.

What are you actually doing?

So, all this is quite nice and well, but where does my project fit in? Long story short, while machines are quite satisfied working with knowledge graphs, humans on the other hand would much rather have a nice snippet of text explaining semantic relations between entities. Consider the example provided below, where one may see a simplified version of such a knowledge graph, and its corresponding verbalization.

Example of a KG and its English language verbalization, borrowed from here

The latter is much more palatable to humans than the former. The problem is that a lot of information on the web is organised as the former. The question then becomes : How can we make machines use such knowledge graphs as inputs, and make them produce natural language verbalizations? After all, we would like machines to be able to intuitively communicate with humans. We want them to be able to reason about, and respond to, complex search queries by assimilating relevant information from various online sources. This is where my contribution comes in.

The Idea

Graph-to-text Natural Language Generation (NLG) is the computational process of generating natural language from non-linguistic data. The data here in question is in the form of resource description framework (RDF) triples, permitting a graphical representation. Below, we can see a training instance, comprised of 2 RDF triples, borrowed from the WebNLG dataset.

Once these triples are converted into a graphical representation, we can represent relations between entities using the structure of the graph itself (eg. edges and vertices). Essentially, we wish to approximate a function capable of mapping the information contained in a graph (i.e. various entities, and their interrelations) to an adequate natural language verbalization. In other words, we would like systems that can come across such knowledge graphs, and generate human readable summaries of the information contained therein. For the above example, a summary could look something like this :

‘White rice is an ingredient of Arròs negre which is a traditional dish from Spain.’

And wherever there’s a complex function to be approximated, provided enough data, deep learning is probably the way to go. Of course, the kind of network one would want to employ depends largely on the nature of the task, and the available data itself. In our case, the WebNLG corpus provides ample RDF triples of (subject | predicate | object ), mapped to English language text. In specific, my approach seeks to develop an RDF-to-text system using Generative Adversarial Networks, which can accurately and efficiently estimate the quality of the textual outputs and back-propagate it through the model. The exact architecture, training methodology, as well as evaluation of the system will be revealed in the following blog posts to come. This article simply marks the beginning of my project — which, as of the 1st of June, has entered the coding period!

Up next

We will be experimenting with Graph Convolutional Encoders and Recurrent decoders, for the generator network, and plan to train it using policy gradients, borrowed from the field of reinforcement learning. The discriminator will also be a recurrent architecture, and will be used to supply our generator agent with a reward signal, guiding its convergence. I look forward to sharing more details of this project with you all in my following GSoC blog-posts, and hope you enjoyed this introductory article. Until next time!

Written by : Niloy Purkait

References

--

--

Niloy Purkait

Data Scientist | Strategy consultant | Machine and Deep learning enthusiast. Interests range from computational biology and theoretical physics to big data tech