Google Summer Of Code 2021
Link to Repository : Link
I am Paritosh Singh, an undergrad student in Computer Science and Engineering, pursuing my B.Tech from NIT Kurukshetra, with deep interest in the ﬁelds of programming, development and data engineering. I am an open-source enthusiast and I truly believe in the capability it holds to inﬂuence and contribute to the world. This blog describes my journey through the project being pursued under Google Summer Of Code 2021.
Google Summer Of Code (GSoC) matches students with open source organizations to write code and contribute, and upon announcement of the program for the year 2021, I jumped right into searching for projects announced by organizations to participate in this amazing summer program. My interest and skills set introduced me to the organization Wikiloop. I immediately got in contact with the mentors of the project Improving Google Search on Buddhist Entities by Ingesting BDRC Database Into Wikidata.org
The mentors to the project were:
Forest Jiang — Employee Google
Élie Roux — Technical lead of Buddhist Digital Resource Center
Elan Hourticolon-Retzler — Employee Google
Wikidata has proven to be the most efficient open source of knowledge having structured data which is extracted directly by google search algoorithm and affects the google search result directly. As a document-oriented database, Wikidata stores information in a structured manner, which allows search engines to process it efficiently. Optimized and complete Wikidata content will improve chances of ranking higher on a Google search engine results page. Knowledge Graphs pull in data from a variety of sources such as Wikidata, Wikipedia. This data is then used to understand user search intent and to answer search queries directly as a Google Knowledge Graph Card.
The Buddhist Digital Resource Center has a well-maintained database. The entities which already existed on Wikidata previously often lacked a name label in Tibetan. The project focused on adding missing as well as new entities with Tibetan, Chinese, English labels and aliases to wikidata from BDRC. Élie Roux(mentor) who is the technical lead in BDRC has been working right from building this database to maintaining it.
The goal of the project was to dramatically improve the usability of both Google and Wikidata for Tibetan natives and the community. With the results, we are sure that the contributions made throughout the project were very helpful as the results continue to improve over time.
The repository used for the project: Link.
The extracted data can be found in the directory of repository : Link
GSoC’21 Journey Start:
Student Application Period:
During the application phase of Google Summer of Code I began with establishing an understanding of the data present on BDRC:
- Persons Data — This BDRC’s data repositories subgroup contains multiple directories containing ttl files for each entity in person data.
- Places Data — This BDRC’s data repositories subgroup contains multiple directories containing ttl files for each entity in places data.
- RDF vocabulary (and language tags) used by BDRC.
The proposal was created and submitted for evaluation and was accepted by the mentors. I was given the amazing opportunity to pursue the suggested project.
Community Bonding Period:
Once the community bonding period started, without any time being wasted, with the help of all the mentors and members of the community I dived right into creating the mapping for people and places data between BDRC and Wikidata.
This was my first encounter with Resource Description Framework (RDF). RDF is a general method of describing data by defining relationships between data objects. Élie Roux greatly and generously helped me in getting familiar with the data and ontology followed on BDRC before starting the data extraction process.
After learning and understanding the data we were dealing with it was decided that the complete project can be divided into 3 parts:
- Extraction and uploading of People data from BDRC to Wikidata.
- Extraction and upload of Places data from BDRC to Wikidata.
- Extraction of relationships and other properties of people data. Namely Kinship relations and Teacher and Student relations. (All the extracted data can be found in the Github Repository)
With a decent amount of understanding established with the ontology both on BDRC and wikidata the next big task was to learn the RDFLib Python library. RDFLib is a Python library for working with RDF. Managing with the documentation available and previously written code I was able to push start the coding process to first create a code used to only extract names of entities from a single directory Link.
The process of learning while implementing and improving in the process, turned out to be the key for executing a smooth sailing project. The next 2 weeks were spent in learning and improving the mapping created and writing the code for extraction of the data.
In the following weeks code was written following trial and error method to extract the persons data. Code Link
The CSV sheet created by the extraction code includes the following data for a person entity:
- Label Name in Tibetan, Chinese and English
- All alias Names in Tibetan, Chinese and English
- Date Of Birth & Death
- Tradition followed by the person
- Role/Occupation — mapping
RDFLib Python does not come with the best of documentation and efficient utilization of the library can only come with experience and practice. Thus learning the use of the library was a major task in the coding process, but once concrete knowledge was formed with the help from mentors the task was a smooth sail.
OpenRefine (previously Google Refine) is a tool used to clean, analyze and process large amounts of data. The major application of this helpful tool in this project was reconciliation of data against entities present on Wikidata and creation of Wikidata schema for our extracted data. With a few tutorials available on using OpenRefine I was able to master the use of the tool quickly and also create the wikidata schema for People entities. This video created by me explains the complete creation of schema on OpenRefine.
Ingestion of data on wikidata could be done in two ways: Exporting through OpenRefine and the other option was the use of tool Quickstatements. For a data set as large as ours the most efficient and problem free option was to use Quickstatements (QS). QS is a tool used to upload schema structured data on Wikidata. QS gives users the opportunity to revert the batches in case of any error that might occur.
Using this workflow, 16938 people entities, 41109 names and aliases were uploaded on wikidata, with multiple other properties for each entity.
All the batches of uploads can be found here: Link
By now the understanding of data on BDRC and wikidata were pretty concrete and writing the code for places data extraction was not a difficult task. The CSV sheet generated by the places extraction code contains the following data:
- Type of Place — mapped here
- Label Name in Tibetan, Chinese and English
- All alias name for the place in Tibetan, Chinese and English
- Coordinate Location of the place.
- Tradition associated with the Place
- Date of establishment of the place and Date of conversion.
Following the similar pattern as Persons Data, Places data was exported by first creating the schema on OpenRefine and then using QuickStatements for uploading the data on wikidata. 5042 place entities were uploaded in the process.
Mapping and coding for extraction was then done to extract other relationships present for Person data such as Kinship relations and Teacher and Student relations between people entities already uploaded. Code written for this process and the sheets by the same are mentioned below:
- Extract Person Kinship Relations , Can be found in Sheet1 , Sheet2
- Extract teacher Student Relationship , Can be found in Sheet
Due to bad reconciliation in OpenRefine, there were multiple BDRC ID’s added to 254 people entities. The only way to resolve this issue without disturbing the other uploads was to manually remove the multiple IDs. This was resolved and it was made sure this problem does not occur in any further uploads. The entities causing the issue were recorded here, and uploaded to Wikidata.
Improvement in Search Results:
To measure the improvement in search result this code takes input of a set of Tibetan name (Label or Alias) string and conducts google search for them and takes the screenshot of the same.
Therefore by taking the screenshots of the search results we can observe and analyze the impact being created by the data that has been uploaded over various periods of time.
The screenshots taken can be found here: Link
Future Scope in the Project:
Works such as scriptures and books can also be uploaded in future on wikidata from BDRC to improve their visibility on Google Search.
It has been a wonderful experience and I am very pleased that I was able to contribute in completing this wonderful project. The Tibetan community members will now be able to get better search results while using google search. The labels and alias added will result in efficient use of google search and Wikidata while searching with Tibetan strings.
Collaborating with a huge and diverse community ranging from programmers, editors, and various volunteers from across the globe has helped me learn and improve my soft skills and has given me an insight into the working of an organization. I have come to truly believe in the vision, “Imagine a world where we can all share freely in the sum of all knowledge”.
Finally, a 10-weeks-long learning extravaganza has come to an end and it was a very enlightening and experience-filled process and will surely help me in the future. I would once again like to thank my mentors for believing in me and helping me throughout the project and making this summer program a successful one. I will surely cherish this great learning experience and I hope to keep contributing in the spirit of open knowledge for all.