Summarizing GSoC 2018 with InterMine

12 min readAug 13, 2018

This is the summary of the work I have been doing for the past 3 months and a report for the final evaluation process. Hold your breath and patiently read it because it contains lot of information.

What is the project all about?

Around 85% of the data around the world is considered unstructured data (Digital Reasoning website, IBM) and only 15% hypothetically utilized. Search engines have been struggling in the past to retrieve best results for the user query. Services such as Web search, question-answering, retrieving products from eCommerce websites ,flight reservations, etc. that operated on the unstructured content had access only to HTML. Applications requiring access to the structured data underlying these Web pages had to build custom extractors to convert plain HTML into structured data. These efforts were often laborious and the scrapers were fragile and error prone, breaking every time a site changed its layout. In 2011, the major search engines — Google, Bing and Yahoo came together and launched a standard markup — Schema.org for web content publishers to improve consistency across search engines relying on the data. The schema.org site “provides a collection of schemas that webmasters can use to markup their pages in ways recognized by major search providers. Search engines…rely on this markup to improve the display of search results, making it easier for people to find the right web pages.”

The structure has been evolving since then. The World Wide Web contains immense data from all domains — technology, politics, history, etc. Schema.org is making efforts to provide schema markup for all data structures. In a similar vein, the bioschemas.org initiatives is putting in efforts to provide a markup for data in the Life Sciences domain. Once this mark up is approved within the bioschemas community, the markup is pushed to Schema.org for acceptance. This markup is to be followed by bio banks and data warehouses publishing biological, medical and life sciences data on the WWW. The data would thus be easily discover-able by the users — researchers, enthusiasts, universities etc.

This brings InterMine to the picture. InterMine is an open source data warehouse system. InterMine is used to create databases of biological data accessed by sophisticated web query tools. It can be used to create databases from a single data set or can integrate multiple sources of data. InterMine includes a user-friendly web interface that works ‘out of the box’ and can be easily customized(Wikipedia). InterMine took the initiative to provide an interface to search the bioscemas.org marked up data on the Internet as an independent project — Buzzbang.

Buzzbang is a Google-like bio data search engine for Life Sciences data marked up with bioschemas.org mark-up. The data from different sources following this mark-up is collected and indexed in a search tool which can be accessed via a search interface. The project is in its early stages since the bioschemas.org markup is itself evolving and the markup is not yet followed by a lot of bio-data banks. We encourage the web masters publishing the data to follow this markup to reach a bigger audience. The users of these data — research labs, universities and organisations — can use this tool, Buzzbang to aid access to these data.

Project Architecture

The project has three parts, as pointed out previously, consisting of

a web scraper to scrape the data from websites marked up with bioschemas.org
an indexer to index the scraped data set in a search tool — Apache Solr
and, a UI built upon Flask framework to retrieve the query results

At what time did I pick up the project? When I started the project, the project had a similar architecture. The scraper was slow — based on primitive beautifulsoup, sqlite3 database, minimal implementations in the Solr for indexing and a prototype UI for basic searching. The project was at a stage to showcase a proof of concept for different stakeholders.

What is the current status? At the end of three months of GSoC timeline — I revamped the scraper. The scraper is now able to Scrape a million links from a single domain in 3–4 hours on a personal laptop (8 GB RAM) with configurable settings to optimize the scraping on different hardware. The indexer can now index the docs based on the latest bioschemas markup with generalized indexes for the bioschemas already developed — beacon, samples, DataCataloges etc. In addition to this, spell-check and suggester modules have also been integrated with Solr to assist the user in posting queries. Finally, the UI has been given an improved look to display results, suggest correct spellings and paginate through the results dynamically. All these three modules have been developed in separate repositories hosted on GitHub to keep it modular and developer friendly. Hereon, I will explain the developments in each segment in detail to assist the active and interested open source community to contribute to this project further.

Scraper

Repository Link — https://github.com/buzzbangorg/bsbang-crawler-ng

Scrapy

Scrapy has been chosen as the framework for scraping the websites. The reason for choosing Scrapy as the tool has been well explained in the last three blogs that I posted on Scrapy — [1] [2] [3]. To start with, SitemapSpider has been used to crawl throughout the nested sitemaps enlisting the data on a given site. Almost all the websites today include sitemaps on their websites and put these references in the robot.txt file with details of privileges for scrapers. ScrapingHub community have also developed a small module — extruct to extract embedded microdata from the HTML pages. Using these two modules, the basic scraping module was developed. These implementations can be tracked in this file.

2. MongoDB

MongoDB has been used as the database for temporary storage of raw data scraped from different sources. A collaborative effort is being made by me and my mentor to come up with a common structure of data ingested in the MongoDB so that different crawlers can use the same structure for the temporary storage before indexing it in any search engine tool. The progress of which can be tracked here. Scrapy provides a very simple interface to connect to the MongoDB database as depicted in this script through its object pipeline.

3. Middlewares and Extensions

Two powerful features of Scrapy that accelerated the scraping process are middlewares and extensions. The middleware is designed such that if the URL is already scraped and stored in the DB in last 7 days, then the request is dropped. The extensions in Scrapy provides the best method to log requests and responses. The crawler has a detailed logger developed by stamping Signals in Scrapy. The optimisation parameters used to provide optimal Scrapy settings — concurrent requests per domain and concurrent requests are decided based on these results. The logger logs various scraping metadata after each crawl for developers. You can see the snapshot of a log here —

The code snippets for middleware and extension are here and here.

4. Scheduler for automated re-crawling

Data at websites of our interest are very frequently updated, so there is a need to re-crawl these websites and ingest their markup into MongoDB for re-indexing. For this purpose, an automated job scheduler is used to schedule this task on cloud machines. Python has a crontab module to manage scheduled jobs via Cron. The functions available in it allow us to access Cron, create jobs, set restrictions, remove jobs, and more. This script schedules the crawl for every 2 days (as per current settings) and would update the DB. The scheduler accepts the basic optimization parameters to tune the crawl based on the hardware.

These are the 4 pillars of the new generation bsbang-crawler. The code is updated in the repository and is ready to be used. Please refer to the Mind map to see the list of implementations and scope of improvements.

Indexer

Repository Link — https://github.com/buzzbangorg/bsbang-indexer

Solr Setup, Configuration & Schema Generation
This part was already implemented when I started my GSoC by my mentor and is pretty straight-forward. We can create a Solr core and connect to it using the APIs/ curl commands. After creating the core, the schema for Solr needs to be defined. The previous indexer was limited to the hard-coded schema for a limited specification defined by Bioschemas. In the present indexer, all the specifications defined by Bioschemas has been converted to XML format and all of them have been used for indexing the documents. This creates a room all websites using any of the the current bioschemas.org markup to be indexed with ease. The indexes are created in the following format —
<PROFILE>.<PROPERTY>
Eg: beacon.dataset, beaon.name, beacon.url
To Be Done — In the current code base, the specifications has been converted to XML manually. We need to have a discussion with the bioschemas community to post the specifications drafts/deploys in a parsable format so that the indexer module can fetch the current version before every configuration.
Data Ingestion and Indexing Optimizations
The raw data from MongoDB is fetched and indexed in Solr. The data is however flattened before indexing. The JSON-LDs extracted from the websites are nested and thus, cannot be directly stored in Solr. Solr provides methods to index nested documents, however, to keep the indexes simple and more relevant for searching, weighing results and faceting, a simple flattening method is used. The flattening logic is scripted here. Most importantly, the previous indexer was indexing 1 document at a time and this method was slow. This is because it takes no time to fetch the data from MongoDB but it takes seconds to post it to Solr core. The indexing script thus took hours to index the data which could have been indexed in minutes. The results of this observation arewell explained in this video. This issue can be tackled in two ways — the first way is to paginate through the MongoDB. This would retrieve 100s docs at once from Solr and would post it in Solr at once as programmed here. This is what has been proposed in the GSoC period and has been successfully implemented in the current project.
Multi-threading (in progress)
This was not a part of my GSoC proposal, but I tried to further optimize the indexer to index millions of documents even faster. This process involves creating two threads — retrieving documents from MongoDB in one thread and storing it in a queue and pushing the documents from the same list to Solr core in other thread. After spending a few days attempting to achieve results, the tests failed and this part has been kept in future targets. The code used is here. Check it out if you are willing to contribute.
De-duplication
This is again something that has not been a part of my initial proposal, but I felt it is going to be necessary for preveting duplicate documents from being indexed in the Solr core. Solr natively supports de-duplication techniques of this type via the Signature class and allows for the easy addition of new hash/signature implementations. In the current implementation, Lookup3Signature method has been used to create signatures for indexing the documents. It is a 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index. I can be configured by adding a small XML code(here) in the solrconfig.xml file created for the particular core that the user is using for indexing. This file can be accessed by two ways. Editing the file in a text editor be going in the solr core directory where the code has been created. The other way to work this out is to create a script using the solrConfig.xml api to set this up. This has not yet been done and is To Be Done.
Query, Spell-check and Suggester Module
Three scripts for querying, spelling corrections and auto-suggester have been prepared and kept in the bsbang-indexer repository. These scripts have no direct usage in this repository and they are called in the main app.py script on the bsbang-frontend repository. Since the development and testing of these modules are to be done within the indexer, let’s understand it so that we can call it in the flask app directly. There are several inbuilt query functions developed by Apache Solr community for querying. After going through all the query parsers, the extended dismax parser is found to be apt for our use case. The features of this parser makes is the best for a general search engine query parsing as discussed in the Solr documentation. The spell-check and the suggester module on the other hand has to be integrated with the Solr core in the solrconfig.xml file. Like the deduplicaiton one, these two modules can also be integrated with the Solr by editing the config file as depicted here and here. The script to edit the config file is yet to be prepared. The users can edit this file manually for now. Contributions are welcomed for this part as well.

This summarizes the work that has been done in the bsbang-indexer repository. This repository is also ready to use. Please raise issues in case you are facing any set-up challenges. Please follow the MindMap section on the repository readme to track what has been implemented and what is still left to be done. Contributions are welcome.

Frontend

Repository Link — https://github.com/buzzbangorg/bsbang-frontend

Improvements in the UI and generic result display
The above image displays the current UI look of the Buzzbang. Several improvements has been included in the current UI.
a. The number of search results displayed per page is added
b. 10 results are displayed per page and the old primitive look s replaced with a more professional look for the results. It can be compared from the old version — buzzbang.science
c. The pages have navigation buttons — previous and next
d. In case of spelling mistakes, the user is first shown suggestions out of which he can select the correct option where he will be finally redirected.
The frontend is created using bootstrap libraries and tools while the earlier version of this project was using primitive css styles.
Handling wrong spellings, suggesting the user and redirecting to correct results
The user query is submitted from the UI to the flask app using wtf-forms library which sends a request to the solr core. The request is sent to the spell-check module first. If the spell check module suggests changes, the results are displayed out of which the user is expected to choose a correct alternative which will redirect him to the results page. This script performs this request.
Advanced Pagination
Pagination is achieved via two key functions next_results() and prev_results() in main.py. Suppose the query returns 143 results to the user, then all the results are not retrieved at once to the UI. This can slow down the web page response is case the query returns 1000s of responses. Hence, the query returns only 10 results at a time. Further results are retrieved by the next button which further calls the next 10 responses.

These are all the developments in the frontend of Buzzbang project. I have also created a small screen recording showing the UI. Please have a look at the MindMap section in the readme section of the repository to track the developments and get some ideas on how to contribute to this project further.

Summary

Finally I would like to thank Google and InterMine for giving me this opportunity to work on such an exciting project. My mentor — Justin Clarke-Casey has been very supportive during the GSoC journey in the past three months. I really hope that more and more people start following the bioschemas markup and this project is further developed for professional use cases. I will be contributing to this project and the bioschemas community in future. Anyone interested in collaborating and contributing to this project are heartily welcome. Please feel free to reach me out via mails — lohani.1575@gmail.com

Signing off!

Innovationchef