Interested in Google searching the biological web? — InterMine — GSoC 2018

Ankit Lohani
2 min readMay 23, 2018

--

Why InterMine?

I have never worked on an open source project and I realized that GSoC is the best place to start learning and seeing my stuff at work. Honestly, while looking for organizations, in which I may be able to contribute, I came across InterMine and the various projects enlisted here. The application domain of InterMine is very appealing and I could relate myself with this organization because of two key reasons — first, my past internship was on information retrieval on clinicaltrials.gov data. I touched upon various topics like — semantics, ontologies, UMLS (Unified Medical Language System), PubMed, Named-Entity Recognition for biological terms etc. Secondly, because the technologies used in this project were something I have been familiar with as a part of my course and term projects, like Solr, elasticsearch, docker. Apart from these, the project itself has got a unique potential to create a breakthrough in the way complex scientific data is organized on the Internet.

Plans for this summers

My project — Buzzbang, is significantly different from all other InterMine instances and it focuses on scraping all the data we have on internet marked with bioschemas.org and indexing them in a search tool — Apache Solr. So far, a basic scraping module and an indexing engine are up and running. I am planning to integrate “Scrapy” for crawling and indexing new paths and upgrading the Solr search tool in this project. Towards the end of this project, I will make sure all the changes are reflected in the front-end as well.

Anticipated Challenges

I believe there could be some serious challenges that I might face with Scrapy. Making a generalized scraping tool looks easy with the data having bioschemas.org markup, but, the organization of this data on various domains varies, and crawling across some of those domains might not be a simple task. Moreover, we are also planning to introduce some degree of parallel processing to this module. Though my focus would be on EBI biosamples domain, which should make my task easier, I will try to keep the crawler as general and powerful as I can. Additionally, I suspect I would need some help in planning the architecture for the re-crawling and re-indexing part from the community. I am not very sure about what level of automation would be desirable in this project with respect to the previous point.

--

--

Ankit Lohani

When there‘ s so much that could be done, there’s only so much that you could do!