A Twitter bot for keeping up with your academic research subfield (1/)
Welcome to my first Medium post! Here I will explain why and how I built biophotonicat, a Twitter bot that scrapes research articles, screens them for relevance to biophotonics using machine learning, and tweets out articles on a regular schedule.
Motivation
I had recently completed a PhD in the field of biophotonics (also known as biomedical optics), which, broadly speaking, is the use of light (from light bulbs to lasers) for imaging, sensing, or measurement in biology/medicine. For my first postdoctoral stint I moved into machine learning.
Among many cultural shocks in this very different field, I was struck by the community’s widespread and powerful utilization of social media, particularly Twitter. Almost all leading researchers tweet regularly on the latest papers, new developments in the field, and their personal musings on research trends and societal implications. In addition to an incredibly active and congenial group of human tweeters, several bots (automated Twitter accounts) trawl the internet for new papers and hot discussions. In this field of unprecedented explosive growth, social media would seem to be one of the most effective tools for blasting out the latest and greatest.
I was also pleasantly surprised to find vibrant Twitter niches in microscopy, neuroscience, and parts of biology and physics, but there was relatively little Twitter activity in other areas of biophotonics. As someone who recently deleted Facebook and uses Instagram for memes and cat photos, I am certainly guilty of underutilizing social media for the good of science.
A related observation I had was how powerful Twitter is at disseminating new papers and results. Most people in other areas including biophotonics rely on email alerts from journals, repositories, and news aggregator sites including the excellent OCT News. I use RSS feeds to follow not only the most relevant journals (e.g. Biomedical Optics Express) but also more general (e.g. Optics Express, Optics Letters, Scientific Reports etc.) and higher impact journals (e.g. Optica, LSA, Nature Photonics/Methods/Communications, Science Advances etc.). The difficulty with the latter is that the percentage of papers in these broadly scoped journals that are relevant to my research subfield of biophotonics is very small (less than 1 in 10, possibly 1 in 100 for some journals).
These rare occurrences in high impact journals are almost always exciting and important advances, but the irony of reporting them in a high impact journal with broad scope is that the research community in that subfield may not see it for some time (perhaps until a talk is given during a specific topic session at a conference). While some academics on Twitter are skilfully using the platform for publicizing their work, most other research subfields lack a critical mass of active Twitter users to make manual publicity (“hey check out my new paper!”) effective.
Most journals do not group or tag articles into smaller subcategories (‘Engineering’ or ‘Biophysics’ is usually as far down as they will go). I usually end up having to manually scroll through hundreds of RSS posts on my Feedly app every couple weeks just to pick out the 10 or fewer articles in biophotonics. Scrolling through email alerts is similarly mind numbing. I already knew that a Twitter solution could make this a lot more fun. A bot might find new and relevant papers and tweet them, so that I and others would easily see them without needing to wait for everyone to become friends on Twitter. I just had to build it.
The model uses logistic regression on 3 classes — 0: non-optics, 1: non-biomedical optics, and 2: biomedical optics, and is trained on 68,000 titles (last updated: 3 May 2019) scraped from PubMed and arXiv. Performance on holdout test data is 80.7% accuracy.
In this series (WIP), I will describe different components of the project and details of how they were set up. I built biophotonicat in my free time after work, so you can similarly tackle your project in bite-sized chunks. Each part is somewhat distinct, so I will have multiple sections for convenience of readers who may be here for a specific procedure or tip. Code will be provided on Github once I finish this series. I am pretty new to nearly all of this — Python, machine learning, Twitter, Medium — so please forgive me for any inaccuracies or non-optimal implementations (leave comments to let me know!). With this information, you will be able to build a bot for your own academic research subfield.
1/ A Twitter bot for keeping up with your academic research subfield (this post)
2/ Scraping big data from public research repositories e.g. PubMed, arXiv
3/ Building a machine learning model to classify research paper titles
4/ Reading and writing to a simple database using the Google Sheets API
5/ Running a bot on a regular schedule using Heroku
Have fun, and feel free to get in touch!
And here’s the paper in Science Advances. Did you find it?