Playing with GPT3 in CACTUS hackathon: Semantic Search
CACTUS organized a GPT3 hackathon as soon as we got access to the Open AI tool. Employees with a diverse range of roles and backgrounds came together in teams to generate ideas and build prototypes.
The team that I was part of built two things:
- A description generator for scientific concepts, and
- Free-text semantic search functionality on a huge corpus of scientific publications.
Project 1: Generating descriptions for concepts
As a warm-up, we tried out GPT3’s famous text generation function by having it generate descriptions of concepts. Concepts extracted from scholarly writing play an important role in several CACTUS/UNSILO products, such as Research Discovery and Classify. Being able to get a description of an unfamiliar concept will be of great help, especially to non-expert users.
We experimented with concepts from the biomedical domain. We used Wikipedia to help us manually craft some good examples of descriptions, in order to show GPT3 what to do. This works alright for concepts that GPT3 knows enough about, i.e. that were represented well enough in the data that GPT3 was trained on.
As GPT3 was trained in the fall of 2019, it turned out that it didn’t know anything about COVID-19. We also got bad results for some of the more obscure MeSH concepts. We, therefore, tried to give GPT3 the opportunity to learn from our collection of scientific articles. We used our in-house Concept Extraction API to identify articles for which the concept that we wanted to generate a description for (e.g. ‘Covid-19’) was prominent. Then we extracted sentences that mention the concept from the articles’ abstracts and provided this as extra information to GPT3. The results got a lot better.
2019-nCov is a human protein encoded by the 2019 gene. It has been shown to interact with CDK5R1, CENPF and NCOA3.
The novel coronavirus (2019-nCoV) is a member of the family Coronaviridae, genus Betacoronavirus. It was first identified in September 2019 in China and has since spread to other countries. The virus causes severe acute respiratory syndrome (SARS)-like symptoms.
Concept: ‘Casein Kinase I alpha’
Casein kinase I alpha (CKIA) is an enzyme that in humans is encoded by the “CSNK1A1” gene.
Casein kinase I (CKI) is a family of serine/threonine protein kinases that phosphorylate the N-terminal amino acids of caseins. Casein kinase I alpha (CKIα) is one member of this family, and it has been shown to be involved in the regulation of apoptosis.
It was really exciting to see that GPT3 can get the gist of a new concept in this way. Some caution is advised though.
GPT3 will make things up that are not always correct. For example, in one run it correctly guessed that COVID-19 started in China, even though this information was not included in the extracted sentences, but it also stated that its fatality rate was 50%! This is an often-observed issue with GPT3’s text generation. Here is a recent research project that tries to address it by coaxing GPT3 into providing evidence for its claims.
Project 2: ‘Ask Discovery’, free-text semantic search
Research Discovery was primarily conceived as a notification service, suggesting relevant articles to read, based on a user’s interests. At the time of the hackathon, it was possible to search in the huge collection of available articles, using the concepts that we extracted from these, using UNSILO Concept Extraction, our document fingerprinting tool for finding related documents.
Being restricted to concepts as input gives a rather limited search experience, though. With our hackathon project Ask Discovery we set out to enable free-text search using GPT3’s semantic search functionality. (Here is an enthusiastic story about GPT3’s semantic search, boasting that you can even look for abstract things like ‘humorous situations’ and get sensible results.)
It is common practice in search to use a cheap search method to retrieve a manageable number of potentially relevant documents as a first step. After that, a more sophisticated method is used to re-rank the results of this first search, in order to bring the most relevant documents to the top of the list. This is what we needed to do here as well because GPT3 can search for only 200 documents at a time. The only search functionality that we had available for our document collection was our Concept Search, which was already built into Research Discovery. So we decided to use our Concept Search API to narrow down the number of documents for GPT3 to search in. As we wanted to accept free-text queries, we needed to extract concepts from the query. Fortunately, we have an API for that as well. The free-text semantic search thus works as follows:
- A user enters a free-text query
- Concept Extraction API extracts concepts from the query
- Concept Search API finds the 200 most relevant documents for those concepts
- GPT3 re-ranks those 200 documents according to how well they match the user’s original free-text query
Building semantic search on top of Concept Extraction is of course a hack. We can easily fail to extract an important concept from the query. However, as long as we extract some concepts correctly there is still a decent chance that there will be some very relevant documents in the initial search results. And if that is the case, GPT3 can do its magic. Leveraging the skills of our different team members, we were able to build this on top of our existing system, with a UI that ensured a good user experience.
The results were very encouraging. Many were impressed by Ask Discovery and it won the prize for 2nd best project in the Hackathon. Below are some examples of how reranking with GPT3 improves search results.
Follow-up after the hackathon
We followed up on the Ask Discovery project after the hackathon ended. As GPT3 is a bit pricey for use in a free service, we tried if could get similar results by using Google’s universal sentence encoder (GUSE). (This article suggests that the embeddings from GUSE are not necessarily much worse than GPT3 embeddings and orders of magnitude cheaper.)
Whereas GPT3 provides a full semantic search experience, GUSE provides only embeddings. Therefore, it was not immediately clear how to get the best results. Should we compare the query to the abstracts in our data set? Or should we rather look for abstracts that contain a sentence that is very similar to the query? Maybe a combination of both factors? We couldn’t find anywhere how GPT3’s semantic search works under the hood, so we couldn’t use that as inspiration. We would need to do some thorough testing to find the right setup, which we haven’t yet done.
Instead, we are now implementing a more standard free-text search functionality for Research Discovery, so we remove the hack of being dependent on concept search. Once this is in place, adding semantic re-ranking using GUSE or GTP3 will be the logical next step. Especially on longer queries, we expect this to have a bigger impact than re-ranking models that are based on the user's past behavior, as it is important to deliver semantically appropriate results. Our project helped prioritize this, as we created a hacked-together, but still a quite complete experience of free-text search for Discovery, that people within CACTUS could try out.
- GPT3 is fun and easy to play around with, but also very much a black box.
- The output generated by GPT3 can look really impressive, but the quality can vary a lot from run to run, and it is not easy to distinguish between information that can be found somewhere and claims that are made up by the system.
- GPT3’s semantic search looks very promising. You do need to first reduce the search space to max 200 documents.
- A hackathon is a great way to meet colleagues you would not otherwise work with and try out something new.
- Showing an actually working demo helps to convince people that an idea is worth pursuing and can influence priorities in the mid to long term.
Oh, and if you do want to be part of a team of tinkerers and dreamers making and breaking things with the likes of GPT3 and its other fellow transformers, CACTUS is always hiring!