Using the latest Large language models like GPT to do Q&A over webpages or papers (I)

Aaron Tay
Academic librarians and open access
11 min readFeb 27, 2023


One early commentary to the “threat” of large language model (LLM)tools like ChatGPT is that in general such tools currently make up references a lot.

Already we see librarians reporting students asking to find fake references generated by ChatGPT.

Of course, as I have noted this state of affairs is likely to be temporary, given that it is fairly trivial to fix this issue by doing something as simple as asking such models to

  1. Do a search over documents or webpages (text or semantic matching)
  2. Rank the best ones (in practice this means extracting the best sentences or passages that answer the query)
  3. Extract and summarise the answers from the documents or webpages and cite them.

This was indeed done for past unreleased models like OpenAI’s WebGPT, while Google’s LaMDA used a knowledge base to help answer questions. Now of course Bing’s chatbot — supposed named “Sydney”, combines Bing Search with a version of OpenAI’s GPT.

In theory combining a real live search with the natural language processing capabilities of GPT seems to give us the best of both world’s. Unlike ChatGPT which has information trained up to a certain date in 2021 (training LLMs from scratch are expensive!), one that is hooked up to a search engine can give answers as update to as what it can find.

Also, I suspect, LLMs when asked to summarize answers tend to be relatively accurate (I suspect 70–80% accurate based on some tests with tools like as opposed to asking a LLM like ChatGT questions where its answers rely on the “learning” from its pretraining but it can “hallucinate” and make up answers.

While I have not being able to get access to the New Bing+GPT service, I saw the following example posted where the user asked a question about summarising studies on aerobic exercise.

While the example above looks good, you notice the sources are domains like and which probably aren’t the best sites you want your evidence to come from. ( is much better).

While OpenAI’s WebGPT was trained to try to learn which websites are reputable (the model has full access to a BING api which it can use to send queries and extract webpages), this might not be something in the current Bing model.

One obvious idea in the case of doing academic queries is to do this search only over academic websites/papers.

This is indeed what tools like ,, scispace do. See my post on Q&A academic systems —, Scispace,, and Galactica

In practice, if you looking for health type evidence, it might even be better to just do the query over Systematic reviews and meta-analysis! Elicit allows you to filter to that class of items.

Explaining the Q&A system in detail

Take as an example. The data source it uses is the metadata (title, abstract) and full-text (selected open source) from Semantic Scholar. When you type a search query it first needs to find possible relevant papers/documents that might be useful to answer the question.

Traditionally this would use traditional information retrieval ranking algos like TF/IDF or the more modern BM25, which is quick and fast but mostly keyword based (also known as lexical search). They may do a second step of reranking, typically using more expensive “neural based” search methods (also known as semantic search) which produce state of art results for relevance ranking.

Technically speaking traditional lexical search generates sparse vectors (vectors are just string of numbers that represent tokens/words), while semantic based methods (using neural nets) learn dense vectors embeddings. The latest embeddings are contextual (eg BERT embeddings) as opposed to static (e.g. word2vec, GloVe) . Matching is done by checking our “close” the vectors of the queries are to the vectors representing the documentation e.g. using cosine or dot-product similarity.

Dense vectors tend to outperform sparse vectors because they embed “meaning” in their representation, but they are expensive to process and hard to interpret. More on this in a future post.

In practice these days this expensive neutral based/semantic search means using contextual embeddings from state of art LLMs from OpenAI or opensource variants like BLOOM, OPT, Google T5-FLAN etc.

One possibility for example is to use OpenAI’s new embedding API which is relatively fast and cheap. At time of writing, text-embedding-ada-002 is recommended.’s document used to claims that it uses the build-in Semantic Scholar keyword API search to get the top 1000 results before it uses other LLM/Neural based methods to do reranking. While this works, it might be better to try to do the whole ranking on their own using semantic/LLM based methods from the very beginning.

The latest Elicit instead takes a copy of the Semantic Scholar corpus, converts title and abstracts to embeddings (using something called a paraphrase-mpnet-base-v2 which is a kind of sentence embedding derived from a BERT model) and stores the copy locally with Semantic Scholar.

This allows a semantic type of match, so that your query’s first step is directly using sentence-based transformers embeddings for matching. The 400 closest embeddings are retrieved as this first step.

You can read about further details from Semantic Scholar on how this ranking works. It then runs on the 400 results the

GPT-3 Babbage search endpoint for the first ranking step, then a finetuned T5 model for the second ranking step. The ranking step takes titles and abstracts as input, and computes how relevant each (title, abstract) pair is to your question.

Essentially it translates your query to embeddings via the LLMs, does the same for the top results and does a ranking of similarity using something like cosine or dot-product.

Beyond ranking, how does LLMs give an exact answer?

So, we have seen how elicit or search engines ranks papers. But how do systems actually know how to answer questions with direct answers?

In the query below, Elicit is asked about use of Google Scholar for systematic reviews. Semantic Search allows it to rank the best papers but how does it know what to write about them.

Elicit query can you use google scholar alone for systematic reviews?

My evaluation of the text generated : While it does identify the key papers and mostly correctly summarises what they say, I would have preferred it to explain why say in Giustini 2013 Google Scholar is not enough to be used alone for systematic reviews.

Roughly speaking, when you do a search, the system will try to find the best match between your query’s embedding and the best passage or sentence’s embedding (it may find the closest k documents first and then narrow down).

It will then add the best passages and ask the LLM to answer the query given the found best passages

Here’s another explanation from Simon Willison

You can also see OpenAI’s sample code notebook showing the samplecode needed in python under

Question Answering using Embeddings and web crawl Q&A

Of course, is currently unique among academic search in that it goes further and even extracts special details of each top paper such as “takeaways”, “population” and displays as separate columns and this is done using LLMs generally. This is mostly done as you might expect , by prompting LLMs questions while given each paper.

In detail, they take an item like “Does the study have a placebo”, think how a researcher might figure this out and break it down into steps. Each step is then queried with the LLMs. For example, for checking if a study has a placebo, a researcher might try to find sections on “trial arms”, extract the relevant paragraphs and check those paragraphs for presence of Placebo. The results are compared with a gold standard and if the results are not good, they will manually check to see what went wrong and break down the steps further. Each step can also be attempted to be done by machine learning/LLMs, for more details.

There are some default pre-available columns you can add (I assume has done fine tuning and prompt engineering work to improve results there), but can always create your own column with your own custom prompt.

Elicit extracts information about papers and displays in additional columns

Other tools like citespace and are similar. The major difference of with the rest is that it doesn’t use full-text but just citation statements/citances as raw material for answering questions

can you use google scholar alone for systematic reviews? in scite ask a question (beta)search

My evaluation of the text generated : Because the evidence is generated from a mix of both abstracts and citances ( has no full-text), it can cite a paper saying in the citances that another paper said X (secondary citaton). This can be confusing.

Preingesting the corpus and converting into embeddings

As currently described in the documentation, Elicit relies heavily on Semantic Scholar keyword API to pull out the right documents. After all, if the top 1000 documents miss out a relevant article, no amount of further reranking with more advanced methods will help.

Fortunately, Semantic Scholar keyword search is using quite an advanced algo (it uses elastic search followed by its own ML based ranker) but this may not be so for other search apis which may use simple minded keyword match.

Given that the latest contextual embeddings are superior for matching, if you are building a Q&A search over your documents or website, it is theoretically better to harvest all the documents and precompute embeddings for them pre-search, so that the matching can be done without the use of a first step keyword based matching . This is assuming speed isn't an issue).

Here a blogger does this using the free Open gtr-t5-large embeddings model , where he converted 6,400+ blog posts into embeddings from T5 model. He does not require a first keyword step and can directly try to find the most similar match against this and his query embedding.

Building a Q&A over your website or a group of papers

As I write this, many people are trying similar ideas, see


There’s even one called S2QA that works directly with Semantic Scholar API and is very close to a base version of

Here’s another

Don’t want to code at all? Try Perplexity Chrome extension and

If you understand a bit of python and are comfortable with API use, a lot of the examples above are very doable if you want to create a good state of art Q&A search over a set of documents or over say webpages in your domain.

The main issue is the LLMs APIs, specifically OpenAI ones are chargeable. Either that or you could opt for using a open source one like Google FLAN-T5 or OPT, but this takes even more skill and a powerful GPU to run the model.

Let me show you an experimental no-code method that might be interesting so you can do a Q&A over your website or even just academic papers

Restricting results by domain using and Perplexity.

When ChatGPT sprung up, there were a couple of search engines like and that did the obvious thing that Bing would soon do. Combine LLMs with search as I described above.

To create a non-code Q&A search using or perplexity simply restrict the search by domain.

Before I show an example search how many books can i borrow as a undergraduate

I am unsure if honours the site operator like Google…. but I’ve found if you just put the domain name, it will show other pages that has the domain name in the text. If you try with the site operator and there is no result, it will say something like “it seems you are looking for something on the domain … but no results were found” though I can't tell if it is just the LLM mimicing what it has seen,

and it happily shows 10 results hits from my domain and extracts the answer from there! (Note the answer may or may not be right!)

Note that the answer with the page presents the answer in a table of loan entitements. This might be hard for the engine to interpret.

Similarly, you can try a similar trick with .

You can try a similar trick with the Perplexity Chrome extension by going to a specific domain (e.g. your website) and selecting “This domain”

And of course, you can use a similar trick to make either or Perplexity to generate results only from certain domains with academic papers.

But what if we wanted the results to only come from research papers. You can try doing the same trick over preprint server domains like arxiv, but is there something that covers more?

In the example below, I make perplexity get results only from Core, one of the biggest aggregators of OA papers and it hosts the full-text on the same domain.

Below I ask a specific q “Which paper first coined the term “bronze OA” and restricted results from CORE,, and it surfaced the right paper (which was OA) as well.

Perplexity search over — which paper first cointed the phrase “bronze OA”

See next blog post for further testing.


I think in the next 5 years, Q&A systems will become common place. While the results from perplexity and are often inaccurate they are far far better than the promises of “Sematic Search” in the 2000s and 2010s,

Some reasons we expect it to get better.

  1. This is a very early implementation. ( has a more mature implementation).

2. We do not know which LLMs are behind and they may be slightly behind state of art. (Perplexity is using OpenAi’s GPT-3)

3. We are also unsure how big the coverage of Perplexity is and how good the algo used for the initial retrieval of documents that is later used by the LLM to answer questions. It might be simple keyword match and this can be improved much further.

4. Lastly, I don’t think this system is at all well tuned for specific use cases including academic use.



Aaron Tay
Academic librarians and open access

A Librarian from Singapore Management University. Into social media, bibliometrics, library technology and above all libraries.