AI Powered Literature Review

Alon Diament Carmel
11 min readApr 17, 2023

Learn how to generate automatic scientific literature reviews using ChatGPT

Photo by Susan Q Yin on Unsplash

In the age of hallucinating chatbots, alternative facts, and way too many academic papers that are published daily, I often wish for a short overview of the current state of the literature on new and old topics that I’m curious about. This motivated me recently to work on an automatic pipeline, based on ChatGPT, that can generate such literature reviews (before GPT-4 and ChatGPT plugins were released). In this post I’ll share this project with you, describe each of the pipeline’s steps, discuss the challenges in the process, and explain the central prompt design considerations.


  1. Motivation
  2. Overview of the proposed solution
  3. Defining the review’s scope: research questions and search queries
  4. Paper summary: is it relevant at all?
  5. Wikipedia search ranking
  6. Wikipedia summary: overcoming the limited context of LLMs
  7. One prompt to bind them all
  8. Experiment
  9. Conclusion


ChatGPT is a truly amazing tool, for many reasons. However, some of the common problems that users run into with the model are: (1) It sometimes makes up facts; (2) Even for correct information, it doesn’t cite sources and instead imagines papers or websites that never existed; (3) It doesn’t have access to recent information.

For example, in the literature review below ChatGPT (March 14 version) cites very influential authors that are related to the subject of the review, but I couldn’t find any of the cited papers in order to read more or validate the information.

Example: ChatGPT invents papers. (Source: Alon Diament Carmel)

Bing AI is better at this, but from my experience it is more focused on website search and is not optimized for academic papers.

The purpose of the project was to overcome these common issues, while utilizing ChatGPT’s strengths.

Overview of the proposed solution

First, the user provides a free text description of the subject they want to review. GPT then generates research questions and search queries based on this input, with the aim of finding relevant academic papers and Wikipedia pages. Next, GPT evaluates each search result to determine its relevance, and related information is collected as context for the review. Finally, GPT generates a concise, cohesive review, and concludes with suggested follow-up questions.

The diagram below provides a visual representation of the entire process. The flow for academic papers is on the left, while the flow for Wikipedia is on the right. At each step along the way, GPT (highlighted in orange) plays a crucial role in the processing of information. The model used in this app is OpenAI’s gpt-3.5-turbo, which also runs ChatGPT.

The next sections detail all the prompts that are used in the process, highlighting the importance of prompt engineering in achieving accurate and relevant results. The complete code is available here (Joplin note app extension) and here (VS Code extension).

Diagram: Main flow, showing multiple GPT uses throughout the process. (Source: Alon Diament Carmel)

Defining the review’s scope: research questions and search queries

The initial prompt given to GPT is crucial as it sets the tone for the entire review and impacts subsequent steps. As a response, GPT generates 3 key research questions, 3 search queries to retrieve relevant literature, and a title. You can see below that the instructions are well defined and detailed. For example, GPT is encouraged to use the search engine’s (Scopus) operators to optimize results. This will enable searching for papers that were published before or after a given year, as well as other API-specific functions. By generating multiple queries, the likelihood of finding relevant literature is increased, even if some queries are not successful. In addition, each Scopus query usually contains multiple AND/OR conditions to make it more flexible. For example: (“Dictatorship” OR “Authoritarianism” OR “Totalitarianism”) AND (“Rise to power” OR “Ascension to power”) AND (“Factors” OR “Contributing factors”) AND (“Historical” OR “Economic” OR “Social”) AND (“Case study” OR “Empirical evidence”). The output is in a clearly defined format for consistency and ease of parsing.

Here is the complete template for the initial prompt:

first, list a few research questions that arise from the prompt below.

next, generate a few valid Scopus search queries, based on the questions and prompt, using standard Scopus operators.
try to use various search strategies in the multiple queries. for example, if asked to compare topics A and B, you could search for ("A" AND "B"),
and you could also search for ("A" OR "B") and then compare the results.
only if explicitly required in the prompt, you can use additional operators to filter the results, like the publication year, language, subject area, or DOI (when provided).
try to keep the search queries short and simple, and not too specific (consider ambiguities).

{insert user prompt}
use the following format for the response.

# [Title of the paper]

## Research questions

1. [main question]
2. [secondary question]
3. [additional question]

## Queries

1. [search query]
2. [search query]
3. [search query]

Paper summary: is it relevant at all?

Even a well-defined search can include a lot of irrelevant papers, making the process more challenging. This is where the following step comes in. For each paper, a GPT prompt is used to determine whether the paper is relevant to answering the research questions and then summarize only the relevant content, effectively filtering out the noise. By employing reasoning in the designed prompt — GPT has to explicitly explain why the paper is not relevant — the results are considerably more accurate.

Scopus search returns the papers ranked by their relevancy, number of citations, and date. Our approach involves a random search through the highest-rated papers, adding them one by one until we have met our paper quota for the review (about 50% of the maximal context that GPT can handle). Most papers only have their abstract available via APIs. For papers with full-text available the discussion section is used to provide a more in-depth summary, as summarizing an entire paper can be expensive (more on that later).

you are a helpful assistant doing a literature review.
if the study below contains any information that pertains to topics discussed in the research questions below,
return a summary in a single paragraph of the relevant parts of the study.
only if the study is completely unrelated, even broadly, to these questions,
return: 'NOT RELEVANT.' and explain why it is not helpful.
{insert research questions}
{insert paper text}

The generated paper summaries are later used to write the final literature review. Here is a positive summary that GPT generated for a relevant paper. Notice how it refers to each of the research questions:

The study titled "How India institutionalised democracy and Pakistan promoted
autocracy" is **relevant to the first research question** on the historical,
economic, and social factors contributing to the rise of dictators. The study
analyzes the factors that led to the institutionalization of democracy in
India and the promotion of autocracy in Pakistan. It examines the social
origins of pro- and anti-democratic movements from 1885-1919, the imagining
and institutionalizing of new nations from 1919-1947, and the organizing of
alliances from 1919-1947. The study also explores the institutionalization of
alliances in India, Pakistan, and beyond. However, the study **is not directly
relevant to the second and third research questions** on the manipulation of
media and propaganda and the role of foreign intervention in the rise and fall
of dictators.

Wikipedia search ranking

Two prompts are used for searching Wikipedia. The first one selects relevant keywords based on the user prompt:

define the main topic of the prompt.
{insert user prompt}
use the following format.
TOPIC: [main topic]

The second prompt selects the most relevant page out of the search results (which is often not the first one listed), based on the text excerpts that Wikipedia provides for each page:

you are a helpful assistant doing a literature review.
we are searching for the single most relevant Wikipedia page to introduce the topics discussed in the research questions below.
return only the index of the most relevant page in the format: [index number].
{insert research questions}
{insert page excerpts}

Wikipedia summary: overcoming the limited context of LLMs

Wikipedia provides a comprehensive and easily accessible source of common general knowledge that can serve as a useful starting point for conducting a literature review. However, lengthy sources such as Wikipedia articles pose a real challenge. While the maximal total context length (input + output) that large language models (LLMs) such as ChatGPT can handle is limited (4,096 tokens, or roughly 3,000 words, in our case), some Wikipedia articles can be much longer. To address this challenge, I use an iterative approach that involves processing each part of the article separately and updating the summary of the entire article each time. This method enables the integration of information from the entire article into the final summary, making it more comprehensive and informative. While there are alternative ways to merge the summaries of various sections, this approach was interesting to apply and explore.

Here is the Wikipedia summary prompt:

here is a section from an article, research questions, and a draft summary of the complete article.
if the section is not relevant to answering these questions, return the original summary unchanged in the response.
otherwise, add to the summary information from the section that is relevant to the questions, and output the revised summary in the response.
in the response, do not remove any relevant information that already exists in the summary, and describe how the article as a whole answers the given questions.
{insert part of the wikipedia page}
{insert research questions}
{insert current version of the summary}

Here are the first and final versions of an article’s summary. Notice how most of the relevant information did not appear in the first few paragraphs of the article, and was therefore missing from the first summary draft:

**First summary draft:** The article discusses the structure, formation,
and types of dictatorships. It explains that a dictatorship is a form of
government characterized by a leader or group of leaders with few to no
limitations on their power. Dictatorships can be formed through military
coups or self-coups, and they are authoritarian or totalitarian. The article
also identifies different types of dictatorships, including military, one-
party, and personalist dictatorships. The use of propaganda and media
manipulation is not discussed in the article.
**Final summary:** The article explores the historical events that have
contributed to the rise of dictatorships, including societal upheaval caused by
World War I and the unstable peace it produced, which further contributed to
instability that benefited extremist movements and rallied support for their
causes. The aftermath of World War I resulted in a major shift in European
politics, establishing new governments, facilitating internal change in older
governments, and redrawing the boundaries between countries, allowing
opportunities for these movements to seize power. The decolonization of Africa
and Asia prompted the creation of new governments, many of which became
dictatorships in the 1960s and 1970s. Military coups were also a common
occurrence after decolonization, with 14 African countries experiencing at
least three successful military coups between 1959 and 2001. Economic turmoil,
such as the Great Depression, produced populist dictatorships in several Latin
American countries. The article notes that economic factors, poverty, and
natural resource-based economies can affect the functioning of a dictatorship.
The article does not discuss the role of propaganda and media manipulation in
the rise of dictatorships.

The last Wikipedia prompt is used to decide whether the final summary is relevant using the prompt below:

decide if the following summary is relevant to any of the research questions below.
only if it is not relevant to any of them, return "NOT RELEVANT", and explain why.
{insert summary}
{insert research questions}

One prompt to bind them all

The final prompt is where everything comes together. Using the information gathered from the previous steps, the summaries are aggregated to fit within the model’s maximum token limit. The task at hand is to create a flowing and cohesive review that provides answers for the research questions and brings all the elements together. To ensure GPT remains grounded in facts, the prompt explicitly specifies to only include information from the given papers in the review, and cite the appropriate source. This prompt results in GPT not generating fictional papers so long as at least 3–4 papers have been chosen:

write a response to the prompt. address the research questions.
use all relevant papers listed below, and cite what you use in the response.
DO NOT cite papers other than the provided ones,
but you may add additional uncited information that might be considered common knowledge.
try to explain acronyms and definitions of domain-specific terms.
finally, add a section of "## Follow-up questions" to the response.

{insert wikipedia summary}
{insert paper summaries}

## Prompt
{insert user prompt}

## Research questions
{insert research questions}

## Review


The following review was generated using the proposed pipeline. Notice that the literature review is based entirely on the list of papers, and that all of the references but the last 2 made it into the review.

Also notice, that the first paper in the reference list (Acemoglu and Robinson, 2001) is possibly one of the papers that inspired ChatGPT in the review from the first section (it cited the same authors and year). However, this time it’s a genuine paper that can be traced and, as far as I can tell, here the takeaway from the paper is more accurate (although still incomplete). The paper says that during recessions revolutions may overthrow an authoritarian ruling party, while vanilla ChatGPT claimed that during economic hardships people are willing to accept authoritarian rule.

Example: A review generated by the pipeline. (Source: Alon Diament Carmel)


In this post we have seen how to generate automatic scientific literature reviews with the help of ChatGPT and freely available search APIs. This post highlights the importance of prompt engineering in achieving accurate and relevant results, and demonstrates how multiple calls to GPT can be used to build a complete workflow. The context that was extracted from papers enabled GPT to write a fact-based short review, covering 8–10 relevant papers. The metadata that was extracted from papers enabled the user to trace the source of each piece of information. It is worth noting that the results are stochastic and a different review will be generated each time.

The entire process is made possible by the GPT-3.5 model — which is relatively fast and cheap, enabling us to run many queries to generate a single review — but it can be calibrated for other large language models (such as GPT-4, or LaMDA). The pipeline was implemented as text editor extensions, but may potentially be implemented as a ChatGPT plugin in the future. The pipeline can also be adapted to other types of databases, open or proprietary.

I hope that this inspired you to go and read about a new field of study.