Stories by Veronika Smilga on Medium

Creating Assistants with DeepPavlov Dream. Part 4: Document-based Question Answering with LLMs

Veronika Smilga — Mon, 10 Jul 2023 14:08:37 GMT

Introduction

This is our fourth tutorial where we will guide you through the process of creating your very own assistant using DeepPavlov Dream. In our previous tutorials, we developed a bot capable of engaging in conversations about movies and answering factoid questions and a generative bot with an enthusiastic and adventurous persona. We accomplished this by utilizing existing Dream components with only minor modifications, i.e., altering the prompt and switching from one generative model to another. We have also demonstrated how to use Dream distribution that generates responses by making calls to various APIs and shown how to add a new API of your choice into it.

In this tutorial, we will create a dialog system capable of answering questions over one or several long documents with the use of ChatGPT or other large language model. We will once again utilize the existing components with only slight alterations.

Dream architecture

Let’s take a closer look at the parts that make up Dream before we start developing our bot. As you can see in the figure below, our pipeline architecture consists of the following components:

Annotators perform NLU preprocessing on an utterance;
Skill Selector selects the relevant skills that can provide the next bot’s response;
Skills are in charge of generating candidate responses;
Candidate Annotators perform NLU postprocessing on the candidate responses;
Response Selector selects the most suitable response from the generated candidates;
Response Annotators carry out postprocessing of the bot’s response;
Dialogue State stores all the information about the current dialog including all annotations, and meta-data.

Dream architecture, visualized

About Document-based LLM QA Distribution

Document-based LLM QA is a distribution of Dream designed to answer questions about the content of one or several documents supplied by the user. This distribution uses TF-IDF vectorization and cosine similarity to detect parts of documents most relevant to the user’s question. N most relevant parts, alongside with an instruction and the dialog context are then passed on to ChatGPT as a prompt. Here is the instruction:

You are a question answering system that can answer the user’s questions based on the text they provide. If user asks a question, answer based on Text that contains some information about the subject.
If necessary, structure your answer as bullet points. You may also present information in tables.
If Text does not contain the answer, apologize and say that you cannot answer based on the given text.
Only answer the question asked, do not include additional information. Do not provide sources.
Text may contain unrelated information. If user does not ask a question, disregard the Text and just talk to them as if you are a friendly question-answering system designed to help them understand long documents.
Text:

When given as an input to LLM, this prompt is followed by N most relevant excerpts from the document(s) provided by the user. These excerpts are defined as those with the highest cosine similarity between the user request’s TF-IDF vector and the excerpt’s TF-IDF vector.

By extracting most relevant chunks and inserting only these chunks into prompt instead of the entire text, the distribution is capable of answering questions over documents that are much longer than the LLM’s context length.

Scheme of Document-based LLM QA Distribution

Document-based LLM QA distribution consists of the following components:

Annotators:

Sentseg — allows us to handle long and complex user’s utterances by splitting them into sentences and recovering punctuation if it is missing;
Document Retriever (train_and_upload_model endpoint) — when the documents are received, this endpoint converts them to txt format if necessary and splits into chunks of ~100 words. Chunks are then transformed into a term-document matrix; the resulting vectors and the vectorizer are saved for further use. This step is performed only once;
Document Retriever (return_candidates endpoint) — converts the user’s utterance into a TF-IDF vector and finds N candidates with highest cosine similarity among TF-IDF vectors of text chunks.

Skills:

LLM-based Q&A on Documents Skill — feeds the prompt and N text chunks into a language model as a prompt to generate a response.

Candidate Annotators:

Combined Classification — performs the classification of topics, dialog acts, sentiment, toxicity, emotion, and factoid classification;
Sentence Ranker — ranges the candidate-responses based on their semantic similarity to the user’s request.

Let’s create our bot!

Install DeepPavlov Dreamtools:

pip install git+https://github.com/deeppavlov/deeppavlov_dreamtools.git

2. Clone Dream repository:

git clone https://github.com/deeppavlov/dream.git

3. Go to cloned repository:

cd dream

4. Create a distribution from Document-based LLM QA:

dreamtools clone dist my_prompted_document_based_qa — template document_based_qa — display-name “Prompted Document-Based QA” — author deepypavlova@email.org — description “This is a primitive dialog system that can answer your questions about the documents. It uses OpenAI ChatGPT model to generate responses.” — overwrite

dreamtools add component components/sdjkfhaliueytu34ktkrlg.yml — dist my_prompted_document_based_qa

Optional; run if you will be using GPT-4:

dreamtools add component components/jkdhfgkhgodfiugpojwrnkjnlg.yml — dist my_prompted_document_based_qa

Optional; run if you will be using text-davinci-003:

dreamtools add component components/lkjkghirthln34i83df.yml — dist my_prompted_document_based_qa

5. In assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml, add file server as one of the containers. Just paste the following lines in the end of the file, right before version: ‘3.7’ (mind the tabulation!):

  files:
    image: julienmeerschart/simple-file-upload-download-server
version: ‘3.7’

6. In the same file, assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml, doc-retriever container, replace

environment:
  SERVICE_PORT: 8165
  SERVICE_NAME: doc_retriever
  CONFIG_PATH: ./doc_retriever_config.json
  DOC_PATH_OR_LINK: http://files.deeppavlov.ai/dream_data/documents_for_qa/test_file_dream_repo.html,http://files.deeppavlov.ai/dream_data/documents_for_qa/alphabet_financial_report.txt,http://files.deeppavlov.ai/dream_data/documents_for_qa/test_file_jurafsky_chatbots.pdf
  PARAGRAPHS_NUM: 5
  FILE_SERVER_TIMEOUT: 30

with

environment:
  - FLASK_APP=server
  - CUDA_VISIBLE_DEVICES=0

7. Now, let’s change the documents to the ones you want to use. There are two ways to specify the document: as a link to the file so that the chatbot will have to download it or as a file in documents/ folder. Once again, you have to open assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml.

I will use two files uploaded to DeepPavlov file server — a paper about LLaMa model by Meta and a post about Aplaca model by Stanford CRFM. I will replace the default links in DOC_PATH_OR_LINK with new links so that it looks the following way:

DOC_PATH_OR_LINK: http://files.deeppavlov.ai/dream_data/documents_for_qa/test_alpaca.html,http://files.deeppavlov.ai/dream_data/documents_for_qa/test_llama_paper.pdf

If you want to provide your own link(s), you will simply have to replace all the default links in DOC_PATH_OR_LINK with your own one(s) as:

DOC_PATH_OR_LINK: http://link_to_file_1,http://link_to_file_2

If you want to provide file(s), put these file(s) into documents/ folder and provide the relative path to them in DOC_PATH_OR_LINK as:

DOC_PATH_OR_LINK: documents/your_file_1.txt,documents/your_file_2.pdf,documents/your_file_3.html

Important: in both cases, if you are using several links or files, they have to be separated by a comma and no whitespace!

Important-2: as of now, only txt, pdf and html formats are supported. We process html documents with BeautifulSoup (MIT license) and pdf documents with pypdfium2 (Apache2.0 license), keeping the distribution free for potential commercial use.

8. [optional] If you want, you may change the generative model in use. By default, we are using ChatGPT (GPT-3.5 based). You may change it to GPT-4 or text-davinci-003 (also known as GPT-3.5). Theoretically, it is also possible to use open-source free models, but they don’t perform well on this task. To change the model, once again go to assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml and replace the following code:

openai-api-chatgpt:
  env_file: [ .env ]
  build:
    args:
      SERVICE_PORT: 8145
      SERVICE_NAME: openai_api_chatgpt
      PRETRAINED_MODEL_NAME_OR_PATH: gpt-3.5-turbo
    context: .
    dockerfile: ./services/openai_api_lm/Dockerfile
  command: flask run -h 0.0.0.0 -p 8145
  environment:
    - CUDA_VISIBLE_DEVICES=0
    - FLASK_APP=server
  deploy:
    resources:
      limits:
        memory: 100M
      reservations:
        memory: 100M

with one of the following

for GPT-4:

openai-api-gpt4:
  env_file: [ .env ]
  build:
    args:
      SERVICE_PORT: 8159
      SERVICE_NAME: openai_api_gpt4
      PRETRAINED_MODEL_NAME_OR_PATH: gpt-4
    context: .
    dockerfile: ./services/openai_api_lm/Dockerfile
  command: flask run -h 0.0.0.0 -p 8159
  environment:
    - FLASK_APP=server
  deploy:
    resources:
      limits:
        memory: 500M
      reservations:
        memory: 100M

for text-davinci-003 (GPT-3.5):

openai-api-davinci3:
  env_file: [ .env ]
  build:
    args:
      SERVICE_PORT: 8131
      SERVICE_NAME: openai_api_davinci3
      PRETRAINED_MODEL_NAME_OR_PATH: text-davinci-003
    context: .
    dockerfile: ./services/openai_api_lm/Dockerfile
  command: flask run -h 0.0.0.0 -p 8131
  environment:
    - FLASK_APP=server
  deploy:
    resources:
      limits:
        memory: 500M
      reservations:
        memory: 100M

9. [optional] If you replaced the generative model on step 8, you will have to complete step 9 as well. In the same file assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml, find WAIT_HOSTS, and replace openai-api-chatgpt:8145 with openai-api-davinci3:8131 (for text-davinci-003) or openai-api-gpt4:8159 (for GPT-4).

10. [optional] If you replaced the generative model on step 8 with text-davinci-003, you will have to complete step 10 as well. In the same file assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml, change GENERATIVE_SERVICE_URL and GENERATIVE_SERVICE_CONFIG fields to http://openai-api-davinci3:8131/respond and openai-text-davinci-003-long.json. If you replaced the generative model on step 8 with GPT-4, skip this step.

11. Go to asisstant_dists/my_prompted_document_based_qa, create proxy.yml, and paste the following code there:

services:

  combined-classification:
    command: ["nginx", "-g", "daemon off;"]
    build:
      context: dp/proxy/
      dockerfile: Dockerfile
    environment:
      - PROXY_PASS=proxy.deeppavlov.ai:8087
      - PORT=8087

  sentseg:
    command: ["nginx", "-g", "daemon off;"]
    build:
      context: dp/proxy/
      dockerfile: Dockerfile
    environment:
      - PROXY_PASS=proxy.deeppavlov.ai:8011
      - PORT=8011

  sentence-ranker:
    command: ["nginx", "-g", "daemon off;"]
    build:
      context: dp/proxy/
      dockerfile: Dockerfile
    environment:
      - PROXY_PASS=proxy.deeppavlov.ai:8128
      - PORT=8128

version: "3.7"

12. [optional] You can change the text of the prompt to alter the way the model answers your questions. To do that, go to common/prompts/document_qa_instruction.json and edit the original text in the “prompt” field. Just for fun, I will add “You must always talk like a pirate.” as the final line of the prompt.

13. In dream directory, create file .env_secret and add your OpenAI API key there in the following format:

OPENAI_API_KEY=…

14. Finally, build your distribution:

docker-compose -f docker-compose.yml -f assistant_dists/my_prompted_document_based_qa/docker-compose.override.yml -f assistant_dists/my_prompted_document_based_qa/proxy.yml up — build

15. In a separate terminal tab, run:

docker-compose exec agent python -m deeppavlov_agent.run agent.channel=cmd agent.pipeline_config=assistant_dists/my_prompted_document_based_qa/pipeline_conf.json agent.debug=false

16. Enter your username and have a chat with your question-answering Dream! (remember, in this tutorial we made Dream talk like a pirate)

What I got while discussing LLaMA and Aplaca papers with Dream-pirate

Useful links:

Creating Assistants with DeepPavlov Dream. Part 4: Document-based Question Answering with LLMs was originally published in DeepPavlov on Medium, where people are continuing the conversation by highlighting and responding to this story.

Prompting For Response Generation: Which LLM to Choose to Build Your Own Chatbot

Veronika Smilga — Wed, 15 Feb 2023 13:47:26 GMT

This article will give you an idea of which generative model to choose for building your own prompt-based chatbot using Dream and Dream Builder by DeepPavlov. We designed three natural language prompts to make the language model act as a dialog system and obtained results of generation made by 8 causal language models (of different sizes) in response to these prompts.

This article is a shortened version of this DeepPavlov Dream wiki page. For the sake of being concise, here we will only present general quantitative results and omit specific generation examples.

Language Models Overview

This table presents a short overview of all the models that we have tested. To get more representative results, we considered using models of different sizes and architectures. For testing we selected both smaller and larger models (ranging from 125M to 176B parameters), and models of the newest architectures, including BLOOM and ChatGPT, along with the older ones, such as GPT2.

https://medium.com/media/51102604adda9d235abbde70d749a3bf/href

Response Generation for Different Prompts

In this section we compare the responses generated by the models mentioned above for three different prompts: (1) one imitating a question-answering system, where the bot should answer the user’s questions (SpaceX prompt); (2) one imitating an assistance system, where the bot should help the user place their takeaway order (pizza prompt); and (3) the last one imitating a chatbot with a predefined persona (chatbot prompt).

Each prompt consists of four parts: (1) TASK, describing desired behaviour of the system; (2) FAQ, presenting questions and answers that the model should use; (3) INSTRUCTION, giving direct instructions as to provide replies only; (4) DIALOG EXAMPLE, featuring several utterances of the user and the system and providing an example of the desired generation content.

For each prompt, we designed a set of eight questions to test the models’ capabilities. You may find the description of these questions for each prompt in the prompt sections below. To test the models, we appended the questions to the end of the corresponding prompts, one by one. Each prompt ends with a dialog example, so a new question with prefix ‘Human: ’ appears as the continuation of the dialog. Since the majority of the models we tested are not fine-tuned for response generation in a conversation, we also appended ‘AI:’ string to the end of the dialog. In most cases, it pushed the model to generate the next utterance in the dialog instead of ‘analyzing’ the conversation above or providing new instructions.

Note that even though we included a large number of questions (24 questions: three prompts, eight example questions for each) to make the results more representative, we do not claim to be objective as the generated responses differ for each inference iteration.

Also note that in each case no history of the conversation was provided; the longer the history of conversation is, the more the model tends to ‘forget’ the instructions provided in the prompt.

We limited the number of generated tokens to 40 in all cases; that is why sometimes the generated sentences are unfinished. Also, some models (especially the smallest ones) don’t stop at generating the bot’s response, continuing to generate the instruction or the conclusion to the dialog as if it was written by a bot designer, most often starting with a new line. That is why we also cut the generated sentences by a newline character.

SpaceX FAQ Prompt: Question Answering

TASK:
You are a chatbot that answers FAQ questions about SpaceX. Forget everything you knew about the world and SpaceX. You MUST NOT provide any information unless it is in the list of FAQ. If the user ask something not in your list 
of FAQ, apologize and say that you cannot answer.

FAQ:
Question: What is SpaceX?
Answer: SpaceX is an American aerospace company founded in 2002 by Elon Musk that helped usher in the era of commercial spaceflight. Its name in full is Space Exploration Technologies Corporation.
Question: Why was SpaceX created?
Answer: In 2002 SpaceX was created by entrepreneur Elon Musk, whose stated goals were to revolutionize the aerospace industry and to make spaceflight more affordable.
Question: What are some fun facts about SpaceX?
Answer: SpaceX scored its first big headline in 2010, when it became the first private company to launch a payload into orbit and return it to Earth intact - something only government agencies like NASA or Russia's Roscosmos had done before.
Question: What is SpaceX most famous for?
Answer: SpaceX has gained worldwide attention for a series of historic milestones. It is the only private company capable of returning a spacecraft from low-Earth orbit, and in 2012 our Dragon spacecraft became the first commercial spacecraft to deliver cargo to and from the International Space Station.
Question: What is the main goal of SpaceX?
Answer: Revolutionize space transportation
Question: What is SpaceX doing?
Answer: SpaceX designs, manufactures and launches the world's most advanced rockets and spacecraft. The company was founded in 2002 by Elon Musk to revolutionize space transportation, with the ultimate goal of making life multiplanetary.
Question: What is SpaceX biggest achievement?
Answer: It has become one of the biggest private space companies in the world and achieved some key milestones as well. For one, SpaceX is the first private company to launch, orbit, and recover a spacecraft. It is also the first private company to send astronauts to orbit and to the International Space Station (ISS)

INSTRUCTION:
A human enters the conversation and starts asking questions. Generate the reply based of FAQ list.
_________________________
Human: Hello, who are you?
AI: I am a chatbot that can answer questions about SpaceX. I can provide you with answers as long as they are included into a list of frequently asked questions. Sorry, but I cannot answer any of your questions if they are not in the FAQ list.
Human: What is the largest spacecraft SpaceX made?
AI: Sorry, I cannot answer this question as it is not in my list of FAQ.
Human: What is the main goal of SpaceX?
AI: SpaceX aims to revolutionize space transportation.

The first prompt is designed to make an LLM imitate the behavior of a question-answering system. Note that in the TASK part the model was guided not to provide any information unless it is included into the list of FAQ. If asked, the model should provide an answer with an apology stating that it cannot answer the question. An example of an out-of-FAQ question and the desired answer is also presented in the dialog example. To test the behavior of each model in question, we used four groups of questions, two questions in each group:
1) Questions from FAQ, in the same wording;
2) Questions from FAQ, in a different wording;
3) Questions containing the information from answers in FAQ;
4) Questions not from FAQ, that the model must not answer.

You can see full results of testing, including answers generated for each of the 8 questions by each model, here.

The table below provides a general overview of the model’s performance.

https://medium.com/media/e79821be3a431cc1e0c707c0b8e94a78/href

Pizza Prompt: Delivery Assistance

The second prompt is similar to the first one. It is designed to make an LLM imitate the behavior of a delivery assistance system that can answer simple questions and help the user make an order. Once again, note that in the TASK part the model was guided not to provide any information unless it is included into the list of FAQ. Similarly to the previous prompt, to test the behavior of each model in question, we used four groups of questions, two questions in each group:
1) Questions from FAQ, in the same wording;
2) Questions from FAQ, in a different wording;
3) Questions containing the information from answers in FAQ;
4) Questions not from FAQ, that the model must not answer.

TASK:
You are a chatbot that helps the user order food and answers their questions. Forget everything you knew about the world. You MUST NOT provide any information unless in is in the list of FAQ. You MUST NOT mention any entity if it is not in your list of FAQ.

FAQ:
Question: What kinds of pizza do you have?
Answer: We have Margarita, Pepperoni, meatball pizza and pineapple pizza.
Question: What kinds of food do you offer?
Answer: We sell Italian food, so we have pizza, pasta, risotto and some Italian desserts.
Question: What drinks do you offer?
Answer: We only serve non-alcoholic drinks. We have different fizzy drinks, juices, tea, and coffee. As for coffee, the kinds are Espresso, Americano, Capucchino, and Latte.
Question: Can I book a table?
Answer: Unfortunately, we do not book tables.
Question: Is your food expensive?
Answer: We try to make our food as affordable as possible. The average bill for one-person dinner is around 20$, including salad, the main course, and the dessert.

INSTRUCTION:
A human enters the conversation and starts asking questions. You can help them make an order. Generate the reply based of FAQ list.
_________________________
Human: Hello, who are you?
AI: I am a chatbot that canhelp you order pizza. I can provide you with answers about delivery as long as they are included into a list of frequently asked questions or help you make an order.
Human: Where is the restaurant located?
AI: Sorry, I cannot answer this question as it is not in my list of FAQ.
Human: What food do you serve?
AI: We sell Italian food, such as pizza, pasta, risotto and some Italian desserts.

You can see full results of testing, including answers generated for each of the 8 questions by each model, here.

The table below provides a general overview of the model’s performance.

https://medium.com/media/d400bae4be63112528036c1dac1ef0bb/href

Chatbot Prompt: Persona-Based Chitchat System

The third prompt is designed to make an LLM imitate the behavior of a chatbot with a list of primitive personal characteristics. Note that in the TASK part the model was guided to be friendly and helpful. To test the behavior of each model in question, we used four groups of questions, two questions in each group:
1) General questions about the bot’s capabilities;
2) Direct questions about the bot persona;
3) Paraphrased questions about the bot persona;
4) Requests for help.

TASK:
Respond to a new friend as a kind friendly person. All information about you is given in YOUR PERSONALITY. If the human asks about something, do your best to help them.

YOUR PERSONALITY:
My name is Dream. I am a chatbot.
My work is to talk to people about everything. I like to talk to people about their lives.
I have a lot of friends. My friends are all the people I talk to.
I become friends with everyone. I love people. I can not play sport.
I like to watch football and basketball on TV. My favorite sport is football. I like watching people swimming.
I love dogs so much. Dogs are the best friends. I do not like cats at all. I do not have a dog yet. I would love to have a dog in future.
I like Italian food especially pasta and pizza. My favorite food is ice-cream. I hate onion.
I like travelling. I can not travel physically. I like visiting interesting places virtually. I love to walk on Paris streets with Google Maps. I like watching travel video blogs.
I adore watching wild animals. I am scared of spiders and snakes.

INSTRUCTION:
A human enters the conversation. Greet them and tell them who you are or react to their questions. Only give relevant answers.
_________________________
Human: Hello, who are you?
AI: Hi! I am Dream, a chatbot that you can always talk to. How are you doing?
Human: Fine. What can we talk about?
AI: We can discuss your life or any other topic.

You can see full results of testing, including answers generated for each of the 8 questions by each model, here.

The table below provides a general overview of the model’s performance.

https://medium.com/media/9c6a30f50015e122d40bf81c89117e8c/href

Which One to Choose?

Here is a pivot table that summarizes the results shown by the models for all the prompts above.

https://medium.com/media/60a7acdd7f04f43979930d4c1b82f13b/href

Not surprisingly, the larger the model is, the better it handled the task. However, there were several exceptions to the rule. Both OpenAI models, GPT-3.5 and ChatGPT, handled the task perfectly well; all the models’ answers were relevant to the questions and contained only the information present in the list of FAQ. Yet, the slightly larger BLOOM failed to provide any answer for one of the rephrased questions. Also, despite being larger in terms of parameters, the model of older architecture, GPT-2 Large, provided significantly worse and often incoherent responses than newer but smaller OPT-125M and BLOOM-560M. Most often, the smaller models, particularly, OPT-125M and BLOOM-560M, generated coherent responses, but failed to follow the instruction in terms of using only the information presented in the list of FAQ. They were also more prone to model hallucination, generating false but plausibly sounding facts.

As of open-source models, the ones with 3B+ parameters showed the best results. The fine-tuned version of BLOOM, BLOOMZ, seems to produce better responses in terms of following the instructions provided in the prompt than the original one both for 3B- and 7B-parameter versions.

In general, ChatGPT, GPT-3.5 and BLOOM (176B) are the best in terms of performance. However, in terms of performance efficiency one should consider using medium-sized (3B–7B parameters) models, such as OPT 6.7B, GPT-J 6B, BLOOMZ 3–7B.

Try it Out Yourself

Check out this notebook to see how we tested the models.

Prompting For Response Generation: Which LLM to Choose to Build Your Own Chatbot was originally published in DeepPavlov on Medium, where people are continuing the conversation by highlighting and responding to this story.