Newspaper Chatbot (BellingChat?)

7 min readMar 1, 2024

This is a chatbot tailored for the news organization, Bellingcat. I use ChromaDB in the context of fine-tuning a custom Large Language Model (LLM) application with Retrieval Augmented Generation (RAG). This is then all available in an online Streamlit dashboard.

Information Retrieval

The process begins with the retrieval of relevant information from a pre-existing database based on a given input or query. In this case, I web-scraped all of Bellingcat’s articles and listed all the URLs in a CSV file.

BASE_URL = "https://www.bellingcat.com"
BELLINGCAT_START_YEAR = 2014  # earliest article on site
HTML_CLASS = "news_item__image"

def list_months_article(year: int, month: int):
    url = f"{BASE_URL}/news/{year}/0{month}/"
    res = requests.get(url)
    articles = BeautifulSoup(res.content, "html.parser")
    news_item_tags = articles.find_all("div", {"class": HTML_CLASS})

    create_object = lambda tag: {
        "year": year,
        "month": month,
        "url": tag.findChild("a")["href"], # finds links
    }

    return [create_object(t) for t in news_item_tags]

def flatten_list(x, y):
    return x + y

def list_all_articles():
    nested_links = [list_years_articles(y) for y in range(BELLINGCAT_START_YEAR, 2024)]
    return reduce(flatten_list, nested_links)

def list_years_articles(year: int):
    nested_links = [list_months_article(year, i) for i in range(1, 13)]
    return reduce(flatten_list, nested_links)

all_articles = list_all_articles()
df = pd.DataFrame(all_articles)

After listing all the articles, I used newspaper3k to download and parse all the HTML content into plain text.

def get_article_text(url: str):
    article = Article(url)
    article.download()
    article.parse()

    return {
        "text": (
            article.text.split(article.title)[1]
            if article.title in article.text
            else article.text
        ),  # removes title from text if there
        "publish_date": article.publish_date,
        "title": article.title,
    }

# download all the articles' text
articles_text = df.url.map(get_article_text)

# create new DataFrame columns
extract_dict_key = lambda s, key: s.apply(lambda x: x[key])

df["articles_text"] = extract_dict_key(articles_text, "text")
df["publish_date"] = extract_dict_key(articles_text, "publish_date")
df["title"] = extract_dict_key(articles_text, "title")

# save to CSV
df.to_csv("all-bellingcat-articles.csv", index=False)

The full Jupyter notebook is available below and the data output is also available here.

Article Scraping Notebook

Vector Databases

In the context of machine learning, a vector is an array of numerical values representing some features or characteristics of an object. They can be used to represent various types of data, such as images, text, or numerical features. Vector databases excel at efficiently querying and retrieving these vectors, enabling faster and more accurate information retrieval. An SQL database uses a primary key to search for a match; a NoSQL database might use key-value search; vector databases use a probabilistic approach where entries closest in distance are returned. Pinecone offers a great comparison to other paradigms.

ChromaDB

Is an open-source embedding database. This application will use it to search for relevant article titles.

Embeddings

Embeddings are a specific type of vector representation derived from a more extensive dataset. They are used to capture relationships and similarities between items in a dataset. Below is an example of how to implement it in Python.

from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

embedding_function = DefaultEmbeddingFunction()
embedding_function(["This is a string"])
# [[-0.012367671355605125, 0.0822967141866684,...

Collections

Collections are the DB part of the ChromaDB. They store IDs, metadata and embeddings. Below a collection is created using article data into batches of 250.

def fill_chroma_collection(articles: list[dict], batch_size: int = 250):
    for i in range(0, len(articles), batch_size):
        batch = articles[i : i + batch_size]

        batch_titles = [story["title"] for story in batch]

        # Upsert all of the embeddings, ids, metadata, and title strings into Chromadb.
        collection.upsert(
            ids=[str(story["id"]) for story in batch],
            metadatas=[dict(time=story["publish_date"]) for story in batch],
            documents=batch_titles,
            embeddings=embedding_function(batch_titles),
        )
    return collection

def query_collection(collection, query: str, n_results: int = 10):
    results = collection.query(query_texts=[query], n_results=n_results)

    if results:
        return "\n".join(results["documents"][0])

The collection can be queried like any other database. The number of results was limited to 10 for performance reasons, although this is probably not recommended in a live system. The 10 top results are what is fed to the LLM.

LLM Choice

Large Language Models (LLMs) like GPT-3.5 have demonstrated promising capabilities in natural language understanding and generation. However, enhancing their performance in specific tasks requires more than just large-scale training. Integrating ChromaDB with LLMs enables the creation of custom applications that leverage the strengths of both technologies.

Llama2

Llama2 is Meta’s answer to OpenAI’s ChatGPT and Google’s BERT. One of its benefits is that it’s open-source and available on GitHub, it is also trained on more recent data than ChatGPT, which might be more suitable to a news application.

Mistral-7b

This is the Llama2–7b model retrained by Mistral. It outperforms Meta’s 7-billion and 13-billion parameter model on all metrics, as well as the 34-billion one on most metrics. Mistral-7b will be our LLM in this application.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation is a paradigm that combines the strengths of information retrieval and language generation. In the context of LLMs, RAG involves retrieving relevant information from a database using vectors and custom data to enhance the generation of responses. This approach allows for more nuanced and context-aware responses.

Creating a Chatbot for Bellingcat

This chatbot was developed in Streamlit as it already has developed chat elements that are compatible with a chatbot app, it’s also easy to use and pretty fast. The code below starts a chat window, uses the prompt to search all article titles and returns the top 10.

prompt = st.chat_input("Enter prompt...")
if prompt:
    relevant_artcles = self.query_collection(prompt)
    st.write(relevant_artcles)

Titles were used instead of article content for performance reasons, the same reason applies to the number of articles returned from the vector database. More testing will be done to judge how much changing to article content or returning more results will affect how good the response is.

The prompt chosen was “buk launcher”. The Buk Launcher investigation conducted by Bellingcat and other OSINT researchers, helped unravel the mystery behind the downing of Malaysia Airlines Flight 17. Investigators used a combination of satellite data, social media and traffic videos to find the culprits. The full article is definitely worth a read.

Assuming there are some entries in the Bellingcat article database, the prompt will be formatted in the Mistral format below and sent to the LLM.

def generate_prompt(user_prompt: str, relevant_artcles: str) -> str:
    # NOTE: The [INST] and [/INST] tags are required for mistral-7b-instruct to leverage instruction fine-tuning.
    return f"""[INST]
    You are an expert in all things Bellingcat. Your goal is to give me a summary of the top results. You will be given a USER_PROMPT, and a series of RELEVANT_ARTICLES.

    USER_PROMPT: {user_prompt}

    RELEVANT_ARTICLES: {relevant_artcles}

    SUGGESTIONS:

    [/INST]
    """

def query_replicate(
    user_prompt: str,
    relevant_artcles: str,
    temperature: float = 0.75,
    max_new_tokens: int = 2048,
) -> str:
    # Prompt the mistral-7b-instruct LLM
    mistral_response = self.replicate_client.run(
        MISTRAL_URL,
        input={
            "prompt": generate_prompt(user_prompt, relevant_artcles),
            "temperature": temperature,
            "max_new_tokens": max_new_tokens,
        },
    )
    # Concatenate the response into a single string.
    return "".join([str(s) for s in mistral_response])

if relevant_artcles:
    llm_response = self.query_replicate(prompt, relevant_artcles)
    st.write(llm_response)

The response from the LLM is below. It breaks the response into a numbered list where each item is one of the results returned from a semantic search of the ChromaDB.

I tried it again and returned it in a different format. Customising the prompt format to give the LLM more details of what I want would make the results more reproducible.

This chatbot can answer queries related to news articles, and investigations, and provide in-depth contextual information. By leveraging the speed of ChromaDB, the chatbot can quickly retrieve and analyse relevant data, ensuring that responses are not only accurate but also up-to-date.

Testing

I’m new to LLMs so plan to cover this in another article.

Limitations & Cloud Implementation

Below is an email I received from Streamlit where the app is using more than the allocated 2.7GB. This could be due to a memory leak but is likely due to the huge resource load apps like this need.

Every time the app is started it downloads the Bellingcat datastet from GitHub, converts it to JSON and creates and fills a Chroma database. This is all before the LLM is queried, which also uses a lot of processing power. I made a quick sketch-up of how this app might work in a cloud implementation.

The above implementation could also be avoided entirely if I uploaded the model to Replicate which has a built-in API. Another approach might be to load the data into a persistent ChromaDB at Streamlit build time. The database can then be queried at runtime.

Conclusion

The creation of a custom chatbot for Bellingcat demonstrates the practical application of these technologies in real-world scenarios, promising a future where intelligent systems easily comprehend large datasets, providing open-source researchers and journalists with insights and assistance.

References

https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide

Rapid Q&A on multiple PDFs using langchain and chromadb as local disk vector store

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my…

diptimanrc.medium.com

Enhancing Semantic Search with LangChain, Vector Databases, and Llama2–70B-Chat

Introduction:

medium.com

https://replicate.com/blog/how-to-use-rag-with-chromadb-and-mistral-7b-instruct