Building a Smarter Documentation Chatbot: A Practical Guide Using Firecrawl and KDB.AI

Published in

KX Systems

6 min readJul 25, 2024

Robot searching documentation, wishing they had a faster way!

I’ll be honest, I hate reading documentation. Why should I read docs when a Large Language Model can read them for me? This guide explores how to build a documentation chatbot that does exactly that, using Firecrawl for web scraping and KDB.AI for vector storage and retrieval.

The Documentation Challenge

Before we dive into the technical details, let’s discuss why we’re building this chatbot. Many organizations face similar issues with documentation:

Information overload: Extensive documentation can overwhelm users.
Relevance: Traditional search often returns too many irrelevant results.
Context: Users struggle to find information relevant to their specific situation.
Currency: Outdated information misleads users.

A well-implemented chatbot can address these issues by providing targeted, up-to-date responses. Let’s explore how Firecrawl (sign up link) and KDB.AI (sign up link) make this possible.

Firecrawl: Simplifying Web Scraping

Web scraping is usually a complex task. Extracting content for Retrieval Augmented Generation can be even more challenging. Firecrawl is a tool that crawls or scrapes for you, extracting relevant content in markdown format. This streamlines the entire process into just a few lines of code:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your-firecrawl-api-key')

crawl_result = app.crawl_url(
    'https://code.kx.com/kdbai',
    params={
        'crawlerOptions': {
            'limit': 10,
            'includes': ['kdbai/*']
        },
        'pageOptions': {
            'onlyMainContent': True
        }
    },
    wait_until_done=True
)

This code snippet demonstrates several of Firecrawl’s key features:

limit: Controls the extent of the crawl. Useful for testing or smaller document sets.
includes: Focuses the crawler on relevant sections.
onlyMainContent: Extracts primary content, reducing noise.

Firecrawl handles JavaScript rendering and complex site structures, which often pose challenges for traditional scrapers. This efficiency enables more frequent updates to your knowledge base, ensuring your chatbot always has the latest information.

Firecrawl: A Closer Look at Its Capabilities

While we’ve covered some basics, Firecrawl provides a variety of advanced features that distinguish it in the field of web scraping:

Advanced Crawling Options

crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'crawlerOptions': {
            'limit': 100,
            'includes': ['/blog/*', '/docs/*'],
            'excludes': ['/admin/*', '/login/*'],
            'maxDepth': 3,
            'mode': "fast"
        },
        'pageOptions': {
            'onlyMainContent': True,
            'includeHtml': True,
            'screenshot': True,
            'waitFor': 5000
        }
    }
)Let’s break down these options:

Selective Crawling:

includes and excludes allow you to target specific sections of a website, ensuring you only collect relevant data.
maxDepth controls how deep the crawler goes, balancing between comprehensive coverage and efficiency.

Performance Modes:

The fast mode can crawl websites without a sitemap 4x faster, though it may be less accurate for heavily JavaScript-rendered sites.

Content Extraction:

onlyMainContent automatically removes clutter like headers, footers, and sidebars.
includeHtml option allows you to retain the HTML structure when needed.

Visual Capture:

screenshot capability can be crucial for visual documentation or debugging.

JavaScript Handling:

waitFor ensures that JavaScript-rendered content is properly loaded before scraping, addressing a common pain point in web scraping.

Structured Data Extraction

Firecrawl goes beyond simple text extraction by pulling out structured data using LLM-powered extraction:

result = app.scrape_url(
    "https://example.com",
    params={
        "extractorOptions": {
            "mode": "llm-extraction",
            "extractionPrompt": "Extract key product information.",
            "extractionSchema": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string"},
                    "price": {"type": "number"},
                    "in_stock": {"type": "boolean"}
                }
            }
        }
    }
)

This feature allows you to:

Extract specific, structured information from web pages.
Customize the extraction to suit your needs using natural language prompts.
Ensure consistency in the format of the extracted data.

Handling Dynamic Content

Many modern websites rely heavily on JavaScript to render content, which traditional scrapers struggle with. Firecrawl addresses this issue effectively:

It can wait for specific elements to load before scraping.
The waitFor option allows you to specify a delay, ensuring all dynamic content is loaded.
Firecrawl can interact with the page (e.g., clicking buttons, scrolling) to reveal content before scraping.
Firecrawl also integrates with Browserbase (headless browser for AI agents), so it can scrape pages that require a login!

KDB.AI: Powerful Vector Storage and Retrieval

KDB.AI offers a powerful solution for content storage and retrieval. (Disclosure: I work as a Developer Advocate for KDB.AI.) To obtain your KDB.AI API keys, sign up at kdb.ai. Registration is free and your access won’t expire.

import kdbai_client as kdbai

session = kdbai.Session(endpoint="your-kdbai-endpoint", api_key="your-kdbai-api-key")

schema = {
    "columns": [
        {"name": "document_id", "pytype": "bytes"},
        {"name": "text", "pytype": "bytes"},
        {
            "name": "embedding",
            "vectorIndex": {
                "type": "flat",
                "metric": "L2",
                "dims": 1536
            }
        },
        {"name": "title", "pytype": "bytes"},
        {"name": "sourceURL", "pytype": "bytes"},
        {"name": "lastmod", "pytype": "datetime64[ns]"}
    ]
}

table = session.create_table("documentation", schema)

Our schema is designed with two key components to optimize search and retrieval:

Embedding Column:

Contains vector representations of content
Utilizes a flat index for efficient, exhaustive vector search
Enables similarity-based queries to find conceptually related information.

Metadata Fields:

title: Descriptive name of the content
sourceURL: Origin of the information
lastmod: Last modification timestamp

These fields facilitate precise filtering and provide crucial context for responses

This dual structure enables powerful, context-aware searches that combine semantic similarity with specific metadata constraints.

The Power of Metadata Filters

Metadata filters are where KDB.AI really shines. They allow the chatbot to understand the context of user queries and provide more relevant answers. Here’s how:

# Query with URL filtering
filtered_query_engine = index.as_query_engine(
    filters=[("in", "sourceURL", "https://code.kx.com/kdbai/latest/gettingStarted/kdb-ai-server-setup.html")]
)
print(filtered_query_engine.query("What are the hardware requirements for KDB.AI Server?"))

# Query with date filtering
three_months_ago = datetime.datetime.now() - datetime.timedelta(days=90)
recent_docs_engine = index.as_query_engine(
    filters=[("lastmod", ">", three_months_ago)]
)
print(recent_docs_engine.query("What are the latest features added to KDB.AI?"))

Combine various metadata fields to create highly specific queries.

Putting It All Together

By integrating Firecrawl’s efficient web scraping with KDB.AI’s advanced querying capabilities, we’ve developed a chatbot that can:

Keep your documentation knowledge base up to date.
Understand the context of user queries.
Provide relevant and timely responses.

Here’s a basic implementation that integrates everything using Llamaindex, a library for building LLM applications:

We can implement this in only a few lines of code:

# !pip install firecrawl-py llama_index kdbai_client llama-index-vector-stores-kdbai
from firecrawl import FirecrawlApp
import kdbai_client as kdbai
from llama_index.core import VectorStoreIndex, Document, StorageContext
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding

# Firecrawl setup and crawling
app = FirecrawlApp(api_key='your-firecrawl-api-key')
crawl_result = app.crawl_url(
    'https://code.kx.com/kdbai',
    params={
        'crawlerOptions': {
            'limit': 10,
            'includes': ['kdbai/*']
        },
        'pageOptions': {
            'onlyMainContent': True
        }
    },
    wait_until_done=True
)

# KDB.AI setup
session = kdbai.Session(endpoint="your-kdbai-endpoint", api_key="your-kdbai-api-key")
# our schema includes extra metadata fields in case we want to filter by them
schema = {
    "columns": [
        {"name": "document_id", "pytype": "bytes"},
        {"name": "text", "pytype": "bytes"},
        {
            "name": "embedding",
            "vectorIndex": {
                "type": "flat",
                "metric": "L2",
                "dims": 1536
            }
        },
        {"name": "title", "pytype": "bytes"},
        {"name": "sourceURL", "pytype": "bytes"},
        {"name": "lastmod", "pytype": "datetime64[ns]"}
    ]
}
table = session.create_table("documentation", schema)

# Process and index documents
documents = [Document(text=item['content'], metadata=item['metadata']) for item in crawl_result]
vector_store = KDBAIVectorStore(table)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=OpenAIEmbedding()
)

# Create query engine
query_engine = index.as_query_engine()

# Example query
response = query_engine.query("What are the system requirements for KDB.AI?")
print(response)

Looking Forward

While this implementation provides a solid foundation, there is always room for improvement:

User Interface: Develop a user-friendly front-end for easier interaction.
Continuous Crawling: Recrawl and index our data whenever our documents change.
Performance Optimization: Enhance retrieval performance on technical documentation by introducing Hybrid Search, which combines keyword and semantic search. For an example of Hybrid Search, check out KDB.AI’s implementation here.

Final Thoughts

Firecrawl’s advanced web scraping capabilities ensure that your chatbot always has access to the latest and most relevant information. Its ability to handle dynamic content, extract structured data, and scale efficiently makes it a powerful tool for keeping your knowledge base up-to-date.

KDB.AI’s vector storage and advanced querying capabilities, especially its metadata filtering, enable your chatbot to deliver context-aware, relevant responses. This blend of semantic search and precise filtering ensures users find exactly what they need, exactly when they need it.

By using these tools together, you can create a powerful system to navigate and understand your documentation more effectively, potentially eliminating the need to read docs again!

Connect with me on LinkedIn for more AI Engineering tips.