Practical ETL Pipeline for Chatbots: From Confluence to ChromaDB using GitLab

Published in

EQS Engineering Blog

12 min readDec 24, 2023

Confluence has long stood as a pillar for organizations to store and share their knowledge. Yet, simply having a rich knowledge base is only the start. With the rising popularity of chatbots, there’s a pressing need to make this knowledge easily accessible through conversational interfaces. The first step in this evolution? Migrating that knowledge to a structured format, like vector databases, paves the way for intelligent, interactive information exchange internally.

This article isn’t mere data migration; it’s about refining and supercharging access to knowledge and enhancing searches and operations. Explore the depths of ETL pipeline creation, utilizing GitLab for seamless Confluence-vector database integration, through a guide that aims to light the path for data engineers and knowledge managers. Ready to bridge the gap from static to dynamic knowledge use?

A. Before we dive:

a. What is a Vector Database?

Vector databases, part of the NoSQL family, manage data as vectors — mathematical constructs representing data in multi-dimensional spaces. These databases excel at handling complex data, like machine learning embeddings. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms.
Their hallmark? Swift similarity searches. Submit a query vector, and they’ll pinpoint its closest matches in the dataset. When searching for a Kitten, you will find it near Cats as the nearest neighbor followed by Tigers.

Very simple visualization of similar words presentation in a vector space

b. What is an ETL?

Extract: Pull data from varied sources like databases or spreadsheets.
Transform: Process the data, enhancing its consistency and reliability through cleaning, enriching, and restructuring steps.
Load: Deposit the curated data into a destination (like Chroma DB) for optimized access and use.

B. Prerequisites

Before exploring the ETL pipeline development for integrating a Confluence Knowledge Base with a vector database, let’s identify what we need first:

Confluence API Access: To extract data from a Confluence Knowledge Base, secure API access is a. must. For security reasons. We require a user who can access only internally public data, safeguarding against the disclosure of confidential information.
Vector Database: we will use a hosted ChromaDB, accessible from the GitLab pipeline. To ensure secure access, it’s crucial to leverage API keys behind a gateway.
GitLab: We will require an adequately privileged account and crucial CI/CD configurations. This entails setting up runners, embedding environment variables like API keys, and constructing a “.gitlab-ci.yml” file to secure and define the pipeline efficiently.

1. Extracting Knowledge from Confluence

The act of obtaining information from Confluence is intricate, particularly when handling diverse data forms such as text, video, attachments, and so on. In this article, however, we’ll focus solely on the textual content of the html output.

a. Connecting to Confluence API

The extraction process commences with the fetchConfluenceData function. It’s designed to make authenticated HTTP GET requests to the Confluence API using axios. Authentication is handled using a Basic Auth token:

const fetchConfluenceData = async (url, authToken) => {
 // … other code …
 const response = await axios.get(url, {
 headers: {
 Authorization: `Basic ${authToken}`,
 'Content-Type': 'application/json',
 Accept: 'application/json',
 },
 });
 // … other code …
};

This token is generated using the Confluence username and password, then encoded in Base64 and passed in the request header.

b. Fetching Active Spaces

Identifying active spaces is a pivotal step, conducted by fetchAllActiveSpaces. It interrogates the /rest/api/space endpoint of the API:

const fetchAllActiveSpaces = async (baseUrl, authToken) => {
 const url = `${baseUrl}/rest/api/space?type=global&limit=500`;
 // … other code …
 const response = await fetchConfluenceData(url, authToken);
 // … other code …
};

PS: you can instead of fetching If specific spaces are preconfigured in config.insight_space, it sidesteps the API call, otherwise, it retrieves all spaces, respecting the pagination limit set by Confluence.

c. Retrieving Pages from Spaces

With the space keys in hand, each space’s content is fetched through fetchAllPagesInSpace, which continuously requests pages until all are retrieved:

const fetchAllPagesInSpace = async (baseUrl, spaceKey, authToken) => {
 // … other code …
 while (hasMoreData) {
 const url = `${baseUrl}/rest/api/content?spaceKey=${spaceKey}&limit=${limit}&start=${start}`;
 const data = await fetchConfluenceData(url, authToken);
 // … other code …
 }
 // … other code …
};

This method is robust against API limitations by incrementing the start parameter to loop through paginated results.

d. Challenges and their Mitigation

1. Rate Limiting:

Challenge: Confluence API might have rate limits, which can hinder the extraction process if you’re making frequent requests.
Mitigation: Implementing rate limit handling in your Node.js scripts, using retries with exponential backoff, or pre-fetching data at non-peak times can help.

2. Incomplete or Inconsistent Data:

Challenge: Not all data in Confluence might be structured or complete, leading to potential gaps in the extracted data.
Mitigation: Data validation checks during extraction can identify and flag incomplete data, allowing for manual intervention or automated fill mechanisms.

3. Large Attachments:

Challenge: Some attachments in Confluence might be heavy, causing delays or potential failures during extraction.
Mitigation: Breaking down large files into chunks, optimizing the extraction process to deal with large files separately, or temporarily storing them in ChromeDB before processing can help.

4. Data Security and Privacy Concerns:

Challenge: Extracting sensitive or private data might raise security and compliance issues.
Mitigation: Ensure that all data extraction adheres to GDPR, CCPA, or other regional data privacy laws. Additionally, anonymize or mask sensitive data during the extraction phase if needed.

2. Transforming Data for Vector Database

Once the data is extracted from Confluence, the transform phase manipulates this raw data into a more analytically valuable form. This process involves sanitizing text, enriching data, and preparing it for loading into ChromaDB.

Text Sanitization and Normalization

The raw HTML content from Confluence pages is converted into plain text to simplify analysis. This conversion might look something like the following:

const transformPageContent = (htmlContent) => {
    const textContent = htmlToText(htmlContent, {
        wordwrap: null,
        ignoreImage: true,
        uppercaseHeadings: false
    });
    return textContent.trim();
};

Here, htmlToText is a function that converts HTML to plain text, stripping out any formatting and images to leave behind the raw, unstyled content.

Data Structuring

During the transformation phase, it’s essential to structure the data into a more coherent form that reflects the requirements of the target data store or the analytical tools to be used later. This might involve creating a uniform schema or data model:

const structureData = (pageData) => {
    return {
        title: pageData.title,
        content: transformPageContent(pageData.body.storage.value),
        metadata: {
            author: pageData.version.by.displayName,
            createdDate: pageData.version.when,
            url: `ConfluenceURL/pages/viewpage.action?pageId=${pageData.id}`
            // other relevent metadata
        }
    };
};

This structureData function demonstrates the process of taking the unstructured page data and organizing it into a more defined format.

If your data has inherent relationships (like hierarchies in Confluence pages or related tags), understand how these relationships can be represented within the vector database.

Content Enrichment

The transformation phase may also include enriching the data with additional context or metadata, which could involve appending related tags, categories, or any other relevant data that can provide more depth to the analysis.

const enrichData = (structuredData) => {
    // Example of adding a tag based on content analysis or other criteria
    // you can also leverage LLMs to make such categorization
    if (structuredData.content.includes("urgent")) {
        structuredData.metadata.priority = "High";
    }
    return structuredData;
};

The enrichData function showcases a simple conditional enrichment, where a priority level is assigned to the data based on the content.

Data Filtering

Sometimes, the transformation phase includes filtering the data to remove unnecessary or irrelevant information:

const filterData = (structuredData) => {
    return structuredData.filter((page) => page.content.length > 100);
};

In this filterData function, only pages with more than 100 characters are kept, filtering out pages with less content which might be considered less valuable.

Chunking data

Now we split the elements into chunks less or equal to 512 characters with an overlap of 20 characters.

const textSplitter = new RecursiveCharacterTextSplitter({
            chunkSize: 512,
            chunkOverlap: 20,
        });
const splitDocs = await textSplitter.splitDocuments(rawData);

Generating these chunks can make the IDs of the original article id obsolete. the simplest way is to auto-increment an integer as a suffix to the original ID.

let previousBaseId = '';
let suffix = 0;
splitDocs.forEach((doc, index) => {
    let baseId = doc.metadata.id;

    // If the baseId is different from the previous one, reset the suffix
    if (baseId !== previousBaseId) {
        suffix = 0;
        previousBaseId = baseId;
    }

    // Generate newId by appending the suffix to the baseId
    let newId = `${baseId}-${suffix}`;

    // Adding the calculated glId to the document's metadata
    doc.metadata.glId = newId;

    // Increment suffix for next document
    suffix++;
});

Possible Challenges and Measures

1. Identifying and Addressing Null Values:

Missing or null values can hinder the efficiency of a vector database. Identify these gaps and either fill them with default values, mean/median imputation or use advanced techniques like predictive filling.

2. Removing Duplicates:

Confluence data might have redundancies, especially with repeated information across different pages or sections. Identify and eliminate these duplicates to ensure uniqueness.

3. Standardizing Formats / Data Normalization:

Ensure that date, currency, text formats, etc., are consistent. This is especially important when dealing with global data that might have regional variations.

4. Handling Non-textual Data:

Images, videos, and other rich media extracted from Confluence can be represented in vector form using feature extraction techniques. Tools like convolutional neural networks (CNN) for images can be instrumental in this regard.

5. Categorization and Tagging:

Apply classification models or clustering techniques to categorize the data. This can be valuable for quick retrieval or segmentation within the vector database. Can be part of Advanced RAG techniques.

3. Loading Data into Vector Database

After the data has been extracted from Confluence and transformed into a structured and enriched format, the final step in the ETL pipeline is to load the data into ChromaDB.

Preparation

The code begins by importing the necessary modules and reading the data_transformed.json file containing the transformed data ready to be loaded.

const splitDocs = JSON.parse(await readFile('data_transformed.json', 'utf-8'));

Embedding and Database Initialization

The ChormaDB leverages machine learning models to embed the documents for similarity searches. The code initializes a Hugging Face model to perform the embedding:

const model = new TransformersEmbeddingFunction({
    model: "Xenova/bge-small-en-v1.5",
});

Then, it connects to ChromaDB and either retrieves or creates a new collection to store the documents:

const chroma = new ChromaClient({path: config.chroma_server_host});
const collection = await chroma.getOrCreateCollection({name: "confluence", embeddingFunction: model});

Loading Documents

Documents are added to the CHROMA database in batches. The code iterates over each document and adds it to the specified collection:

await addDocumentsToChromaDB(collection, splitDocs);

Each document consists of its content and associated metadata, including a globally unique identifier (glId).

Strategies for Efficient Data Loading

1. Batch Processing:

Instead of loading data in real-time or one-by-one, utilize batch processing. This approach aggregates data and loads it in chunks, ensuring efficiency and reduced strain on the database.

2. Parallel Loading:

If your infrastructure and the vector database support it, parallelize the data loading process. This means multiple streams of data can be loaded simultaneously, speeding up the entire operation.

3. Error Handling:

Always anticipate potential issues during loading, such as data mismatches or server timeouts. Implement error handling mechanisms that can either retry, log, or notify about such disruptions.

4. Prioritization:

If specific data segments are more critical than others, prioritize their loading. This ensures that the most vital data is available for querying even if the loading process isn’t complete.

4. Validation and Verification Procedures

Loading data into a vector database might seem like the last step of the ETL journey, but in many ways, it’s just the beginning of realizing the true value of your data. Some checks can ensure you did the previous steps well.

Data Integrity Checks:

Once data is loaded, compare a sample against its transformed counterpart to ensure no data loss or corruption has occurred during the loading phase.

Performance Benchmarks:

Conduct speed and performance tests to verify that data can be retrieved efficiently. This will help you gauge if the loading was not just successful but also optimized for future operations.

User Feedback:

Allow a subset of users to interact with the newly loaded data. Their feedback can provide insights into any anomalies, errors, or inefficiencies that might have been overlooked.

Auditing:

Maintain logs of the data loading process. This will not only aid in troubleshooting but also ensure compliance and offer transparency about the data’s lineage and any transformations it underwent.

5. Orchestrating ETL with GitLab

This section will showcase how to set up a GitLab CI/CD pipeline to streamline the ETL process for data from Confluence, with additional stages for Testing and Resetting the environment.

Stages Definition

The stages keyword is used to define the order of execution:

stages:
  - Extract
  - Transform
  - Load
  - Test
  - Reset

Extract Stage

The “Extract” stage is defined by the job 💫 E-confluence. It is responsible for pulling data from a Confluence instance:

💫 E-confluence:
  stage: Extract
  script:
    - npm install
    - npm run confluence:extract --option=$EXTRACT_OPTION
  variables:
    INSIGHT_URL: $INSIGHT_URL
    INSIGHT_PASSWORD: $INSIGHT_PASSWORD
    INSIGHT_USERNAME: $INSIGHT_USERNAME
  when: manual  # This line makes the job manual
  allow_failure: false  # If you want the job to be marked as 'failed' if any script fails
  artifacts:
    paths:
      - data_raw.json
      - node_modules/   # Add the node_modules directory as an artifact
    expire_in: 24 hours  # Artifacts expire in 24 hours

Key features:

Installation of project dependencies using npm install.
Manual trigger (when: manual), which means this job runs when initiated by the user.
Use of project-specific variables like INSIGHT_URL, which are assumed to be set in the CI/CD settings.
Artifact creation for data_raw.json and the node_modules directory with a 24-hour expiry.

Transform Stage

The “Transform” stage is defined by the job 🔧 T-confluence, which modifies the extracted data into a structured format:

🔧 T-confluence:
  stage: Transform
  script:
    - npm run confluence:transform
  when: manual  # This line makes the job manual
  artifacts:
    paths:
      - data_transformed.json  # Assuming this file is produced by the transform.js script
  dependencies:
    - 💫 E-confluence

Key features:

The job is also manually triggered.
It depends on the successful completion of the 💫 E-confluence job.
The transformed data is saved in data_transformed.json, which is preserved as an artifact.

Load Stage

The “Load” stage is defined by the job ✍ L-confluence, which loads the transformed data into a vector database:

✍ L-confluence:
  stage: Load
  script:
    - npm install
    - npm run confluence:load
  when: manual  # This line makes the job manual
  dependencies:
    - 🔧 T-confluence

Key features:

This stage is manually triggered.
It depends on the 🔧 T-confluence job to ensure data is transformed before loading.

Test Stage

The “Test” job 👨‍🎓 T-confluence handles the testing of the ETL process to ensure the integrity of the data:

👨‍🎓 T-confluence:
  stage: Test
  script:
    - npm install
    - npm run confluence:test
  # ...other actions...

Key features:

The job is manually triggered.
It is dependent on the 💫 E-confluence job, suggesting that it tests the extraction process or the end-to-end ETL pipeline.

Reset Stage

Finally, the “Reset” job ⚠⚠⚠ RESET ⚠⚠⚠ is defined as resetting the environment, cleaning up, or reinitializing as needed:

⚠⚠⚠ RESET ⚠⚠⚠:
  stage: Reset
  script:
    - npm install
    - npm run confluence:reset
  # ...other configurations...

Key features:

Manually triggered for controlled cleanup operations.
This is typically used to clear databases, remove temporary files, or revert settings to a baseline state.

Possible enhancements

1. Trigger-based Workflows:

Automate your ETL workflows to initiate based on specific triggers, such as data updates in Confluence or periodic time intervals.

2. Rollbacks and Error Handling:

Configure your pipeline to handle errors gracefully, with potential rollback mechanisms or notifications for manual intervention.

3. Logging and Reporting:

Ensure detailed logging of each ETL step, aiding in troubleshooting, performance benchmarking, and audit compliance.

4. Alerts and Notifications:

Set up notifications to be alerted about any disruptions, failures, or successful completion of ETL jobs.

Conclusion

In closing, remember that the integration of Confluence, vector databases, and GitLab is not just a technical upgrade but a step towards a more innovative and efficient future. This is an evolving field, and the possibilities are vast. Stay curious and keep exploring new ways to enhance data accessibility and knowledge sharing. Your willingness to innovate and adapt will not only benefit your organization but also contribute to the broader community’s growth in this exciting domain. Let’s embrace these tools and approaches with enthusiasm and a spirit of discovery, continually pushing the boundaries of what’s possible in knowledge management.

I’d love to hear your thoughts on this process. Have you embarked on similar journeys? Any challenges or successes you’d like to share? Let’s enrich this discussion with diverse experiences!

Passionate about shaping the future of SaaS products? At EQS Group, we’re on the hunt for enthusiastic individuals ready to bring their talents to our dynamic team. Discover opportunities tailored for you: https://eqs-group.personio.de/recruiting/positions.

Practical ETL Pipeline for Chatbots: From Confluence to ChromaDB using GitLab

A. Before we dive:

a. What is a Vector Database?

b. What is an ETL?

B. Prerequisites

1. Extracting Knowledge from Confluence

a. Connecting to Confluence API

b. Fetching Active Spaces

c. Retrieving Pages from Spaces

d. Challenges and their Mitigation

2. Transforming Data for Vector Database

Text Sanitization and Normalization

Data Structuring

Content Enrichment

Data Filtering

Chunking data

Possible Challenges and Measures

3. Loading Data into Vector Database

Preparation

Embedding and Database Initialization

Loading Documents

Strategies for Efficient Data Loading

4. Validation and Verification Procedures

Data Integrity Checks:

Performance Benchmarks:

User Feedback:

Auditing:

5. Orchestrating ETL with GitLab

Stages Definition

Extract Stage

Transform Stage

Load Stage

Test Stage

Reset Stage

Possible enhancements

Conclusion

Written by Jaouher Kharrat