Streamlining Complex PDF Management Locally with Mixtral 8x7B: A Complete Guide

Mollel Michael (PhD)
10 min readDec 30, 2023

--

Repo: link

Navigating the complexities of PDF management can often feel daunting, especially when dealing with intricate documents containing extensive data and tables. In a world where digital documentation is paramount, the need for efficient and reliable tools to handle such tasks is more pressing than ever. This is where Mixtral 8x7B within the LLaMA-Index comes into play.

Recently, the AI community has witnessed a surge in the capabilities of Large Language Models (LLMs) thanks to groundbreaking advancements like Mixtral. These models have challenged the dominance of established, high-budget corporations and opened doors for individuals and small businesses to leverage cutting-edge technology, even without access to expensive hardware. In this article, I aim to demystify the process of implementing Mixtral 8x7B on a local Windows setup, tailored explicitly for complex PDF workflows.

From a comprehensive setup guide to practical strategies and troubleshooting tips, this guide is designed to empower you with the knowledge and tools to manage PDF documents efficiently. We will explore how to harness the power of Mixtral in a CPU-only environment, overcoming the usual barriers of financial constraints and hardware limitations. Whether you’re a seasoned developer or a curious enthusiast, this guide promises to provide insights into innovative and efficient data processing, making complex PDF management a breeze.

This concept was inspired by the insightful tutorial from LLM wizard Chris Alexiuk and AI Makespace pioneer Greg Loughnane, which can be found here: Chris Alexiuk and Greg Loughnane’s tutorial. Their approach to engaging with complex PDFs using OpenAI APIs offers a glimpse into the practical capabilities of LLMs.

I would like to point out that certain segments of the code, particularly those involving the conversion of PDFs to HTML format, require execution on a Linux-based system. I suggest setting up Windows Subsystem for Linux (WSL) on your computer to accommodate this. As a helpful addition to this tutorial, I will provide a concise guide on installing ‘pdf2html’ within WSL. For those interested in a more in-depth exploration of WSL, including its advanced setup and utilization, I recommend the following resources:

However, it’s important to note that the function for converting PDF to HTML doesn’t need to be integrated into the main function. If you already have RAG data in HTML format that has been transformed from a PDF, you can proceed without installing WSL. To simplify things, I have included these HTML documents in the repository, allowing you to bypass this step. Feel free to use these resources to streamline your setup process.

Installing PDF2HTML on WSL

  1. First, navigate to LINK to access the most recent version of pdf2htmlEX. Once there, download the package and save it to a working folder or another location where you can easily recall the path. Select the one with a .deb file extension among the various packages on the website.
Download the package that has .deb as its extension.

2. After downloading the package and placing it in the desired folder, open the Windows Subsystem for Linux (WSL) using PowerShell. Then, mount your drive to WSL. Navigate to the directory where you’ve stored the package by executing this command:

username@DESKTOP-QPTTKI7Z:cd /mnt/c/Users/username/LLM/deciphiMistra

To install the setup, run:

sudo apt install "./pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb" -y

Next, verify the conversion functionality by executing the Python file for document conversion. Run this file (testpdf2html.py) in the WSL terminal to test if it successfully converts the document as expected.

python3 testpdf2html.py

The file testpdf2html.py is as follows:

import subprocess

def convert_pdf_to_html(pdf_path, html_path):
command = f"pdf2htmlEX {pdf_path} --dest-dir {html_path}"
subprocess.call(command, shell=True)
input_pdf = "data/Market-Report-21-December-2023.pdf"
output_html = "data/Market-Report-21-December-2023"
convert_pdf_to_html(input_pdf, output_html)

You can convert all your PDF documents to HTML using this method or integrate this conversion code into the main RAG script. It’s important to run this conversion process on WSL to produce the HTML files. Once you have the HTML files in your data folder, you can execute the rest of the code on your native Windows system. Alternatively, if you prefer to conduct the entire conversion within the main RAG file, you can run the whole RAG pipeline on WSL. Just make sure to install all the required libraries beforehand.

Main File (RAG your complex PDF)

  1. Begin by installing and importing the essential libraries for this setup. The required libraries include i) llama-index, ii) langchain, iii) llama-hub, iv) unstructured==0.10.18, v) lxml, vi) among others. Remember to upgrade LLAMA CPP by running the command pip install — upgrade llama-cpp-python to keep your setup up-to-date.
from llama_index.llama_pack import download_llama_pack
from embedded_tables_unstructured_pack.base import EmbeddedTablesUnstructuredRetrieverPack
import subprocess

2. It’s necessary to download the llama_pack from llama-index to tailor the local model and embeddings for our needs. This step is essential because, by default, llama_pack utilizes OpenAI’s resources for Large Language Models (LLMs) and embeddings.

# #Remove the comment the first time, and once the pack has been downloaded, you can then add the comment back.
# #*****************
# EmbeddedTablesUnstructuredRetrieverPack = download_llama_pack(
# "EmbeddedTablesUnstructuredRetrieverPack",
# "./embedded_tables_unstructured_pack",
# )
# #*************

Remember that after downloading, you can comment out the relevant lines to prevent re-downloading in future uses. I have included this file in my repository, so if you clone my repo, there’s no need for an additional download. The lines in question have been modified to enhance the adaptability with local LLMs and embeddings, so make sure to leave them commented as they are.

3. I mentioned utilizing the document featured in AI MakerSpace’s tutorial earlier. We’ll pose identical questions to compare the responses generated by Mixtral against OpenAI, but you can also test the capability of other small models. The response generated from Mixtral 8 x 7B is a relatively large model that has been shown to outperform GPT-3.5 in various tasks. The document ‘quarterly-nvidia.pdf’ used for this purpose is already included in my repository. There’s no need to worry about executing the conversion code for this document, as I have pre-converted it to HTML for ease of use. However, if you’re interested in experimenting with a different document, feel free to uncomment the relevant code in the main file.

#Add a comment to skip the process, and remove the comment if you wish to execute another document.
def convert_pdf_to_html(pdf_path, html_path):
command = f"pdf2htmlEX {pdf_path} --dest-dir {html_path}"
subprocess.call(command, shell=True)

input_pdf = "quarterly-nvidia.pdf"
output_pdf = "quarterly-nvidia"

convert_pdf_to_html(input_pdf, output_pdf)

4. Initializing the Index and Query Engine:

i) Use the EmbeddedTablesUnstructuredRetrieverPack as the wrapper. This pack includes a variety of functions. For more detailed information about its capabilities and usage, refer to the documentation on the llama index.

embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"data/quarterly-nvidia/quarterly-nvidia.html",
nodes_save_path="data/quarterly-nvidia/nvidia-quarterly.pkl"
)

ii) Returning to step 2, the EmbeddedTablesUnstructuredRetrieverPack you downloaded contains several key files, including i) base.py and ii) requirements.txt. The base file has been altered to support local LLMs and embedding configurations. The primary change involves specifying the LLM to be used, and you can choose from three different models by commenting on the other two. It’s important to note that if you plan to utilize the mixtral-8x7b model, there are various considerations to remember. For a more comprehensive understanding, please refer to my article on mixtral-8x7b, available at this link. Additionally, to access the quantized model, you can follow the link provided:

a) Deci/DeciLM-7B-instruct-GGUF link is here

b) TheBloke/phi-2-GGUF link is here

c) TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF link is here

Another important aspect of this setup is using local embeddings; specifically, I’ve implemented ‘gte-large’ as the embedding choice. I encourage experimenting with various embeddings to observe the differences in results and performance enhancements. You can download these embeddings from the provided link. Once the local LLM and embeddings are initialized in the service_context, I configure the global service context to replace the default OpenAI LLM and model settings. Finally, I initialize the query_engine using this new service_context. The fully modified ‘base.py’ is presented below:

 
class EmbeddedTablesUnstructuredRetrieverPack(BaseLlamaPack):
"""Embedded Tables + Unstructured.io Retriever pack.

Use unstructured.io to parse out embedded tables from an HTML document, build
a node graph, and then run our recursive retriever against that.

**NOTE**: must take in a single HTML file.

"""

def __init__(
self,
html_path: str,
nodes_save_path: Optional[str] = None,
**kwargs: Any,
) -> None:
"""Init params."""
self.reader = FlatReader()

docs = self.reader.load_data(Path(html_path))

## Uncomment this if you want to experiment with DeciLM
# set_global_tokenizer(
# AutoTokenizer.from_pretrained("Deci/DeciLM-7B").encode
# )

## Uncomment this if you want to experiment with Phi-2
# set_global_tokenizer(
# AutoTokenizer.from_pretrained("microsoft/phi-2").encode
# )

#Make sure to comment this out if you decide to go with an alternate model.
set_global_tokenizer(
AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1").encode
)


self.llm = LlamaCPP(
# Optionally, you can pass in the URL to a GGML model to download it automatically
model_url=None,
# Set the path to a pre-downloaded model instead of model_url
# model_path='./decilm-7b-uniform-gqa-q8_0.gguf', ## deciLM local path
model_path='./mixtral-8x7b-instruct-v0.1.Q2_K.gguf', # Mixtral local path
# model_path='./phi-2.Q8_0.gguf', ## phi-2 local path
temperature=0.0,
max_new_tokens=2000, # Increasing to support longer responses
context_window=8048, # Mistral7B has an 8K context-window
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": -1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)


embed_model = LangchainEmbedding(
HuggingFaceEmbeddings(model_name="thenlper/gte-large")
# HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)


# embed_model = HuggingFaceEmbedding(model_name="./gte-large")

service_context = ServiceContext.from_defaults(
llm=self.llm,
embed_model=embed_model
)

# service_context = ServiceContext.from_defaults(
# llm=self.llm,
# embed_model="local:./embed_model"
# )


set_global_service_context(service_context)

self.node_parser = UnstructuredElementNodeParser()
if nodes_save_path is None or not os.path.exists(nodes_save_path):
self.node_parser.llm = self.llm
raw_nodes = self.node_parser.get_nodes_from_documents(docs)
pickle.dump(raw_nodes, open(nodes_save_path, "wb"))
else:
raw_nodes = pickle.load(open(nodes_save_path, "rb"))

base_nodes, node_mappings = self.node_parser.get_base_nodes_and_mappings(
raw_nodes
)
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(base_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)
self.recursive_retriever = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever},
node_dict=node_mappings,
verbose=True,
)
# self.query_engine = RetrieverQueryEngine.from_args(self.recursive_retriever)
# service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4-1106-preview"))


self.query_engine = RetrieverQueryEngine.from_args(
self.recursive_retriever,
service_context=service_context
)

def get_modules(self) -> Dict[str, Any]:
"""Get modules."""
return {
"node_parser": self.node_parser,
"recursive_retriever": self.recursive_retriever,
"query_engine": self.query_engine,
}

def run(self, *args: Any, **kwargs: Any) -> Any:
"""Run the pipeline."""
return self.query_engine.query(*args, **kwargs)

Finally, within the setup, I’ve adjusted the global tokenizer to align with our chosen LLM through the global_tokenizer call. Be sure to tailor this setting to match your specific model.

set_global_tokenizer(
AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1").encode
)

Model Responses Showcased
I’ll present the responses from llama-index default OpenAI and Mixtral 8x7B for demonstration purposes. However, I feel encouraged to engage with different models and different questions to explore their responses further. The questions I fed to the model are as follows.

Question 1:

response = embedded_tables_unstructured_pack.run("Revenue?")

i. The response from the default llama-Index: “The revenue for the third quarter of fiscal year 2024 was $2.86 billion for Gaming, $416 million for Professional Visualization, and $261 million for Automotive.

ii. The Response from Mixtral 8x7B:

From two responses, Mixtral provides an answer that more closely aligns with the expected response than the default LLM in the llama-index. The term ‘Revenue’ appears 119 times in the document, making the query overly broad. Nonetheless, Mixtral managed to generate a generalized response that appropriately addresses the query as posed.

Question 2:

response = embedded_tables_unstructured_pack.run("Revenue from sales outside of USA?")

i. The response from the default LLM in llama-Index: “The company generated 65% and 62% of its revenue during the third quarter and first nine months of fiscal year 2024, respectively, from sales outside of the United States.

ii. The Response from Mixtral 8x7B:

Both responses seem to convey the same information regarding the company’s revenue sources. The slight difference is that Mixtral elaborates on the implication of the data, emphasizing the importance of international sales to the company’s revenue.

Question 3:

response = embedded_tables_unstructured_pack.run("Any policy changes?")

i. The response from the default LLM in llama-Index: “There may be policy changes that could impact the company’s operations. Changes to laws, regulations, and their interpretation and enforcement could lead to increased compliance costs and further restrictions on manufacturing and supplying products. Additionally, revisions to laws or regulations could result in increased taxation, trade sanctions, import duties or tariffs, and other retaliatory actions. Government actions, including trade protection and national security policies, could affect the company’s ability to ship products and provide services. The increasing focus on the risks and strategic importance of AI technologies has also resulted in regulatory restrictions that may impact some or all of the company’s product and service offerings.

ii. The Response from Mixtral 8x7B:

Both responses adequately address the question posed.

Question 4:

response = embedded_tables_unstructured_pack.run("What are the research and development expenses, and what percentage of the net revenue do they represent?")

i. The response from the default LLM in llama-Index: “The research and development expenses for the three months ended October 29, 2023, were $2,294 million, representing 12.7% of net revenue. For the nine months ended on the same date, the expenses were $6,210 million, which was 16.0% of net revenue.

ii. The Response from Mixtral 8x7B:

General Comment

The Mixtral model exhibits impressive performance, yet it stands to benefit from refined prompt crafting. My observations from the experiment indicate that the precision and relevance of Mixtral’s responses can be considerably enhanced through careful, prompt engineering. Consequently, I suggest fine-tuning your prompts to maximize the efficacy of the answers you receive from the Mixtral model for specific usecase.

--

--