How Unstructured and LlamaIndex can help bring the power of LLM’s to your own data

Jerry Liu
7 min readMar 9, 2023

--

(co-authored by Jerry Liu, creator of LlamaIndex, and Brian Raymond, CEO of Unstructured)

In this tutorial, we’ll show you how to easily obtain insights from SEC 10-K filings, using the power of a few core components: 1) Large Language Models (LLM’s), 2) Data Parsing through Unstructured, and 3) Data Indexing through LlamaIndex! We show how the LlamaIndex tooling can help you build an amazing query interface over your own data.

Large Language Models (LLM’s) are starting to revolutionize how users can search for, interact with, and generate new content. The incredible popularity of tools like ChatGPT have only served to accelerate value discovery along all of these dimensions. By simply entering a search query into a text box, users have the ability to interact with information from an incredible array of data sources, in-built into the network. Moreover, the value of LLM’s isn’t simply that users can access this information for search and retrieval purposes; users have the ability to leverage this information for a variety of generation and reasoning tasks, from distilled question-answering, summarization, writing long-form content, style transfer, and even making decisions + performing actions.

There is one challenge though: how do users easily apply LLM’s to their own data? LLM’s are pre-trained with enormous amounts of publicly available natural language data, such as Wikipedia articles, Stack-overflow-like coding questions, Reddit posts, and more; but they haven’t been trained or optimized on domain/organization-specific data.

A key technique that has emerged to exploit non-public or specialized data is “In-Context Learning,” in which the user provides the model sufficient context using the prompt (e.g. QA/summarization/generation). In this way, users can creatively exploit the LLM’s reasoning abilities rather than spending time and money to fine tune/retrain the LLM on domain/organization-specific data.

At a high level, In-Context Learning requires two steps:

  1. Data Ingestion and Transformation: Files containing natural language must be extracted, cleaned, and transformed into a format and an initial structure that an LLM can understand (e.g. JSON). They also tend to perform better if one is able to isolate the most important natural language data (e.g. body text) and discard irrelevant data (e.g. image captions, headers/footers, advertisements, etc.)
  2. Data Indexing: What’s also required is a data structure that can manage the underlying data and an interface so that it can provide the “right” interface to LlamaIndex during query-time!

Unstructured and LlamaIndex help provide the tooling for 1) and 2) respectively!

An additional note: Figuring out the best formats for data-ingestion and indexing is non-trivial. We will show you how to utilize our tooling in order to provide the best answers to your queries.

Let’s dive into an example use case where we want to use the power of LLM’s to answer various questions about certain SEC 10K filings. Below, we’ll walk through each step for preprocessing, transforming, and loading some sample documents,

Data Ingestion and Transformation

We first download the raw UBER 10-K HTML filings from Dropbox. We first ingest the raw natural language data through the Unstructured library.

Unstructured is an open source library that reduces the amount of data engineering required to ingest and preprocess files containing natural language. Unstructured offers a range of upstream data connectors as well as tools to transform a wide range of file types (e.g. PDF, HTML, MSFT Office, PNG/JPEG) into JSON or CSV. They’ve focused on going beyond traditional packages by making it easy to clean up OCR’d text, scraped web data, or other file types, and render files ready for ingestion or an embeddings endpoint downstream. Critically, they also detect and classify key document elements such as body text, lists, headlines, and more. Their code is available under an open source license in their GitHub repository where you’ll also find specialized preprocessing pipelines such as those that have been fine tuned to process XBRL versions of S1, 10K, and 10Q SEC filings.

LlamaIndex provides a simple wrapper over Unstructured in order to easily retrieve the parsed content and convert it into a format that LlamaIndex can ingest.

Here, we use the HTML parser of Unstructured, which is able to parse the raw HTML DOM into both a clean object structure, as well as nicely formatted and cleaned text. The final output is a Document object that we can then use within an index data structure. We create a Document object for each 10K filing.

Data Indexing

We now want to “index” the data in a way so that it can be easily used with a downstream LLM! A simple way of doing this is to combine an embedding-based retrieval model with a language model. In this approach, we first would create “text chunks” from the document, create an embedding for each chunk (for instance through OpenAI’s API), and store the chunks in a vector store.

LlamaIndex supports this capability; you can choose to either store the vectors with a simple in-memory structure, or use a proprietary vector store: Pinecone, Weaviate, Chroma, Qdrant, and more. We can automatically handle the text chunking and storage under the hood with our “vector store” indices.

We first create a separate vector store index for each document:

We also create a global vector index for all documents.

Asking some initial Queries

With our indices set up, we’re already ready to ask some initial “queries”! A “query” is simply a general task passed in to the index, which will eventually make its way to the LLM prompt.

Here, we show an example of asking about risk factors over the 2020 SEC 10-K.

The response highlights that the predominant risk factor in 2020 was the rise of COVID-19.

One question is how general of an interface the vector store index is. Can the global vector index answer the same question, but synthesize the information across different documents corresponding to different years? Let’s take a look!

As shown, even with a top-k of 3, the returned responses focuses on risk in 2019, but we don’t have full confidence that the index is explicitly traversing the risk factors in every document.

We turn to composability in order to help us answer questions about cross-document queries.

Composing a Graph

The composability framework of LlamaIndex allows users to more explicitly define a structure over their data. For this specific use case, we show how users can define a graph to synthesize answers across 10-K filings. The user defines a “list” index over each vector store index corresponding to each document. Since querying a list index involves going through every node in the list, this means that every query sent to this graph will go through every subindex (a vector store index), and within each subindex, a top-k result will be retrieved.

This allows us to define our query with the expectation that it will traverse every document.

The full response is below:

For 2017, 2018, and 2019, the risk factors included economic uncertainty, geopolitical tensions, and the potential for natural disasters.

For 2020, the risk factors include the COVID-19 pandemic and the impact of actions to mitigate the pandemic, which has adversely affected and continues to adversely affect business, financial condition, operating results, and prospects. This includes the potential for reduced demand for products and services, disruption of the supply chain, reduced liquidity, increased costs, and reduced revenue.

For 2021 and 2022, the risk factors include the potential for Drivers to be classified as employees, workers or quasi-employees instead of independent contractors, the highly competitive nature of the mobility, delivery, and logistics industries, and the need to lower fares or service fees and offer Driver incentives and consumer discounts and promotions in order to remain competitive in certain markets.

Overall, the risk factors have remained relatively consistent across the years 2017, 2018, 2019, 2021, and 2022. The primary risk factors have been the potential for Drivers to be classified as employees, workers or quasi-employees instead of independent contractors, the highly competitive nature of the mobility, delivery, and logistics industries, and the need to lower fares or service fees and offer Driver incentives and consumer discounts and promotions in order to remain competitive in certain markets. Additionally, economic uncertainty, geopolitical tensions, and the potential for natural disasters have been risk factors in each of these years. The COVID-19 pandemic has been a risk factor since 2020, and has included the potential for reduced demand for products and services, disruption of the supply chain, reduced liquidity, increased costs, and reduced revenue.

It’s clear that the answer has been synthesized across multiple years!

Concluding Thoughts

The example above uses 10K filings, but you can use these tools with a vast array of data, from .pdf files to .pptx, to API calls. Unstructured provides the tooling to quickly and easily preprocess data into LLM-compatible formats, and LlamaIndex provides the toolset to help connect this data with your LLM tasks.

The Colab notebook is here if you want to try it out! https://colab.research.google.com/drive/1uL1TdMbR4kqa0Ksrd_Of_jWSxWt1ia7o?usp=sharing

We’re incredibly excited to see more use cases emerge as this interface between LLM’s and data develops further.

--

--