10-KGPT: Creating an investing tool and learning LLMs

9 min readNov 17, 2023

Like many techies, I’ve been looking for an excuse to play with LLMs/Gen AI. Ben Royce, Head of AI Service at Google Cloud, thinks GenAI is the most impactful tech since the internet. I’m not quite there yet, but I’d liken it to the release of the iPhone. Time will tell, we’re still in the middle of it and it’s moving fast. So fast, that after Open AI’s Dev Day last week some techies are declaring startups in this space are dead.

Before the tech nerdery, I want set the stage on why 10-k’s with a different sort of nerdery: investing. In 2017 my wife and I went on a sabbatical to travel around the world (and surf!). During that time, we didn’t work, but I wanted to learn something new. The year before we went to a financial advisor who told us to put our money into an index fund. Perfectly good advice. In fact, it’s what Warren Buffett and Charlie Munger recommend most people do. But I’d flirted enough with investing and made the dumb rookie mistakes so was motivated to learn. Plus, I’m a glutton for punishment. Furthermore, if I found it engaging, investing is a great skill to have for a very very long time (Munger is 99 and sharp as a tack). Unlike my surfing and mountain bike skills, I can continue to get better for a long time. Learning is like compound interest and if I start now, by the time I retire I‘d be good enough to keep me in a lifestyle I’m accustomed to. So long as I don’t make any life altering mistakes (ie lose a-lot of money). I’ll spare you all the deep dive I went into and continue to go down on learning to invest my own money, but most contemporary financial pundits might call me a concentrated value investor. I subscribe to Seeking Alpha, Guru Focus, Bloomberg, WSJ, read books on investing, read 10-Ks…I spend a decent amount of time investing. Which is why index funds are better for most. I like the “investigative journalism” aspect of investing. There is so much you learn about industries, it’s fascinating. I’ve learned: electric arc furnaces used for greener steelmaking, semi-conductor supply chain, vehicle salvage market, autoparts retail, e-commerce/retail, and many others. Then there’s how to analyze a business: 3 financial statements, how to read a 10-k, assess management, how to value it, how to check my bias, and the different kinds of competitive advantage. Finally, I’ve found there are a circle of investors that are both wise and persons of character. They’re like modern day philosophers in action that require a world view and balance of left/right brain thinking that’s extremely engaging. I’ve learned an incredible amount about business and philosophy and still understand so little. Investing keeps me quite engaged.

Did I lose you? Let’s get down to what I’m building. As part of my investment process, I do research on companies and industries. I seek to build my understanding of the market, a company’s position in that market, and how to value that company. I think LLMs can be of great use for research. I can’t rely on chatGPT’s corpus to give me what I need, plus I want the ability to control the inputs. In the world of more data, I believe the final 20% will be built on quality data + product presentation. Whereas generative AI has a great general capability, it pairs best with domain expertise and that’s where I think things are very exciting right now in many industries. My venn diagram of CTO + investor is cut out for this particular project.

I’ve started by feeding 10k’s in, but I’ll eventually add earnings call transcripts, and 10q’s. My initial use-case is for new companies. I still need to spend alot of time reading (summaries won’t cut it), but I want a tool that aides my investigation. My initial process will be like this:

Start with the financials and look for anomalies/trends
Interrogate the 10-Ks for those anomalies
General inquiries into competition, management quality, congruence with my values, and my capability to understand this company/industry.

The What

Screenshot that shows financial history and notable highlights

This is a screenshot from the Streamlit app I built to do my research. I’m calling it 10-kGPT for now, but for you product aficionados this is neither a branding nor a UX exercise 😅. At the top is key financial information I look at, and at the bottom is a summary of trends/anomalies from that table. Check out the 2nd bullet:

Net Income: QCOM’s net income has also shown a positive trend, although it has been more volatile compared to revenue. The company experienced a significant drop in net income in 2018, with a loss of $4,964 million. However, it recovered and reached $12,936 million in 2022.

Let’s dive into that.

Interesting. I knew about issues with Apple (and them trying to building their own chips), but I hadn’t realized it affected recorded royalties. I also have a reference here if I want to go read more.

Here’s a couple more questions I’d ask

Pretty interesting! I’d definitely like to check out their competitors, validate their moat, and see if the SEC investigation denotes any issues with management.

You can see I’m referencing sources here, but what I’d like to do in the future is provide a link to the 10-k + section and go right to there in a browser for further reading. Put it on the backlog.

The How

Parsing/Chunking/Embedding

I used Python’s Beautifulsoup. I started going through section by section for just the interesting bit, but I ran into issues with inconsistencies across time and companies so I just parse everything now. I grab GAAP/financial metrics from another API, but I still wanted the option to make table information searchable. So I spent a bit of time making sure I could parse table information and read it while debugging by converting tables into an ASCII format. This was one of those exercises that lead me down a rabbit hole, but I think it was worth it for debugging and parsing. Even though I have metrics, I can still ask things like “what was the cash from operations in 2022?” and so far no hallucinations 🤞.

Much ink has been spilled about Retrieval Augmented Generation (RAG), so I won’t go too far into detail. I chunk my text into 1000 token chunks using from a bit of borrowed code from Open AI’s embedding cookbook below. Works pretty well for chunking text, but I did have to make a few tweaks.

GPT_MODEL = "gpt-3.5-turbo"  # only matters insofar as it selects which tokenizer to use


def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    """Split a string in two, on a delimiter, trying to balance tokens on each side."""
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  # no delimiter found
    elif len(chunks) == 2:
        return chunks  # no need to search for halfway point
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]


def truncated_string(
    string: str,
    model: str,
    max_tokens: int,
    print_warning: bool = True,
) -> str:
    """Truncate a string to a maximum number of tokens."""
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    if print_warning and len(encoded_string) > max_tokens:
        print(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens.")
    return truncated_string


def split_strings_from_subsection(
    subsection: tuple[list[str], str],
    max_tokens: int = 1000,
    model: str = GPT_MODEL,
    max_recursion: int = 5,
) -> list[str]:
    """
    Split a subsection into a list of subsections, each with no more than max_tokens.
    Each subsection is a tuple of parent titles [H1, H2, ...] and text (str).
    """
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    num_tokens_in_string = num_tokens(string)
    # if length is fine, return string
    if num_tokens_in_string <= max_tokens:
        return [string]
    # if recursion hasn't found a split after X iterations, just truncate
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    # otherwise, split in half and recurse
    else:
        titles, text = subsection
        for delimiter in ["\n\n", "\n", ". "]:
            left, right = halved_by_delimiter(text, delimiter=delimiter)
            if left == "" or right == "":
                # if either half is empty, retry with a more fine-grained delimiter
                continue
            else:
                # recurse on each half
                results = []
                for half in [left, right]:
                    half_subsection = (titles, half)
                    half_strings = split_strings_from_subsection(
                        half_subsection,
                        max_tokens=max_tokens,
                        model=model,
                        max_recursion=max_recursion - 1,
                    )
                    results.extend(half_strings)
                return results
    # otherwise no split was found, so just truncate (should be very rare)
    return [truncated_string(string, model=model, max_tokens=max_tokens)]

Next, just like in the cookbook, I compute the embeddings, but unlike the cookbook, I stored into the vector database: ChromaDB. The simplicity/opensourciness appealed to me. It reminded me of DuckDB, so I tried it and been happy so far.

I opted not to use Llama_Index because it feels like an unnecessary wrapper that gets in the way of powerful tech. Reminds me of ORMs in that sense.

Prompting/LLMs stuff

This is where RAG comes in. When I ask a question, I create I convert the question to an embedding, query my vector database, append the top 4 rows (~4k text input), and insert into my prompt like below. That’s it, pretty rad.

    prompt = f"""Use the sections below from {symbol} set of 10-K's to answer concisely the subsequent question.  Cite references.  Tables have been converted to ascii. If the answer cannot be found, write "I don't know."

        Sections:
        \"\"\"
        {res}
        \"\"\"
        Question: {question}"""

    messages = [
        {"role": "system", "content": f"You answer questions about {symbol}'s business"},
        {"role": "user", "content": truncated_string(prompt, max_tokens=MAX_GPT_MODEL_TOKEN - 500)}
    ]
    response = openai.ChatCompletion.create(
        model=GPT_MODEL,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"

Streamlit

I don’t have alot to say about Streamlit, other than it’s awesome for the kind of things I like to build: data driven prototypes. I don’t have to futs around with HTML/CSS and it’s much better than WSYWIG. I could also see it being useful for internal apps in an org or analytics dashboard where Jupyter is too technical and Tableau is too stuffy/expensive.

Streamlit has good examples of layouts (some purpose built for chat apps) and I learned how infinitely customizable datatables are in Python (more a feature of Pandas than Streamlit).

APIs

I’ve subscribed to GuruFocus for a long time. I was suprise to see they have an API that comes with my subscription and it’s pretty good. It has a (reasonable) rate limit, so I do some caching with this to cut down on repeat hits. But they capture so much financial history and valuation metrics, it would be a waste not to use this as my core financial engine.

Prototypes != Production

It’s easy to get get prototypes going, it’s much harder to build something that scales. My app is for me: N=1. Maybe it will be useful for others someday, but that’s not why I made it. I get to shortcut some things and put in hacks so I stave off having to go down technical rabbit holes. For example, retrieval (how to query your vector store) is an important topic. And I’ve run into issues with my approach. At some point I may need a more traditional sparse vector search like Solr/Lucene so I can do customized ranking (ie: good ole fashioned key words). Pinecone claims to have the capability, I just don’t really want to have to use a DBaaS yet (or is it VDBaaS?)…I like the warm comfort of local development. I may cave on that. Also, with the length of my chunks being 1k, I may need to experiment with that a bit. BUT, there’s a super cool paper that just came out that claims to extract the necessary content from a chunk for feeding into LLM. Like I said, this space is moving FAST.

What’s next?

I’m keen to start using this for my investment research. Along the way I’ll keep building and tinkering. I’m interested in what other’s think of the What or the How. Leave me a comment or DM me if you want to nerd out.

10-KGPT: Creating an investing tool and learning LLMs

The What

The How

What’s next?

Written by Trent Niemeyer