Stories by Levi Stringer on Medium

The Invisible Denominator: On Measuring What Language Models Actually Cost

Levi Stringer — Mon, 15 Dec 2025 15:31:08 GMT

The engineer showed me her terminal. Fourteen microservices, each making between two and two hundred LLM calls per user session. “We have no idea what anything costs,” she said. “We just get a bill at the end of the month.”

This is the state of the art.

In 1854, John Snow mapped cholera deaths in London. He did not theorize about miasma or debate the merits of various humoral imbalances. He counted. He located. He drew. The resulting map, which I first saw in Edward Tufte’s The Visual Display of Quantitative Information shows dots clustered around the Broad Street pump, demonstrated what argument could not. The pump handle was removed. The epidemic ended.

The lesson is not about cholera. The lesson is about the relationship between measurement and action. You cannot optimize what you cannot see. You cannot debug what you cannot trace. You cannot manage what you do not count.

Modern software teams have, for the most part, internalized this. We instrument our services. We trace our requests. We know, to the millisecond, how long each database query takes. We can tell you which endpoint is slow, which user is hammering the API, which deploy introduced the regression.

Then we call an LLM, and all of that rigor evaporates.

Consider the anatomy of a language model API call. You send tokens. You receive tokens. You are charged. The provider tells you how many tokens you consumed. But this information arrives detached from context, a usage object appended to a response, logged nowhere, correlated with nothing.

The aggregate bill arrives weeks later. By then, the code that generated the costs has been modified seventeen times. The feature that caused the spike shipped three sprints ago. The engineer who wrote the recursive summarization loop has moved to a different team.

This is not a monitoring problem. It is an attribution problem. The data exists; it simply isn’t connected to anything useful.

Let us be concrete. Here is a table of prices, current as of December 2025:

The market has shifted dramatically in 2025. GPT-5 now costs less than GPT-4o did at launch. Claude Opus 4.5 is cheaper than Claude 3 Opus despite being far more capable, a 66% price reduction at the flagship tier. Google’s Gemini 2.0 Flash delivers strong performance at a tenth of a dollar per million input tokens.

Yet the ratio between cheapest and most expensive still spans two orders of magnitude. A task that costs a fraction of a cent on Gemini Flash might cost twenty-five cents on Claude Opus 4.5. Same input, same output structure, same business logic, different model selection.

Most teams default to a single model for everything. They choose based on capability benchmarks or, more often, based on what the first engineer to touch the codebase happened to use. Then they discover, potentially months later, in a costs meeting, that 80% of their calls were classification tasks that any model could handle.

The information needed to make better decisions existed in every API response. It was simply never recorded.

There is a pattern here that extends beyond language models. Organizations consistently fail to measure costs at the point of decision. They measure in aggregate, at the end of the accounting period, when the money has already been spent and the code has already been written.

This is backwards. Cost information is most valuable before the decision, not after. The engineer choosing between GPT-4o and GPT-4o-mini needs to know, at that moment, what each option costs for this particular task. Of course if the model is capable of performing the task as well. The product manager prioritizing features needs to understand the marginal infrastructure cost of each option. The architect designing a new pipeline needs visibility into which stages dominate the budget.

None of this requires sophisticated tooling. It requires only that someone write down the numbers.

I built a small library to do exactly this. It is perhaps 500 lines of Python. It stores data in SQLite , a single file, no infrastructure, no external dependencies. Each API call is logged with its token counts, its calculated cost, and whatever attribution metadata you care to attach: feature name, user ID, session, experiment cohort.

The implementation is trivial:

from llm_costs import track

track("gpt-4o", input_tokens=500, output_tokens=100, feature="chat")

One line. The token counts come from the API response you already receive. The feature name comes from your own code. The cost calculation is arithmetic. The storage is append-only writes to a local database.

The value is not in the code. The value is in the data it produces.

The output token asymmetry deserves particular attention. Across all major providers, generating tokens costs substantially more than consuming them. This reflects the underlying economics: input processing is parallelizable, output generation is sequential.

The implication is that verbose responses are expensive responses. A model that returns a 500-word explanation costs five times more than one that returns a 100-word summary. Yet most prompts do not specify length. They ask for “analysis” or “explanation” without constraint, then accept whatever verbosity the model provides.

Structured outputs — JSON rather than prose — typically reduce output tokens by 60–80%. The information content is identical. The cost is not.

Verbose: "Based on my analysis of the customer feedback provided, 
I believe the overall sentiment expressed is negative, primarily 
due to concerns about shipping delays and customer service 
responsiveness..."  (47 tokens, continuing for many more)

Structured: {"sentiment": "negative", "issues": ["shipping", "support"]}
(15 tokens, complete)

This is not optimization. This is basic hygiene. But it requires visibility into the data to recognize the opportunity.

The deeper problem is organizational. Engineering teams are evaluated on velocity and reliability. Cost efficiency is someone else’s concern, reviewed quarterly, in a different meeting, by different people. The feedback loop between “code that spends money” and “money that was spent” is measured in months.

Contrast this with performance optimization. When a page loads slowly, the engineer sees it immediately. The feedback is visceral, instantaneous. Slow code feels slow.

Expensive code feels like nothing at all. It executes in milliseconds. The response is correct. The user is satisfied. The expense is invisible until it becomes someone else’s problem.

Closing this feedback loop requires making cost visible at the point of creation. Not in a dashboard three clicks away. Not in a monthly report. In the logs, in the terminal, in the development environment where decisions are actually made.

The December 2025 pricing landscape offers more levers than ever. All major providers now offer batch processing at 50% discounts for non-urgent workloads. Prompt caching delivers 50–90% savings on repeated context — a system prompt sent with every request can be cached once and reused thousands of times. The cheapest capable models have dropped below $0.10 per million input tokens: Gemini 2.0 Flash, Cohere Command R7B, Groq’s hosted Llama variants.

But these savings require knowing where your tokens go. A team using prompt caching on 30% of their calls saves nothing if the other 70% — the expensive 70% — remain uninstrumented and unexamined.

The tooling implications are straightforward. Every LLM call should be logged with:

The model used
Input and output token counts
Calculated cost
Attribution metadata (feature, user, session)
Timestamp

This is perhaps fifty lines of wrapper code around any SDK. The data should be queryable by dimension: cost by model, by feature, by user, by day. The queries are simple aggregations. The storage can be SQLite for small deployments, anything with SQL semantics for larger ones.

The organizational implications are harder. Someone must look at the data. Someone must have authority to act on what they find. Someone must care.

In my experience, the most effective intervention is not a dashboard or an alert. It is a weekly email to the engineering team: “Here is what we spent on LLMs this week. Here is the breakdown by feature. Here is the breakdown by model.” No commentary. No judgment. Just numbers.

Engineers, given data, will optimize. They do not need to be told. They need to be informed.

There is a final point worth making about the nature of these costs. Unlike traditional infrastructure — servers, databases, bandwidth — LLM costs scale with usage, not capacity. You do not pay for what you provision. You pay for what you use.

This is both blessing and curse. The blessing: no idle capacity, no overprovisioning, perfect elasticity. The curse: no ceiling, no budget you can set and forget, no way to guarantee costs will not exceed some threshold.

The only defense is measurement. Continuous, granular, attributed measurement. Not because measurement alone solves the problem, but because measurement is the prerequisite to solving anything.

John Snow did not cure cholera. He identified the pump. The identification was the contribution. Everything else followed from seeing clearly what had previously been obscured.

The pump handle, in this case, is the API call. Remove the handle — which is to say, instrument the call — and the path forward becomes visible.

The code discussed in this post is available at https://github.com/levi-katarok/llm-costs. It is MIT licensed, dependency-free, and perhaps useful

Simplifying RAG Context Windows with Conversation Buffers — How to Stop Your Agent Forgetting…

Levi Stringer — Mon, 08 Dec 2025 00:07:00 GMT

Simplifying RAG Context Windows — How to Stop Your Agent Forgetting

Building a RAG chatbot that works in demo is straightforward. Building one that still works after twenty messages is a whole different story. Github Link :

This post covers four independent solutions:

conversation buffers
document reordering
semantic caching
multi-agent orchestration.

These solutions address what I’ve heard been called the RAG memory wall, and the math is unforgiving.

Ten message exchanges at 300 tokens each gives you 3,000 tokens of history.
Add five retrieved chunks of documents at 500 tokens each for another 2,500.
Include your system prompt for 1,000 more.

You’re at 6,500 tokens before the model generates a single word. You’re hemorrhaging money before the conversation even starts!

Prompt Context Overload

You may be thinking who cares? We’ve got models with context windows so large that we could put all of Homer’s The Odyssey and still have hundreds of thousand of tokens to spare. But not without its own problems, LLMs exhibit what researchers call U-shaped attention. Model performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access information in the middle, even for models explicitly designed for long contexts. More context means more middle, more places for your carefully retrieved documents to get ignored.

Companies running RAG at scale have figured this out.

DoorDash combined smart context management with guardrails and saw 90% fewer hallucinations.
LinkedIn added knowledge graphs to their support ticket system and cut resolution times by 28%.

That’s not incremental improvement, that’s a different product!

The patterns they use aren’t complicated — they just require thinking about context as a resource to manage rather than a bucket to fill.

Throughout this post we’ll walk through four independent solutions for managing RAG context windows. Each one addresses a different aspect of the problem, and you can implement any of them on their own or combine them based on your needs.

Solution 1: Compresses conversation history while preserving critical facts.

Solution 2: Reorders documents to combat attention degradation.

Solution 3: Adds semantic caching to avoid redundant work.

Solution 4: Distributes complex queries across multiple focused agents. Start with whichever one addresses your most pressing problem.

Setup and Requirements

Node.js Environment: We’ll be using TypeScript throughout. Make sure you have Node.js 18+ installed. I recommend using a package manager like pnpm, but npm works fine.

API Keys: You’ll need an OpenAI API key (for embeddings and model). Store these in environment variables — never hardcode them.



export OPENAI_API_KEY='your_key_here'

npm install openai

Solution 1: Conversation Summary Buffer

Problem: Your context window fills up as conversations get longer, forcing you to either truncate history (losing important early context) or pay for ever-larger context windows.

The core problem with long conversations is simple: you can’t keep everything. Sliding window approaches that keep the last N messages lose important early context. A user mentions they’re in Canada in message two, asks about shipping in message twelve — but your window discarded the location. Your chatbot just developed amnesia mid-conversation.

Conversation Summary Buffer

The solution is to maintain recent messages verbatim while summarizing older content, with explicit entity tracking for facts that must never be lost. This approach draws from Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models which demonstrated that recursively generating summaries to compress older dialogue while preserving key information significantly improves response consistency in long conversations.

Think of it like how you’d summarize a long meeting: you’d compress the general discussion into key points, but you’d make sure specific action items and names don’t get lost in the summary.

interface Message { role: "user" | "assistant"; content: string }

class ConversationSummaryBuffer {
  private client = new OpenAI();
  private messages: Message[] = [];
  private summary = "";
  private entities = {
    ids: new Set(),
    locations: new Set(),
    products: new Set()
  };

  constructor(private maxRecent = 6) {}

  async addMessage(role: "user" | "assistant", content: string) {
    this.messages.push({ role, content });

    // Extract entities before any compression
    content.match(/(?:account|order|ticket)[\s#:]*(\d{6,})/gi)?.forEach(id => {
      const num = id.match(/\d{6,}/)?.[0];
      if (num) this.entities.ids.add(num);
    });
    content.match(/(?:in|from|shipping to)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)/g)?.forEach(loc => {
      this.entities.locations.add(loc.replace(/^(in|from|shipping to)\s+/i, ""));
    });
    content.match(/\b(Pro|Enterprise|Basic|Premium)\b/gi)?.forEach(p =>
      this.entities.products.add(p.toLowerCase())
    );

    if (this.messages.length > this.maxRecent) await this.compress();
  }

  private async compress() {
    const old = this.messages.slice(0, -this.maxRecent);
    this.messages = this.messages.slice(-this.maxRecent);
    const text = old.map(m => `${m.role}: ${m.content}`).join("\n");

    const res = await this.client.chat.completions.create({
      model: "gpt-4o-mini",
      max_tokens: 500,
      messages: [{
        role: "user",
        content: `Summarize concisely:\n${text}\nPrevious: ${this.summary || "None"}`
      }],
    });
    this.summary = res.choices[0].message.content || "";
  }

  buildContext(docs: string[]): Message[] {
    const parts = ["You are a helpful assistant."];
    if (this.summary) parts.push(`\n\nHistory:\n${this.summary}`);

    const active = Object.entries(this.entities).filter(([, s]) => s.size > 0);
    if (active.length) {
      parts.push(`\n\nKey info:\n${active.map(([k, s]) =>
        `- ${k}: ${[...s].join(", ")}`
      ).join("\n")}`);
    }

    if (docs.length) parts.push(`\n\nDocs:\n${docs.join("\n\n---\n\n")}`);

    return [      { role: "user", content: parts.join("") },      { role: "assistant", content: "I understand." },      ...this.messages    ];
  }
}

The key insight here is separating what can be compressed (general conversation flow) from what cannot (specific identifiers, locations, product references). The entity extraction is deliberately simple, using a small model to extract entities.

When to use this: Implement this solution when you’re seeing quality degradation in longer conversations, when users report the chatbot “forgetting” things they mentioned earlier, or when your token costs scale linearly with conversation length.

Supported Research:

Solution 2: Strategic Document Ordering

Problem: Your retrieved documents are being ignored even when they contain the right information, because they’re landing in the middle of your context where the model pays less attention.

Even with perfect context management, you can still lose information to the U-shaped attention problem. The foundational research here is Lost in the Middle: How Language Models Use Long Contexts.

They demonstrated that LLM performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access information in the middle, even for models explicitly designed for long contexts.

Reordering Documents

Follow-up research in Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization confirmed this is due to an intrinsic attention bias: LLMs exhibit a U-shaped attention pattern where tokens at the beginning and end of input receive higher attention regardless of their relevance to the query.

To me, the practical implication is clear: if your most relevant retrieved document happens to land in the middle of your context, the model might not give it the weight it deserves. My fix is straightforward: put your best documents at the beginning, second-best at the end, and weakest in the middle. This is shown to maximize the chance that your most relevant retrievals actually influence the response.

interface ScoredDoc { content: string; score: number }

function reorderForAttention(docs: ScoredDoc[]): string[] {
  const sorted = [...docs].sort((a, b) => b.score - a.score);
  if (sorted.length <= 2) return sorted.map(d => d.content);

  const result: ScoredDoc[] = [sorted[0]];
  for (let i = 1; i < sorted.length; i++) {
    if (i % 2 === 1) {
      result.push(sorted[i]);
    } else {
      result.splice(Math.floor(result.length / 2), 0, sorted[i]);
    }
  }
  return result.map(d => d.content);
}

For cases where you have clearly primary vs. supplementary documents, explicit sections work even better. The headers aren’t just for human readability — they help the model understand the relative importance of different sections.

function buildStructuredContext(
  primary: string[],
  secondary: string[]
): string {
  let context = "## Primary Sources\n\n" + primary.join("\n\n---\n\n");

  if (secondary.length) {
    context += "\n\n## Additional Context\n\n" + secondary.join("\n\n---\n\n");
  }
  return context;
}

When to use this: This solution is worth implementing when you’re retrieving multiple documents per query and notice that highly relevant information sometimes gets overlooked. It’s a low-effort change — just a reordering step — that can noticeably improve response quality without any additional API calls or infrastructure.

In short put your worst docs where the model won’t notice them anyway!

Solution 3: Semantic Caching

Problem: You’re paying for retrieval and generation repeatedly for queries that are semantically identical, just phrased differently.

Many RAG queries mean the same thing even when they use different words. “What’s your refund policy?” and “How do I get my money back?” Same question, same answer, but you just paid twice!

The paper GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching formalized this approach: by storing embeddings of user queries, systems can efficiently identify semantically similar questions and retrieve pre-generated responses without redundant API calls. The technique achieves notable reductions in operational costs while significantly enhancing response times.

Semantic Caching

More recent work in Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data explored how specialized, fine-tuned embedding models can further improve cache effectiveness. They note that semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency, but I think it’s worth the tradeoffs.

The concept is similar to traditional caching, but instead of checking if two strings are identical, we check if their vector embeddings are similar enough to be considered the same question.

import { createHash } from "crypto";

class SemanticCache {
  private openai = new OpenAI();
  private cache = new Map();
  private stats = { hits: 0, misses: 0 };

  constructor(private threshold = 0.92) {}

  private async embed(text: string) {
    const res = await this.openai.embeddings.create({
      model: "text-embedding-3-small",
      input: text
    });
    return res.data[0].embedding;
  }

  private cosine(a: number[], b: number[]) {
    const dot = a.reduce((s, v, i) => s + v * b[i], 0);
    const magA = Math.sqrt(a.reduce((s, v) => s + v * v, 0));
    const magB = Math.sqrt(b.reduce((s, v) => s + v * v, 0));
    return dot / (magA * magB);
  }

  async get(query: string): Promise {
    const emb = await this.embed(query);
    let best = { score: 0, response: "" };

    for (const c of this.cache.values()) {
      const sim = this.cosine(emb, c.embedding);
      if (sim > best.score && sim >= this.threshold) {
        best = { score: sim, response: c.response };
      }
    }

    if (best.response) { this.stats.hits++; return best.response; }
    this.stats.misses++;
    return null;
  }

  async set(query: string, response: string) {
    const hash = createHash("sha256").update(query).digest("hex");
    this.cache.set(hash, { response, embedding: await this.embed(query) });
  }
}

A few implementation notes worth highlighting. The threshold of 0.92 , conservative, yes. But nobody ever got fired for a cache that was too careful…

When to use this: Implement semantic caching when you have repetitive query patterns (common in customer support), when latency is a concern, or when you’re looking to reduce API costs. It pairs well with Solution 1 — the summary buffer manages conversation state while the cache handles repeated questions.

Solution 4: Multi-Agent Architecture

The problem this solves: Complex queries that span multiple domains produce shallow answers because cramming all the relevant context into a single call overwhelms the model’s ability to reason effectively.

Some queries need information from multiple sources. “Compare our enterprise pricing to competitors and summarize recent customer feedback” requires pulling from pricing docs, competitor analysis, and support tickets. If you retrieve all of that and stuff it into one context window, you get shallow answers that don’t do justice to any of the sources.

The alternative is to spawn specialized sub-agents, each with their own focused context window, then synthesize their findings. Think of it like delegating research tasks to a team: each person goes deep on their area of expertise, then you bring everyone together to synthesize insights.

Simple Multi-Agent Design

This approach is part of a broader trend toward “Agentic RAG.” Singh et al.’s 2025 survey Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG provided a comprehensive exploration of how autonomous AI agents can be embedded into RAG pipelines, leveraging agentic design patterns like reflection, planning, tool use, and multi-agent collaboration to dynamically manage retrieval strategies and adapt workflows to meet complex task requirements. Pretty cool!

You wouldn’t ask one person to be an expert in pricing, competitive intel, and customer sentiment. Why ask that of one context window?

More specifically, in A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data proposed a multi-agent RAG system where specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document-based systems. They demonstrated that this approach becomes more efficient than single-agent architectures when dealing with diverse data sources.

Recent work in MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning showed that orchestrating a collaborative set of specialized agents — Planner, Step Definer, Extractor, and QA Agents — each responsible for a distinct stage of the RAG pipeline, significantly outperforms standalone LLMs and existing RAG methods on multi-hop and ambiguous question-answering benchmarks.

type Retriever = (query: string) => Promise;

class MultiAgentRAG {
  private client = new OpenAI();

  async orchestrate(query: string, retrievers: Record) {
    // Step 1: Plan - decompose query into subtasks
    const planRes = await this.client.chat.completions.create({
      model: "gpt-4o",
      max_tokens: 500,
      messages: [{
        role: "user",
        content: `Break down for retrieval.
Query: ${query}
Sources: ${Object.keys(retrievers).join(", ")}
Respond JSON: {"subtasks": [{"source": "...", "subquery": "..."}]}`
      }],
    });

    const planText = planRes.choices[0].message.content || "{}";
    const plan = JSON.parse(planText.match(/\{[\s\S]*\}/)?.[0] || '{"subtasks":[]}');

    // Step 2: Execute subtasks in parallel
    const results = await Promise.all(
      plan.subtasks.map(async (t: { source: string; subquery: string }) => {
        const docs = await retrievers[t.source](t.subquery);

        const res = await this.client.chat.completions.create({
          model: "gpt-4o-mini",
          max_tokens: 300,
          messages: [{
            role: "user",
            content: `Answer: ${t.subquery}\n\nDocs:\n${docs.join("\n\n")}`
          }],
        });

        return {
          source: t.source,
          findings: res.choices[0].message.content || ""
        };
      })
    );

    // Step 3: Synthesize findings into final answer
    const synthRes = await this.client.chat.completions.create({
      model: "gpt-4o",
      max_tokens: 1000,
      messages: [{
        role: "user",
        content: `Question: ${query}

Findings:
${results.map(r => `**${r.source}**:\n${r.findings}`).join("\n\n")}

Synthesize a comprehensive answer.`
      }],
    });

    return {
      answer: synthRes.choices[0].message.content || "",
      plan,
      results
    };
  }
}

Usage Example

const retrievers = {
  pricing: async (q: string) => ["Enterprise: $99/mo", "Pro: $49/mo"],
  competitors: async (q: string) => ["Competitor A: $120/mo"],
  feedback: async (q: string) => ["92% satisfaction", "NPS +15"],
};

const rag = new MultiAgentRAG();
const { answer } = await rag.orchestrate(
  "Compare enterprise pricing to competitors and summarize feedback",
  retrievers
);

This architecture uses significantly more tokens than a single-agent approach, roughly 15x more for a complex query with multiple subtasks. That’s a real tradeoff: you’re paying more per query to get substantially better answers on questions that would otherwise produce shallow or incomplete responses.

When to use this: Reserve multi-agent orchestration for queries that genuinely span multiple domains. Simple factual questions don’t need it and would just waste tokens. The pattern shines when users ask analytical questions that require synthesizing information from different sources — comparing products across multiple dimensions, investigating issues that span documentation and support history, or generating reports that need multiple data sources.

Further reading:

Wrapping Up

Context management separates RAG demos from production systems. The solutions here — conversation buffers, strategic document ordering, semantic caching, multi-agent architectures — each address a different aspect of the problem. You don’t need all of them, and you don’t need to implement them in any particular order.

If your main issue is conversations getting worse over time, start with Solution 1. If you’re retrieving good documents but the model seems to ignore them, try Solution 2. If you’re seeing repetitive queries driving up costs, Solution 3 will help. If complex questions consistently get shallow answers, consider Solution 4.

Your users don’t get less demanding on message twenty. Your RAG shouldn’t get less capable! Start with measuring your baseline, implement the solution that addresses your most pressing problem, and add others as needed. Each piece works independently, and you can stop wherever your requirements are met.

Troubleshooting Common Issues

Before diving into the solutions, here are issues you’ll likely encounter.

Token Counting Confusion: Different models tokenize differently. Claude and GPT don’t produce identical token counts for the same text. When setting buffer limits, leave headroom — if your limit is 8,000 tokens, aim to stay under 7,000.

Entity Extraction False Positives: The regex patterns in Solution 1 are intentionally simple. You’ll extract some noise. For production, consider using a small model to extract entities more accurately, or tune the patterns to your specific domain.

Cache Threshold Tuning: The semantic cache in Solution 3 uses a 0.92 similarity threshold. Too high and you’ll miss valid cache hits. Too low and you’ll serve wrong answers. One banking chatbot reported reducing false positives from 99% to 3.8% by focusing on cache design over threshold tuning — start conservative and adjust based on your data.

Summarization Hallucinations: When compressing conversation history, the summarization model can occasionally invent details. The entity extraction in Solution 1 exists partly to catch this — critical facts get preserved separately from the summary.

References

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://arxiv.org/abs/2307.03172
Hsieh, C.Y., et al. (2024). Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization. https://arxiv.org/abs/2406.16008
Wang, Q., et al. (2023). Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. https://arxiv.org/abs/2308.15022
Maharana, A., Lee, D.H., Tulyakov, S., & Bansal, M. (2024). Evaluating Very Long-Term Conversational Memory of LLM Agents. https://arxiv.org/abs/2402.17753
Regmi, S., et al. (2024). GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching. https://arxiv.org/abs/2411.05276
Gill, W., et al. (2025). Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data. https://arxiv.org/abs/2504.02268
Singh, A., et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. https://arxiv.org/abs/2501.09136
Nguyen, T., et al. (2025). MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning. https://arxiv.org/abs/2505.20096
Salve, A., et al. (2024). A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data. https://arxiv.org/abs/2412.05838

How I Built a Local AI-Powered Mint Alternative

Levi Stringer — Wed, 26 Nov 2025 03:55:15 GMT

When Canada’s best free budgeting app shut down, I spent a weekend building my own. Here’s what happened.

Mint shut down in Canada about a year and a half ago.

The email was brief: “We’re discontinuing Mint. Try Credit Karma instead.”

I tried Credit Karma. It was mostly a credit card sales pitch with some budgeting features tacked on. Pass.

So I looked at the alternatives. YNAB was $100/year USD. Monarch was beautiful but expensive. PocketSmith had a limited free tier. For Americans, there are dozens of options. For Canadians? Our fintech has always been an afterthought. Plaid barely works with our banks. Most apps don’t support Interac. We’re not a priority market.

I’m an AI engineer, and I’d been curious about these new local language models — small enough to run on a laptop, supposedly capable enough for real tasks. Transaction categorization seemed like a good test case. I think small models like this are the future.

Could I build something that worked?

What I Actually Needed

Here’s what I used Mint for: automatic transaction categorization.

That’s it. Not budgeting advice. Not “ways to save” recommendations. Just: tell me where my money went this month so I can see the pattern.

What I wanted:

Works with bank exports (PDF/CSV from TD, RBC, Scotiabank — whatever my bank spits out)
Automatically categorizes transactions
Shows spending trends over time
Spots anomalies (duplicate charges, subscription increases)
Runs locally on my computer

Dashboard

The Local AI Part

I started with SmolLM3–3B a 3GB language model from HuggingFaceTB its free and can run on my MacBook Pro.

First test: 50 transactions from my credit card statement.

Result: 48 correct, 2 edge cases (a coffee shop inside a bookstore got tagged as “shopping” instead of “food”). Still needed a bit of prompt tuning.

The model understood things I never taught it:

FIZZ MOBILE → Utilities

UBER TRIP vs UBER EATS → Transportation vs Food

TTC PRESTO → Transit

LCBO #482 → Alcohol

INTERAC E-TRANSFER → Transfer (not an expense)

Processing speed: 50–100ms per transaction. Fast enough.

I’ve since added support for other models too, OpenAI, Anthropic is as well. If you want cloud, or various Ollama models locally. The system is model-agnostic.

The Build

Simple tech stack:

Bun + Elysia backend (TypeScript)
PostgreSQL with Drizzle ORM
Ollama to run the AI model locally
React/Next.js frontend with Recharts

Through two weeks of evenings most of the time was spent tweaking prompts to improve categorization accuracy and handling different bank statement formats.

Minimal by Design

Most finance apps try to do everything. Budgets, goals, savings challenges, investment tracking, credit score monitoring, “insights” that are really just ads.

I wanted the opposite: show me my spending, let me ask questions, get out of my way.

The dashboard is one screen. Category breakdown on the left, trend chart on the right. No gamification. No badges. No Ads. No credit card promos. No “you saved $12 this month!” notifications. You get the idea.

Document Processing

Ask Questions in Plain English

I can ask: “How much did I spend on coffee last summer?”

The system parses the intent, queries my transaction database, and returns an answer with a suggested chart. It understands relative time (“last month”, “this year”, “last 6 months”) and fuzzy category matching.

Talk with your finances. Locally

Financial Time Machine

Upload two years of statements and it finds patterns I never noticed:

Seasonal spending (I spend 40% more in December — not just gifts, but “treating myself” purchases too)
Subscription creep over time
Lifestyle changes (“You started ordering delivery regularly in March 2023”)

It can compare arbitrary time periods: “How does Q1 2024 compare to Q1 2023?” with category-level breakdowns showing what changed.

Finance Time Machine

Smart Transfer Detection

Mint always counted my checking-to-savings transfers as spending. The AI understands “TRANSFER TO SAVINGS” and “INTERAC E-TRANSFER to Mom” aren’t expenses. A bit of manual rules and tweaks needed, but the power was in my hands.

What Five Years of Data Revealed

Once it worked, I uploaded every bank statement I could find. Five years, roughly 5,000 transactions.

Processing time took all of 8 minutes!

What I learned:

My grocery spending varies wildly I thought: “I spend about $400/month on groceries” Reality: Anywhere from $280 to $650, no consistent pattern

Subscription creep 2020: 4 subscriptions, $38/month 2024: 11 subscriptions, $127/month Added one at a time, never noticed the total. Life style creep is real!

Small recurring charges $4.85 coffee × 3 days/week = $756.6/year

The Friday pattern Spending consistently spiked $40–60 every Friday. A couple too many of “screw it, let’s order dinner”

Local AI Lessons

What I learned about running AI models locally:

They’re capable enough. A 3GB model gets 95%+ accuracy on transaction categorization. You don’t need large models for this.

They’re fast. Faster than cloud APIs when processing batches. No network latency.

They understand context. Better than keyword matching. Knows “AMZN KINDLE” is different from “AMZN FRESH.”

They work offline. Download the model once, process statements anywhere.

The Manual Upload Flow…

See all your financial documents in one place

No automatic bank syncing means I do have to download statements manually each month. Takes 30 seconds per account.

I’m thinking that this turned out to be a feature, not a bug.

When Mint auto-synced, I’d check once a month, glance at charts, close the app. Now I have to download the files, which means I actually see the transactions. Notice things. Ask questions. In the future an automated email reminder will also be nice.

Why $240 at Home Depot? (That shelf project I never finished.) Why is my phone bill $65 now? (Price increase I missed.)

The friction makes me pay attention.

Plus, in Canada, automatic bank syncing is unreliable anyway. Plaid support is spotty. Banks break connections constantly. The PDF workflow is simple but it just… works.

What Actually Matters

Building this taught me something: the AI isn’t the valuable part.

The valuable part is looking at your spending. Really looking at it.

The local AI makes the tedious parts faster — categorizing hundreds of transactions, spotting patterns across years, flagging anomalies. But I still have to upload statements, review the results, think about what they mean.

That’s what matters.

Mint’s shutdown forced me to pay attention again. Building this tool forced me to pay even more attention — to both my spending and what local AI can actually do.

Worth a couple late nights.

Right now this runs on my laptop. If there’s enough interest from other Canadians stuck in post-Mint limbo, I might turn it into something others can use — either self-hosted or as a proper app.

If you’d want early access, shoot me a message!

What It Takes to Start an AI Consulting Corporation in Canada

Levi Stringer — Sun, 23 Nov 2025 03:08:39 GMT

In May 2023, after spending six months traveling and thinking about what I wanted next, I incorporated my AI engineering consulting practice in Canada. I wanted to challenge myself and see if there was more out there for me.

Here’s what I learned spending about a week and about $500 setting everything up, plus the decisions I’d make differently knowing what I know now.

Important Disclaimer: I’m an AI consultant, not a lawyer, accountant, or financial advisor. This post describes my personal experience incorporating in Canada in 2023. Laws, regulations, and tax rules change frequently and vary by province. Don’t use this as legal or financial advice — consult with actual professionals (lawyer, accountant, or CPA) before making decisions about your business.

Mininum Viable List

Before we dive into the details, here is a shortlist of everything you need to incorporate and get up and running:

Legal & Incorporation

[ ] NUANS name search (or accept a numbered company like I did)
[ ] Federal or provincial incorporation filing
[ ] Corporate minute book kit
[ ] Business insurance (I use Zensurance)
[ ] Contract templates (I use LawDepot)

Banking & Financial

[ ] Business bank account
[ ] International payment solution (if needed)
[ ] Accounting software
[ ] Bookkeeper (I found mine on Upwork)
[ ] Accountant for year-end (also found on Upwork)

Operations

[ ] Time tracking software (I use Harvest)
[ ] Invoicing system
[ ] Client contracts and SOW templates

Total timeline: ~1 week from filing to reaching out to clients

Total first-year cost: $500–1000

How to start a business in 7 days.

Federal vs Provincial: I Went Federal (Here’s Why)

Everyone debates this. Provincial is cheaper upfront ($200 vs $300 federal), keeps your ownership private, and works fine if you’re only in Ontario.

I went federal anyway.

The reason: I’m an AI consultant working with tech companies. Many are in BC, some in Alberta, and mostly in the US. Having a federal incorporation just felt more professional when talking to clients across the country.

There are some practical benefits: automatic name protection Canada-wide (no one else can register my corporate name, not that it mattered in my case anyways…), and if I ever want to work physically in another province, I don’t need extra paperwork. Federal costs $200 to file online, processes in one business day, and the annual return is only $12.

The downsides can be real though. Federal corporations now file public shareholder registers, technically anyone can look up who owns your company. Not a big deal or something I was concerned about. The name approval process also would mean waiting for a government examiner to review a proposed name. This would take three business days instead of same-day like Ontario. Again, probably not a big deal.

The Numbered Company Name Mistake (Learn From Me)

Here are you, ready to register your company. You’ve probably already got a name! Very exciting! But here’s something nobody tells you until it’s too late: if you just click through all the registration forms and you get to the end you may end up getting assigned a numbered company name.

That’s how I ended up as 1506596–8 Canada Inc.…

I didn’t realize I needed to have my business name sorted out before filing. I must have missed it during the incorporation process when it asked for my corporate name. Now if you’re not ready with one that’s been approved through a NUANS search, they just give you a number.

It’s not the end of the world, you can operate under a trade name or “doing business as” (DBA) name. But it looks less professional on invoices, contracts, and bank accounts. When clients see a numbered company, it sometimes raises questions.

If I could do it over, I’d spend the extra day doing the NUANS name search ($13.80) and have a proper business name from day one. Or I’d just embrace the numbered company from the start and operate under a clean trade name.

Lesson: Don’t rush the incorporation. Get your business name figured out first. It kind of matters!

The Incorporation Process Takes One Business Day (Federal Edition)

I incorporated online through the Corporations Canada website here. The actual filing took about an hour. You need:

Your proposed corporate name (or accept a numbered company like I did)
Articles of Incorporation specifying your share structure. You can also just do this within the process.
Your registered office address (can be your home, but must include the province)
Director information (just me, and since I’m a Canadian resident, met the 25% Canadian director requirement)

For the share structure, I kept it simple: unlimited common shares. You don’t need complicated Class A and Class B share structures unless you have specific tax planning or ownership arrangements. Most solo consultants can just go with a simple unlimited common share structure.

The system is pretty straightforward once you figure out the interface. I submitted everything in the morning, and got my Certificate of Incorporation by email the next business day.

Very official looking

What I didn’t realize until much later: you immediately need a corporate minute book. This is legally required under the Canada Business Corporations Act and contains your incorporation documents, by-laws, shareholder registers, and meeting minutes. I bought a pre-made kit for $250 rather than having a lawyer prepare everything for $1,500+.

The organizational resolutions were straightforward: approve the by-laws, appoint myself as president, authorize banking, issue my shares to myself, set my fiscal year-end. Took maybe an hour with the kit templates.

Lawyer Consultation (Optional but Worth Considering):

I didn’t do a lawyer consultation during incorporation, and the process was fine without one. I do have friends who have started corporations with friends and they have strong recommend starting with a laywer and doing it right. Having a corporate lawyer review your setup can be valuable, especially for:

Understanding salary vs dividend strategies for your specific situation
Fiscal year-end optimization
What triggers CRA audits
Complex ownership structures if you have co-founders or investors

If you’re doing a straightforward solo incorporation, you can probably skip it. If you want the advice, budget $300–600 for 1–2 hours of consultation, ideally focused on year-end tax planning rather than the incorporation mechanics themselves. You can also do tax planning with a CPA.

Total incorporation cost: $200 filing fee + $250 minute book = $450 (or $464 if you do the NUANS name search).

Banking: Why I Went Old School (RBC)

Everyone talks about Float, Venn, and other fintech solutions with high interest rates and no fees. I looked at all of them.

But in the end I walked into RBC and opened a plain old business account.

My thought process: I was starting my first corporation. I had questions. Lots of them. What documents do I actually need? How do I collect US dollars from clients? How can I get a company credit card? Can I integrate with accounting software?

The banker sat with me for 45 minutes and walked through everything. Showed me how their online banking worked for businesses. Explained the different account types. Answered my dumb questions about corporate vs personal accounts. Ultimately everything is basically the same as your chequing account.

One thing is the interest rate is basically zero, which sucks. Especially if you pay yourself in dividends and have to set aside substantial amounts of cash for corporate tax time. Even with these accounts the fees exist, which is annoying. But the peace of mind early on was valuable. Now that I understand how everything works, would I switch to Float or another high-interest account? Maybe. But I don’t regret starting with a real bank.

For international clients (which I have several), and accepting USD, I opened a Wise Business account. No monthly fees, and foreign exchange rates are about 90% cheaper than RBC. When a US client pays me, I’m not losing 2.5% on the conversion. Thousands saved each year.

My banking setup: RBC for Canadian operations ($12/month), Wise for international payments (free monthly, pay per transaction). Total monthly cost: $12.

Having both means I can accept payments however clients want to send them, convert currencies cheaply, and have someone to call when I need help. Good enough.

The Tools That Helped Me Out

QuickBooks — Financials

Handles everything I need:

Invoicing clients
Tracking expenses
Recording transactions
Calculating HST/GST,
Generating year-end tax forms.

But truthfully the main reason I went with QuickBooks? It’s incredibly easy to find talented bookkeepers who already know the software. When I hired a bookkeeper on Upwork, every single qualified candidate knew QuickBooks. That means less training time and they can hit the ground running.

Bookkeeping

I once spent 8 hours pulling my hair out trying to reconcile my books, and categorize rough transactions. I eventually realized this is not what I was put on this earth for and promptly hired a bookkeeping service. My bookkeeper charges around $100–200/month depending on transaction volume, which includes categorizing expenses, reconciling accounts, and preparing everything for my accountant at year-end.

Harvest ($12/month) — Time Tracking

For time tracking Harvest is purpose-built for consultants who bill hourly or need detailed time records. It makes hourly billing straightforward, you can track time by project and client, then create invoices directly from your time entries. Way cleaner than trying to do time tracking inside accounting software.

LawDepot — Contracts and Legal Documents

For me LawDepot was great as it provides Canadian-specific templates for service agreements, NDAs, independent contractor agreements, and more. Much cheaper than paying a lawyer $500–1000 per contract, and the templates are solid for standard consulting work.

Total monthly cost with software and bookkeeping: around $220–280. Total time saved: probably 15 hours per month not doing manual bookkeeping, paperwork, and payroll calculations.

Business Insurance (Don’t Skip This)

One thing I was glad I did early: getting business insurance through Zensurance.

As a consultant, you’re exposed to professional liability risk. What if your code recommendation causes a client to lose money? What if there’s a security issue with something you built? What if a client claims you missed a deadline that cost them a deal?

Professional liability insurance (also called Errors & Omissions or E&O insurance) protects you against these claims. General liability covers things like if you spill coffee on a client’s laptop during a meeting.

I pay about $600–700 annually for coverage. I did several quote comparison and for me Zensurance made it easy — filled out a quick online form about my business, got quotes from multiple insurers, and was covered within a day.

Many larger clients (especially US companies) won’t sign contracts without proof of insurance. Having it ready means you don’t lose opportunities because you can’t provide a certificate of insurance quickly.

There are alternative places you can go ultra-budget: time tracking before I had FreshBooks. I used Clockify (free) for the first few months. Unlimited time tracking, unlimited projects, zero cost. Worked perfectly fine.

The Salary vs Dividend Decision Isn’t What You Think

Every online guide says the same thing: dividends are more tax-efficient. Pay yourself dividends, not salary.

Disclaimer: This is what worked for my situation based on conversations with my accountant. Your optimal salary/dividend split depends on your income level, retirement goals, province, and personal financial situation. Talk to a CPA or tax accountant about your specific circumstances — don’t just copy my numbers.

Here’s what they don’t tell you: most accountants actually recommend salary for Canadian consultants. When I talked to several accountants, they pretty much all said the same thing, salary is often better than the internet makes it sound.

Here’s why: salary builds CPP retirement benefits. Salary creates RRSP contribution room. Salary helps you qualify for mortgages. Dividends do none of these things.

After running the actual numbers with my accountant, I pay myself $75,000 in salary annually, paid quarterly. This puts about $6,000 into CPP (which I’ll get back in retirement), creates $13,500 in RRSP room, and keeps my quarterly payroll remittances manageable rather than dealing with monthly paperwork.

Everything above that $75,000 I take as dividends for flexibility. Need extra cash for a big expense? Declare a dividend. Slow client month? Skip it. No payroll calculations required.

The administrative difference is huge though. Salary means:

Calculating CPP contributions
Income tax withholding,
Monthly remittances to CRA.

I started to do this when I first was starting and had minimal clients. But now my bookkeeper handles this, and it takes about an hour per quarter. Dividends require a T5 slip at year-end if I take more than $50 annually.

NO EI Premiums: as the owner of more than 40% of your corporation, you’re exempt from EI premiums. Don’t withhold or remit EI. This saves about $2,500 annually in combined employee and employer premiums that would be wasted money.

The Tax Obligations

Tax Disclaimer: Tax rates and rules change. The rates mentioned here were accurate for Ontario in 2023 when I incorporated. Your province may have different rates, and federal/provincial tax rules change regularly. Verify current rates with an accountant.

Your corporation pays 12.2% corporate tax on the first $500,000 of active business income in Ontario (9% federal + 3.2% Ontario). This is the Small Business Deduction rate. Way better than personal tax rates.

But there are four different tax filing obligations to track:

T2 Corporate Tax Return — Due 6 months after your fiscal year-end, though you pay the actual tax 3 months after year-end
GST/HST Returns — Required once you hit $30,000 in revenue (I registered voluntarily at $15,000 to claim Input Tax Credits on all my software and equipment purchases)
T4 Slips — If you pay yourself salary, due February 28 annually
T5 Slips — If you pay yourself dividends over $50, due February 28 annually

Important GST/HST Note for Consultants with US Clients:

If you’re selling services to US clients (like I do with software consulting), your services are zero-rated exports. This means:

You charge 0% GST/HST to US clients (not 13%)
You still must file GST/HST returns if registered
You can still claim Input Tax Credits (ITCs) on your Canadian business expenses
You need to keep proper documentation proving your client is outside Canada

This is actually beneficial. You recover the 13% HST on all your Canadian expenses (software, equipment, office supplies) but don’t need to collect any HST from clients. I immediately claimed back the 13% HST on my $3,000 MacBook Pro, $1,200 in software subscriptions, and $800 in office furniture. That’s $650 cash back from CRA.

Mandatory electronic filing started January 1, 2024 — you must file your GST/HST returns electronically through CRA’s My Business Account.

What It Actually Costs ( Full Breakdown)

Here’s what I spent in year one:

Federal incorporation filing: $200
Minute book kit: $250
Business banking (RBC): $72/year
QuickBooks: $240/year
Harvest time tracking: $144/year
LawDepot contracts: $80/year
Zensurance business insurance: $600/year
Bookkeeper (Upwork): ~$800/year
Accountant for year-end (Upwork): $300
Federal annual return: $12

Total first-year cost: $2,500–3,000 (can be done for $800–1,200 if you go ultra-lean)

There are several ways you could go ultra lean here. Cheapest accounting software, do the bookkeeping yourself, leaner insurance, etc..

The Timeline From Zero to Operational

Day 1: Incorporated online through Corporations Canada website in the morning. Ordered minute book kit from Amazon.

Day 2: Received Certificate of Incorporation by email (one business day processing). Booked appointment at RBC for the following week (they needed a few days for business banking appointments). Applied for Business Number through CRA’s Business Registration Online. Got my number instantly. Registered for GST/HST and payroll accounts at the same time.

Day 3: Minute book kit arrived. Spent 2 hours completing organizational resolutions and setting up corporate records.

Day 4: RBC appointment. Opened business account. Got temporary debit card same day, permanent card arrived a week later. Opened Wise Business account online (approved same day).

Day 5: Set up Quickbooks. Connected to RBC account. Created invoice templates with my new GST/HST number.

Day 6: Fully operational.

The whole process took about a week from deciding to incorporate to starting to find clients!

What I’d Do Differently

Four things:

First, I’d take the extra day to get a proper business name instead of ending up with a numbered company. The NUANS search is only $13.80 and would have saved me the awkwardness of invoicing clients as “1506596–8 Canada Inc.”

Second, I’d have found my bookkeeper and accountant on Upwork from day one instead doing anything myself. I am not an accountant nor want to be! Leave it to the professionals! Platforms like upwork can make this very accessible and allow you to hire just for what you need.

Third, I’d start with Wave Accounting (free) for the first 3–6 months instead of immediately paying for QuickBooks. Wave handles Canadian taxes correctly and costs nothing. Upgrade once you’re consistently hitting $10,000+ monthly revenue.

Remember: This is my experience from 2023–2025. Your situation, province, business model, and goals are different. Use this as a starting point for your research, not as instructions! Hire professionals for the important stuff.

Common Mistakes

Overthinking the federal vs provincial decision (just pick one)
Paying themselves only dividends and missing CPP/RRSP benefits
Not getting insurance until a client requires it (get it day one)
Doing their own bookkeeping for too long (hire help at $5k+ revenue/month)
Not registering for GST/HST early when they have big expenses

The Bottom Line

Incorporating takes about two weeks and $2500 in first-year costs to do properly. Provincial incorporation makes more sense than federal for most consultants, unless you’re working with clients nationally like I do. The hybrid salary + dividend approach (favoring salary despite what the internet says) optimizes both taxes and retirement planning. Free or low-cost Canadian tools (Wave, Wise, Harvest) combined with Upwork professionals can keep costs manageable.

The biggest surprise: corporate administration takes way less time than expected when you have a good bookkeeper. About 2–3 hours per month on my end for reviewing numbers and payroll sign-offs. The liability protection, professional credibility, and 12.2% corporate tax rate make it worthwhile once you’re consistently billing $75,000+ annually.

2.5 years in, I’m glad I incorporated. The structure forces better financial discipline, tax planning saves $15,000–20,000 annually compared to sole proprietorship, and clients take you more seriously when you have “Inc.” after your company name (even if it’s a numbered company).

Just don’t overcomplicate it!

This covers the mechanics of setting up a consulting corporation in Canada. But the setup is just the beginning. The real journey, figuring out what to charge, finding clients who value your work, building systems that scale, learning when to say no, and navigating the psychological shift from employee to business owner , that’s a story for next time!

I Built a Route Planning App for Cyclists Who Hate Stopping

Levi Stringer — Mon, 17 Nov 2025 04:48:38 GMT

I love cycling but I also love solving problems. And I had a problem.

I don’t have a flexible way to search, decide on a route, hop on my bike and go. So I decided to build something to fix it.

I’m relatively new to Montreal. Back home, I could leave my house and be on open road in 10 minutes. Want a nice 20k loop? I knew exactly where to go. Here? I’m starting from scratch.

A couple months ago, I’m cycling to Laval for some longer stretches. Google Maps: “10 km to bike-friendly route.” Easy warmup, right?

Ten minutes later, I’m stopped at my sixth red light, unclipping again, and breathing exhaust.

The problem isn’t the apps. It’s that none of them let you describe the type of ride you want. They give you “fastest route” or “bike-friendly.” But what does bike-friendly even mean? Shared lane with cars? Protected path? Side streets?

I don’t want the fastest route. I want a route with flow. Minimal stops. Actual bike lanes. Away from traffic.

So I built something that lets me describe exactly that.

The Core Idea: Just Tell It What Kind of Ride You Want

Sometimes I want:

A quick 20-minute flow session with minimal stops
A chill route away from traffic
Something flat because my legs are toast
The quickest way to just put kilometers on my bike

Existing apps make you pick from their categories or search segments. What I wanted was simple: question → solution. Something you just talk to:

“Find me a chill route from Plateau to Mile End” “Quick ride to Old Port and back, avoid traffic” “Flat route, no hills, I’m tired”

No clicking through menus. No toggling 15 different options. Just get a route.

How It Works: Streaming AI + Google Maps APIs

The AI Layer I integrated an LLM API that streams responses. When you type a query, you get real-time, conversational responses:

You: “I want a short ride to Mile End, maybe 20 minutes”

App: (streaming) “I’ll find you a relaxed route to Mile End, aiming for around 20 minutes. Looking for options with fewer stops…”

It should feel natural. It’s flexible. Works in English or French.

The Google Maps Integration Once the agent understands what you want, it taps into Google’s APIs to build the route. A few things I like:

Alternative Routes — Google gives you multiple alternatives. I use provideRouteAlternatives: true and let the agent pick the best match based on the query.
Bike Layer — Google knows where actual bike infrastructure is. Protected lanes, shared lanes, bike paths. You can see which routes are actually bike-friendly vs “you’re sharing with cars and praying.”
Traffic Layer — Real-time traffic data matters WAY more for bikes than people think. Heavy car traffic = you’re stopping at every light. Light traffic = better flow.

Map View With Traffic Overlap

The Route Feed Instead of a list, routes appear in a feed. Like Instagram, but for bike routes. Each route shows:

Map preview (full-screen, interactive)
Distance, time, elevation
Traffic level indicator (green/yellow/red dot)
Safety score (0–100)

You can iterate until you find a route you like.

The Safety Score

Each route gets scored 0–100 based on real-time data from Google Maps. Here’s what affects it:

Road types: Residential streets and bike lanes increase the score. Highways and major boulevards decrease it.

Time of day: Routes score highest mid-day. Rush hour (6–9am, 4–7pm) and nighttime rides get penalized for traffic and visibility.

Traffic patterns: The algorithm analyzes turn frequency as a proxy for traffic density. More turns usually means quieter residential areas, which boosts the score.

Route warnings: Google flags things like missing sidewalks or “use caution” conditions. Each warning drops the score.

So when browsing your feed:

Route A: Fast (12 min) but safety score: 45/100 (busy boulevard, rush hour timing)
Route B: Slower (15 min) but safety score: 85/100 (residential streets, protected bike lanes)

The score updates in real-time based on current conditions. Same route at 2pm might score 80, but at 5pm during rush hour? Maybe 55.

What This Actually Looks Like

Real example: I want to go from my place in the Plateau to grab coffee in Mile End.

What I type: “Chill ride to Mile End, avoid traffic”

What the AI responds: (streaming) “I’ll find you a relaxed route to Mile End with less traffic. One moment…”

Then it shows the route:

Distance: 2.8 km
Time: 16 minutes
Elevation: 35m gain
Traffic: Light (green indicator)
Safety score: 78/100

I tap it, see the map. Route goes through side streets (Clark/Waverly), not the main drags. Perfect.

I start riding. Minimal stops. Quiet streets. Just flow.

Why This Works for Route Planning

The key difference: you describe what you want, not what other people have done.

Google Maps optimizes for speed. Great for getting somewhere fast, but “bike-friendly” could mean anything.

Strava is built for tracking and performance. Route planning is paywalled, the free tier is amazing for analyzing rides you’ve already done, not for “I have 30 minutes and want something chill.”

What I built:

Normal queries that match how you think about rides
Multiple route options in a feed so you can quickly compare
Safety scores and traffic indicators visible before you start
Real-time data that updates as conditions change

It does one thing well: help you find the type of ride you actually want, right now.

The Air Quality Bonus

Montreal summer air can be rough. I added air quality data (1–10 scale) to each route. Now I can check before I leave: Is it a good day to ride? Or should I stick to the metro?

The API was already there (WeatherAPI.com), took maybe 30 minutes to wire up.

Was This Worth It?

Honestly? Yeah.

I use it before almost every ride now. Sometimes just to check traffic. Sometimes to find a new route. Sometimes because I want a specific vibe and I know it’ll find something that matches.

But more importantly: I’m learning Montreal. I’m building that cyclist’s intuition I had back home.

Is this a Strava killer? No probably not. Strava is for tracking, competing, analyzing performance, flexing on your friends.

This is for planning. For knowing. For getting the type of ride you actually want.

And as someone who just is still getting used to a new city and wants to ride without stopping every 200 meters? That’s everything.

What’s Next?

This does what I need, but three features would make it even better:

1. Route saving — Star favorites so I stop re-discovering the same routes
2. Traffic light counter — Show exactly how many stops to expect (the dream)
3. Community routes— Let other riders share their no-stop gems. For free!

Whether I build these depends on whether anyone else actually wants this. If you’d use features like these, let me know in the comments.

For now, it solves my problem. That’s enough.

And if you’re in Montreal and have route suggestions — especially ones with minimal stops and good flow — drop a comment. I would be stoked to find some new ones.

The Tech Stack

Frontend:

TypeScript + React 19
Vite (blazing fast builds)
Tailwind CSS (utility-first styling)
Mobile-first responsive design

APIs:

Google Maps APIs (Directions, Elevation, Maps JavaScript)
Weather API (air quality data)
Custom LLM API (streaming chat responses)

Architecture:

Route Feed (Instagram-style UI)
Chat Interface (streaming AI responses)
Natural Language Parser (extracts locations and preferences)
Route Orchestrator (coordinates all services)

The whole thing is on GitHub

Built this because I needed it. If you’re also trying to figure out cycling in a new city, hope this helps.

If this resonated with you, give it a clap. If you have questions about the APIs or cycling, drop a comment. And if you’re also new to a city and missing your old routes, I feel you.

Claude Code The Experiment: Atomic Task Breakdown Test

Levi Stringer — Thu, 06 Nov 2025 02:05:13 GMT

TLDR: I gave Claude Code identical bug fixes twice.

First run: $1.82, complete with custom hashing algorithms.
Second run: $0.77, five lines of validation logic. Both solved the problem.

Task decomposition sounds boring until you’re staring at a production bug at 2 AM.

You’re wondering whether to throw the entire problem at Claude or methodically carve it into digestible chunks. Most developers default to the first option because it’s faster. But as we all know… faster isn’t always smarter.

I was trying to figure out what to test on, a true production app. And then it hit me, why not use the library so many thousands of developers rely on everyday, Vercel’s AI SDK. So I tested both approaches using a genuine bug from the backlog(issue #8132). Bundle everything into one mega-prompt versus atomize it into focused sessions. The results caught me off guard.

The Bug That Kicked This Off

OpenAI’s Responses API enforces a strict 40-character limit on tool call IDs. Sure why not.

The Vercel AI SDK’s `web_search_preview` tool? Blissfully unaware. It generated 51-character IDs like it was no one’s business.

Straightforward problem. Generate shorter IDs while maintaining uniqueness. How complex could this possibly get?

Approach A: The “Just Fix It” Bundle

My first prompt kept it simple.

Fix GitHub issue #8132 in the vercel/ai repository.

The problem: The web_search_preview tool generates tool_call IDs that exceed OpenAI’s 40-character limit, causing this error: ‘Invalid messages[36].tool_calls[0].id: string too long. Expected maximum length 40, got length 51 instead.’

No constraints. No guidance. Just point Claude at the problem. Keep in mind that I’m using almost exclusively Sonnet 4.5.

Cost: $1.82

Time: 15 minutes

Files changed: 6

Lines added: +340

Claude Got Creative

It built an entire utility module from scratch. Hash-based truncation system included.

// packages/provider-utils/src/truncate-id.ts (43 lines)
export function truncateId(id: string, maxLength: number = 40): string {
  if (id.length <= maxLength) {
    return id;
  }
  const separatorIndex = id.indexOf('_');
  const prefix = separatorIndex > 0 ? id.substring(0, separatorIndex) : '';
  // Custom hash function
  const simpleHash = (str: string): number => {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  };
  const hash = simpleHash(id);
  const hashSuffix = hash.toString(36); // Base36 encoding
  const availableSpace = maxLength - (prefix.length + 1);
  const truncatedHash = hashSuffix.substring(0, availableSpace);
  return prefix ? `${prefix}_${truncatedHash}` : truncatedHash;
}

Then it applied this truncation machinery across six different files. Added 125 lines of unit tests. Updated integration test snapshots. Built a full abstraction layer for what should’ve been a simple fix.

Takeaways

I had to prevent it from reading a file it wasn’t supposed to read.
Tests failed on the first pass.
Required multiple iterations before everything clicked.
Ended up with custom bitwise operations and base36 encoding when the problem boiled down to “make shorter IDs.”

It worked. But I’d just commissioned a Swiss Army knife when I needed scissors.

Approach B: Atomic Sessions

This time I carved it into three surgical strikes.

Session 1:

> Write a test that reproduces issue #8132. The test should generate a tool call ID using the current code and assert that it exceeds 40 characters. Only create the test — don’t fix anything yet.

Session 2:

> Fix the ID generation to enforce 40-character maximum. Modify the existing createIdGenerator function to auto-adjust the size parameter if it would exceed the limit. Only modify the generation logic.

Session 3:

> Add validation to prevent regression. Create a validateIdLength function that throws if an ID exceeds 40 chars. Add it where IDs are generated.

Cost: $0.77

Time:16 minutes

Files changed: 2

Lines added: ~90

Claude Kept It Real

This time? It addressed the actual problem.

// packages/provider-utils/src/generate-id.ts
export const validateIdLength = (id: string): void => {
  const MAX_ID_LENGTH = 40;
  if (id.length > MAX_ID_LENGTH) {
    throw new InvalidArgumentError({
      argument: 'id',
      message: `Generated ID length (${id.length}) exceeds maximum allowed length (${MAX_ID_LENGTH})`,
    });
  }
 };
export const createIdGenerator = ({
  prefix,
  size = 16,
  alphabet = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
  separator = '-',
  }: {prefix?: string; separator?: size?: number; alphabet?: string; } = {}): IdGenerator => {
      const MAX_ID_LENGTH = 40;
      // Auto-adjust size if it would exceed maximum
      let actualSize = size;
      if (prefix != null) {
        const prefixLength = prefix.length + separator.length;
          if (prefixLength + size > MAX_ID_LENGTH) {
            actualSize = MAX_ID_LENGTH - prefixLength; // ← The fix
          }
      } else if (size > MAX_ID_LENGTH) {
        actualSize = MAX_ID_LENGTH;
      }
      const generator = () => {
      const alphabetLength = alphabet.length;
      const chars = new Array(actualSize);
      for (let i = 0; i < actualSize; i++) {
        chars[i] = alphabet[(Math.random() * alphabetLength) | 0];
      }
      const id = chars.join('');
      validateIdLength(id); // ← Defensive check
      return id;
     };
      // … rest of function
   };

Zero hash functions. Zero base36 encoding shenanigans. Just prevent the problem where it starts.

Takeaways

From my notes I had:

Cleaner code quality
I felt like I had better code understanding
Built checkpoints to check and course correct.

Each session delivered one clear outcome:

✅ Session 1: Test passes showing IDs exceed limits
✅ Session 2: Generator auto-adjusts size dynamically
✅ Session 3: Defensive validation guards against regression

At every checkpoint, I could verify we weren’t building a Rube Goldberg machine. That’s powerful and I think that’s important to keep as an engineer. That understand and active part of the process.

Why Atomization Crushed Bundling

Bundles Breed Complexity

Give Claude an unconstrained problem and watch it flex. Custom hash function with bitwise operations? Uh huh. Base36 encoding for compactness? Absolutely. Separate truncation utility applied after-the-fact across multiple files? Sure why not.

Every line was technically defensible. That’s exactly the problem I think, when everything is justifiable in isolation, you lose sight of whether you’re solving the right problem. I don’t care about fancy engineering, it basically solved a phantom problem. We didn’t need to truncate existing IDs. We needed to stop generating oversized ones in the first place.

Zero checkpoints meant zero chances to ask the crucial question: “Wait, is this the simplest solution?”. I had a great mentor one time who told me. KISS.

“Keep It Simple Stupid”

Sessions Enforced Simplicity

Breaking tasks into sessions created natural inflection points.

After Session 1: “Alright, we can reproduce the bug. IDs exceed 40 characters.”

After Session 2: “Hold on — why not just prevent generating long IDs?”

After Session 3: “Add a defensive guard. Ship it.”

Each checkpoint gave me a chance to course-correct before complexity metastasized into the codebase. That’s the difference between code that works and code that endures. It also reinforces my ability to speak to the code and defend it when it comes to PR-time.

Where Complexity Shows Its True Cost

Code that works isn’t always code that ships.

Code Review Tells the Story

Reviewing the bundled approach triggers questions:

Why implement a custom hash function?
Could truncating IDs risk collision scenarios?
Now ID logic lives in two places — generation and truncation
Should we use an established hashing library instead?

Reviewing the session-based approach:

Prevent size from exceeding 40 characters. Sensible.
Validation provides a safety net.
ID logic stays centralized in the generator.”

One PR sparks five debates. The other sparks zero.

Deciding When to Atomize

Running this experiment crystallized my framework.

Split Tasks When:

You’d review intermediate state anyway → Transform those review points into distinct sessions

The solution approach remains unclear → Run a research session before implementation. Plan, plan, plan.

Incremental validation matters → Chain them: Test session (Replicate bug) → Fix session → Validation session

Stay Bundled When:

Single file with obvious changes → “Add null check to this function” doesn’t need decomposition

You want to iterate quickly→ “Change this header, this table, and the colour to blue”. Quick visual mockups aren’t worth the trouble of breaking down.

Well-defined API contracts exist → “Implement this interface” provides clear boundaries. Certainly leverage existing context.

The One-Sentence Litmus Test

Can’t articulate the task in ONE sentence with ONE measurable outcome? Split it.

❌ “Fix the ID length issue” → Vague endpoint
✅ “Write failing test when IDs exceed 40 characters” → Crystal clear
✅ “Modify createId to auto-adjust size based on prefix length” → Surgical precision

Try it yourself:

Next time you’re tempted to throw Claude a complex task:

1. Draft your prompt

2. Ask yourself: “Can I verify success in one step?”

3. No? Engage in a planning session and work to split it.

4. Identify natural review checkpoints

5. Convert each checkpoint into its own session

What This Really Means

The session-based approach cost half as much. Generated cleaner code. But here’s what actually matters — I felt confident at every checkpoint.

Bundled finished in 15 minutes. Session-based took 16. One minute difference.

Yet one approach generated technical debt I’d need to defend during code review. The other produced production-ready code that shipped without debate.

Speed doesn’t equal value.

Breaking tasks into atomic sessions forced both Claude and me to think clearly about each step. That clarity manifested directly in the codebase. No over-engineering. No unnecessary abstractions. Just clean, maintainable solutions.

Try atomizing your next complex feature. You’ll invest maybe one extra minute planning sessions. You’ll recover hours in code review and maintenance cycles.

Running DeepSeek Locally: A Practical Example

Levi Stringer — Thu, 30 Jan 2025 16:09:23 GMT

TLDR: Want to chat with DeepSeek quickly?

curl -fsSL https://ollama.com/install.sh | sh
ollama -v #check Ollama version
ollama run deepseek-r1:1.5b

That’s all it takes! Read on for more details.

Let’s Dive in

Have you heard the buzz about DeepSeek, or have you been blissfully unaware? Well, if you’re like me, keeping up with emerging language models is part of your job. Otherwise, you may be wondering what in the world is going on.

DeepSeek-R1, released last week by DeepSeek, has been gaining traction for its cost-effectiveness (~free) and strong performance.

What Is DeepSeek?

It is a marked improvement in the cost of high performing models. This release was a huge step forward in the commodification of LLM models. Think of it as a “generic-brand” version of OpenAI’s GPT models- bringing capable models locally, but without reliance on cloud-based APIs.

Key benefits:

💰 Cost Effective: High performance without hefty monthly invoices
🖥️ Local Execution: Run inference directly on your hardware.
🔧 Flexible Choices: Available in different sizes (1.5B, 7B, 700B) to fit your hardware capabilities.

Now, you might be wondering: “What’s in it for me?”
Simple — commoditization makes technology more accessible and cheaper, driving down costs for everyone. To highlight this effect, I’ll show you how to run DeepSeek locally on a MacBook Pro M3.

Step 1: Install Ollama

Ollama’s Llama

Download Ollama. This is probably the easiest way to get started with open source — models. Ollama is an open source project that allows you to quickly and easily run various LLMs on your local machine — without relying on an cloud based platforms.
Installation on macOS is straightforward:

brew install ollama

After installing, verify it works by running ollama --help.

Step 2: Choose the Right Model for Your Hardware

Next, figure out what size of DeepSeek model is right for your hardware. If you have limited RAM or an older GPU, you might start with a smaller model like deepseek-r1:1.5b or deepseek-r1:7b.

You can see the models on Ollama’s page.

Pro Tip: Check out the GPU requirement guide to see if your hardware can handle larger models.

Model Versions

For this illustration, I’m using a MacBook Pro M3 with 18 GB of RAM and I was just able to get away running

ollama run deepseek-r1:7b

A Note on Quantization

By default, ollama run deepseek-r1:7b uses 4-bit quantization, meaning it requires far less memory than full 16-bit or 32-bit versions.

Want higher precision? Experiment with 8-bit or 16-bit quantization if you have the memory/GPU capacity.

⚠️ Fun fact: The largest DeepSeek model currently has 716 billion parameters with FP16 precision — requiring 1.3TB of storage. Safe to say, your MacBook chip ain’t handling that.

Step 3: Install the Model & Run It

Once you’ve picked a model, running it for the first time will trigger an automatic download:

ollama run deepseek-r1:7b

⏳ Note: The first run may take several minutes as it downloads and caches the model. After that, you should see a prompt where you can type a question or statement and get a response.

Common Issues and Solutions

1. Ollama Connection Fails

Make sure Ollama is running via ollama serve.
Verify you’re connecting to the right URL (http://localhost:11434/v1).
Check if the model is loaded: ollama list.

2. Insufficient RAM or GPU

Try a smaller model variant, or use 4-bit quantization.
Close other memory-intensive apps.

3. Model Download Is Slow

The larger the model, the longer it will take. Consider using smaller model sizes first

Integration Example: 🚀 Better Commit Messages with DeepSeek

Now that you have the model running, how do you really put it to use? Most developers hate writing commit messages. It’s easy to just type:

git commit -m "changes"

I’ve never much enjoyed is writing or reading commit messages. I know what you’re thinking: ‘Really? It’s just one sentence.’ And you’re not wrong. But I’ve seen developers put effort into elegant features and bug fixes — only to end up with commit messages like git commit -m 'changes' , “changes”? Six months later, that message is useless. So for this example I took this problem of mine and did what engineers do best — fix it.

Example Solution: commitry-cli

I built commitry-cli, a simple tool that generates meaningful commit messages using DeepSeek. Here’s how to use it:

Clone the repo & Install

git clone git@github.com:levi-katarok/commitry-cli.git
cd commitry-cli
python -m venv venv && source venv/bin/activate
pip install -e .

2. Ensure Ollama is running on the default port:

ollama serve
lsof -i :11434

Sample Command Line Ouput Upon Success

3. Test it:

You can now try and make a change in your repo and see what you get!

Here I make a change in my repo

def calculate_radius_of_planet(planet):
    if planet == "Earth":
        return 6371
    elif planet == "Mars":
        return 3389
    else:
        return 0

Then I staged the function

git add .

Sample of Staged Changed

Pretty simple change right? Now we can see how Deepseek will do by running my handy cli tool commitry-cli. You can see the truncated output below.
Run commitry-cli:

commitry-cli                          


Alright, so I'm trying to understand this code that was generated by a user. Let me read through it again.

The function is called calculate_radius_of_planet and it takes 'planet' as an argument. The first condition checks if planet == "Earth". If that's true, it returns 6371—which must be in kilometers because I recall Earth's radius is about 6,371 km.

Then there's an elif clause checking if planet == "Mars" and returning 3389. Yeah, Mars' radius is roughly around there—maybe like 3,389 km? That seems right.

If none of those conditions are met, it goes to else and returns 0. Hmm, so for any other planets not specified here, the function just gives a zero radius. I guess that's by design because they only have data for Earth and Mars.

Wait a second, in the code, after the else clause, there was another 'else' statement: return 0 again? No, no— actually, let me check the indentation. Oh, right, it seems like the first function is from line 3, so maybe that's the correct structure. But wait, looking back, I think there might have been a mistake in how these code blocks were presented.

Wait, the original code given starts with a function and then has those conditionals inside. So after handling Earth and Mars, if none are matched, it returns 0. There isn't an extra 'else' for another planet because any other input would just hit the last else clause.

I think this is correct as per the user's data. The code correctly maps Earth to ~6371 km and Mars to ~3389 km. Any other planet not specified returns 0, which could be a placeholder or default value if they don't have radius information for others.

So the function works as intended based on what was provided.


The code defines a function `calculate_radius_of_planet` that returns the radius of a given planet in kilometers, returning 6371 km for Earth and 3389 km for Mars, with all other unsupported planets returning 0.

You’ll get a suggested commit message generated by DeepSeek. Below is an example of the truncated output:

Suggested commit message:
The code defines a function `calculate_radius_of_planet` that returns the radius of a given planet in kilometers, returning 6371 km for Earth and 3389 km for Mars, with all other unsupported planets returning 0.

Not bad! It’s certainly more descriptive than the typical “updates” or “changes.” For comparison, if you have an OpenAI API key, you can configure the CLI to call an OpenAI model as well. In one test, when I called GPT-4o-mini:

commitry-cli --model='gpt-4o-mini'

It returned back to me:

Suggested commit message:
Add a function to calculate the radius of a planet based on its name.

Also not bad. Honestly I think OpenAI might have won in this case. Both are decent commit messages. One of the big perks here is that DeepSeek can run entirely offline — no need to rely on a third-party API or pay for tokens.

Complete Flow:

git clone git@github.com:levi-katarok/commitry-cli.git
cd commitry-cli
python -m venv venv && source venv/bin/activate
pip install -e .
ollama serve
commitry-cli

Conclusion

The rise of DeepSeek represents another milestone in the commodification of large language models — if you are curious about why DeepSeek went with this strategy, I recommended reading The Law of Commoditize Your Tech. As models become cheaper and more accessible, we gain more flexibility to run high-quality AI locally, at the edge.

This is just one potential use case. But what other scenarios could benefit from high-quality, locally-run models?

As always, please share your thoughts, questions, or any corrections in the comments below. I hope this was helpful.

Resources:

Simplifying LLaMA on AWS Bedrock

Levi Stringer — Tue, 24 Sep 2024 15:28:54 GMT

Created by Midjourney 2024

In this tutorial, I’ll guide you through setting up and using Meta’s LLaMA model on AWS Bedrock, showcasing a semi-practical use case… generating recipes based on available ingredients. This guide is for developers and data scientists who are either familiar with AWS or looking for a hosted service to deploy alternate models such as the LLaMA series. If you prefer to see the code directly you can checkout the github here.

Note: I used LLaMA here simply because I found it the hardest to find an example elsewhere, but this process will work for any model hosted on AWS Bedrock, you can see a complete list on AWS here.

Prerequisites

An AWS account with appropriate permissions and billing set up
Python 3.7 or later

Step 1: Set Up Your AWS Environment

Though AWS configuration can sometimes be challenging, this process for Bedrock was surprisingly straightforward.

Log in to your AWS Console.
Navigate to the IAM (Identity and Access Management) dashboard:

IAM allows you to control who can access your AWS resources. You’ll create a user with programmatic access and apply the AmazonBedrockFullAccess policy to enable usage of Bedrock services.

3. Create or update an IAM user:

Create a new IAM user or use an existing one with programmatic access.
Attach the AmazonBedrockFullAccess policy to this user.
Note down the Access Key ID and Secret Access Key.

4. Enable model access:

Visit the Bedrock model access page and enable the Meta models (or others as needed).

5. Verify region availability:

Ensure you are using the correct AWS region (e.g., us-west-2), as model availability may vary by region.

Step 2: Configure Your Local Environment

Use the requirements.txt from GitHub or copy the following installation command:

pip install boto3 python-dotenv pydantic

Step 3: Create Python Script

Create a .env file with your AWS credentials:

AWS_SECRET_ACCESS_KEY=********
AWS_ACCESS_KEY_ID=****
AWS_REGION=us-west-2

2. Create a Python file named recipe_generator.py with the following code. In the script, we first load the environment variables, then set up the AWS Bedrock Runtime client for making API calls. The generate_recipe function prompts the LLaMA model using an ingredient list and expects the output in JSON format conforming to the Recipe model schema.

import boto3
import json
import os
from typing import List
from pydantic import BaseModel
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up AWS credentials
aws_access_key = os.environ.get("AWS_ACCESS_KEY_ID")
aws_secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY")
aws_region = os.environ.get("AWS_REGION")

# Create a Bedrock Runtime client
client = boto3.client(
    "bedrock-runtime",
    region_name=aws_region,
    aws_access_key_id=aws_access_key,
    aws_secret_access_key=aws_secret_key
)

# Define Recipe model
class Recipe(BaseModel):
    name: str
    ingredients: List[str]
    instructions: List[str]

# Create JSON schema for Recipe
schema = Recipe.model_json_schema()
schema_json = json.dumps(schema, indent=4)

# Set the model ID for Llama 3
MODEL_ID = "meta.llama3-1-8b-instruct-v1:0"

def generate_recipe(ingredients):
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are a helpful assistant. Generate a recipe using the following ingredients. The output should be formatted as a JSON instance.<|eot_id|><|start_header_id|>user<|end_header_id|>
    Ingredients: {', '.join(ingredients)}. 
    Output in JSON format only. The output should be formatted as a JSON instance that conforms to the JSON schema below. {schema_json}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 
    try:
        response = client.invoke_model(
            modelId=MODEL_ID,
            body=json.dumps({
                "prompt": prompt,
                "max_gen_len": 512,
                "temperature": 0.7,
                "top_p": 0.9,
            })  
        )
        
        response_body = json.loads(response['body'].read().decode('utf-8'))
        recipe_json = json.loads(response_body['generation'])
        return Recipe(**recipe_json)
    except Exception as e:
        print(f"Error generating recipe: {str(e)}")
        return None

# Example usage
if __name__ == "__main__":
    ingredients = ["chicken", "rice", "bell peppers", "onions", "potatoes"]
    recipe = generate_recipe(ingredients)
    if recipe:
        print(recipe)
    else:
        print("Failed to generate recipe.")

Tip: Using the recommended prompt structure from Meta can significantly improve the quality of the output, especially when generating structured JSOB

Step 4: Run the Script

Execute the script by running:

python recipe_generator.py

Upon execution, the script will send the ingredient list to the LLaMA model via Bedrock and return a JSON-formatted recipe.

Step 5: Analyze the Results

The script will output a JSON object containing the generated recipe. You can further customize the output by modifying the prompt or tweaking the generate_recipe function to fit your application's needs. Here’s an example of what the pydantic output might look like.

Recipe(
    name='Chicken and Vegetable Stir Fry',
    ingredients=['chicken', 'rice', 'bell peppers', 'onions', 'potatoes'],
    instructions=[
        'Heat oil in a pan and sauté the onions and bell peppers until tender.',
        'Add the chicken and cook until browned.',
        'Add the potatoes and cook until they start to soften.',
        'Add the rice and stir-fry for 2-3 minutes.',
        'Season with salt and pepper to taste.',
        'Serve hot and enjoy!'
    ]
)

Conclusion

You’ve now successfully set up and used Meta’s LLaMA model on AWS to generate recipes based on given ingredients! This example alternatives to other APIs, and how easy it is to connect to LLaMA on AWS Bedrock and utilize it for various use cases.

While this is a simple example, I always strive for flexibility in my LLM applications and have access to hosted LLaMA is a great tool in the tool belt.

Note: Remember to always review and test the generated recipes, as AI models may sometimes produce inedible, or impractical meals... Happy cooking and coding!

Building Stateful Conversations with Postgres and LLMs

Levi Stringer — Tue, 12 Mar 2024 21:21:07 GMT

Image generated by DALLE March 2024

In an era dominated by AI-driven interactions, creating LLM applications or chatbots that can handle multiple users and remember past conversations is crucial. This guide dives into the challenge of ensuring continuity and context in Large Language Model (LLM) interactions, revealing how PostgreSQL can be the key to managing chat histories effectively. By integrating a database, you not only enhance user experience but also equip your chatbot with the ability to engage in more meaningful, context-aware dialogs. Let’s explore how to elevate your chatbot from a simple question-answer machine to an intelligent conversational agent that remembers and learns from each interaction.

Complete solution at the end of the article.

Benefits of Having Why Chat History Enhances User Experience

George Santayana wisely pointed out “Those who cannot remember the past are condemned to repeat it”. A concept as true in conversations as it is in history.

Imagine the frustration you’d have if chatting with a support system and every time you asked a question, they repeatedly asked the same questions. Maybe you don’t have to imagine, as I imagine many of us have been there before. By avoiding repetition, incorporating history in any chatbot offers significant benefits that elevate any application.

What are some technical benefits of having a saved session history?

Contextual Understanding: Large language models thrive on context. Including chat history in prompts, allows LLMs to grasp the nuances of ongoing dialogue, ensuring answers are not only accurate, but also highly relevant. This keeps a fluid coherent conversation
Improved Continuity: In the real world life is full of interruptions. We step away from conversation, only to return later with new questions or to continue where we left off. With chat history, a bot can pick up precisely where the conversation left off, regardless of breaks, ensuring a coherent and continuous user experience without missing a beat.
RESTful Chatbots: In the stateless realm of RESTful applications, where each request stands alone, chatbots face the challenge of remembering past interactions. This is where integrating chat history becomes critical. It allows RESTful chatbots to overcome their inherent lack of memory, enabling them to provide responses that are coherent and contextually relevant, even after interruptions. Chat history ensures that despite the stateless nature of RESTful services, chatbots can maintain a seamless conversation flow, significantly enhancing the user experience by allowing for natural, uninterrupted dialogue. This approach transforms isolated interactions into a connected, ongoing conversation, crucial for user engagement and satisfaction in RESTful environments. While still allowing your application to attend to other incoming requests.

Designing Your Chat History Feature

Now, of course you can go with any database and any format you’d like. I’m going to be providing an example with Postgres. It is reliable and free (check out https://supabase.com/ if you want a great cloud hosted solution). You can always set up and install a local version.

PostgreSQL is renowned for its robustness and strong compliance with SQL standards. It offers features such as transactional integrity, complex queries, and support for JSON data types, making it an excellent choice for applications requiring reliable data storage and retrieval mechanisms, such as AI-powered chat applications.

For a robust chat history management system, we’ll focus on three essential components:

Users Table: Stores user-specific information.
Sessions Table: Keeps track of each chat session, noting start and end times, and linking back to the respective user.
Messages Table: Records the details of every message within a session, including sender, text, and timestamps.

CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE sessions (
    session_id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(user_id),
    start_time TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    end_time TIMESTAMP WITH TIME ZONE
);

CREATE TABLE messages (
    message_id SERIAL PRIMARY KEY,
    session_id INTEGER REFERENCES sessions(session_id),
    sender VARCHAR(50) NOT NULL,
    message_text TEXT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

With these three tables — users, sessions, and messages — we lay a solid foundation for managing chat histories. This structure not only helps in identifying unique users but also enables efficient retrieval of session and message histories. Simple yet powerful, this setup is all you need to start implementing robust chat history management in your chatbot applications. We are able to identify unique users, and if they provide us with a session_id we can retrieve it.

Integrating LLMs and Chat History with FastAPI

Choosing Your Tech Stack: While FastAPI is my framework of choice for its speed and ease of use, the concepts outlined here are adaptable across various technologies. You can you follow similar steps in Node, Ruby, Rust, PHP… the list goes on.

Setting Up Your Database: With PostgreSQL, we structure our chat history using three key tables: Users, Sessions, and Messages. These tables are crucial for tracking user interactions, session details, and the messages exchanged.

FastAPI Implementation Overview:

Database Connection: Utilize databases for asynchronous database interactions, ensuring smooth operation in concurrent environments.
Model Definitions: Define Pydantic models for structured data exchange. MessageRequest captures incoming messages, while AIResponse handles the LLM's output.
API Endpoints: The /send_message endpoint is central to our implementation. It saves incoming messages to the database, interacts with the LLM (e.g., OpenAI) for response generation, and logs the response back in the database.

So now that we’ve got our table schemas define from before, here is an example of how you might adjust the implementation to send a message and receive a response from an LLM, while managing conversation history with PostgreSQL.

Note we’re just going to start with saving messages and sessions with raw SQL statements. In practice for a production application you’d likley use an ORM layer like SQLAlchemy. SQLAlchemy is a layer that simplifies database interactions.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import databases
import openai
from typing import Optional

DATABASE_URL = "postgresql://user:password@localhost/dbname"
database = databases.Database(DATABASE_URL)

app = FastAPI()

# Define Pydantic models for request and response data
class MessageRequest(BaseModel):
    user_id: int
    session_id: Optional[int] = None
    message: str

class AIResponse(BaseModel):
    ai_response: str
    session_id: Optional[int]
# Initialize OpenAI
openai.api_key = 'your_openai_api_key'

@app.on_event("startup")
async def startup():
    await database.connect()

@app.on_event("shutdown")
async def shutdown():
    await database.disconnect()

@app.post("/send_message", response_model=AIResponse)
async def send_message(request: MessageRequest):
    query = "INSERT INTO messages (session_id, sender, message_text) VALUES (:session_id, 'User', :message)"
    await database.execute(query=query, values={"session_id": request.session_id, "message": request.message})

    ai_response = get_openai_response(request.message)

    # Record AI's response
    await database.execute(query=query, values={"session_id": request.session_id, "message": ai_response})

    return AIResponse(ai_response=ai_response)

def get_openai_response(message: str) -> str:
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=message,
        temperature=0.7,
        max_tokens=150,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response.choices[0].text.strip()p

Okay great! So no we’re just saving messages. But that doesn’t help us keep track of sessions or chat history. Let’s update our send_message endpoint to check if a session_idis provided.

@app.post("/send_message", response_model=AIResponse)
async def send_message(request: MessageRequest):
    if request.session_id is None:
        # No session_id provided, create a new session
        create_session_query = "INSERT INTO sessions (user_id, start_time) VALUES (:user_id, :start_time) RETURNING session_id"
        session_id = await database.execute(query=create_session_query, values={"user_id": request.user_id, "start_time": datetime.now()})
    else:
        # Use the provided session_id
        session_id = request.session_id

    # Record user message in the database
    insert_message_query = "INSERT INTO messages (session_id, sender, message_text) VALUES (:session_id, 'User', :message)"
    await database.execute(query=insert_message_query, values={"session_id": session_id, "message": request.message})

    # Get AI response
    ai_response = get_openai_response(request.message)

    # Record AI's response in the database
    await database.execute(query=insert_message_query, values={"session_id": session_id, "message": ai_response})

    return AIResponse(ai_response=ai_response)

Progress is clear: we’re now adeptly saving messages and tracking conversations. You might wonder, ‘How do we ensure conversations continue smoothly, especially when revisiting past interactions?’ A valid concern, indeed! To address this, let’s enhance our setup by incorporating a Pydantic model specifically designed for managing message history.

class MessageHistory(BaseModel):
    sender: str
    message_text: str
    created_at: datetime

Now once again update our send_message endpoint

@app.post("/send_message", response_model=AIResponse)
async def send_message(request: MessageRequest):
    try:
        # Manage session existence
        if request.session_id is None:
            query = """
            INSERT INTO sessions (user_id, start_time) VALUES (:user_id, CURRENT_TIMESTAMP)
            RETURNING session_id
            """
            session_id = await database.execute(query=query, values={"user_id": request.user_id})
        else:
            session_id = request.session_id

        # Fetch the last 5 messages for context - helps maintain continuity
        recent_messages_query = """
        SELECT sender, message_text FROM messages
        WHERE session_id = :session_id
        ORDER BY created_at DESC
        LIMIT 5
        """
        recent_messages = await database.fetch_all(query=recent_messages_query, values={"session_id": session_id})
        
        # Building context from recent messages
        context = " ".join([f"{message['sender']}: {message['message_text']}" for message in recent_messages])

        # Forming a context-rich prompt for the LLM
        full_prompt = f"{context} User: {request.message}"
        ai_response = get_openai_response(full_prompt)

        # Logging the conversation
        messages_query = """
        INSERT INTO messages (session_id, sender, message_text)
        VALUES (:session_id, :sender, :message_text)
        """
        await database.execute(query=messages_query, values={"session_id": session_id, "sender": "User", "message_text": request.message})
        await database.execute(query=messages_query, values={"session_id": session_id, "sender": "AI", "message_text": ai_response})

        return AIResponse(ai_response=ai_response, session_id=session_id)
    except Exception as e:
        # Consider more specific exception handling and logging
        raise HTTPException(status_code=500, detail=str(e))

By including the sender’s role in the context string, you give the LLM additional information about the dialogue structure, which can help it generate responses that are more in line with the conversation’s flow. This approach mimics a more natural conversation pattern, where each participant’s contributions are clearly identified.

Complete Solution

And there you have it! If you’ve skimmed down to this point, no worries. We’ve outlined a straightforward chatbot utilizing minimal external tools, ensuring clarity without resorting to complex abstractions (🦜). This foundational setup equips you with everything necessary to develop a session-based chatbot designed to enhance user experience without causing frustration.

Consider this a starting point. As you become more comfortable, you can expand and tailor the chatbot to meet specific needs or incorporate advanced features. Remember, the core principles covered here are adaptable and scalable.

I hope this guide proves useful in your development journey. Should you encounter any challenges or wish to share feedback, I’m happy to help and connect!

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import databases
import openai
from datetime import datetime

DATABASE_URL = "postgresql://user:password@localhost/dbname"
database = databases.Database(DATABASE_URL)

app = FastAPI()

# Define Pydantic models for structured data exchange
class MessageRequest(BaseModel):
    user_id: int
    session_id: Optional[int] = None
    message: str

class AIResponse(BaseModel):
    ai_response: str
    # Ensure the session_id is included in the response for continuity
    session_id: Optional[int]

# Model for tracking message history not immediately used but defined for future expansion
class MessageHistory(BaseModel):
    sender: str
    message_text: str
    created_at: datetime

# Initialize OpenAI with your API key
openai.api_key = 'your_openai_api_key'

@app.on_event("startup")
async def startup():
    await database.connect()

@app.on_event("shutdown")
async def shutdown():
    await database.disconnect()

@app.post("/send_message", response_model=AIResponse)
async def send_message(request: MessageRequest):
    session_id = request.session_id
    if session_id is None:
        # Create a new session if one is not provided
        session_query = "INSERT INTO sessions (user_id, start_time) VALUES (:user_id, CURRENT_TIMESTAMP) RETURNING session_id"
        session_id = await database.execute(query=session_query, values={"user_id": request.user_id})

    # Fetch recent messages for context
    recent_messages_query = """
    SELECT sender, message_text FROM messages
    WHERE session_id = :session_id
    ORDER BY created_at DESC
    LIMIT 5
    """
    recent_messages = await database.fetch_all(query=recent_messages_query, values={"session_id": session_id})

    # Build context from recent messages
    context = " ".join([f"{message['sender']}: {message['message_text']}" for message in recent_messages])

    # Create a context-rich prompt for the LLM
    full_prompt = f"{context} User: {request.message}"
    ai_response = get_openai_response(full_prompt)

    # Log the conversation in the database
    message_query = "INSERT INTO messages (session_id, sender, message_text) VALUES (:session_id, :sender, :message_text)"
    await database.execute(query=message_query, values={"session_id": session_id, "sender": "User", "message_text": request.message})
    await database.execute(query=message_query, values={"session_id": session_id, "sender": "AI", "message_text": ai_response})

    # Return the AI's response along with the current session_id for continuity
    return AIResponse(ai_response=ai_response, session_id=session_id)

def get_openai_response(message: str) -> str:
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=message,
        temperature=0.7,
        max_tokens=150,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response.choices[0].text.strip()

Simplifying RAG with PostgreSQL and PGVector

Levi Stringer — Tue, 23 Jan 2024 05:16:15 GMT

Image generated by DALLE Jan 2024

Authors: Levi Stringer

Source code: https://github.com/levi-katarok/simplified-rag

Building an application powered by Retrieval Augmented Generation (RAG) can be difficult, time-consuming, and expensive. Spending a lot of time in the LLM space, you begin to crave simplicity and avoid bloated libraries. When building a RAG application, I wanted only the necessities to get started. To achieve this I’ve began using PGVector within SQL, PostgresSQL is a tried and true database and if you’re like me and don’t want manage several database providers and wanted to get started quickly. These are the preliminary steps in getting the right context to give to an LLM. Here is a fantastic article if you’re unfamilar with the concepts of RAG.

Goal

Develop a basic application that:

Extracts text from documents (e.g., PDFs).
Converts text into embeddings using a pre-trained model.
Stores embeddings in a PostgreSQL database using PG Vector.
Queries the database to find documents whose embeddings are most similar to the embedding of a query text.

This post outlines of transforming textual content from PDF documents into vectorized forms and querying them for similarity, using PostgreSQL with PG Vector and SQLAlchemy.

This will be useful for those who prefer sticking to a single database like PostgreSQL and avoiding the overhead of juggling multiple database systems. We’ll be exploring a straightforward example that demonstrates how to create embeddings from text data and utilize PostgreSQL for efficient querying. A simplified overview can be seen below.

Interaction between components (Application, PostgreSQL, OpenAI)

Setup and Requirements

PostgreSQL Database: Make sure you have PostgreSQL installed and running. This method assumes you’re working with a local or remote PostgreSQL instance where you have the privileges to install extensions and create tables. Ensure you have PG Vector installed in your PostgreSQL database. This usually requires superuser access. If you’re not sure whether PG Vector is installed, you can check by connecting to your PostgreSQL database and running:

CREATE EXTENSION IF NOT EXISTS vector;

Python Environment: A Python environment (preferably Python 3.6 or newer) is essential for running the scripts. I highly recommend using a virtual environment for your project to manage dependencies efficiently and avoid conflicts with system-wide packages. If you’re unfamiliar with Python virtual environments, they allow you to create isolated spaces on your machine, tailor-made for your project’s requirements. To set up a virtual environment, run:

python3 -m venv myenv
source myenv/bin/activate  # On Windows, use `myenv\Scripts\activate`

Libraries and Modules:
sqlalchemy for interacting with the PostgreSQL database.
pgvector.sqlalchemy for integrating PG Vector with SQLAlchemy.
openai for accessing OpenAI's GPT models for vectorization. (Read embedding model description below)
pypdf for reading PDF documents.
pandas for handling data manipulations.
numpy for numerical operations.

Note: You can do this with simple Python lists instead of pandas and numpy. In general NumPy arrays use less memory than Python lists for numerical calculations. We aren’t doing any calculations on our embeddings here, I’ve just gotten in the habit of using them.

Embedding Model: I’ve used OpenAI’s text-embedding-ada-002 model. You’ll need an API key from OpenAI to use their models for generating embeddings for this example. You can can sign up and get an OpenAI API key here. Note this process will work for any embedding model you like. The choice of embedding model, such as OpenAI’s text-embedding-ada-002, affects the dimensionality and quality of your data embeddings. Select a model that aligns with your application's requirements for the best results. Some other paid ones include Cohere, Hugging Face, Google’s Vertex AI. With a list of open source ones here.

To install these libraries :

pip install numpy pandas sqlalchemy pgvector.sqlalchemy openai pypdf

When using embedding models, it’s vital to adjust the N_DIM parameter to match the dimensionality of the embeddings produced by your chosen model. For OpenAI's text-embedding-ada-002 model, the dimension is 1536. This information is crucial when defining your database schema to store embeddings correctly. If you opt for a different model, consult the model's documentation to find the correct dimensionality and adjust the N_DIM in your code accordingly. This ensures that your database is correctly set up to store the vector data without loss or truncation.

Troubleshooting Common Issues

As you embark on setting up your RAG application, you might encounter a few common hurdles:

PG Vector Installation Issues: If you run into problems while installing PG Vector, ensure you have the necessary permissions on your PostgreSQL instance. Sometimes, cloud-hosted databases require specific steps to enable extensions. Check your hosting provider’s documentation for details.
Python Dependency Conflicts: In case of conflicts between installed libraries, ensure that you’re using a virtual environment as recommended. If issues persist, try updating your packages to their latest versions, or consult the package documentation for compatibility information.
API Key Security: Keep your OpenAI API key secure. Avoid hard-coding it into your scripts. Instead, use environment variables or secure vaults to store sensitive information. Don’t push your API key to Github!
Database Connection Errors: Ensure your database connection strings are correct. Connection issues often arise from incorrect credentials, hostnames, or firewall settings blocking access to your database.
Model Dimensionality Mismatch: If you encounter errors related to the size of the data being inserted into the database, double-check the N_DIM parameter against your embedding model's output dimensionality. Mismatches here can lead to data truncation or insertion failures.

Part 1: Extracting Text

For this example I’ve chosen to demonstrate how to extract text from a PDF. After Part 1, Step 1 the process remains exactly the same for text extracted from any document format;.txt, .docx, .html, etc.. So for us the first part of our process involves reading a PDF document and extracting its text. This is a crucial step as the extracted text will be used for generating embeddings.

Step 1: Reading a PDF Document

We’ll use the pypdf library to read a PDF document from your local machine. PDF parsing has been battleground for a long time, and there many many different libraries who can do it (this post outlines a few). I chose pypdf as I found it had the largest community support.

from pypdf import PdfReader

def read_pdf(file_path):
    pdf_reader = PdfReader(file_path)
    text = ""

    for page in pdf_reader.pages:
        extracted_text = page.extract_text()
        if extracted_text:  # Check if text is extracted successfully
            text += extracted_text + "\n"  # Append text of each page

    return text

# Example usage
pdf_text = read_pdf("path_to_your_pdf.pdf")
print(pdf_text)

Step 2: Splitting Text into Manageable Chunks

After extracting the text, we’ll split it into smaller, manageable chunks. This is particularly useful for large documents, as embedding generation typically has a limit on the text length. For this tutorial I’ve just used a naive chunking strategy, this article by Roie Schwaber-Cohen outlines why and when different strategies should be used. TLDR, it will take some experimentation to figure out the best way to chunk your documents.

def split_text(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    end = 0

    while end < len(text):
        end = start + chunk_size
        if end > len(text):
            end = len(text)
        chunks.append(text[start:end])
        start = end - overlap  # Overlap chunks

    return chunks

# Example usage
text_chunks = split_text(pdf_text)

Part 2: Generating Text Embeddings

After splitting the text into manageable chunks, we’ll generate embeddings for each chunk using OpenAI’s text embedding model. This involves sending the chunks to the model and receiving a vector representation for each. The vector dimension for OpenAI embeddings is

import openai

openai.api_key = 'your_openai_api_key'

def generate_embeddings(text_chunks):
    embeddings = []
    for chunk in text_chunks:
        response = openai.Embedding.create(
            input=chunk,
            model="text-embedding-ada-002"
        )
        embeddings.append(response['data'][0]['embedding'])
    return embeddings

Part 3: Storing Embeddings in PostgreSQL

To store the embeddings in PostgreSQL, we’ll first need to ensure the PG Vector extension is enabled in our database. Then, using SQLAlchemy and pgvector.sqlalchemy, we’ll create a table to hold our text embeddings. Typically you’ll also want to store the content in some fashion to augment your LLM prompt. This can be useful for citations as well. I used N_DIM=1526 here as that is the output dimension of OpenAI’s embedding model.

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from pgvector.sqlalchemy import Vector
import numpy as np

Base = declarative_base()
N_DIM = 1536

class TextEmbedding(Base):
    __tablename__ = 'text_embeddings'
    id = Column(Integer, primary_key=True, autoincrement=True)
    content = Column(String)
    embedding = Vector(N_DIM)

# Connect to PostgreSQL
engine = create_engine('postgresql://user:password@localhost/dbname')
Base.metadata.create_all(engine)

# Create a session
Session = sessionmaker(bind=engine)
session = Session()

To insert embeddings:

def insert_embeddings(embeddings):
    for embedding in embeddings:
        new_embedding = TextEmbedding(embedding=embedding)
        session.add(new_embedding)
    session.commit()

Part 4: Querying for Similar Text Embeddings

Finally, we’ll query the stored embeddings to find similar text. PG Vector supports various similarity and distance metrics that can be leveraged for this purpose.

In the find_similar_embeddings function, k and similarity_threshold are parameters used to refine the search for embeddings similar to a given query:

k (limit=5): Specifies the maximum number of similar embeddings to retrieve. Setting k = 5 means the query will return at most five embeddings that meet the similarity criteria. It controls the breadth of the search results, balancing between relevance and quantity.
similarity_threshold (0.7): Defines how similar an embedding needs to be to the query embedding to be considered a match, based on cosine distance. A threshold of 0.7 filters for embeddings that are sufficiently similar to the query, with lower values indicating greater distance (less similarity) and higher values indicating closer proximity (more similarity). This parameter helps ensure the relevance of the results to the query.

Together, k and similarity_threshold fine-tune the search by determining the quantity and relevance of the returned embeddings, allowing for a balanced retrieval based on the specific needs of the application.

def find_similar_embeddings(query_embedding, limit=5):
    k = 5
    similarity_threshold = 0.7
    query = session.query(TextEmbedding, TextEmbedding.embedding.cosine_distance(query_embedding)
            .label("distance"))
            .filter(TextEmbedding.embedding.cosine_distance(query_embedding) < similarity_threshold)
            .order_by("distance")
            .limit(k)
            .all()

This method allows you to pass a query embedding and retrieve the most similar embeddings from the database, based on a specified similarity threshold and ordered by similarity.

Conclusion

By following these steps, you can efficiently transform textual content into vectorized forms and perform similarity queries using PostgreSQL, PG Vector, and SQLAlchemy. Hopefullt this setup simplifies the architecture for applications requiring text similarity searches while leveraging a free and database such as PostgreSQL.

To further enhance this solution, consider integrating more advanced text preprocessing, experimenting with different embedding models, and optimizing database queries for performance.

Remember to replace placeholder values (e.g., database connection strings, OpenAI API key) with your actual configuration details.

Please share your thoughts, questions, or any corrections in the comments below. I hope this was helpful!