LLM Snapshot Architecture

Blueprint for Deterministic, Scalable, Environmentally Friendly, and Safe Use of GenAI

Published in

Tisane Labs

14 min readSep 28, 2023

It’s been almost a year since ChatGPT and Generative AI took the world by storm.

Since then, we witnessed arguments, promises, lobbying road shows, LinkedIn becoming even less usable, Microsoft getting nostalgic for Clippy, venture capitalists forgetting about crypto and Web 3.0, and yet, not too many significant changes outside the tech world. Even the spam quality didn’t improve much. Boo!

Can Generative AI live up to its transformative promises and, if yes, how?

Roadblocks

Economics

LLM Benchmarks is a handy little portal built by LLM Monitor that compares responses to various prompts sent to the major available GenAI foundational models. The prompts vary from telling jokes (look up “penguin” and “tractor” here for some food for thought) to explanations, political questions (spoiler: Facebook won’t be allowed in China anytime soon), and standard NLP tasks like entity extraction.

A simple task like extracting the entity Nike from a short sentence takes over half a second; in some cases, as long as 20+ seconds. “ChatGPT, make a joke starting with ‘1970s called’.”

Yes, much is happening under the hood. Yes, every one of these foundational models is an amazing technological feat. But in the real world, barring a handful of scenarios, monetizable tasks are specific and narrow: find this, generate that. How many tasks in the same LLM Monitor portal can be a significant source of stable revenue in the real world? You got it, NLP and maybe coding.

I find it hard to believe that same large customers will require generating haiku about crocodiles in space, writing Python code, and discussing recent advances in genetics. Theoretically, some of them may need all that, but how many and how often? With a few exceptions (e.g. drafting legal documents), how many small, private customers will ever pay to generate cookie-cutter content? Being ridiculously expensive and damaging to the environment to train, maintain, and host does not help the cause, either.

We already have jacks-of-all-trades able to talk about a variety of subjects (sometimes being right), slowly solve a variety of tasks (sometimes correctly), and need a drink after answering 5 questions. They are called “humans”.

Why do we need a software equivalent that is also slow, unreliable, expensive, 99% full of irrelevant knowledge, and has 8,237,124 new ways of doing something horribly wrong? Someone has to pay for all the labeling and data processing, so why should I pay for the labeling and training costs of crocodiles in space if I want to parse code? How will a company that paid $100m to label fiction and news articles (ignoring copyright suits for the sake of simplicity) recoup its expenses if vast majority of its paying customers use its software for other needs?

Most users just want their issue to go away, as fast and as cheaply as possible. Technology does not have to copy nature. A plane neither flaps its wings nor does it sound like a nightingale: it’s not about creating a super-bird, it’s about a better way to get from point A to point B. AGI is a concept sourced in fiction; fiction rarely bothers to look into the economics.

Specialized tools usually beat generic ones. It’s more about engineering universals than about perfecting the technology: were it not the case, we’d all be using Wenger 16999 Giant Swiss Army Knife, and smartphones would be used to film Marvel blockbusters.

“Where is the flugelhorn?”, asks a reviewer. “It is actually the bike horn that is located between the warp drive and crème brulee torch”, replies another.

Optimization: Curb Your Enthusiasm

Every engineer trying to tune a slow piece of machinery sooner or later has to confront the merciless reality of diminishing returns: it starts promising, but very soon you run into a brick wall, and subsequent titanic efforts result in only minor improvements.

And then there’s the uncomfortable question: is there even an incentive for the big players to make it more economical?

Think about the transformer architecture and LLMs: they have been around for years. They are resource-hungry to the point of bankrupting companies. Yet the solution to make it all faster amounts to more GPUs and more cloud resources. With the cloud hosting and GPUs being a major source of income for the large players in the space, why would they complain? Instead, they can provide freebie credits for the new cloud addicts with ever-increasing consumption needs, and grant 0.000001% of their revenue to an NGO to offset the environmental damage caused by the insatiable thirst for processing power.

When Old is Good Enough: How Many Will Upgrade?

Paper checks are still widely used in the US (often in conjunction with banking apps equipped with sophisticated OCR to scan these checks).

Japan still relies on fax machines.

8" floppy disks were used to control nuclear launches until 2019 (with the code being 00000; go on, google it).

1970s mainframes were processing unemployment checks in 2020 in the US.

While these examples may be extreme, technological superiority is rarely a guarantee of adoption:

Upgrading “Stuff That Works” is a daring and thankless office adventure.
If the gains are minor (let alone uncertain), then considering the disruption and the upgrade expenses, it’ll likely be a net loss.

“Oh, but our system is 5% more accurate!” So the customer needs 2 less people paid a minimum wage, great! All at a price of 3 developers and 2 executives spending a net of 3 weeks each, plus hardware upgrade expenses. Not to worry, in 2 years it’ll pay for itself. (If everything works as advertised, there are no new issues, and the implementation is done properly.)

“Oh, but it’ll make your products more competitive!” If the cost of production is higher, then at least one aspect of competitiveness will suffer. And will it move the needle, really? Some people indeed could have moved to Bing from Google (I don’t know anyone personally, but let’s assume it happened), but with all the expenses, did the effort actually pay off?

In other words, GenAI will need to provide outstanding benefits with none of its current drawbacks to justify the upgrades. (Unless it’s a “corporate Louis Vuitton” type of project to show off “innovation”.)

Any Sufficiently Reliable System is Distinguishable from Magic

Sam Altman in his recent interview with Marc Benioff said:

Hallucinations are part of the “magic” of generative AI
“One of the sort of non-obvious things is that a lot of value from these systems is heavily related to the fact that they do hallucinate. If you want to look something up in a database, we already have good stuff for that.”

With all the early talk about mitigating hallucinations, this sounds like an acceptance that the issue will not be solved anytime soon.

Large enterprises are still reluctant to store large amounts of their content on-prem. High-profile hacking incidents don’t help to alleviate their anxieties. Now imagine a reaction of the same people to a suggestion to use a system as consistent as a Magic 8 ball in production. “Can we test the reliability across the board?” No; LLMs are unpredictable, verbatim says a company providing testing services for GenAI. GenAI is also really, really slow and expensive to run.

In other words, a wide adoption of GenAI in its current form in the enterprise is in question.

One Weird Trick: GenAI as Tool Factory

Am I writing it all off? Will Generative AI end up like Wenger 16999, usable only in a fraction of use cases?

I surely hope not, because we use it.

And unlike with most standard use cases, it is making a significant difference for us. Better yet, the way we use GenAI may solve its own nagging headaches: throughput, costs, repeatability, explainability, misuse, and environmental damage.

We can look at the situation in a slightly different way.

There is a versatile but unreliable tech that can (kinda-sorta) solve a wide range of arbitrary tasks.
There is a bunch of tasks, often narrowly-defined and pedestrian, that we want to solve better.

Except the powerful unreliable tech does not have to be the same tech that the end users interact with. What if we use Generative AI (tech 1) to generate tech 2 (the tech to solve the narrow tasks)? Tech 2 may be light, deterministic, consistent, and explainable. For a tangible physical analogy, replace Wenger 16999 with a “tool factory” that quickly generates single use tools on the fly (knives, scissors, etc.).

Very much like Java and .NET generate an intermediate pseudo-machine code, we can create small targeted subsets of “LLM wisdom”.

No, we’re not talking about training another ML engine. While there is some interesting work done in that direction, some original issues still remain, the process is not cheap, and there are better ways.

It will take far less resources to generate condensed formalism and proof-read a domain-constrained knowledge base. Yes, as in expert knowledge base / “good old-fashioned AI” / rule-based systems.

It’s 21st century, Generative AI is all the rage, even pre-transformer architecture is considered obsolete, am I out of my mind?

Well… think of the recent history. Neural networks were going in and out of fashion for decades, until better hardware and other factors aligned.

Symbolic AI’s Achilles’ heel has historically been the human factor. Humans have to be tutored, supervised, and, to make it even worse, the subject matter experts in many areas may not be comfortable with hard logic; not everyone is a coder. Their output is often full of holes. Employing humans to process large amounts of data is also expensive, even in developing markets, let alone difficult to manage.

Just like early 2010s brought GPUs, cloud, and big data, symbolic AI now has an additional tool at its disposal: generic semi-skilled Generative AI, acting like “infinite interns”, being a smart repository of general knowledge. A tool occupying a niche between more versatile but slower humans and less versatile but more reliable deterministic software.

Mass producing expert systems is not far from where the mainstream Generative AI startup economy is headed: the bulk of Y Combinator hopefuls pitch GenAI copilots. The value of these startups is in wrapping and honing the GenAI for the task they focus on.

Case Study: Generating Better NLP Engines

Even though it’s mostly employed for content moderation and law enforcement applications, Tisane is more than “a filter for nasties”. Under the hood, Tisane is a spaCy-like complete NLU platform with a simple to grasp principle:

Text is ingested.
A graph of disambiguated word senses and relations between them is generated.
Higher-level patterns based on a proprietary formalism (Finite State Automata where the word-sense is a unit) are executed over the graph.

These “higher-level patterns” can be virtually anything: entities, sentiment, disambiguation cues, or detection of problematic content. Meaning, the platform is generic in nature and can go beyond the standard set of entities. Because the formalism is efficient, every one of these higher level patterns can contain a human-readable explanation.

Example of a higher-level extraction pattern that detects utterances like “ping me on FB”, “message them on Instagram”, etc. across languages

The trouble, of course, is that assembling the language models is a long process. And so, today we focus on a few applications where finding and modifying training datasets rapidly is challenging (like content moderation or digital forensics). Still, we provide far more capabilities than our peers; think tweezers vs. shovels.

With the GenAI, these language models are becoming comprehensive enough to handle a wider variety of tasks and expand the scope to a new set of tasks, reviving the dream of running arbitrary queries on text. A bit like word embeddings, but predictable, transparent, and editable. Not to mention, faster and cheaper to host than GenAI by a factor of thousand or more.

What kind of additional NLP applications can be created?

Arbitrary structured JSONPath/XPath/SQL-style queries on text referring to concepts and word features: “Is this sentence mentioning any kind of domestic animal being exchanged for money?” “Are there Swiss German regionalisms in the text?” “Is the text an eulogy of a high-level military officer on active duty in which there is no mention of a prior illness?”
Assigning a range of dates when the text could have been created based on the concepts and the wording of the text. E.g. “because it mentions ChatGPT, the text was created after December 2022”.
Stylometry, determining the author(s), their demographics, or linguistic environment where they spent enough time to copy language and idioms.
Cost-effective deep analysis of communications in order to identify spam (both “hard spam” and “soft spam”, like outreach of outsourcing companies or even pig-butchering scams).
Explainable/annotated machine translation that allows the reader to deep-dive into a translation, explaining foreign concepts, word play, and nuances of use in the user’s native tongue; possibly complete with automatically generated illustrations or animations visually explaining the point, creating a new form of art.
Last but not least, building NLU or translation systems for languages and communication systems with scarce or non-existent training data. One day we’ll be able to generate a rough language model in a matter of hours or minutes from a series of prompts based on the info ingested by an LLM from dictionaries and textbooks.

And more.

Putting It All Together

On one hand, there’s an NLP platform with a proprietary formalism and a word-sense oriented structure aligned across languages. On the other hand, there is GenAI that can understand a complex question but not necessarily answer it correctly.

The runtime remains the same: the structured data is easy to work with, predictable, and flexible enough for multiple applications.

GenAI is employed on the design side to take over tedious tasks that don’t require too much reasoning or analysis. Essentially, it looks up centuries-worth of linguistic knowledge copied from textbook to textbook and from one dictionary to another. Its suggestions are then double-checked by human experts.

The tasks are as follows.

Verifying low-confidence portions of the lexicon. When constructing a new language model, we would use a combination of automatic import with some built-in cleanup logic and human review of ambiguous parts. This approach allows covering important parts of generic, colloquial lexicon.

With GenAI validation, it is possible to verify much more, and much faster. The number of calls we need to issue is a small fraction of calls needed to label data. Even less, if we know which ones are proper nouns (which we do). Among 150,000 non-proper noun concepts (a decent amount), maybe 20,000 are ambiguous enough to require verification; among these, even if GenAI singles out 20% (a pretty extreme result), it means 4,000 to quarantine and validate.

Building proprietary formalism. The question we need answered is roughly, “if the structure is X in <already supported language>, what is the equivalent in <language we’re building>”?

This task has proven to be a bigger challenge than most. Despite all the bragging and optimistic papers on arXiv about few-shot learning the results are hit and miss. Our workaround was to focus on an intermediate representation which did not require learning our formalism.

Filling out missing portions of the lexicon. Not every word has its direct equivalent in every other language (how does one say “g’day” in Latin or Navajo?). With words having different word-senses in every language, the relationship is many to many, not one to one. Therefore it makes no sense to look for a translation of every missing word.

Thankfully, with data structure focused around concepts and providing human-readable explanation for every word-sense, we know which parts of the lexicon are missing. We can also prioritize the parts and ignore some word-senses that can fall back to another close sense, if required.

Worth noting that, unlike the task that involved selecting from a closed set, this one seems more challenging to most GenAI services.

Issues, Gotchas, and Workarounds

GenAI has its own vagaries and we had to come up with workarounds.

Prompt Building: Apples vs. Pears

Building prompts that actually work feels part like SQL query coding, part like breaking bad news to a manager with a short fuse. In other words, wording is important. Even if two prompts are semantically the same to you, they may not be the same to GenAI.

My favorite example: when trying to generate a Russian term for apple (fruit, not the company), the word Russian was misspelled as Russain. GenAI understood the request, but in its response, Russian word for pear was provided. When the misspelling was corrected, apple became apple again.

Colloquial Language

Assume that GenAI is utterly clueless about jargon in languages other than English. Not only dirty words and slurs, but often also of the benign kind, especially if there is no agreed upon translation to English. This part is best left to humans. (It’s not a big deal though, as the share of jargon, contrary to the common misconception, is not high as far as the count of words goes.)

Mix and Match

Until recently, GPT was an absolute champion as far as the tasks that mattered to us went. However, GPT-4 is quite expensive (many times more than GPT-3.5, for example), and others are catching up or even exceed its knowledge in the areas that matter to us. Therefore, it’s a waste to use it for tasks where others can do more or less OK.

Because of commercial considerations, the terms and conditions that allow building specific types of applications vary. One day you may find yourself locked out. (In practice, with the initial fascination fading, it’s not likely they will chase after vertical application builders, but who knows!)

For these reasons, we created a set of adapter classes for different providers with a common abstract base class. We alternate the providers in use depending on what works best. For trickier cases, we envision a chorus of 2–3 GenAI providers (“artificial crowdsourcing”) to check for a consensus.

Content Filters

…are annoying. I realize the need to screen the prompts and the GenAI output. But too much logic is keyword-based. We had innocent queries like “what is the word in language X meaning ‘to exercise restraint’” being blocked. I wish the operators understood that if they don’t have much control over the result, then it’s a nuisance that only must misfire a couple of times to annoy the user.

Try to opt out, if you integrate GenAI in internal processes and don’t want accidental misfires.

Validation

GenAI is not the only way to automate. We already had nightly benchmarking and testing processes in place. But it never harms to add more checks if they are simple and straightforward enough.

Don’t Repeat Yourself

While using GenAI to augment KB applications is inexpensive, there is no sense to waste money on repeat requests. Since the GenAI output needs to be validated by humans, repeat GenAI requests are even less desirable.

Store the history of calls in a table or a datastore, and check if a call was already done before issuing another check.

Other LLM Snapshot Applications

What other applications can be “3D-printed” or “co-piloted” in a similar way?

Pretty much any combination of large-scale taxonomy and proprietary logic, a significant portion is publicly available content. For example:

proofreading invoices, e.g. do the amounts on these invoices make sense? GenAI can be prompted to provide approximate ranges for different items in different locations and then the system will be able to figure out that $83.71 is too much for a cup of coffee.
drug discovery, but based on an expert system
extracting common anomalies in OCR output and creating autocorrection rules
scanning large collections of text for logical contradictions

In addition to numerous domain-specific systems based on NLU systems with structured output like Tisane.

It is worth stressing that the result is only as good as the architecture of the intermediary application. Highly structured solutions are less a pile of black-box data scanned by classifiers, and more of a set of cogs, levers, and pulleys that needs to be architected, debugged, and maintained.

Conclusion

Despite its spectacular success, GenAI has serious roadblocks to clear and no obvious way of clearing them. With the astronomic amounts invested, the economics of GenAI are proving to be far from certain.

While some may find going back to old paradigms not to their taste, the commercial benefits of our approach are nothing short of life-changing for the GenAI businesses. The solutions can be retrofitted into the most conservative of organizations.

We know that the last, technical part may be too short and high-level to answer all questions. If you’d like to know more, feel free to comment or contact us and we’ll be happy to help!