Getting Work Done with GenAI: Just do the opposite — 10 contrarian rules that may actually work

11 min readSep 2, 2024

What if the keys to unlocking the value of generative AI were to do the exact opposite of conventional wisdom on how to use it?

Here are 10 contrarian ‘opposite’ lessons that may actually work:

Use Small Models, Not Large Ones. The basic equation in favor of small model use is a statement in the form of {{80%}} of the value/accuracy at {{1%}} of the cost. We can debate the {{80%}} as well as the {{1%}}, but it is difficult to argue with the basic math that the costs scale by orders of magnitude (for mega-models greater than 50 billion parameters compared to smaller models in the 1–10 billion parameter range) while the accuracy scales by relatively incremental percentages, for most use cases. Small models have the same architecture (e.g., ‘core underlying technology’), and are generally trained with substantially the same training materials with training costs that are several orders of magnitude smaller. The impact of this equation is that small models are accessible to essentially every researcher, student, university, business, government, developer, domain expert and hobbyist across the world, and will be innovated, trained, fine-tuned and tested at a rate that will be difficult to match for larger models. Furthermore, as applications focus more on specific use cases and implementations of specialized “agents” and “functions”, it is notable that most ‘specialized’ transformer-based encoders and classifiers are still in the 50–500 million parameter range, which demonstrate solid ability to model most facets of language patterns in much smaller model sizes. The incredible jump in 2021–2022 of OpenAI’s DaVinci to 175 billion parameters was a “moon shot” that gave credence to the “bigger is always better “ mantra — but one that in the passage of time should be viewed with more scrutiny — for most use cases, are models larger than 10 billion parameters worth it — or are there rapidly diminishing marginal returns? In other words, how efficiently can the core patterns of language effectively be encoded and compressed using current architectures (e.g., 5–10 gigabytes?) … In short, don’t look for the biggest model — start with the opposite principle — keep it simple, easy to iterate and adjust, and always look for the smallest possible solution for the problem at hand. We see many consultants start their POCs with OpenAI for convenience with he expectation then of shifting to a small, open source model in production. We would recommend the opposite: start with the smallest possible model in the POC, and only increase to bigger models as needed.
Use language models for language (closed-context), not knowledge (open-context). Most of the hype — and problems — from language models come from mis-using them. Language models are great at language, yet we keep trying to use them for implicit knowledge and imputed understanding. There is an old joke/saying, “I am not a doctor, but I play one on TV.” That pretty much sums it up. An actor could do a pretty compelling job of sounding like a doctor (could probably pass a simple Turing test), but if you needed treatment, obviously, you want a doctor with real training on the foundational principles and concepts of medicine and the human body. After years of the GPT revolution, there is still scant evidence (or understanding) for how a language model encodes structured ontologies and interconnections between concepts that we would typically regard as the preconditions for real knowledge and understanding. Despite this, language models are such extraordinary language pattern recognizers that it can easily create the impression of “knowing” and “understanding.” Take the opposite view: expect that the language model does not know anything — use language models with “closed context” grounded sources, and expect the model to read and analyze, rather than “open context” general inquiries (when the model is essentially a more interactive, stochastic, compressed, plagiarizer of Wikipedia and other sources used in its training!) … A lot of AI safety, hallucination, and copyright issues start to diminish once we focus on language, not knowledge — and use the model for what it is really designed for.
Short contexts, not long ones. One of the major developments over the last year has been increased flexibility in the model architecture that permit extraordinarily large context windows. While large context windows make things easier for the developer (whew, don’t have to worry about that — just throw it all in there!), is it the right way to go for using language models to solve real problems? Large context windows enhance the opacity of the model’s activity, and will almost inevitably lead to inaccuracies and “gap in the middle” kinds of challenges. Huge context windows also lead to a lot of sloppy data pipelines, challenges in reproducibility, and hoping for “magic” from the model, rather than building well-designed, detailed and repeatable solutions. It is difficult to encode and assess meaningful patterns as the chunk of language becomes too large, especially if the models lack formal encoding of concepts and knowledge — the statistical correlations between tokens that are very far apart are simpler weaker and less predictable. While various training techniques can improve and create the appearance of “following” a large text, it seems almost inevitable that the longer the text, the most likely that errors will occur. So, do the opposite of the trend: keep context passages in the 500–1500 token range, e.g., a couple of pages of text, and accuracy/quality improves dramatically across virtually all models immediately. Usually the use of very large context windows is masking a real problem in the data pipeline — keep the context short, and focus on the problem in the data pipeline instead.
Short instructions, if any, not prompt magic. The mega-model makers have encouraged a really fun, but truly weird, phenomena of prompt magic, and a debatable premise that language models should be guided entirely by long, natural language instructions. Anyone who has taught, coached, parented, or managed realizes that short, clear, repeatable instructions work better than long ones. If getting good results requires heavily bespoke, customized lengthy prompts, generative AI will likely continue to struggle to generate the consistency, auditability and portability required for most real-world use cases. Language models have not reversed the basic principles of technical, process or operational governance. Awesome ‘prompt magic’ is intrinsically model-specific and does not travel well. The amazing insight behind “few prompt” learning is likely over-blown. System instructions like “You are a good and helpful assistant” are outright silly. So, do the opposite: keep your instructions clear and simple. Small models with smaller context windows generally like it that way too — and you can build pipelines that are far more portable across different models. Models are going to keep evolving and changing, so design prompts as simple as possible so that they can work across almost any model without change.
Workflow, not chat. Chat has become synonymous with generative AI, but it is only one use case, and a really tough and limited one at that. Most hard automation problems are not solved with dialog. We believe that the killer use case for generative AI is centered on information compression — models can “read” nearly infinite amounts of information to help identify key signals and “needles in the haystack” or eliminate redundant extraction, document analysis and report preparation activities. Most workflow and automation is not chat or interaction-focused, but rather happens behind the scenes and integrates between multiple steps and decision points. Generative AI can be a killer application in doing a lot of “mundane” back-office automation. Furthermore, once you move beyond chat, the imperatives on real-time diminish, which opens up a much wider range of computing architectures, and complex multi-step processes, that may unfold over a number of seconds (or even minutes). So do the opposite — stop squeezing all Generative AI use cases into Chatbots — and start picturing Gen AI as just another tool in your software tool kit to build applications and workflows. Moving beyond the constraints of chat opens up a lot of exciting and practical use cases.
No sampling, no temperature — use deterministic generation. In our testing, if you want to reduce inaccuracies, inconsistencies and hallucination, turning off sampling is the best way to do it, and will usually lead to 5–10% improvement in accuracy, as well as creating reproducible results. If you are using generative AI in truly creative contexts, then sampling is a big part of the fun, but for most ‘enterprise’ use cases, the result should be deterministic. Sampling started several years ago when the models were a lot less powerful, and sampling from a probability distribution (e.g, adding some “random-ness”) led to a lot more interesting results, along with enabling strategies like beam search which led to higher quality outputs. If you are using the model for RAG or a fact-based context, then why inject randomness into the process? If the model is not yielding the correct results, then it is a problem with the model — just adding a random element to the generation so that it will “sometimes” give the target result, does not seem like a good strategy. Do the opposite of the recommendation in all model sandboxes: go beyond temperatures of 0.0, to ensuring that sampling is turned OFF — and you can now see the true, consistent result from the model in each inference and more rapidly identify and problem-solve issues and immediately realize higher-accuracy and consistency.
Self-host, not APIs. Most of the scary stuff around generative AI for enterprises is the idea that it is a black-box not under the effective control and governance of enterprise IT, compliance and security teams. While hosting a 100 billion parameter model is an extraordinarily complex and expensive activity — and requires a whole feat of engineering around dynamic batching, parallelizing and stringing together lots of GPUs, hosting a 1–10 billion parameter model is not — and in fact, it is in the reach of most companies to cost-effectively self-host — including down to individual laptops, small gaming desktops and more “commodity-oriented” private cloud servers — with the curve of accessibility growing rapidly. Most companies can self-host dozens of smaller models integrated into different business processes — and the benefit of self-hosting, in short, is keeping control, which enables integrating models into a business process — and still keeping full visibility and governance over the end-to-end business process. APIs are beautiful and elegant, and essential in accessing large-scale complex shared resources. In a “stand-alone” use case, APIs are usually preferable to self-hosting, but this misses one of the key subtle problems. The “crossing the chasm” challenge for generative AI in the enterprise is integration — and for most use cases, the “API step” is not stand-alone but rather fully-integrated into the fabric of the business process, as it involves the transmission of sensitive data into the API model. So, do the opposite: Build a self-hosting strategy, and it becomes a lot easier to weave generative AI safely and securely into a wider range of critical business processes and protect enterprise data — with the added flexibility of avoiding lock-in to a particular model or provider.
Open source, not proprietary models. There have been challenges recently from the proprietary model makers to question the safety of open source development of AI. We believe that this is completely backward. AI models that are in open source are generally-speaking transparent, widely tested, documented in research papers, have all of their source code and weights/parameters published, and can be subjected to the type of hands-on scrutiny required for safety. Most importantly, the black-box of an opaque technology is pealed open completely as you can directly touch every part of the model technology, scrutinize, probe, finetune, break, introspect every part of the model. A major mis-understanding is that “proprietary models” are apples-to-apples comparisons with open source models. When you call a proprietary API-based model, you do not know what additional pre-processing and post-processing steps are occurring, nor can you have direct visibility into the model execution from the code, weights or even the integrity or version of the model from run-to-run, not to mention any data capture or metadata that is being produced. In contrast, when you execute an inference on an open source model in a self-hosted environment, you have full visibility into exactly what is coming in and out of the model, can confirm the integrity, version and data capture, can gather detailed logs, and have the ability to create custom controls at every step in the model inference lifecycle with full transparency. So, do the opposite of starting with proprietary models: start with open source models at the core of your AI strategy, and with the benefit of hands-on engagement with the models, you can safely deploy with full visibility — and then when needed rapidly fine-tune and develop your own IP on top of the open source models to adapt to a specific process, so that you own the ‘proprietary’ part on top of the open source core.
CPUs, not GPUs. The proliferation of generative AI into the enterprise requires that AI comes to where enterprise processes are operating, not the other way around. Inferencing needs to be possible at the endpoint on widely available equipment. The history of computing arcs in the trend of distributed, decentralized, miniaturized and “on every desktop” and “on every phone.” The emergence of the AI PC over the next year is going to be transformational — as integrated CPU/GPU/NPU chips are projected to be shipped in hundreds of millions of laptops, making local inferencing on models smaller than 10 billion parameters a widespread phenomena. At its core, this is an argument about sustainability. Generative AI is an information technology tool, and needs to be sustainable — thinking about AI solutions that run on normal amounts of power and on cost-effective machinery is a key part of addressing these issues. While so much of the buzz around generative AI has been on the high-end GPUs provided by Nvidia for most model training (and especially very large model inferencing), focus on the opposite: every-day computing infrastructures that are ubiquitous, cost-effective and available to democratize direct and local access to the models. Many compelling use cases open up and become available when costs and complexity reduce — generative AI on ‘commodity’ hardware will be critical for its widespread adoption.
Real world, not science fiction. We will get the AI that we deserve, and too often, we all fall pray to science fiction tropes, Bond villain imagery, and borderline magical thinking — all wrapped in the alluring context of “AGI” — and this is inhibiting the progression and adoption of truly amazing technology. Generative AI is not magic — and does not appear to be a roadmap to AGI in the “science fiction” sense of the word. But, the invention of low-cost, ubiquitous technology that can effectively “read”, rapidly learn almost any language pattern, and generate human-quality responses is one of the biggest technology breakthroughs in software in our lifetime. We have seen several clients in discovery conversations, make a statement along the lines of “There are dozens of back-office areas where we could apply generative AI today to extract a key piece of information or automate a mundane report … but can that really be the best way to apply AI? Isn’t that lacking imagination?” and then the conversation reverts back to a virtually impossible ‘aspirational’ scenario where the technology can not deliver today. Too many POCs fail to progress to production as they are crushed under the weight of these unrealistic expectations about the technology (and the unbounded potential associated with “AI”). This harkens back to the early days of the Internet or Cloud — usually it takes a number of years for the technology to fully mature, so do the opposite: start with “simple, real, practical” and yes, boring use cases — and that is usually the best path to building competencies and getting started on a meaningful AI journey.

Hope you have enjoyed these 10 contrarian lessons — and see value in getting started to experiment and apply these principles in your work!

To check out some of our small, specialized fine-tuned models — none of which claim to be AGI but humbly aspire to be really useful fact-based business tools (when used in conjunction with sound generation loops and well-designed data pipelines ) — please go to our repo home page on HuggingFace — LLMWare Models.

For more information about llmware, please check out our main github repo at llmware-ai/llmware/.

Please also check out video tutorials at: youtube.com/@llmware.

Getting Work Done with GenAI: Just do the opposite — 10 contrarian rules that may actually work

Written by Darren Oberst