Towards a Control Framework for Small Language Model Deployment
Addressing GenAI safety, security and data privacy issues
It usually takes no more than 20 minutes into a presentation on generative AI before The Question is asked: “This is all really exciting, but how do we ensure that generative AI is safe, that our data is secure, that we can protect users, that we can safeguard enterprise information, that we avoid prompt injections, minimize the risk of accidental information disclosures, prevent models from mis-appropriating our data, and generally avoid situations that potentially spiral out of control and create unbounded risk and unforeseen consequences?”
What is striking to us about The Question is how pervasive, yet how inchoate, these concerns remain — and they can overwhelm and derail almost any generative AI use case if not carefully addressed. It is easy to fear the worst, especially in the absence of an analytical framework to look at the key steps in the end-to-end model lifecycle, and understand the control options available at each step to mitigate specific risks.
Sometimes The Question spirals into even larger existential and societal concerns — but always comes down to a more fundamental simple question at its core:
How do we get control over generative AI?
While there has been a lot of focus on myriad piecemeal safety and guardrail solutions for OpenAI and commercial API-based models, there remains relatively few best practices and practical understanding of how to build a control framework for managing open source, self-hosted small language models. In fact, sometimes, there is even a (mis)conception that smaller, open source models can present a greater risk profile.
Through our work with llmware over the last year, we have tested, fine-tuned, experimented and built end-to-end generative inferencing pipelines for literally hundreds of open source models, and worked extensively with Pytorch, GGUF, ONNX and OpenVino as different underlying model inferencing technologies. Out of this work, we see as at least 10 distinct control points that are common across the model lifecycle, and present opportunities to bring generative AI under control practically and cost-effectively.
Model LifeCycle: Models are Just Math Functions
Sometimes, the “black box” nature of generative AI can obscure the individual steps in the process of producing inference outputs, but in our view, a model is a just a giant math function, implemented as a series of low-level operations and parameters, and just like any piece of software, it needs to be discovered, downloaded, loaded into memory, configured, and run. When we refer to the Model LifeCycle, we are referring to that end-to-end process that starts with a user or system trying to discover and access a particular model, prepare the model to an inference ready state, and then all of the steps involved in preparing the prompt, executing the generation loop, and finally saving all of the various inputs and outputs to create an auditable inference history. (Note: we are looking at the lifecycle from an inferencing perspective, and not considering the training or fine-tuning pipeline, which generally are performed separately from the production roll-out of a model).
Here is a visual representation of the 10 Major Steps in the Model Lifecycle:
Let’s review each of these steps in order, and look at the key decisions and controls available at each step:
- Model Catalog — Discovery Controls — the first step in the model lifecycle is the precondition for all of the others, and is often times viewed as an implicit step, but in our view, it is the most fundamental control that an enterprise (or an application / use case) provides, namely, which models will be available to be consumed. In our view, this is a new “master data source”, much akin to master data sources that have emerged in other domains over the years, and we typically refer to this model master data as a Model Catalog, which can be as simple as a list of models (including a list of one), but to be useful, should provide the following information:
- Unique Model Name — the model catalog should provide a unique model identifier, and naming scheme, that can be used to retrieve a model, so that calling users and applications can uniquely identify a particular model, and may include both an external (“friendly display name”) and potentially a more complex “internal reference id” — this will also likely need to include the ‘formal’ model name of origin which often times is not presented in the most ‘user-friendly’ format (e.g., ‘llmware/bling-tiny-llama-v0’);
- Model Description — available both as a human readable “thumbnail” so a user can read and select among different models, as well as a programmatic description that includes key features that can be used by calling applications to filter and identify;
- Loading Instructions — how to access the model, whether over API, and if API, what are the key details in terms of how to identify the endpoint and call the model, or if physical loading, then where to get the model;
- Configuration Information — virtually all models will require a set of configurations to correctly instantiate and configure the model for inference, from the model type, tokenizer, generation settings, and the correct prompt template to be used.
The Model Catalog should also provide an abstraction that provides consistency across deployment of underlying model technologies. The Model Catalog will be constantly evolving, so flexibility and ease of deployment is critical to ensure ongoing alignment.
Once a Model Catalog has been defined and implemented, you have established the main precondition for getting control over generative AI — now, the only models that will be used in a process are those that are in the Model Catalog — and you have the ability to fully vet and evaluate any model before including it in the Model Catalog, and just as importantly, to remove the model if there are any issues about performance, accuracy or quality. Similarly, if you don’t have a Model Catalog in place, it is difficult to implement any of the controls that follow with consistent processes.
2. Routing & Optimization— after a user (or application) discovers a model in the catalog and initiates a request for a particular model, there is an important step before beginning the process of fulfilling the model, which is a routing and optimization set of checks. At this point in the process, we know what the user has requested, but given the reality of change in implementation, the second control is an optimization and routing sequence that provides the ability to over-ride default configurations, and based on the current state, either implement the default configurations in the model catalog, or provide an alternative. The routing control can modify the user request or append the user request with additional information that will be accessible to the downstream processes.
A few common situations and use cases:
- Execution Route — if an API is available for the model, then fulfill the request with the API, but if not available, then prepare to load the model into a local environment, or potentially initiate a “pop up” instantiation of the model on an inference server to fulfill the user request;
- Optimize for Environment — detect the user operating environment, and optimize for that environment, e.g, if the target environment is an end user AI PC with x86 GPU, then send the OpenVino version of the model, while if a Linux environment with CUDA available, then send the Pytorch or GGUF version;
- Availability/Price Checks — there may be a fallback option that should be applied in the event that certain conditions are met, whether availability of the model, current costing, or potentially the nature of the use case;
- Dynamic Updates — a user application may have ‘hard-coded’ a request for v2 of a particular model, and when v3 is released and added to the catalog, rules can be applied to automatically route to v3, or alternatively, if ‘llama2’ is deprecated for ‘llama3’, or a ‘slim-extract’ is upgraded to a ‘slim-extract-phi3’;
- User/Role/App Specific — based on a user profile, role, department, or specific type of app/user case, you may want the ability to substitute or adapt the fulfillment request.
Often times, these routing policies will evolve over time, and will be set at a central level to apply across all GenAI applications. One of the biggest benefits of inserting routing and optimization controls is the agility to avoid hard-coding these types of rules into specific applications, and to enable dynamic updates and controls even after an application has been written and deployed. Models and deployment options will likely continue to evolve rapidly over the next 12–24 months, so the ability to “write a use case once” while adapting aspects of the model deployment is a necessity for production roll-outs.
3. Fetch Controls — after resolving the route optimization rules, it is time to fulfill the request for a model, and either provide the model over an API, or prepare to pull the model from a repository and download to the target machine. (We will focus on the physical download process in this control, although resolving the API and providing checks around its availability and safety are important preparatory steps as well for a self-hosted API-based model.)
If we need to physically instantiate the model, then there must some form of download activity in which the model is moved from a central repository, whether public or private, into a target deployment environment, whether a private cloud server or edge device. In many respects, this is the single most important step in protecting the enterprise for self-hosted generative AI, and good controls are essential to providing safety:
- Model Repository. Most AI power users are comfortable with pulling models from open source repositories such as HuggingFace, but this is not necessarily the best practice for a wider community of enterprise users or a good recipe for scaling access, e.g., you probably don’t want your sales or finance teams downloading models from open source repositories. We recommend a simple two-step process: (1) a small set of expert AI CoE power users who access, experiment, evaluate and fine-tune models pulled directly from open source repositories, and (2) creation of a subset of enterprise safe models in a private repository, which can then be used for scaling generative AI use cases to a wider set of users and departments across the firm. Creating a private repository can be as simple as assembling all of the model components in an organized set of buckets (AWS) or containers (Azure) within a private security zone, and some straightforward secure access mechanisms to those model bits, with the ability to leverage existing enterprise policies around those cloud storage resources.
- Role-based Access Rights. As outlined above, at a testing/development stage, you may want to route requests from power users to a public open source repository, and then change access at the time of deployment to a private repository, without having to make code changes. Implementing good controls should enable user/role based access to different repositories, and the flexibility to adapt repository locations and implementations over time without having to re-write applications. Also, as deployments scale, you may want the flexibility to implement multiple repositories, perhaps for BCP, or access to different departments, regions or users.
- Model Integrity. This is the most important safety check in the fetching process to ensure that the model bits are not modified from the time that they are entered into the model repository. We would recommend implementing a straightforward hash check on the key model components, and keeping those records for easy access at various steps in the downstream process, along with the ability to warn or block depending upon breaks in the hashes. There are a number of detailed decisions to be taken around the scope and frequency of hashing. For example, a configuration file or README in a model package may be adjusted frequently, and should likely not be hashed, while other configurations may need to be hashed to mitigate risks (e.g., generation config or tokenizer). In our experience, we would recommend “keep it simple” and implement consistently and focus on the core model components — generally the graph interface (e.g., model class code) and binary weights/parameters. If you keep model bits clean on entering into a private repository, hash the major determinative components on entry, and check periodically, then the risk of tampering of the model is greatly reduced.
- Download Management. From the moment that the model is pulled from the central repository to the target environment, there is a lot that can go wrong, and needs to be carefully managed. The user needs to be kept updated on the status of the download, the files need to be instantiated on the machine in a consistent location and path, and if there is an error, there needs to be communication so that it can be triaged, whether at the central repository, the network connection, or the end user environment. Critically, the central model hashes need to be validated at the time of download to confirm that the model bits downloaded and installed on the target environment match the hashes from the central repository.
4. Instantiating Model Class — at a code level, the first step towards deploying the model is the invocation of a model class, or set of model classes, that wrap the underlying inferencing engine, and carry the state and configuration information for the model, and expose common inferencing methods for basic inferencing, streaming, function calling, prompt wrapping, and managing the underlying inferencing engine (e.g., GGUF, OpenVino, ONNX, PyTorch). Upon instantiating the model class, and prior to the physical loading of the model parameters from disk to memory, there is a crucial opportunity to capture and check key metadata and insert controls before moving to fully activate the model:
- Environment variables and policies — is a GPU available, is the environment memory constrained, or are they are specific conflicts in configuration variables that need to be resolved?
- Usage metrics and charge-backs — this is a good opportunity upon loading a model class to capture usage and access by department — for example to validate how many times a specific model class has been used to confirm charge-backs or other usage metrics. There is also the opportunity to set department-specific parameters that may apply to a particular model, e.g., token limits or alignment with generation policies (e.g., sampling, temperature, output limits).
- Out of Date architecture — stop the loading process if the architecture or model class is out of date or deprecated before taking the next step of actually loading the model weight parameters.
5. Loading Model Bits into Memory. This is the last step in preparing the model to an “inference ready” state, which consists of applying the loading instructions from the Model Catalog, for that particular model, and using the correct inferencing technology to load the model weights and instantiate the model in memory. There are several elements that could go wrong at this step, and we recommend the following checks:
- Hash validation of loading the model from disk to memory. We checked the model integrity on the first step of downloading the model into the target environment, and now have the opportunity to run a secondary check upon loading the model from the local disk into memory. In some use cases, this may be overkill, but in many others, it is an essential step to confirm that the model has not tampered or corrupted since it was installed on the local machine.
- Inferencing engine version, dependencies and updates. The underlying inferencing engine is connected to the model at this step, and there are opportunities to validate the versions, dependencies, and other environment variables to confirm that the correct engine version is being used. All of the popular inferencing back-ends are platform-specific compiled C/C++ code, and are constantly evolving and improving (including support for new model types, e.g., “qwen2”), so ensuring that the correct inferencing back-end on a particular deployment environment platform is critical to a successful inference.
- Tokenizer loading and configuration. Tokenizers and other supporting processing models can be bundled with the main model, but often times, require a parallel loading process to confirm that all of the elements for successful inference generation are loaded into memory. Tokenizers, in particular, are critical to generation, and key parameters such as the end-of-text token(s) are vital to correctly stopping a generation loop.
- Error handling. If something goes wrong at this stage, warnings and notifications are needed to enable triage and resolution.
Now that we have completed the first five steps in detail, we have fully prepared the model, and are now in a position for the fun stuff — which is accepting user input, building prompts and generating inferences. Often times, the first five preparatory steps may be completed once upon initial setup, and then may not need to be repeated again until the models are unloaded and subsequently restarted. However, the next five steps will occur on every single inference, and generally be completed in mere second(s) of total time.
6. Prompt Creation. Once a user enters an input prompt, this is when the real fun and magic of the generative inference process begins. In the prompt creation stage, there are a number of data transformation steps that occur, usually behind the scenes, to package and “engineer” a complete prompt that will be passed to the model. This is also where RAG and other retrieval and context-building steps will occur to pull information programmatically that will be appended and integrated itno the engineered prompt. Key steps:
- Prompt Template — one of the key artifacts of model training or fine-tuning is a set of “wrappers” around the core prompt that signify to the model respective roles and separations between different sections of a prompt. In our fine-tuning, we use a convention that started with early open source GenAI leaders (but now old-school, e.g., “so 2023”), which is a simple “<human>“ and “<bot>“ wrapper around our prompts. (Side-note: we actually find that it yields better results — so we have kept it -rather than updating to system-user-assistant frameworks). There are still a lot of variations across different models, from the Llama2 simple classic “INST” separator (still used by Mistral), to a number of variants that create three roles — “system”, “user”, and “assistant” — to very complex special token regimes as now implemented in Llama3 and several other newer base models. This is a rules-based “regex” style process, but critical to consistent results. There is nothing that screws up model generations faster than getting this template a little wrong, and as small as a stray “\n” or missing “:” can often times lead to immediate degradation in model quality. In most cases, errors in the template are purely accidental, but this is also a great place to “lock down” templates — as it would be an easy place for a bad actor to significantly impact the quality of generation impact or insert spurious instructions.
- Prompt Assembly & RAG — while not purely a model inference process, the retrieval and building of context information to integrate into the Prompt is often times the most important in getting good, consistent, high-quality results. Quite simply, this is where “garbage in = garbage out” for most generative AI. As it relates to the control framework, we believe that the key elements are the transparency and control of the retrieval process, along with all of the metadata associated with the sources, and tracking the lineage of that data throughout the end-to-end inference. It should be possible at the end of the process to reconstruct the sources (and all of their key bibliographic info — document, page number, etc) that were reviewed to identify how the model generation aligns to those sources, and to subsequently audit and check the consistency.
7. Preview Prompt Remediation and Risk Assessment. After the prompt has been fully packaged into a single engineered ‘prompt’ with the appropriate model-specific template wrapper, there is an opportunity to do a “preview” check to assess the appropriateness of the prompt in terms of catching accidental PII disclosures, risks of prompt injections or other malicious (or frivolous) behavior, and the opportunity to either warn, remediate or block the prompt from being passed to the model. We typically think of the preview controls consisting of the following configurable set of interventions:
- Pattern-based recognizers and redactors — identify regex and string-based pattern matches of common sensitive data types and remediate on the fly before progressing with the prompt, e.g., email addresses, social security numbers, other sensitive PII, white space and other harmful indicators. Notably, these patterns may be different by use case, and user. For example, an HR professional viewing this type of PII may be perfectly appropriate, while it should be blocked for processes with other business or external users.
- Classifier-based tests — evaluate toxicity, prompt injection risks and other potential sensitive areas that should not be progressed in enterprise systems. In most cases, we do not recommend inserting classifier tests before generation, unless there is a perceived high-risk, as the classifier test will result in a higher latency delay as the classifier is run, but we do see the ability to provide a number of classifier-based tests to evaluate risk before starting generation, as a key control opportunity that can be used where warranted, and especially to filter out extreme threats of prompt injection or toxic/inappropriate content.
- Exclusion lists — there may be key phrases and terms (e.g, a presidential candidate in a current election, or key customer or executive names) that should be black-listed and automatically removed from a prompt. This can be done quickly and is an easy way to “blacklist” undesirable information or items from prompt generation and downstream capture.
8. Generation Loop. We are now ready for the model to actually start doing some work inside of our generation loop that repeatedly runs a forward pass on the model, updates the input context, and produces the output, using the sampling and stopping conditions as set in the generation process. This is an important place to see “behind the curtain” of what is really happening inside the model. There are a number of parameters and potential controls that need to be considered:
- Deterministic generation — one of the outdated legacies of the early days of generative AI is the widespread use of stochastic sampling as an integral part of creating “interesting” outputs. If you are in a truly creative generation use case, then turn up the temperature and use sampling, but if you are in a fact-based use case, looking for repeatable results, turn off sampling and run deterministic “return highest logit value” in each forward pass. For enterprises looking to improve model accuracy and reduce perceived hallucinations, this is the fastest and easiest “quick win” — and will generate meaningful improvements in accuracy immediately.
- Capture logits — at each model output, the model returns an array of logits, which when normalized, can be interpreted as the probability distribution of any particular token in the vocabulary. This is extremely useful data, and often times, the 5–10 highest logit ‘probabilities’ for each forward pass can provide great insight into the choices that the model evaluated, and where there are potential inaccuracies. We look specifically at key “value windows” in a generation output in particular, e.g., the value in a function-calling output, and study those logit probabilities as a great indicator into potential biases and errors, and a way to evaluate potential improvements. As an example, if the top ranking logit value for a sentiment classification activity is “positive” at 0.52 and the second highest ranking is “negative” at 0.48 with negligible for all other values, then we have a reasonable interpretation of the model’s confidence (right or wrong) in the sentiment classification. If on the other hand, the model was evaluating other tokens with higher normalized values, then it is an indicator that the model may need further fine-tuning. We believe that there is a lot of opportunity for capturing logit data as part of an inference history, as an auditable record trial of exactly what was produced by the model, and the other choices considered by the model to try to reconstruct where things may have gone wrong.
- Grammars/templates and other logit biases — this is optional, but for some cases, this is an area of active research and study whether to bias the logit output distribution in certain ways, as simply perhaps as removing undesirable repetitions.
- Stopping conditions — correctly setting the end-of-text tokens and limits on output generation size. As generations and overall contexts get larger, the risk of errors generally increases, so there are a lot of reasons to set consistent policies on output lengths, especially for different use cases.
- Metadata capture — usage metrics, and potential time/performance metrics, are valuable to capture for future evaluation and analysis.
9. Postview Remediation and Risk Assessment. Now the model has generated an output response, there is the opportunity to review the model output, provide fact-checking and reconciliation against the input materials, filter out inappropriate responses, and warn, redact or block the inference from being returned to the user. Key control steps:
- Fact-check reconciliation — one of the simplest, but most effective, ‘quality control’ steps is to run a series of comparisons between the input context and the generated output for overlap of key items and numbers. For example, if numbers in the generative output are not found anywhere in the input source materials, this should be a flagged as a risk (although in some aggregation and counting activities there is a good explanation). If there is an extremely low match between tokens in the generative output and the input, this should be flagged as a risk. In an exceptional case, if output fails certain basic criteria, then it could be rejected entirely before returning the results to the end user, and replaced with a general “failure” message.
- Filtering activities — all of the steps from Preview can be applied on Postview to the output generation — pattern-based, classifier-based and exclusion lists — to redact, block or warn key output generations before returning to the user and passing forward in the process.
- Error correction — there are common error corrections that could be applied such as repetition filtering or other common issues in generation quality.
10. Inference MetaData Capture & Analytics. Finally, the inference is complete, and all of the metadata gathered in steps 1–9 should be assembled and entered as a new row in an inference history database to provide support for compliance, reporting, analytics and continuous improvement. This is where the value of the end-to-end lifecycle control truly comes together, as metadata gathered at each step is now available upon completion of the process, and can be used for analytics and compliance. We would recommend capturing some or all of the following:
- Model information — extracted from model catalog, along with key configuration settings used in the generation;
- User information — whatever metadata is captured per policy in terms of user, department, location, etc.;
- Input information — library or document repository evaluated, specific document sources retrieved and their bibliographical information, and the final prompt context created and assembled;
- Output information — the generation output, along with potentially selected logit information, time-stamp, and generation parameters;
- Post-processing reviews — comparisons between inputs and outputs and any notable risks identified; and
- Safety controls — any controls activated during the process and any remediation steps that were taken.
At last, we are done with the model lifecycle — inferencing is completed, time to unload the model, release the memory pointers, and move on to our next activity (until we run it all over again)!
Different use cases will need different sensitivity to each of these controls, and the ability to “overlay” different control policies and controls for each application and potential use cases.
While this may seem like a lot to process, we believe that with a unified framework for managing the end-to-end model lifecycle, these policies can be implemented consistently and in relatively straightforward ways — to provide the right level of control for each use case — and ultimately provide a solid set of of answers to The Question.
While The Question is simple, the key to answering it is all in the details — breaking down the process step-by-step, and at each step, applying the right set of controls to address the risk, and calibrate by the nature of the use case.
We are continuing to expand on this framework throughout llmware, as well as our commercial ModelHQ offering. Welcome feedback, as always, on this framework, and ideas to continue to build upon safety, privacy and control in the use of Generative AI.
Let’s all work together to make the world safe for small language models!
About llmware
Our mission is helping enterprises deploy small language models productively, privately, safely and with high accuracy. To date, our focus has been model fine-tuning, building small model optimized software pipelines, and making it as easy as possible to deploy scalable solutions at radically lower cost and complexity.
For more information about llmware, please check out our main github repo at llmware-ai/llmware/.
Please also check out video tutorials at: youtube.com/@llmware.
You can also contact us on our website: www.llmware.ai.