Model Depot
Leading Gen AI Models Packaged for AI PC Deployment
We have recently launched the Model Depot collection, one of the largest and most comprehensive collections of generative AI models pre-packaged in OpenVino and ONNX formats. These models have been quantized, tested and optimized for fast, high-quality inferencing in resource-constrained edge environments, especially on AI PCs, and more generally on x86 architectures.
The collection is hosted in the llmware repository on Huggingface, and can be found here.
The collection includes over 100 state of the art open source models including:
- Leading Generative Models — leading generative decoder models from 1B — 14B+ parameters in the following leading open source series: Llama 3.2/3.1/3.0/2, Qwen 2.5/2, Mistral 0.3/0.2/0.1, Phi-3, Gemma-2, Yi 1.5/1.0, StableLM, Tiny Llama and popular and leading fine-tunes including Zephyr, Dolphin, Bling, OpenHermes, Wizard, OpenOrca, Nemo, and Dragon;
- Specialized Models — specialized fine-tuned models in math and programming including: Mathstral, Qwen Code-7B, and CodeGemma;
- Multimodal Models — Qwen2-VL-7B, Qwen2-VL-2B, Llama 3.2 11B vision designed for edge deployment of vision+text -> text models;
- Function-Calling Models — specialized function-calling SLIM models for multi-model, multi-step agent-based workflows; and
- Encoders — embedding models, rerankers, and classifiers.
All of the models are prepackaged in “inference ready” x86 optimized formats, e.g., OpenVino and ONNX, quantized with int4, including applying “smart” quantization ratios to mitigate quality impacts (e.g., keeping some parameters at 8-bit).
The models are all in open source, licensed on permissive terms consistent with the terms of the underlying models, and made available as a resource to the wider community to use in their own deployments.
Pulling the Models from Huggingface Directly
The easiest way to pull the models programmatically is using the huggingface_hub library, e.g.,
pip3 install huggingface_hub
And then any model can be pulled using the easy recipe below:
from huggingface_hub import snapshot_download
# select a model from the Model Depot collection, using Huggingface repo ID
model_id = "llmware/bling-tiny-llama-ov"
my_local_path = "C:\\my_local_path\\"
# downloads the repo folder and all files into local directory
snapshot_download(model_id, local_dir=my_local_path)
Please note that the canonical transformers “AutoModel.from_pretrained” formulation in most cases will not work, so if you wish to pull the models directly into a project, then we would recommend using the approach outlined above. Also, generally, there is no dependency in the model files for transformers or torch, so in virtually all cases, you can run inferencing on the models directly and solely with the OpenVINO and/or ONNX Runtime libraries.
Using LLMWare
For those who wish to use llmware, we provide a high-level interface with “out of the box” integration into Model Depot to get started right away to support hybrid inferencing strategies using Pytorch, GGUF, ONNX and OpenVino — even “mix and match” in the same example with minimal code change.
Here is a representative get started example in llmware:
# clone repo or pip install llmware >= 0.3.8
pip3 install llmware
# to use openvino
pip3 install openvino
pip3 install openvino_genai
# to use onnxruntime models
pip3 install onnxruntime_genai
from llmware.models import ModelCatalog
context = "Services Vendor Inc. \n100 Elm Street Pleasantville, NY "
"\nTO Alpha Inc. 5900 1st Street "
"Los Angeles, CA \nDescription Front End Engineering "
"Service $5000.00 \n Back End Engineering"
" Service $7500.00 \n Quality Assurance Manager "
" $10,000.00 \n Total Amount $22,500.00 \n"
"Make all checks payable to Services Vendor Inc. "
" Payment is due within 30 days."
"If you have any questions concerning this invoice, "
" contact Bia Hermes. "
"THANK YOU FOR YOUR BUSINESS! INVOICE "
"INVOICE # 0001 DATE 01/01/2022 FOR Alpha Project P.O. # 1000"
question = "What is the total amount of the invoice?"
prompt = f"{context}\n{question}"
# use gguf model (integrated llama.cpp backend in llmware
model = ModelCatalog().load_model("bling-answer-tool")
response = model.inference(prompt)
print("gguf model response: ", response)
# use openvino model
model = ModelCatalog().load_model("bling-tiny-llama-ov")
response = model.inference(prompt)
print("openvino model response: ", response)
# use onnx model
model = ModelCatalog().load_model("bling-tiny-llama-onnx")
response = model.inference(prompt)
print("onnx model response: ", response)
Please check out the llmware github repository for more examples and use cases, e.g., getting started with openvino and getting started with onnx. You can also bring your own OpenVino and ONNX models, e.g., adding custom openvino and onnx models.
If you are an enterprise looking for a packaged “point and click” solution, or private deployment of these models, please contact us about our ModelHQ offering, now in private preview.
This Model Depot collection began as an internal project, and as it grow in size and scale, we decided to release it to the wider open source communities in OpenVino and ONNX to support adoption and provide value back to those platforms. We are extremely thankful for all of the support that we have received in launching this project, and a very special note of gratitude to the OpenVino team for all of their expertise and advice on optimizing these models for extremely fast deployment on Intel GPUs.
We welcome feedback and ideas on how to grow and evolve this collection over time, and see this as just the starting point. If any model developer would like help in packaging their models in these formats, or if there are requests for specific models to be added to the collection, we are always happy to engage — and want to ensure that Model Depot grows as a collection to the wider community.
Happy Edge Inferencing on x86!
About llmware
Our mission is helping enterprises deploy small language models productively, privately, safely and with high accuracy. To date, our focus has been model fine-tuning and building small model optimized software pipelines. With Model Depot, we are addressing the third key piece of the puzzle, which is deployment, and building close integration between the models, pipelines and deployment platforms to unlock new use cases at radically lower cost and complexity.
For more information about llmware, please check out our main github repo at llmware-ai/llmware/.
Please also check out video tutorials at: youtube.com/@llmware.
You can also contact us on our website: www.llmware.ai.