List: LLM Inference | Curated by activoricordi

Mar 27, 2024
35 stories
LLM InferenceLLM Serving, Hosting and Inference Optimization
Maxime Labonne
in
Towards Data Science
Create Mixtures of Experts with MergeKitCombine multiple models into a single MoE
Mar 27
8
Mar 27
8
Benjamin Marie
Google’s Gemma: Fine-tuning, Quantization, and Inference on Your ComputerMore training tokens and a huge vocabulary
Feb 26
1
Feb 26
1
Yeyu Huang
in
Level Up Coding
The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab.A Quick Tutorial for AQLM-2-bit Quantization and its Implementation
Feb 20
2
Feb 20
2
Yash Bhaskar
in
Cubed
Groq Inference Engine — 18x Faster Than GPUsLet’s delve into Gro’s technology, its implications for various industries, and the transformative potential it holds for the future of AI
Feb 22
5
Feb 22
5
Benjamin Marie
in
Towards Data Science
Run Llama 2 70B on Your GPU with ExLlamaV2Finding the optimal mixed-precision quantization for your hardware
Sep 29, 2023
3
Sep 29, 2023
3
Benjamin Marie
AirLLM: Layered Inference for Low-Memory HardwareLayered-inference
Jan 27
Jan 27
Matthew Harris
in
Towards Data Science
Some Thoughts on Operationalizing LLM ApplicationsA few personal lessons learned from developing LLM applications
Jan 27
3
Jan 27
3
Nikita Kiselov
in
Towards Data Science
Building, Evaluating and Tracking a Local Advanced RAG System | Mistral 7b + LlamaIndex + W&BLearn how to build, evaluate and track advanced RAG system using local Mistral-7b, LlamaIndex, and W&B. Step-by-step guide with code.
Jan 19
2
Jan 19
2
Dennis Bakhuis
in
Towards Data Science
The Most Simple Way to Set Up ChatGPT LocallyThe Secret to Running LLMs on Consumer Hardware!
Jan 17
4
Jan 17
4
Chang She
in
LanceDB
LanceDB + PolarsA (near) perfect match
Jan 19
2
Jan 19
2
Vishal Rajput
in
AIGuys
Mamba: Can it replace Transformers?Solving the quadratic scaling problem of Self-Attention.
Jan 8
5
Jan 8
5
Wenqi Glantz
in
Towards Data Science
Democratizing LLMs: 4-bit Quantization for Optimal LLM InferenceA deep dive into model quantization with GGUF and llama.cpp and model evaluation with LlamaIndex
Jan 15
2
Jan 15
2
Gao Dalie (高達烈)
in
Artificial Intelligence in Plain English
CrewAi + Solar/Hermes + Langchain + Ollama = Super Ai AgentAs technology booms, AI Agents are becoming game changers, quickly becoming partners in problem-solving, creativity, and innovation, and…
Jan 14
7
Jan 14
7
Marcello Politi
in
Towards Data Science
Deploy Tiny-Llama on AWS EC2Learn how to deploy a real ML application using AWS and FastAPI
Jan 12
3
Jan 12
3
Benjamin Marie
in
Towards Data Science
Run Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speed
Jan 11
3
Jan 11
3
Iulia Brezeanu
in
Towards Data Science
How to Cut RAG Costs by 80% Using Prompt CompressionAccelerating Inference With Prompt Compression
Jan 4
11
Jan 4
11
Mandar Karhade, MD. PhD.
in
Towards AI
Run Mixtral 8x7b on Google Colab FreeA clever trick allows offloading some layers
Dec 31, 2023
1
Dec 31, 2023
1
Alon Agmon
in
Towards Data Science
Streamlining Serverless ML Inference: Unleashing Candle Framework’s Power in RustBuilding a lean and robust model serving layer for vector embedding and search with Hugging Face’s new Candle Framework
Dec 21, 2023
1
Dec 21, 2023
1
Martin Thissen
Mixtral 8x7B on Your Local Computer | Free GPT-4 AlternativeIn this article I will point out the key features of the Mixtral 8x7B model and show you how you can run the Mixtral 8x7B model on your…
Dec 17, 2023
7
Dec 17, 2023
7
Aaron 0928
Hugging Face has written a new ML framework in Rust, now open-sourced!Recently, Hugging Face open sourced a heavyweight ML framework, Candle, which is a departure from the usual Python approach to machine…
Aug 14, 2023
12
Aug 14, 2023
12