PinnedBenjamin MarieinTowards Data ScienceRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 113Jan 113
Benjamin MarieinTowards Data ScienceRun and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLMUnderstanding how much memory you need to serve a VLM4h ago4h ago
Benjamin MarieinStackademicFLUTE: Faster QLoRA Fine-tuning with NF4 ModelsFinally, NF4 models have a reasonable latency5d ago5d ago
Benjamin MarieAdEMAMix: Achieve the Same Results as with AdamW Using Only Half as Many Training TokensWith two momentum terms6d ago6d ago
Benjamin MarieinTowards Data ScienceGGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPUFast and accurate GGUF models for your CPUSep 132Sep 132
Benjamin MarieBetter Prioritize LLM Tasks for Higher System ThroughputHow to replace the naive “first-come-first-serve” ruleSep 5Sep 5
Benjamin MarieinStackademicEnhanced SSM Training Through Initialization with a Pre-trained TransformerThe Mamba in the LlamaSep 4Sep 4
Benjamin MarieZamba2–1.2B: A Smaller Hybrid SSM/TransformerVery fast and memory-efficient inferenceSep 31Sep 31