PinnedBenjamin MarieinTowards Data ScienceRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 113Jan 113
Benjamin MarieinStackademicFLUTE: Faster QLoRA Fine-tuning with NF4 ModelsFinally, NF4 models have a reasonable latency2d ago2d ago
Benjamin MarieAdEMAMix: Achieve the Same Results as with AdamW Using Only Half as Many Training TokensWith two momentum terms2d ago2d ago
Benjamin MarieinTowards Data ScienceGGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPUFast and accurate GGUF models for your CPUSep 133Sep 133
Benjamin MarieBetter Prioritize LLM Tasks for Higher System ThroughputHow to replace the naive “first-come-first-serve” ruleSep 5Sep 5
Benjamin MarieinStackademicEnhanced SSM Training Through Initialization with a Pre-trained TransformerThe Mamba in the LlamaSep 4Sep 4
Benjamin MarieZamba2–1.2B: A Smaller Hybrid SSM/TransformerVery fast and memory-efficient inferenceSep 31Sep 31
Benjamin MarieinStackademicJamba 1.5: Two New Hybrid Transformers/SSM of 52B and 398B ParametersHuge but very efficient, especially for long-context processingAug 29Aug 29