PinnedBenjamin MarieinTowards Data ScienceRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 113Jan 113
Benjamin MarieBetter Prioritize LLM Tasks for Higher System ThroughputHow to replace the naive “first-come-first-serve” rule3d ago3d ago
Benjamin MarieinStackademicEnhanced SSM Training Through Initialization with a Pre-trained TransformerThe Mamba in the Llama4d ago4d ago
Benjamin MarieZamba2–1.2B: A Smaller Hybrid SSM/TransformerVery fast and memory-efficient inference5d ago15d ago1
Benjamin MarieinStackademicJamba 1.5: Two New Hybrid Transformers/SSM of 52B and 398B ParametersHuge but very efficient, especially for long-context processingAug 29Aug 29
Benjamin MarieinTowards Data ScienceMistral-NeMo: 4.1x Smaller with Quantized MinitronHow pruning, knowledge distillation, and 4-bit quantization can make advanced AI models more accessible and cost-effectiveAug 29Aug 29
Benjamin MarieinStackademicFalcon Mamba 7B: SSM (attention-free) Model Are Getting BetterAttention-free models for faster inferenceAug 201Aug 201