PinnedBenjamin MarieinTowards Data ScienceRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 113Jan 113
Benjamin MarieinStackademicJamba 1.5: Two New Hybrid Transformers/SSM of 52B and 398B ParametersHuge but very efficient, especially for long-context processing1d ago1d ago
Benjamin MarieinTowards Data ScienceMistral-NeMo: 4.1x Smaller with Quantized MinitronHow pruning, knowledge distillation, and 4-bit quantization can make advanced AI models more accessible and cost-effective1d ago1d ago
Benjamin MarieinStackademicFalcon Mamba 7B: SSM (attention-free) Model Are Getting BetterAttention-free models for faster inferenceAug 201Aug 201
Benjamin MarieFlexAttention: A Flexible Pytorch API for Implementing Attention OptimizationsIt’s going to be easier to optimize attention computationAug 12Aug 12
Benjamin MarieinTowards Data ScienceMulti-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRAWhat you can do with only 2x24 GB GPUs and a lot of CPU RAMAug 81Aug 81
Benjamin MarieThinK: KV Cache Pruning for Memory Efficient InferenceA promising approach if combined with KV cache quantizationAug 8Aug 8
Benjamin MarieinTowards Data ScienceServe Multiple LoRA Adapters with vLLMWithout any increase in latencyAug 3Aug 3