PinnedBenjamin MarieinTowards Data ScienceMistral 7B: Recipes for Fine-tuning and Quantization on Your ComputerCheap supervised fine-tuning with an impressive LLMOct 26, 20234Oct 26, 20234
PinnedBenjamin MarieinTowards Data ScienceRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 113Jan 113
Benjamin MarieinStackademicFalcon Mamba 7B: SSM (attention-free) Model Are Getting BetterAttention-free models for faster inferenceAug 201Aug 201
Benjamin MarieFlexAttention: A Flexible Pytorch API for Implementing Attention OptimizationsIt’s going to be easier to optimize attention computationAug 12Aug 12
Benjamin MarieinTowards Data ScienceMulti-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRAWhat you can do with only 2x24 GB GPUs and a lot of CPU RAMAug 81Aug 81
Benjamin MarieThinK: KV Cache Pruning for Memory Efficient InferenceA promising approach if combined with KV cache quantizationAug 8Aug 8
Benjamin MarieinTowards Data ScienceServe Multiple LoRA Adapters with vLLMWithout any increase in latencyAug 3Aug 3
Benjamin MarieMore Evidence that Ternary LLMs Are Good Enough-1, 0, and 1 are all you need to make good LLMsJul 25Jul 25
Benjamin MarieinTowards Data ScienceFunction Calling: Fine-Tuning Llama 3 on xLAMFast and memory-efficient thanks to QLoRAJul 231Jul 231