PinnedPublished inTDS ArchiveRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 11, 2024A response icon3Jan 11, 2024A response icon3
How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?Let’s review Qwen3 and check which one you should useMay 3A response icon2May 3A response icon2
Published inTDS Archive2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% AccuracyVery accurate 2-bit quantization for running 70B LLMs on a 24 GB GPUJan 31A response icon1Jan 31A response icon1
Published inStackademicHymba: Combining Attention Heads and SSM Heads within the Same LayerFaster and better LLMsDec 3, 2024Dec 3, 2024
Is BFloat16’s Precision Not Good Enough for RoPE?Maybe not, according to a new studyDec 2, 2024Dec 2, 2024
Judge Arena: A New Leaderboard for LLMs as EvaluatorsThe battle of the judges!Nov 25, 2024Nov 25, 2024
Published inTDS ArchiveDPO Full Training vs. LoRA: How Good is LoRA for DPO Training?One model, two adaptersNov 20, 2024A response icon1Nov 20, 2024A response icon1
More Training Tokens => More Difficult to Quantize!But only with GPTQ and AWQ?Nov 18, 2024Nov 18, 2024