PinnedPublished inTDS ArchiveRun Mixtral-8x7B on Consumer Hardware with Expert OffloadingFinding the right trade-off between memory usage and inference speedJan 11, 2024A response icon3Jan 11, 2024A response icon3
RAG with Qwen3 Embedding and Qwen3 RerankerHow to use embedding and reranker models to efficiently retrieve only the most relevant chunks or documents given a user queryJun 26A response icon1Jun 26A response icon1
No Verifier? No Problem: Reinforcement Learning with Reference ProbabilitiesRLPR: Extrapolating RLVR to General Domains without VerifiersJun 25Jun 25
How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?Let’s review Qwen3 and check which one you should useMay 3A response icon2May 3A response icon2
Published inTDS Archive2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% AccuracyVery accurate 2-bit quantization for running 70B LLMs on a 24 GB GPUJan 31A response icon1Jan 31A response icon1
Published inStackademicHymba: Combining Attention Heads and SSM Heads within the Same LayerFaster and better LLMsDec 3, 2024Dec 3, 2024