Published inData Science CollectiveGrids, Threadblocks, and Kernel ProgrammingA Quick Explanation of the Basics of CUDA CodingJun 20Jun 20
Luminal Internal Representation ExplainedGoing Through How Luminal Represents and CodeGens Naive MatmulJun 18Jun 18
Published inData Science CollectiveOnline Softmax to Flash Attention — and Why it MattersConnecting the Key Optimizations from Online Softmax to Flash AttentionMay 26A response icon2May 26A response icon2
What Working at Amazon Taught Meand what I had to learn on my ownMay 23A response icon4May 23A response icon4
Published inData Science CollectivePyTorch Tensors ExplainedFrom Memory Usage to AutoGrad in PyTorchMay 10A response icon3May 10A response icon3
Exploring How DINOv2 was TrainedDiving Deep into the Loss Equations and Data Pipeline Meta UsedMar 9Mar 9
Published inTDS ArchiveExploring DeepSeek’s R1 Training ProcessOpen-Source Intelligence on Par with Proprietary ModelsJan 29A response icon5Jan 29A response icon5
Published inTDS ArchiveApollo and Design Choices of Video Large Multimodal Models (LMMs)Let’s Explore Major Design Choices from Meta’s Apollo PaperJan 23Jan 23
Step-By-Step, Let’s Fine-Tune Flux.1Fine-Tuning Black Forest’s Vision Transformer Flux.1 DevJan 22A response icon1Jan 22A response icon1
Published inTDS ArchiveLoRA Fine-Tuning On Your Apple Silicon MacBookLet’s Go Step-By-Step Fine-Tuning On Your MacBookNov 20, 2024A response icon5Nov 20, 2024A response icon5