Suvasis MukherjeePooling in EmbeddingPooling is typically used in natural language processing (NLP) to reduce the dimensionality of feature representations, making them more…1d ago1d ago
Suvasis MukherjeeDo we need GPU?Neural networks can significantly reduce the number of parameters through pruning, which helps preserve accuracy despite large reductions…Sep 13Sep 13
Suvasis MukherjeeEfficient LLMs at Inference TimeThere are many LLMs in the market, the challenge is to get the model to work on a commodity hardware. The Deja Vu paper proposes a method…Aug 9Aug 9
Suvasis MukherjeeEconomics of KV Cache in transformerOpenAI’s API pricing for GPT — they charge twice as much per input token for models with longer contexts. This reflects one of the…Aug 7Aug 7
Suvasis MukherjeeShort-Wave Infrared CameraImagine a produce distributor needing to inspect frozen peas for any debris. If a small piece of plastic, similar in shape, size, and color…Jul 30Jul 30
Suvasis MukherjeeCost of Training GPTs on Nvidia A100 80GB GPUWhy training LLM is so expensive and contextual sparsity is important during inference.Jul 25Jul 25
Suvasis MukherjeeHow many A100 40GB or 80GB needed to hold GPT3 model or LLaMA to train?Back of envelope calculation:Jul 25Jul 25
Suvasis MukherjeeHow to comprehend 50 page financial report in milli secondsIn finance, often companies releasing extensive 50-page documents. The stock market needs to react swiftly to this information, requiring…Jul 19Jul 19
Suvasis MukherjeeGPU bandwidth and calculating TFLOPs for V100 and A100Compiled code always have scalar instruction. Here is a thread binary that has only scalar instruction on it run as many copies as many…Jun 26Jun 26
Suvasis MukherjeeNVLinkToday’s server GPUs are typically connected by the PCI Express (PCIe) bus, which provides a communication bandwidth of 12 GB per second…Jun 24Jun 24