Gong YsRunning NFSoRDMA in a Kubernetes clusterNFS is a popular shared file system. We can use it to store models and training data. The distributed training program running on each GPU…Oct 19, 2023Oct 19, 2023
Gong YsTest ROCE network for LLM Training or fine-tuning in Kubernetes clustersTraining or fine-tuning LLMs needs RDMA or ROCE networking because of the requirements of low latency and high throughputs. Although there…Oct 18, 2023Oct 18, 2023
Gong YsA walkthrough to tune LLMs with Ray Clusters in the on-premise K8S platformA Ray cluster consists of a single head node and any number of connected worker nodes:Sep 24, 2023Sep 24, 2023
Gong YsSetting up a Kubernetes cluster to run AI applicationsLLM training or fine-tuning are popular these days. It is demanding to prepare one Kubernetes cluster to support these tasks quickly…Sep 20, 2023Sep 20, 2023