Gong YsRunning NFSoRDMA in a Kubernetes clusterNFS is a popular shared file system. We can use it to store models and training data. The distributed training program running on each GPU…·8 min read·Oct 19, 2023----
Gong YsTest ROCE network for LLM Training or fine-tuning in Kubernetes clustersTraining or fine-tuning LLMs needs RDMA or ROCE networking because of the requirements of low latency and high throughputs. Although there…8 min read·Oct 18, 2023----
Gong YsA walkthrough to tune LLMs with Ray Clusters in the on-premise K8S platformA Ray cluster consists of a single head node and any number of connected worker nodes:10 min read·Sep 24, 2023----
Gong YsSetting up a Kubernetes cluster to run AI applicationsLLM training or fine-tuning are popular these days. It is demanding to prepare one Kubernetes cluster to support these tasks quickly…12 min read·Sep 20, 2023----