VALDI’s smart cloud GPU network set to power infrastructure for Generative AI
Generative AI, which is a type of AI that focuses on generating new data or content, is all the rage. Fast growing, accessible datasets, open-source software, and cheaper and faster computers have led to some amazing developments (e.g., AI art). Startups are realizing the immense power of Generative AI, making apps like NFT generators, advertising copywriters, AI artists, and homework assistants. It’s like 2007 all over again when the iPhone and App Store were first launched, kickstarting a whole new generation of startups and use cases.
While Generative AI models can be based on anything from relatively simple statistical techniques to complex neural networks, building and using one is always a 2-step process. First, a dataset (generally of increasing size) is used to “train” the model using techniques such as hybrid, tensor or data parallelism. Once trained, the model can then be used to generate outputs given a new, “unseen” input (e.g., generating an image from some text description). The model training requires a lot of high powered GPU instances (such as several racks of NVIDIA A100’s) to work together to train the model. This is a very expensive process which could take several weeks. On the other hand, the second, so-called “inference” step, can often be handled by a single mid-tier GPU, and is relatively inexpensive. Nonetheless, costs add up, especially given the high demand for such applications given the high demand for such applications.
In practice, to achieve the state-of-the-art in Generative AI, a diffusion model is now indispensable. Diffusion models excel over models such as GANs in mainly two aspects: (1) it needs no adversarial part in training which could be very indeterministic in training results; (2) it can have much larger scale of parameters than GANs. The great increase of parameter scale determines a much better quality and higher resolution of the generation results. To train a diffusion model, a vast set of data (ranging from hundreds of millions to several billion images with resolution higher than 512X512, and embeddings as conditioning from their textual description of title) is needed as training data.
*This image was generated from a VALDI customer’s Generative AI model using hybrid parallelism. Much of the text in the above post was also co-written using Generative AI.
VALDI, which currently has over 10 PFLOP/s computational processing power (which includes over 12k high powered GPUs) committed to its smart cloud network, has been helping companies power the inference step for their diffusion models using a vast array of otherwise idle NVIDIA RTX 3090 and V100 GPUs. Customers benefit by having a cap on their costs as part of VALDI’s startup-friendly monthly subscription plans which include unlimited machine learning at a very low cost. VALDI is also working on supporting data parallelism for training of such diffusion models on a stack of A100 GPUs to further lower costs.