How IT Teams Can Set the Right Foundation for AI Projects

Using hardware from NVIDIA and Pure Storage to consolidate infrastructure silos.

Emily Potyraj
The Startup
3 min readJul 1, 2020

--

Joint post with Tony Paikeday from NVIDIA.

Photo by Shane McLendon on Unsplash

We’ve discussed the importance of MLOps in a previous post. Once businesses commit to integrating devops practices to their AI projects, what’s the first step that IT teams can take towards successful MLOps?

How can IT teams set themselves up for success?

Goal #1: Support a Range of Applications

An AI platform doesn’t just need to support TensorFlow — or even just the model development workloads. It needs to provide testing pipelines, versioning, sandbox environments, monitoring, and more.

For example, you might start creating Kubernetes clusters for AI workloads. That cluster will run a wide set of applications that need access to a variety of datasets and compute hardware — and likely even a variety of protocols.

An AI platform includes everything from developer IDEs to cluster monitoring. If IT teams aren’t careful, they’ll end up supporting many silos of infrastructure across the platform.

Goal #2: Scalable & Self-service Infrastructure

Like with any platform hosted by IT and DevOps teams, an AI platform should support application scalability and resiliency. And, optimally, data scientists should have self-serve access to new environments.

→ Without a cohesive plan to support the production pipeline as a unified project, individual application silos often become inefficient, unscalable, and fragile.

Step back and ask, “How can we make this set of disparate workloads as easy to manage and to scale as possible?”

Solution: Consolidate Workloads

If you’re an IT leader, you have an incredible opportunity. The success of your company’s AI-fueled ambitions requires you to enable developers in a new way.

Get in front of the productionalization crisis by making architectural choices that will centralize AI infrastructure — consolidating people, process and technology.

On the storage side, use the same centralized storage underneath all of the applications in the platform. For example, Pure Storage’s FlashBlade is great at handling all different IO patterns and has performant access for both file and object workloads, which means it’s well suited for any of these components.

Likewise, NVIDIA’s DGX A100 brings consolidation to the compute hardware. With DGX A100, NVIDIA consolidated what used to be three separate silos of legacy compute infrastructure, each sized and designed for supporting only one specific workload: training or inference or analytics. DGX A100 supports all of these workflows using just one universal system type.

Now you have just two building blocks to manage — one for storage and one for compute. This infrastructure simplicity is what lowers the threshold to be able to get models into production; there’s already a place where new workloads can run. With the AIRI reference architecture from Pure Storage and NVIDIA, you can now support the end to end AI lifecycle from development to deployment on one elastic infrastructure.

Takeaway

By using a flexible yet homogenous infrastructure built on one system type, IT teams have the flexibility to adapt to the demands of the business over time. If you have to go heavy on analytics this week and then pivot to training next week, and then pivot heavy to inference the following week, you can support it all without any physical change to infrastructure. Set a strong foundation for AI projects with compute and storage that flexibly adapts to workload demands as they change over time.

Learning more about deploying Pure Storage with NVIDIA for AI: AI Solutions.

--

--