Sitemap

An overview of machine learning infrastructure

Building machine learning infrastructure that work in the real world.

10 min readMay 14, 2022

It may come as no surprise to you that there are many tools available for automating machine learning workflows based on scripts and event triggers. In pipelines, data is processed, models are trained, monitoring tasks are performed, and finally results are deployed. As a result of these tools, teams have the ability to focus on more complex tasks while ensuring the standardization of processes and improving efficiency.

An infrastructure that is capable of developing and implementing machine learning models is known as machine learning infrastructure. There are various methods of implementing machine learning infrastructure depending on the nature of the project. It is the infrastructure for developing, training, and deploying machine learning models that includes systems, resources, and tools that support the development, training, and operation of machine learning models.

It facilitates machine learning workflows at all stages. With this tool, data scientists, engineers, and DevOps teams are able to control and manage all of the resources and processes needed to construct and deploy neural network models.

Data science teams today expect more from their machine learning infrastructure as machine learning becomes more sophisticated. Today, machine learning is driving businesses, as opposed to primarily being used for research. As the base of machine learning platforms remains the same (management, monitoring, tracking experiments and models), there are a few things to consider when designing an infrastructure that enables scalability and elasticity. Machine learning infrastructure today must be built as efficiently as possible with as little technical debt as possible to increase the speed at which it can be developed. The purpose of this blog is to provide an overview of how modern machine learning infrastructures look and how they can be built to scale.

How can a machine learning infrastructure be built in the most scalable way?

To start the MLOps model lifecycle, you are always supposed to identify what business problem(s) you are trying to solve. In the end, if you don’t know what you’re doing, how can you get there?

The members of MLOps teams work in different departments across an organization to develop models. Establish business objectives and goals for every model — your team will remain aligned and on schedule because they have clear direction.

Following the MLOps principle of asking, “What is your goal? “at the beginning of every model lifecycle also prevents models from being developed that don’t benefit the organization. If we know beforehand what we need to identify from video data, for example, then it is easier for the MLOps team to provide real value to the organization.

The building of an infrastructure for machine learning requires a few critical components. In order to build plans on top of your existing machine learning stack, your machine learning infrastructure needs to be built for scalability and visibility. First, we will discuss the AI fabric composed of the computing resources, the orchestration platforms, such as Kubernetes or OpenShift, and how the fabric is integrated with your machine learning workflows. ML infrastructure should also integrate solutions for data management, data version control, and provide a machine learning workbench for data scientists to simplify training models, working on their research, and optimizing models.

Providing an easy and intuitive way to deploy machine learning models to production is the final component of a scalable machine learning infrastructure. There is a huge challenge in today’s world that a lot of the models do not make it to production due to hidden technical debt. In order for machine learning to be successful, it must be agnostic and provide easy integration with existing and future technology stacks. Your data scientists should be able to run experiments and workloads in one click if it’s portable and uses containers for simple deployment. We will examine the main components of building a scalable machine learning infrastructure in the following sections.

In what ways does machine learning infrastructure face the biggest challenges?

Today, AI and machine learning are facing the biggest challenge of doing data science at scale because data scientists aren’t doing much. You’ll find that the majority of a data scientist’s day is spent configuring hardware, configuring GPUs, configuring CPUs, configuring orchestration tools like Kubernetes and OpenShift to implement machine learning, and configuring containers. Furthermore, hybrid cloud infrastructures are also gaining popularity for scaling AI. Hybrid cloud infrastructures add complexity to your machine learning stack, as you need to manage the many resources across multiple clouds, multi clouds, hybrid clouds, and other complicated configurations.

Resource management is now part of a data scientist’s responsibilities. It is difficult to have an on-premises GPU server for a team of five data scientists. Finding out how to efficiently and effectively share these GPUs takes a lot of time. Data science is impacted by the difficulty of allocating compute resources for machine learning. Furthermore, managing the models created by machine learning can be challenging. Tasks such as versioning data and models, managing models, deploying models, and utilizing open source tools and frameworks.

The results of AI and ML can be hampered by a poorly equipped machine learning infrastructure. Managing the machine learning workflow is the major challenge in machine learning. As AI is currently used in enterprise workflows, there are two main workflows that are disconnected and ineffective. In the first place is DevOps, also known as MLOps. The workflow focuses on resource management, infrastructure, orchestration, visualization of models in production, and integration with existing IT stacks such as Git and Jira. Additionally, there is the data science workflow, which involves data selection, data preparation, model research, training models, validating models, tuning models, and finally deploying models. Each of these pipelines involves so many steps and components. These two flows are today completely disconnected, and they are often managed by separate teams. Consequently, enterprises are burdened with a lot of technical debt due to these broken workflows. In addition to affecting production time, it can also affect cost. Workflows tend to become more complex as your organization grows. The infrastructure is completely siloed if you have teams working on different projects around the world. Because of this, a scalable machine learning infrastructure should be streamlined across all projects and teams in a company.

What steps do you take to make sure your machine learning infrastructure follows MLOps best practices?

You must first understand what MLOps is before you can begin to address these challenges. In order to operationalize models, machine learning operations help reduce friction and bottlenecks between teams developing ML models and engineering teams. MLOps makes use of DevOps practices in the context of machine learning and AI development. As a discipline, it aims to systematize the entire machine learning process. Enterprises can use MLOps to productionize machine learning models, automate DevOps tasks, and free up data scientists to focus on developing high-impact machine learning models rather than nagging technical challenges. When building your machine learning infrastructure, you should consider two key questions. What can be done to make it easy to use for data scientists without background in DevOps? What is the best way to build an enterprise stack that provides high scalability and performance for DevOps engineers overseeing the whole machine learning stack? ML Ops answers some of these questions. Computing is the first step in machine learning. It requires a significant amount of computing power. In order to scale machine learning infrastructure, it must be compute agnostic. No matter if you are using GPU clusters, CPU clusters, Spark clusters, or cloud resources. In many enterprise environments, there is a pool of resources that is used to develop machine learning applications.

The desktop is split between different teams but still contains GPUs, GPU clusters, CPU clusters, as well as cloud resources. This diverse pool of resources can be used to train models, preprocess data, serve models, infer conclusions, and perform other machine learning work. A typical machine learning pipeline example starts with a data set, and one worker is assigned to preprocess the data. After that, some workers are designated to train models, such as ResNet, VGG16, YOLO, InceptionV3, or InceptionV4. For better performance with deep learning, we’ll use GPU workers in this case. They are used to actually train the models. During this time, you may also need to run some Jupyter Notebooks, which consume some computing power, as well as deploy models to the cloud. Now, this is only one pipeline, but at the enterprise scale, multiple pipelines and projects can run concurrently, which makes things much trickier. The compute consumption can grow even bigger if you dive even deeper into one pipeline. Hyperparameter optimization will be part of each algorithm. This means that with VGG16 we aren’t just running a single run of TensorFlow, we are running almost 500 runs of TensorFlow code. The computation resources are therefore multiplied by 500. It should be feasible to run 500 experiments in 500 runs with scalable machine learning infrastructure. By running more models and tweaking them more, you’re likely to achieve better results and better accuracy. Data scientists can self-manage workloads with schedulers and meta schedulers in their machine learning infrastructure, which provides scalability and self-service workload management.

Here, we will explore how to build a machine learning architecture with scalability for enterprise workloads.

1. Containers

Managing machine learning infrastructure in containers is essential for its flexibility and portability. Workloads can be distributed across different compute resources using containers. You can therefore assign GPUs, cloud GPUs, accelerators, or any other resource to each workload. You can distribute jobs among any of the resources you have at your disposal by using containers. DevOps engineers love it because it allows them to manage workloads more easily and portable.

You can create an environment with containers, and you can also do reproducible data science and data analysis with containers. Cloud native technologies allow you to run the containers anywhere. So, you can use Kubernetes clusters, bare metal platforms, Docker, as well as cloud resources that support many different containers. OpenShift is an orchestration platform, which makes it easier for you to run and execute containers in the cluster.

2. Orchestration

In orchestration, it’s imperative to create something independent of compute resources. Despite Kubernetes’ popularity as a machine learning and orchestration tool, there are so many flavors of the system. It is possible to run Rancher, OpenShift, or Vanilla Kubernetes. You can deploy MicroK8, as well as MiniKube, for small deployments. As such, you must decide what kind of orchestration platform you want to support both now and in the future when designing your own infrastructure. In other words, you have to be able to design the stack to fit your existing infrastructure while also considering future infrastructure requirements.

In addition, whatever infrastructure you design, you must be able to take advantage of the compute resources that exist within your enterprise. If you need to support large Hadoop clusters, Spark clusters, or bare-metal servers that aren’t running on Kubernetes — like CPU clusters — you need to be able to do that, too. It is important to build an infrastructure that integrates with the Hadoop cluster, leverages Spark and YARN, and can take advantage of all the technology your organization possesses. Additionally, to make all your compute resources more accessible and usable by your data scientists throughout the industry, you should consider how to manage them in one place.

3. Hybrid cloud multi cloud infrastructure

Why is a hybrid cloud infrastructure beneficial for machine learning? I could easily write a whole post about this. However, a hybrid cloud infrastructure is best suited to machine learning workloads because they are usually stateless. A machine learning training may run for a day, or for two weeks, and then be terminated. You can terminate the machine, and then forget about it, as long as all the models and data are being stored. The cloud for machine learning for this reason is unlike either one for software. As a software developer, you need to ensure your database is shared across hybrid environments. In hybrid cloud machine learning, control of your resources is beneficial to utilizing the existing compute you already have. As an example, assume a company has eight GPUs on-premises and ten data scientists. It would be ideal if your organization used all its eight GPUs, and only burst to the cloud when it reached 100% utilization or allocation. The ability to burst the cloud allows organizations to increase parameterization while simultaneously reducing cloud costs. A second benefit of cloud bursting is that it allows data scientists to scale machine learning activities easily.

4. Agnostic & open infrastructure

As machine learning evolves at an extremely fast pace, your platform must be flexible and easily extendable. It is therefore essential that you design your machine learning infrastructure in such a way that it can be easily extended. You are able to easily integrate a new technology, a new operator, or a new platform without having to restructure your entire infrastructure. If you take away one thing from this guide on machine learning infrastructure, make sure you choose your technologies carefully, make sure they are agnostic, and make sure they scale. So that as new technologies and operators emerge you can adapt quickly.

Secondly, you should also consider your interface for data scientists if your infrastructure is agnostic. A poorly designed interface will prevent you from utilizing the new technology in your infrastructure. Don’t forget that data scientists aren’t DevOps engineers. The majority of them are PhDs in math and do not want to use YAML files, namespaces, deployments, or deployment scripts. Their first priority is to work on their models, which is what they were hired for. Especially if Kubernetes is being used, it’s necessary to abstract the interface for data scientists while providing them with the flexibility and control they need. Therefore, if there are data scientists or DevOps on your team who would like to gain access to Kubernetes’ internals, you will need to be able to permit them to do so. Ultimately, it is about helping your teams to become better professionals by supporting their data science and engineering efforts.

For more details use this book.

--

--

Dnyanesh Walwadkar
Dnyanesh Walwadkar

Written by Dnyanesh Walwadkar

Computer Vision Scientist | MS, Big Data Science | Artificial Intelligence Researcher | Data Scientist | Worked as Machine Learning Engineer

No responses yet