AI/ML Is Dead. Long Live AI/ML

With Over 85% of AI/ML Projects Ending life as a PoC, Cloud is The Only Way to Bring them Back to Life

Published in

The Startup

8 min readAug 26, 2020

In recent years, large organizations have committed billions to AI/Machine Learning (AI/ML) investment. According to CIO Magazine, the retail and banking sectors estimated that their 2019 spend on AI/ML would be, cumulatively, in excess of $11.6 Billion. The Healthcare sector was estimating an investment of approximately $36 Billion by 2025. (final totals are still pending)

Even with these huge financial commitments, some analysts predict that 87% of AI/ML Projects will fail to deliver as promised or never make it into production.

Of particular note is that the vast majority of AI/ML projects today are targeted for internal datacenter deployment. This legacy, “inside the corporate fabric” metaphor, imposes a number of constraints and these constraints are a primary reason for failed projects:

1) Organizations can’t keep up with AI/ML infrastructure needs. AI/ML demand for computing and storage is huge and accelerating. But unfortunately, the handcuffs imposed by corporate datacenters, in particular, the lack of elastic scalability and timely access to suitable infrastructure and perpetual need to evergreen infrastructure and upskill staff, makes the infrastructure demands of AI/ML untenable.

2) AI/ML demand for data, and in particular sensitive data, is stretching the capability of corporate security. The inability to address, in a practical manner, the security constraints required to access sensitive data makes data availability cumbersome at best and, more often, prohibitive.

3) Corporate processes, and in particular a lack of maturity of the DEVOPS processes, seriously inhibits the rapid, iterative deployment of models needed to maintain business value required by AI/ML.

However, there are a few things we know for certain. The need for computing and storage will continue to increase. The need for sensitive data will increase. And the need for speed and agility in delivering AI/ML projects — and business results — will continue to be unrelenting.

Unfortunately, existing (and likely proprietary) datacenter AI/ML capability is hopelessly out of date for all but the simplest of tasks. It won’t scale. It is more fragile than agile. It may work today, but it can’t grow to address tomorrow’s needs.

So, most AI/ML practitioners are faced with a simple decision. Is it worthwhile to continue to invest in the AI/ML tools and processes that are currently in use? Or, is it time to rethink the enterprise AI/ML approach and consider a scalable, resilient, and evergreening platform: cloud-native AI/ML?

We cast our lot firmly with the latter. Cloud scales. Cloud is largely self-serve and agile. And, cloud will grow and seamlessly evolve to address tomorrow’s business needs.

The Need for Scale

Unlike Google, Microsoft, Amazon, and the other cloud titans, an enterprise’s primary business is probably not IT. Of equal importance for all organizations, infrastructure demand created by AI/ML can often leave enterprises in a situation where a large infrastructure investment sits fallow in times of lesser demand. So, it isn’t surprising that organizations frequently fail to address scalability demanded by AI/ML.

To compound the scalability problems, technology advancements in infrastructure continue to outpace procurement cycles. Even further exacerbating this issue, is the inability to continually reskill workforces to build and support newer AI/ML platforms. In short, companies cannot keep up and, typically, organizations trying to “catch up” incur more costs, lose competitive advantage and, in their rush to bridge technology gaps, often make mistakes that can impact their brands and customer loyalty. They need to get out of the IT business and focus on the products and services that they deliver to their customer base.

To address this situation, we are seeing more and more organizations migrating their AI/ML workload to the cloud, thereby leveraging a cloud vendor’s economies of scale, technical competency, and cost effectiveness that organizations’ internal IT groups lack.

Cost Effectiveness is Essential

Another consequence of the insatiable demand for scalable infrastructure and the associated costs, is that an environment is created where data scientists and data engineers (typically business focused resources) are attempting to run AI/ML “outside the operational fabric”. They’re developing, training and running models on localized environments (desktop machines, small servers or even laptops) that are often not supported by the enterprise. Typically, these environments can only accommodate limited data and, from a regulatory perspective usually can only use public data. The limited scalability of these environments, coupled with the relatively small datasets, results in depleted efficacy of models and thereby reduction of business value.

To further compound this issue, because the environments are outside of the fabric of operational monitoring, the data that is resident in these environments is at a higher risk of being compromised.

Public cloud infrastructures provide the ability, mitigate the infrastructure costs, protect data and improve the efficacy of models. Providers like Microsoft, Google, Amazon and IBM provide cloud capability that provides elastic scalability in a truly operationalized capacity. The capability to store large data sets and the compute capacity to engineer them and train models is a key differentiator to “on prem” infrastructures.

Safety and Security of Data

Migration to the cloud for AI/ML undertakings is not without its challenges. Security of data needs to be of paramount importance and, often, migration of any project to a public cloud provider is met with resistance from those tasked with the protection of data. Recently, data breaches like at CapitalOne and Desjardins have, rightly, heightened these concerns. This is particularly true with data that is classified as PII data (Personally Identifiable Information). In financial services, although things like credit card and mutual fund processing often require movement of PII data outside of corporate fabric, Cloud is still viewed as new technology and often perceived by information security organizations as “risky”.

Given this perception, it’s not surprising that data security is top of mind for Information Security and Data Governance Groups. Bridging the real and perceived security gaps is essential for any Cloud based AI/ML initiative.

It is true that venders spend generously on security. Microsoft alone spends over $1 Billion on cyber security and fends off approximately 7 trillion cyber threats a day (source). Also, these cloud vendors provide a robust set of APIs in order to integrate back to any organization’s enterprise security and environment monitoring software. This allows companies to be aware of events and changes on the platform. Still, in order to protect the organization, companies making the migration to the cloud need to take certain steps to protect themselves and their customers.

Items like data encryption at rest and in motion, physical security, network security, Virtual Public Clouds (VPCs), configuration monitoring of infrastructure & network are basic table stakes. However, Identity Access Control is probably the most critical aspect of security in an AI/ML cloud ecosystem.

Identity Access Control allows for data scientists and data engineers to operate on a project by project basis and can restrict read access to the data by specific roles so that data access is truly on a “need to know” basis. Even more importantly, Identity Access Control can limit write access to data to be controlled only through applications. While this won’t entirely eliminate the potential for “bad actors” who do have authorized access, it does control most hacker issues. Essentially, this greatly controls the “blast radius” of any potential threat.

It’s also recommended that, at least for training environments, one-way data flows where data can get moved and stored onto the platform but can only be accessed and visualized using “Citrix like” tools, be implemented.

In the event that data must move out of the training environment, multiple executive approvals should have to be provided and that movement should only be able to be done programmatically with multiple individuals involved in each step of the process. This will complicate any plans by nefarious actors.

An Agile Culture is a Critical Success Factor

While technology and process will aid in security of AI/ML during steady state operation. Building the environment in the first place requires that the “build team” be structured for success. There are cultural issues in the adoption of new technology and cloud (as there is with any change) and it is important to get organizational buy in. This is particularly true with Information Security.

To bridge the cultural issues engaging information security and privacy, making them part of the AI/ML implementation team and allowing them to determine the benchmarks that define when the enterprise is suitably protected, facilitates buy in and brings the real experts to the table. The key, is to ensure that the information security resource on the team is senior enough and suitably empowered to make the decisions necessary.

The need for Reproducibility, Traceability, Verifiability and Explainability

Enterprise’s need for model reproducibility, traceability, verifiability, and explainability are driving changes to the traditional AI/Machine Learning delivery lifecycle, and now have become a fundamental requirement for data science in large enterprises.

Why are these requirements important? In financial services, reproducibility, traceability, and verifiability are explicitly regulated (ie: in the European Union, United States, and Canada) and cannot be overlooked. But similar requirements are also found in many other industries from health care and biotech to government security. In fact, now even enterprises in modestly regulated industries are finding that the benefits of reproducibility, traceability, and verifiability far outweigh their costs.

The level of confidence needed can typically be achieved through a documented and automated model lifecycle (aka MLOps or ModelOps) that continually challenges the efficacy of models, trains new ones and, at user request, deploys new models to the inference/ run time environment all at the click of a mouse.

How does cloud address this? For several reasons: first, cloud is the only cost effective way to store massive data files (and their multiple versions and lineage) consumed in AI/ML. Secondly, only cloud offers the scale required to address the intensely iterative AI/ML lifecycle required to support repeated verification and explainability testing. Only cloud offers the self-serve capabilities that enable an agile culture and automated infrastructure.

As AI/ML becomes more of an essential service to any organization. This level of automation will be needed in order to scale deployments and actually condense model iteration cycle times to address growing demand. Organizations that fail to implement suitable automation will have AI/ML become a victim of its own success.

It’s not What’s Next…It’s What’s Now

Movement of AI/ML ecosystem to the cloud is not a step for the future. It is a step for today. In order to actually make AI/ML a production grade service that is a key component of decision making, governance, problem solving and profitability of organizations, it is imperative to move beyond the PoC stage and take the next step in the evolution of this promising technology.

Cloud not only has the promise of moving AI/ML out from underneath the desks of data scientists and data engineers. The scalability of the platforms and hardened, production grade nature of these environments, will allow industry to accelerate realization of other benefits associated with this exciting technology. The simplistic nature of many of the “under desk” AI/ML environments will have the capability to exploit complexities like feature stores, deep learning and more robust natural language processing. This will ultimately result in a better value proposition for the technology.