Make Your Data And Organization Ready For AI

John Emmert
Cloud Pak for Data
Published in
8 min readMay 5, 2020
Make Your Data Ready for AI

Ever since AI was thrust into the spotlight with Watson in 2011, organizations have wanted to leverage AI in their business, but have struggled. Many businesses are unable to exploit the full value of AI in their businesses because their data is not ready for AI workloads.

Through thousands of client engagements around Data and AI, and participation with analyst firms, consultants, and system integrators, the client problem areas for making data ready for AI boil down to three key areas: Data Mesh, Data Fabric, and MLOps. In the following sections, I’ll explore the current state of the problem, the consequences of that problem, and how to move forward based on thousands of engagements and years of experience around Data and AI.

Data Management

Over the past 5–10 years, conversations with clients around their Data estates have gone one of two ways: “We are moving EVERYTHING to the cloud,” or, “We are not cloud-first and never will be.” The reality of the situation is somewhere in between. Organizations have moved to a hybrid multi-cloud approach where some persistence stores reside on the cloud, others in a private cloud or behind the firewall. Because of the disparate nature of the accumulated data sources, and the way organizations have architected their data estates over time, organizations have moved to lock down data, rather than make it accessible.

Current State and Related Consequences

With the introduction of new data sources, such as Hadoop, cloud object store, NoSQL, companies generally have multiple disconnected data stores. These stores all contain different data, which often makes it challenging to make this data available to users, and those users desperately want access to the data. Still, because data is in multiple environments, and in different formats, the data typically needs to be moved, usually done by extracting data onto their laptops to combine various data sources, or through complex integration processes performed by IT. Because data lives in multiple disparate systems, it isn’t straightforward for the users to find the data they want and ensure the integrity of the data that they would like to use.

Due to the current state in most organizations, business users are dependent on IT resources to find data, integrate data with other sources, and make it available for consumption, which is typically labor-intensive and slow. This results in a sub-par experience for business users and analysts, which in turn leads them to spin up their own “self-service data marts.” Additionally, moving data creates a security risk, especially when data is transferred to end consumers’ laptops, or to external data stores that are not under the purview of enterprise governance. Finally, data is often combined and manipulated manually, and the integrity of the data is often compromised, creating trust issues with the data and the resulting analytics. Due to the inability to access quality data in the enterprise, there is an inability to build AI.

Future State

To be successful in delivering on the Future State, an enterprise requires:

1) A collaboration zone that allows users to reuse assets and work that their peers have already built, increasing time to value in the organization.

2) Data Virtualization technology and processes to ensure that data is moved intelligently and efficiently and that all persistence stores are contributing compute resources to accelerate insights.

3) An intelligent data catalog that allows all users in the organization to access information in all persistence stores powered by a recommendation engine that assists users to find the data that they need.

4) A policy enforcement engine that controls access to assets, as well as can redact/obfuscate information within assets to ensure maximum compliance as well as maximum access to assets.

The ideal future state focuses on increasing the end user’s ability to access all data in the enterprise, leverage that data with minimal movement and minimal integration processes, and to ensure that the access to that data is done is a governed and highly performant manner.

Realizing Self-Service with a Data Fabric

Typically, organizations have used governance policies to lock down data stores to ensure high levels of governance, but low levels of access. This data inaccessibility hinders the organization’s ability to deliver self-service, as the IT and Governance organizations are required to be involved in all requests for data access. Optimally, an organization would use enforceable policy controls that allow access to ALL data sets, when it makes sense. Data quality and data lineage, and more specifically, the automation of providing high-quality assets to users is paramount to an organization that wants to leverage AI.

Current State and Related Consequences

Most organizations have implemented governance controls with the intent to lock down data stores and restrict access to users to reduce risk in the enterprise, resulting in a lack of access to data, and therefore sub-optimal analytics and AI. Due to the siloed nature of most organizations, there is no standard view across data quality and data lineage, resulting in an ad-hoc, and expensive approach to making data ready for AI. Additionally, IT teams are burdened with complex integration builds, and therefore cannot deliver high-quality data assets continually. Users are unable to access high-quality assets with the appropriate frequency and speed that is needed to develop AI.

Because of these policies, implementations, and processes, there is a substantial loss of revenue due to inaccurate or incomplete analytics and AI due to reduced access to vital data assets. Additionally, IT costs are consistently increasing as IT groups are forced to deliver suboptimal integrations, and users must endure lengthy deployment times when asking for data access to build insights. Finally, there is a lack of collaboration across the enterprise; each group works independently on their Analytics and AI, which results in waste, in both time and resources. The lack of ability to access the appropriate assets leads to only leveraging small amounts of data, and therefore represents a considerable opportunity loss with regards to the application of AI.

To be successful in delivering on the future state, an enterprise requires:

1) An intelligent data catalog that allows users of all types to access information in all persistence stores in the organization and that leverages a policy engine that controls access to assets. As well as can redact/obfuscate information within assets to ensure maximum compliance as well as maximum access to assets.

2) Automatic classification of content within data assets to ensure that sensitive information is appropriately marked and that policies can enforce access restrictions or mask confidential information.

3) Single user experience across all personas in the organization, including Data Scientists, Data Engineers, Citizen Analysts, power users.

4) Ability to connect to and classify unstructured sources like HDFS and content stores.

The optimal future state is one in which organizations increase time to value in their enterprise through an approach that fuels continuous innovation for the business by enabling self-service access to trusted, high-quality data for all users, and that is delivered through an automated pipeline.

MLOps

Typically, organizations look to move to this step before completing or getting total control of the previous two themes. The ability to deliver AI decreases, the weaker an organization’s foundation is. That is very apparent in this theme as clients generally have very little success in building, deploying, and running AI models in production. The lack of a robust data foundation results in an inability to gain access to the appropriate data sets, and this problem is magnified by the fact that typically very few people in an organization have the skills to build Machine Learning and Deep Learning Models.

Current State and Related Consequences

Currently, users cannot find and reuse data and ML assets, have little to no collaboration with one another, and are unable to participate in building AI due to lack of skills. There is little to no accountability for AI models, and an inability to understand why models make particular decisions if models are biased and systems cannot rectify bias or correct decisions by AI models. Deploying and managing AI models at scale is nearly an impossible task as there is a lack of an MLOps type approach that allows for version control, understanding of models, promotion and demotion of models from production, and a lack of ease with deploying models into production applications. Organizations currently struggle with the necessary skills to build AI, the essential processes to operationalize AI, and the ability to trust the AI that they deploy.

Due to these issues, organizations have little to no ROI on their highly skilled data science resources as most models never make it to production, and the models that do, are often unusable due to a lack of trust. Additionally, the majority of enterprise personas cannot participate in the building and deployment of AI due to a lack of skills, and a lack of an integrated environment that caters to all skill levels. Finally, high regulatory fines for unmonitored, biased, and untrustworthy models either make deploying models too risky or erases their ROI with penalties. Organizations are missing out on the opportunity to implement AI more widely in their organization, resulting in increased expenses, as well as decreased returns on investment.

To be successful in delivering on the future state, an enterprise requires:

1) A deeply integrated, consistent user experience across finding data, building AI models, and deploying AI models that can be trusted.

2) Capabilities that allow for AI to create AI, which enables both Data Scientists, and users of other skills levels to accelerate their build to deploy timeline including multiple build options like coding, building with a canvass approach, and building through a click-through GUI.

3) Explainability and bias tracking of models that ensure that models are not making unethical decisions when in production.

4) An MLOps approach that approaches machine learning and AI, just like DevOps approaches application development.

An optimal future state allows organizations to decrease time to deployment, increase accuracy, and increase the trust of AI models through an integrated toolset that will enable users of all skill levels to build and deploy AI.

Organizations that have struggled with AI implementations can have great success by following the above principles to ensure that their data estate is in a mature enough position to leverage everything that AI can deliver. It is extremely important that organizations have a strong Information Architecture before they undertake projects that put AI into widespread production. In the next article in the series, we will explore the most common applications of AI, as seen through many enterprise engagements. Which can be found here.

Learn more at https://www.ibm.com/analytics/data-modernization

John Emmert leads Global Sales and Strategy for IBM Information Architecture within IBM Data and AI. Lives in Raleigh, NC. He is married to his wife Sarah, and has 3 boys, Jack (7), Liam (7), Cormac (1).

--

--