Data Privacy Across AI Lifecycle with Mukta Singh
This interview is part of Women in AI Ethics podcast series sponsored by IBM. In today’s podcast, we will explore the different dimensions of data privacy across the entire data and AI life cycle with Mukta Singh, Product Management Executive — Data and AI at IBM.
The views on this podcast are those of the person being interviewed and don’t necessarily represent IBM’s positions, strategies or opinions.
Mia Dand: Mukta, you’ve been with IBM for over 20 years in various roles. Tell us more about your journey and how you got started in the AI and data governance space.
Mukta Singh: Thank you for the question and for having me here. I started off as a Development Engineer and a Development Manager early in my career at IBM and went through product development and application development in the data management field before the whole run from data management. The shift happened towards data science and data analytics becoming more prominent to how we make it easier for the data users or data consumers to acquire the data they need and be able to use that data in a very easy manner. Working with different clients, I became more prominently aware that this is something that they need every day in their businesses.
Working towards that and making a shift towards addressing their business problem of making the data easily available, consumable, and being able to address their privacy needs became such a well-defined area that caught my interest. That’s how I moved from Data Management to Data AI and Data Privacy in the Product Management role. It’s been an evolution of addressing the market needs and the client needs for their data consumption and ease of use.
Mia Dand: You’re a pioneer in this space because you’ve been here right from the beginning as these conversations got started. Let’s talk about two predictable and yet lesser discussed topics in AI and Data Management, which are now growing in interest and rightly so. If you can help us understand in broad strokes, what is IBM’s approach to managing data privacy and security?
Mukta Singh: A key focus has been to help clients with the self-service of their data. Being able to access their data, collect their data, organize it in a more efficient manner where a lot of things can be automated for them as they do the process and then be able to consume that data as they analyze it and consume it.
Self-serve in all the realms of this journey — from data collection to data consumption — is the key focus that IBM has and as we do that, the focus has been making sure that we enforce governance, as well as the privacy management of the data that the consumers have available to them, and the organizations are able to abide by the compliance regulations and compliance rules. Whether that’d be through PII discovery or that’d be through the governance of addressing the changing regulations that are out there. Addressing these two key areas — self-serve and ease of use — in the journey of data management is IBM’s focus with the advent of AI and ML, which has always been in different shapes and sizes in the data management market and has become even more prominent today.
In the last few years, it’s pretty clear that the citizen data scientists or the data consumers have their own initial skills and one needs to serve their needs to allow for that AI and ML activities to be held in an easier and focused manner where they do not have to worry about the governance and the privacy.
So self-serve and access self-serve access to data is the key focus for IBM and allowing. In such a way that we enforce governance and privacy as the data is managed and provided to the users. So that’s the focus in everything we do.
Mia Dand: It’s a critical area and making it easier for companies to implement these measures and the governance measures.
Diving into one specific area that you mentioned in your approach is privacy. Everybody talks about privacy, but it is very complex and it’s a multi-dimensional challenge that all organizations are facing at individual levels and there’s a growing awareness of why it matters.
How does IBM help companies manage that? Starting from the consent of how data is gathered and how are you getting consent for private data as you also mentioned PII?
Mukta Singh: As we collect the data, the platform we use is Cloud Pak for Data and essentially the outcome that we are after is a faster time to comply with the privacy regulations and automation for every piece of data that is acquired or collected and the automation of identifying this PII discovery across different data elements that are in that singular catalog that we bring in the metadata.
The advancement around the discovery classification and mapping of these different business entities is done in a specific way to address detecting the sensitive data elements, as well as enforcing the rules based on the organization’s needs or particulars for governance. The ability to identify the PII sensitive data elements and automatically mask these data elements based on the privacy regulations or rules that they can easily define across the PII data, across the different data elements, or data categories that one can do easily. That is how we solve it. A lot of automation around discovery classification, as well as detection of the sensitive elements, and then allowing the user to define the privacy rules and being able to automatically mask it every time the data is presented based on the right access controls, right privacy rules that are defined.
Mia Dand: That seems like a very practical approach, given that the privacy rules and regulations across global regions differ. Also, it depends on the organization and the company as to what their governance policy is.
Can you talk us through other challenges in data privacy? We know from an organizational and a global region perspective, they look different but also are there other challenges? Can you share any IBM privacy tools that you’ve built or deployed to help with this?
Mukta Singh: Talking about challenges, data sets are suddenly in so many different forms and shapes. They contain personal sensitive information about all kinds of individuals, their health records, financial records, or just the diversity of it. Whether those are formatted in structural datasets or whether they are in the form of free text. The whole realm of data in today’s landscape is so wide and ever-changing so harnessing that information in a simplified way takes away the complexity of that data is a big challenge that users face today.
IBM Data Privacy tools are part of the governance and data privacy within our Cloud Pak for Data platform, which allows a lot of this automation to classify all that data that one collects and allows to organize in a central catalog so that this discovery of PII information happens across diverse datasets. The datasets are constantly increasing in volume but the characteristics are overly interconnected between these different datasets so allowing for interconnected datasets to be identified is another big challenge that has to be tackled. And then the guidance access management across different stakeholders and access management and workflow management of the data and identifying the risks of the data is another element that we tackle. So between different IBM tools that are all part of our singular platform — Cloud Pak for Data — has different microservices like our Watson Knowledge Catalog and our Data Privacy capabilities as well as Open Data Management tools and AI tools that allow the ease of data gathering, as well as consumption, with the appropriate privacy applied.
Mia Dand: Got it. Thank you for that comprehensive answer. The good news is there’s a lot of data. The bad news is there’s a lot of data; it needs to be managed. It’s the same with AI, right? AI is everywhere but also poses a challenge. As a way of governing these AI algorithmic systems, there’s been a lot of interest in algorithmic auditing and reporting in this space. So I would love to hear your thoughts on one of the components I’ve seen in your information architecture stack, which is continuous auditing and reporting. At what stage of the AI Development Cycle does the audit happen and what is the goal? Are you trying to make sure the system is accurate? What are the parameters that you use in your audits and reporting?
Mukta Singh: We allow data scientists to build the whole AI Model Lifecycle Management, which certainly is inclusive of bias mitigation as they build the different models. There’s a lot of automation around configuring and being able to design the right models, but at the same time infusion of AI to not only collect and train the data but also to be able to deploy the models to address the problem of testing the models appropriately for any biases and eliminating the biases as well as allowing indirect bias detection, which is used to deflect the models that are exhibiting biases to choose sensitive attributes and to tweak and train the model appropriately to remove the bias.
It’s certainly critical for the role of ML and AI to have a well-defined and robust method in a platform to be able to have that AI life cycle. And businesses can even fail in their own method or logic if they have to do each piece individually in their AI life cycle.
So we have a comprehensive AI Model Life Cycle Management tool that we provide on the same platform that I talked about, CloudPak for Data and essentially our Watson Open Scale and Watson Knowledge tools that are part of the platform allow for that whole AI model lifecycle management to allow enterprises to have that simplified role for AI using ML behind the scenes to get them where they need to go in a couple of minutes versus multiple days that they could go through in the whole AI life cycle.
Mia Dand: I’m so glad we are having this conversation because so much of the machine learning and the AI discourse is very superficial without a lot of thought about how it actually happens and what’s happening behind the scenes. We are down to our last question and we may have time for one bonus question after. How do you help enterprise clients balance their multiple goals? Now their goals are data privacy, security, accessibility, but they also have to balance that against the cost, right? No one talks about the costs of managing their systems responsibly ethically. What I’m really asking is, in my long-winded way, can enterprises have it all? And what are some of the trade-offs that they need to consider?
Mukta Singh: That’s a very complex question there. The complete realm of orchestrating the different pieces, as you talked about, whether that be the access management to the data management or what people talk about data fabric today is ultimately all pointing to where you can see Gartner and Forresters and analysts of the world today are allowing people to focus on, is because, unless you have your data, well organized and ready with the appropriate privacy overlay, you cannot consume it in an efficient manner for your AI and ML. Hence, the whole emphasis on the data fabric as a tool or as architecture, I should say, to allow for the data to be intelligently and securely provisioned for self-serve. And that’s the definition that Forresters and Gartners of the world are calling out now.
IBM has been pioneering data management for a long time, but this complex problem of bringing it all together in a simplified way, as you had said, can be a very expensive exercise. It’s months and months, large organizations sort of spend getting services and lots of private consultants in engineering to basically bring and stitch everything together and that can be a very complex exercise. So the focus that we bring from IBM’s point of view today is to address the top use cases for data fabric whether that be our customer 360 view aspect of the use cases and stitching that all together on this platform of Cloud Pak for Data, allowing you to not only have your data available in a simplified way and collected in one place, but being able to stitch the data across different silos through one singular metadata platform with privacy as an overlay, and then allowing for your data science tools and your AI lifecycle tools, to be able to have that direct connection with that single catalog for that level of analytics and AI.
The idea of IBM’s data fabric is to have that open tool allowing organizations to not only be able to have that data organized in a simplified way with automation and ML that we provide in our tooling, but also have that simplified security and privacy rules and definitions.
A lot of it is suggested, but then they can tweak it to their organizations or particular needs. So that’s the kind of focus we have to simplify that journey and minimize the cost of this large architecture that most organizations have to sort of stitch together.
Mia Dand: Simplification. I think it’s music to many people’s ears just because this is a very complex issue and the larger the organization, the harder it is. I’m glad you mentioned data fabric because I did have a question for you about that because we have been hearing a lot about data fabric. There is some skepticism. Is this just another term that the technologists have come up with? Can you share it with us? How is this an improvement over the previous approaches and the previous approaches to data privacy management tools?
Mukta Singh: So data fabric is an architecture. It’s not a product, it’s not a solution. Data fabric is an architecture that orchestrates different elements of data management and data operations if I should say right. Organizing the data and bringing that together to address these kinds of key use cases. Whether that be your customer singular view or that be real-time analytics or that be fraud detection or whatever, be the key data and management use case. The data fabric is all about bringing your data layer through data integration and others but you have a simplified metadata management view across governance and privacy as an overlay. That’s what data fabric is all about. That’s what analysts are advocating, which some may argue that this has always been a realm of data management. Yes. But this is now sort of building an architecture that allows for that stitched, simplified metadata and a data view that can give you that privacy overlay on governance and access management in a simplified way but I understand that some people may challenge that.
At the same time bringing this architecture will allow people to have it easier. And the amount of ML today it’s prescribed to bring in allows for a simpler dynamic orchestration of the data for self-serve. It’s all about architecture and allowing for the applications and analytical use cases to be stitched together for leveraging the data on the platform.
Mia Dand: Got it. Thank you for clarifying that. And I liked the point you made about simplification because it’s relative, right? It’s relative to what it was and it’s also relative to who’s going to be using it and other business metrics that go along with it.
So talking about data security. I read a report by Gartner, which found 78% of CSIOs have 16 or more tools in the cyber security vendor portfolio. 12% of those CSIOs have 46 more. Is this because the current tools are not comprehensive and nobody’s taken the time to simplify this or is this the time when you’re seeing more consolidation in this space? What are your thoughts about the proliferation of so many tools?
Mukta Singh: Working with hundreds of thousands of clients in my experience, I can tell you that it’s the diversity of different business problems that the organizations are trying to solve or that they have solved a smaller use case and over the years collected so many different tools that they are now trying to look at and someone high up at a C level is trying to bring that simplified and organized definition of their enterprise data and realizing that they have so many different security and privacy tools that they have gathered over the years which might be addressing part of their use cases in different pockets for the use case that was brought in at that particular point in time and it might have been relevant at that time. But with today’s focus on that end-to-end view of data to AI data collection to analysis and automated governance and discovery, the whole self-serve realm, I think that automation is a key that all our C-levels are realizing.
The Chief Data Officers and the Chief Privacy Officers and CSOs of the world want to shorten their path to ML and then address the business problems that they are after and be able to do that in such a way that they can shorten their decision-making cycle. Hence, all that consolidation has become even more critical and important today so it’s not that the tools are inadequate; I won’t say that all those tools were inadequate at that point in time. But now looking at the speed of business and the accelerated pace at which the businesses have to perform their business tasks, this has become a mandate almost for all chief Data Officers and they are looking for this to be a simplified journey for their data scientists. And hence, that big focus on self-serve and citizen data scientists that are a big community out there helping them to be self-sufficient.
Mia Dand: Absolutely. The more you empower them, the better they can be at actually deploying these models and systems, which are ethical, which are responsible, and doing it in a way that’s benefiting their customers and the individuals who ultimately be impacted by these.
I enjoyed our conversation. Thank you so much for taking the time. I believe that you can’t talk about responsible or ethical AI or machine learning models unless you talk about data management practices, which includes everything you talked about today, privacy, security, accessibility, governance model.
Join us again next week as we’ve invited Beth Rudden, distinguished Engineer and Principal Data Scientist, Cognitive and AI Services at IBM for a deep dive into her organization’s approach to AI ethics and to learn more about the tools and resources IBM has created to deploy responsible AI at scale.