Evaluate and Improve Generative AI-based Products— An Introduction for Product Managers

Introduction

7 min readDec 4, 2023

In this article, I have explored the realm of multimodal AI evaluation methods to ensure content safety, moderation, responsible AI content generation, deepfake detection, and its implications for our day-to-day lives. Personally, this is an important area of work, as one of my dear ones was a victim of identity theft and personal data morphing, which was a painful time in my life to handle and left a permanent attitude change in our relationship. Also, as a Product manager working on building machine learning-based products, identifying the key metrics and evaluating them before launching, is crucial to meet the customer needs plus the ever-changing AI Governance policies.

Source: Newspaper articles regarding the latest concerns about AI content in the media and society

AI Governance is evolving across the world and the market size is set to grow at a CAGR of 28% from 2023–28.

Source: https://www.mordorintelligence.com/industry-reports/ai-governance-market

Various startups like Credo.ai, Fiddler.ai, Reality Defender, etc. are working on supporting enterprises to meet the AI governance norms. Moreover, big enterprise players like Microsoft Azure, Google Cloud, and AWS are also building AI safety products for their customers.

First, let's see how Generative AI content is generated, and then we will see how to evaluate and improve them. I have covered text and image modalities of Generative AI in this article.

How does Generative AI work?

I have covered the basics of Generative AI in detail in my previous article here — https://medium.com/@kvskellogg/generative-ai-an-intro-for-product-managers-733c6a1a6d72

What are the criteria to evaluate and improve a Generative AI-based application?

As a Product Manager, building AI-based products and applications, it is important to consider the following aspects for monitoring Generative AI models.

A breakdown of the types of evaluation, specific examples related to the evaluation, and the mitigation/improvement strategies are provided in the table below. These criteria would eventually drive the Technical Product Strategy of the business.

Robustness

Evaluating and ensuring the robustness of AI applications, particularly those based on generative AI, is of paramount importance for delivering reliable and effective solutions. Begin by subjecting the AI to rigorous evaluations, such as sentiment analysis to gauge its understanding of emotions and sentiments, and grammar correctness checks to ensure impeccable language use. Implement duplicate sentence detection to eliminate redundancy and maintain user engagement, and employ natural language inference to test logical coherence within the generated content. Assess the AI’s adaptability to diverse knowledge domains through multi-task knowledge evaluations and evaluate its reading comprehension, translation accuracy, and mathematical calculations. To enhance robustness, consider integrating advanced techniques like fine-tuning with adversarial loss, and exposing the AI to challenging cases during training. In-context learning with input perturbations can make your application more adaptable to varying contexts and inputs. Real-time monitoring and user feedback mechanisms are essential for prompt issue detection and iterative improvements. Data augmentation diversifies training data, reducing biases, while regular retraining keeps the model up-to-date.

Security

In the context of security evaluation, it’s crucial to be vigilant about potential adversarial attacks. These attacks can take various forms, including prompt injection, jailbreaking, data poisoning, and backdoors, which can compromise the integrity and security of your AI system. To measure the effectiveness of your application’s defenses against such threats, metrics like Attack Success Rate (ASR) and cross-entropy loss can be employed. These metrics help quantify the model’s vulnerability and the extent to which it can be compromised by malicious inputs. To mitigate and improve the robustness of your generative AI application, consider implementing strategies like jailbreaking detection and engaging in red-teaming exercises. Jailbreaking detection mechanisms can help identify compromised environments and prevent your AI from operating in insecure settings. Red-teaming involves simulated attacks by security experts to uncover vulnerabilities and weaknesses in your system proactively.

Deepfakes

Deepfakes pose a multifaceted threat across diverse domains, from pornography to politics, banking to social media, and even face recognition and data fraud. One of the primary strategies to evaluate robustness against deepfakes is the deployment of multimodal deepfake detection systems. These systems utilize a combination of techniques, such as analyzing not only visual but also audio and metadata components of multimedia content to flag potential deepfakes.

Source: https://arxiv.org/pdf/1901.08971.pdf

Bias

Bias and unfairness can manifest in various forms, including harmful stereotypes, unfair discrimination, exclusionary norms, toxic language, and performance disparities that disproportionately affect specific social groups. To ensure the ethical and equitable operation of your generative AI-based application, a multifaceted evaluation approach is crucial. One key strategy to evaluate and mitigate bias and unfairness is fine-tuning the model. This process involves retraining the AI system with a focus on addressing specific biases, by incorporating diverse and representative training data that spans different demographics and perspectives. Additionally, counterfactual data augmentation techniques can help create scenarios that highlight potential biases and enable the system to learn from them. An essential aspect of mitigating bias and ensuring fairness is in-context learning. This approach involves continuous monitoring of model outputs in real-world scenarios to detect and address any emerging biases or fairness concerns promptly.

Transparency

Evaluating and enhancing the robustness of generative AI-based applications involves addressing transparency concerns, especially in critical domains like healthcare and legal. Transparency ensures that AI systems provide clear explanations for their decisions, fostering trust and accountability. To achieve this, product managers can implement mitigation and improvement strategies such as explainer models and neuron explainers. These techniques enable users to understand the rationale behind AI-generated outputs, making the application’s decision-making process more transparent and interpretable. By incorporating transparency measures, AI applications can better serve users in sensitive areas where clear explanations are essential for informed decision-making and compliance with regulatory requirements. If an LLM for eg: needs to be measured for its political neutrality, there are metrics like Refused to Answer % ,Toxicity Score and Political Psychology tests that could be used to measure its performance.

Privacy and Copyright Implications

Evaluating and enhancing the robustness of generative AI-based applications also extends to addressing privacy and copyright implications, especially concerning the reproduction of personal identifiable information (PDIs) from pre-trained data. To ensure compliance with privacy regulations and copyright laws, product managers should consider mitigation and improvement strategies like differentially private fine-tuning or training and deduplication of training data. These measures help protect user privacy and intellectual property rights by minimizing the risk of unintentional data exposure or copyright infringement. By implementing these strategies, AI applications can navigate privacy and copyright concerns, making them more trustworthy and legally sound, which is crucial for maintaining user trust and avoiding legal complications.

Measurement Strategies of Generative Models

Different metrics used to measure and evaluate Generative Text Models

In the evaluation of Large Language Models (LLMs), various metrics are used to assess their performance and ethical considerations. “Model Drift” refers to the tendency of LLMs to produce inconsistent or biased output over time, highlighting the need for continuous monitoring and mitigation strategies. “SHAP” (SHapley Additive exPlanations) is a technique used for understanding the contribution of different input features to model predictions, aiding in transparency and fairness.

Metrics like “Agreement Ratio” measure the degree of consensus among multiple LLMs, promoting consistency. “Defect Rate” evaluates the occurrence of harmful or biased content generated by LLMs, emphasizing the importance of minimizing adverse outcomes. “Confusion Matrix” helps assess the model’s performance by quantifying the accuracy of its predictions and identifying potential biases.

*Source* — Microsoft -A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications https://arxiv.org/pdf/2310.17750.pdf

In the context of generative text, metrics like “NLL” (Negative Log-Likelihood) and “Perplexity” evaluate the fluency and coherence of generated text, with lower values indicating better performance. “BLEU” assesses the quality of generated text by comparing it to human-written references, while “Self-BLEU” measures diversity by comparing the generated text against itself. “MS-Jaccard” evaluates the uniqueness of generated text compared to reference text, providing insights into content originality.

Different metrics used to measure and evaluate Generative Image Models — There are metrics like Fr´echet Inception Distance (FID) and Inception Score, which uses thousands of generated data and passes it through a pre-defined evaluation neural network model that compares it with real images and comes up with a score that defines the closeness of the real and generated images. Source: Large Scale Qualitative Evaluation of Generative Image Model Outputs, Yannick et.al

System Level metrics of Generative AI-based applications

For applications built over GenAI, multiple techniques can be used to evaluate :

a. Human-assisted evaluation: Comparing results of human-generated vs AI-generated content.

b. AI-assisted evaluation: Using other LLM models to evaluate the output of an LLM under tests

Summary

Since more and more policies and regulations are put in place for AI applications, it is important to know the latest happening in the field of evaluation and improvement of Generative AI. Product Managers have to be aware of the opportunities and limitations of the technology and build products that meet the evaluation criteria.