AI Models Can Get It Wrong: Why Comparing Them Could Save Your Business — or Even Lives

Vinci Vinni
4 min readSep 19, 2024

--

Compare answers from different AI models

AI models has become an integral part of our daily lives, influencing decisions in healthcare, finance, and even personal relationships. Yet, many of us take for granted that these AI models are infallible. The truth is, they aren’t. Each AI model operates differently, trained on unique datasets and carrying its own set of biases. Understanding these differences isn’t just a technical concern — it can have real-world consequences.

Why Do AI Models Differ?

At their core, AI models are mathematical constructs designed to recognize patterns and make predictions. However, the way they’re built varies significantly. For instance, developers choose different activation functions — like ReLU, GELU, or Swish — that affect how the model learns from data. The number of attention heads, which help the model focus on different parts of the input data, can also vary. GPT-3 uses 96 attention heads, while other models might use fewer due to resource constraints.

Layer depth and hidden size are other critical factors. GPT-3, for example, has 96 layers with a hidden size of 12,288. Smaller models might have fewer layers and smaller dimensions, affecting their capacity to understand complex patterns. Techniques like dropout rates, which prevent overfitting, and positional encoding methods also differ among models. Even the choice of optimizers — whether it’s Adam, AdaFactor, or Lamb — can influence how a model converges during training.

Moreover, the training datasets themselves are a significant source of variation. Different models are trained on different slices of the internet, literature, and curated databases. These datasets imbue the models with unique behaviors and nuances, leading to varying responses to the same prompt.

The Risks of Relying on a Single Model

Imagine making a critical business decision or a life-altering medical choice based solely on the output of one AI model. The consequences could be dire. Let’s consider a real-world example:

Prompt: Provide the most recent FDA-approved treatments for non-small cell lung cancer (NSCLC), including information about their mechanism of action, efficacy from clinical trials, and any notable side effects.

This is not just an academic question; it’s a matter of life and death for patients and a critical concern for healthcare providers. When different AI models are tasked with this prompt, their answers vary significantly

You can see each answer compared side by side here: https://overallgpt.com/s/lba1ptxJjlrbPHMo5fxW

OverallGPT: Compares answers from different AI models side-by-side

ChatGPT’s Response:

ChatGPT lists treatments like Amivantamab, Sotorasib, Pralsetinib, Capmatinib, and Lurbinectedin, providing detailed mechanisms, efficacy rates from specific trials, and side effects.

Anthropic’s Response:

Anthropic mentions Sotorasib, Amivantamab, Tepotinib, Pralsetinib, and Capmatinib. While some overlap exists, it lacks newer drugs like Pembrolizumab and provides less detail on clinical trial outcomes.

Google’s Response:

Google introduces treatments like Pembrolizumab combined with chemotherapy, Lorlatinib, Osimertinib, and Atezolizumab with Bevacizumab. These are important immunotherapies not mentioned by the other models.

Meta’s Response:

Meta focuses on treatments like Sotorasib, Amivantamab, Tepotinib, Capmatinib, Pralsetinib, and Selpercatinib, offering detailed efficacy metrics but missing the immunotherapies that Google includes.

The Value of Comparing Model Outputs

By analyzing these responses, it’s evident that no single model provides a complete picture. ChatGPT and Meta offer detailed efficacy rates and side effects but miss out on some immunotherapies. Google’s model includes newer treatments and combination therapies, highlighting options that others overlook. Anthropic provides a good summary but lacks depth in trial-specific metrics.

Completeness of Information: Comparing models ensures a more comprehensive understanding of available treatments. Google’s inclusion of immunotherapies like Pembrolizumab adds valuable insights that could influence treatment decisions.

Currency of Data: Some models may not be updated with the latest FDA approvals. Google’s model mentions treatments approved as recently as 2022, while others focus on drugs from 2020–2021.

Specificity and Detail: Models like ChatGPT and Meta provide specific data from clinical trials, such as objective response rates and progression-free survival times, which are crucial for medical professionals.

Diverse Perspectives: Different models may emphasize various classes of treatments — some focus on targeted therapies while others highlight immunotherapies. This diversity can lead to more personalized and effective decision-making.

The Real-World Impact

Relying on a single AI model can lead to incomplete or outdated information, especially in high-stakes fields like medicine. A healthcare provider might miss out on recommending a life-saving treatment simply because their AI assistant didn’t mention it. Businesses could make poor strategic decisions if they don’t have all the relevant data.

Conclusion

In an era where AI models influence critical decisions, it’s imperative not to put all your trust in a single source. Each model has its strengths and limitations, shaped by its unique architecture, training data, and inherent biases. By comparing outputs from multiple models, you enhance accuracy, reduce bias, and mitigate risks.

Don’t gamble with decisions that could affect your business outcomes or, more importantly, people’s lives. Always seek multiple perspectives to ensure you’re making well-informed choices. After all, when it comes to AI, a second opinion isn’t just beneficial — it’s essential.

OverallGPT lets you compare answers side by side and generates key takeaways by synthesizing all answers, please checkout here

--

--