AI’s Triumph: When Your ML Model Beats Human Skills

Your ML model has exceeded human performance. What next?

Published in

The Research Nest

6 min readSep 15, 2023

Photo by Possessed Photography on Unsplash

Scenario: Imagine that you create an ML model to solve a specific problem. A benchmark is crucial for assessing the model’s performance. This benchmark is the Bayes error, which conveys the best possible performance on the dataset. Estimating the Bayes error can be impractical, so you decide to use human error as a proxy for it in this case.

Understanding Bias and Variance

Bias: It is an error due to overly simplistic assumptions in the learning algorithm, which can lead it to underperform.
Variance: It is an error due to too much complexity in the learning algorithm, which can make it sensitive to small fluctuations in the training data, leading to overfitting where it performs well on the training data but poorly on unseen data.

If the error on both the test and dev sets exceeds human error, you can easily diagnose bias or variance issues. This situation relates to the well-known “bias-variance trade-off,” and you can refine the model iteratively.

You can determine your model’s position on the Eisenhower framework below and make adjustments accordingly. If your model leans left on the matrix, you may consider using regularization. Conversely, if it leans north, acquiring more training data could be a viable strategy.

In Simple Terms

If your model is making too many mistakes on both the test and dev datasets compared to what a human would make, it means there are some problems.

If your model is too simple and does not capture the necessary patterns in the data, it has a high bias issue. To fix this, you may need to make your model more complex.
If your model is too complex, trying to fit the training data too closely, it has a high variance issue. To address this, you can either simplify your model using regularization techniques or try to get more data to train your model.

You’d use a framework (here referred to as the Eisenhower framework) to figure out which issue (bias or variance) is more prominent and then take steps to fix it by adjusting your model accordingly.

You diligently work to narrow the room for improvement by fine-tuning hyperparameters, optimizing algorithms, and occasionally reworking parts of your system. Eventually, your model surpasses human performance through your efforts and a bit of luck.

Congratulations! What now?

At this point, you have two main choices:

Conclude your project, deploy it to production, and move on, or
Continue to improve the model

Choosing the second option presents a more complex task. Diagnosing bias or variance becomes challenging. Previously, we compared the Bayes error, training error, and cross-validation error for diagnosis.

If the training set error was much higher than the baseline error, it suggested a high bias problem.
If the dev set error exceeded the training set error or the baseline error by a large margin, it indicated a high variance problem.

Sample model evaluation

But now, both the training set error and the dev set error are lower than the baseline error. Without a systematic method for diagnosis, how can you determine the direction to refine your model?

You have a couple of approaches to consider at this point:

Alternative Evaluation Metrics: Instead of relying solely on accuracy, explore other informative evaluation metrics like precision, recall, or F1-score. These metrics provide a deeper understanding of where the model excels and where it struggles. Additionally, reconsider the human error rate metric. If it was initially calculated with non-experts, reevaluate it with experts in the task to get a more accurate surrogate for the Bayes error.
Human-in-the-Loop Evaluation: In tasks involving subjective judgment or complex decision-making, involve human intervention to evaluate the model’s performance. Perform manual error analysis to identify patterns in the model’s errors on the development set.
Traditional Techniques: You can also apply conventional methods to address bias, variance, or data mismatch. These methods include regularization, early stopping, feature selection, and more. Since you may not be certain about the specific issue, compare the cost function’s value before and after applying these methods. When using this exploratory approach, remember the concept of orthogonalization, systematically isolating and testing individual changes to understand their impact on the model’s performance.

Here’s what that would look like:

Identify Different Ideas or Changes: Start by listing the different ideas or changes you have in mind for improving your ML model. These could include changes to hyperparameters, feature engineering, architecture modifications, or even different algorithms.
Isolate Each Idea: Take one idea or change at a time and implement it in isolation while keeping all other aspects of your model constant. This means that you should hold everything else steady, such as data preprocessing, other hyperparameters, and the model architecture.
Evaluate and Measure: Train and evaluate your model with the isolated change. Use appropriate evaluation metrics to assess its performance on your validation or test data. Record the results, including any changes in performance compared to your baseline model.
Repeat for Each Idea: Repeat steps 2 and 3 for each of your different ideas or changes. Ensure that you have a clear and consistent way to evaluate the impact of each individual modification.
Compare Results: After implementing and evaluating each idea in isolation, you will have a clear understanding of how each one affects your model’s performance. You can then compare the results to identify which changes had a positive impact, which had no significant effect, and which may have worsened performance.
Combine Complementary Changes: If you find that some changes are complementary and improve performance together, you can combine them. However, be cautious about combining changes that may have a detrimental impact when used together.
Iterate and Refine: Based on your findings, iterate on the ideas that positively impacted your model. You can fine-tune these changes further or try additional variations to maximize performance.
Document and Communicate: Keep detailed records of your experiments, including the changes made, the results obtained, and any insights gained. This documentation will help you make informed decisions and communicate your findings to others.

Sounds great! But why go all through all this effort?

Achieving superhuman-level performance in a machine learning task is rare, but there are compelling reasons to diagnose and potentially improve your model, even after surpassing human performance:

Ethical Considerations: Human-level performance doesn’t guarantee ethical behavior or fairness. Models can still exhibit bias, discrimination, or undesirable behaviors that require diagnosis and rectification for ethical reasons.
Robustness and Generalization: Models can overfit training or dev data, leading to decreased performance with unfamiliar or edge cases. Diagnosis helps identify vulnerabilities and enhances robustness and generalization.
Changing Data: Real-world data distributions can evolve over time. What’s considered human-level performance today may not hold in the future. Regular diagnosis and fine-tuning ensure adaptation to shifting data distributions.
Resource Efficiency: Highly accurate models may demand significant computational resources. Diagnosis and improvements in efficiency reduce computational costs, which is crucial in resource-constrained environments.
Business Goals: Beyond human-level performance, business objectives like customer satisfaction, cost reduction, response times, and regulatory compliance drive ongoing model enhancement efforts.
Competitive Advantage: Maintaining a technological edge in competitive fields motivates continuous improvement, even post-human-level performance.
Research and Development: In research settings, pushing boundaries and advancing the state of the art remains a motivation. Continuing model work explores new techniques and deepens problem understanding.

Depending on your context and requirements, you can adapt your evaluation metrics, either in conjunction with accuracy or as its replacement, to address these considerations.

AI’s Triumph: When Your ML Model Beats Human Skills

Your ML model has exceeded human performance. What next?

Understanding Bias and Variance

In Simple Terms

Written by Armaanjeet Singh Sandhu