The Elephant in the Room: Machine Learning without Ground Truth

Diego Perez Sastre
Clarity AI Tech
Published in
9 min readMay 19, 2023

As machine learning (ML) continues to revolutionize various industries, one significant challenge remains: the difficulty of obtaining high-quality, labeled data for training and validating models. This is particularly true for sustainability-tech startups like Clarity AI, where the quality of publicly available data can be far from ideal. In this blog post, we will delve deeper into the complexities of working with imperfect sustainability data and the innovative techniques that machine learning engineers (MLEs) can employ to overcome these challenges.

The year 2022 has been a significant one for Artificial Intelligence, with numerous models grabbing the attention of both the media and the general public. It’s easy for non-experts to have a distorted perception of what’s possible with AI. However, the reality of building machine learning models can be harsh for many. One of the biggest challenges faced is the scarcity of high-quality data (i.e. correctly labeled, unbiased, informative, …), which is essential for model development.

In the following sections, we will discuss the background, context, and core concepts of working with limited data, explore key components and architecture of proposed solutions, and delve into implementation and best practices. We will also introduce the novel concept of “Expert in the Loop” as a means of ensuring data quality and compare it with the traditional “Human in the Loop” approach. By acknowledging the elephant in the room — machine learning without ground truth — we hope to equip MLEs working in data-scarce industries with the tools and techniques necessary to build robust, accurate models even when faced with less-than-ideal data.

Background, context, and core concepts

The Importance of High-Quality Sustainability Data

At Clarity AI, a digital-native firm, we leverage machine learning to analyze more than 2 million data points bi-weekly, enabling us to quickly adapt to the evolving landscape of sustainability assessment and reporting requirements. At the core of our offerings is our powerful, scalable AI technology that performs reliability checks and runs estimation models at scale. This cutting-edge approach allows us to offer reliable, transparent, and unique data coverage, including 70,000+ companies, 360,000+ funds, 198 countries, and 199 local governments.

To ensure the accuracy and effectiveness of its sustainability scores and insights, at Clarity AI we rely on scientific and evidence-based methodologies powered by a global team of sustainability and data science experts. Experts work together to innovate, create, deploy, and maintain our tools and scores. Our platform’s robust and fully-customizable tech kit allows users to assess, analyze, and report on sustainability factors efficiently and confidently.

At Clarity AI, the importance of high-quality sustainability data is paramount. Obtaining accurate and reliable sustainability data can be challenging due to several factors, such as missing or incomplete data, inconsistent labeling, and poorly defined categories. To address these challenges, Clarity AI must not only embrace a data-centric approach but also actively involve domain experts and employ innovative techniques to ensure the quality and relevance of the data used to power its AI-driven sustainability assessments.

The Shift from Algorithm-Centric to Data-Centric AI

As the field of artificial intelligence has evolved, there has been a notable shift from an algorithm-centric approach to a data-centric one. This change highlights the importance of high-quality, labeled data for training and validating machine learning models. In the context of sustainability, machine learning engineers (MLEs) must navigate the complexities of imperfect data and develop creative techniques to address these challenges, such as transfer learning, data augmentation, or active learning (as described below).

The Role of Domain Experts in Machine Learning

Domain experts play a crucial role in the development of machine learning models, especially when dealing with complex or novel topics. They bring valuable knowledge and insights that can help MLEs make informed decisions during the model development process. Involving domain experts throughout the process is essential for ensuring the accuracy and reliability of the models created, particularly when working with imperfect sustainability data.

Expert in the Loop vs. Human in the Loop: Addressing Complex Topics

Traditionally, the “Human in the Loop” approach has been used to involve subject matter experts in the data labeling process, ensuring that the data used for model training is accurate and reliable. However, when dealing with new or complex topics, this method may not be sufficient. In these cases, the “Expert in the Loop” concept is more appropriate, as it emphasizes the expert’s role in providing continuous feedback throughout the model development process.

This approach allows MLEs to iteratively refine their models, incorporating expert knowledge to overcome the challenges of working with imperfect data. By actively involving domain experts in the process, MLEs can more effectively address data quality issues and build robust, accurate models even in the face of data scarcity and imperfections.

Expert in the loop underscores the principle of “garbage in — garbage out”. Simply put, when your models are trained with subpar data, the results will likely mirror that lack of quality. However, this method allows us to elevate the quality of the training data, directly influencing the improvement in results. It’s crucial to acknowledge a potential drawback — the increased cost of engaging domain experts compared to typical non-experts. Despite this, we argue that for complex subjects, the enhancement in output effectively justifies the uptick in expenditure.

In conclusion, understanding the unique challenges and core concepts associated with sustainability data is essential for MLEs working in this field. By embracing a data-centric approach, actively involving domain experts, and employing innovative techniques, they can develop effective strategies for building accurate and reliable machine learning models, even when faced with less-than-ideal data.

Key Components, Architecture, and Proposed Solution

The Challenge of Data Quality in Machine Learning

In our experience at Clarity AI, we have faced the challenge of obtaining high-quality data for training and validating our machine learning models. Initially, we collected data from humans through a “Mechanical Turk” approach. However, we soon realized that this data often contained inaccuracies, inconsistencies, and mislabeling, which negatively impacted the performance of our models.

Introducing the Expert in the Loop: A Two-Stage Solution

To address the issue of data quality, we introduced the Expert in the Loop concept into our machine learning pipeline in two key stages. This approach involves domain experts at both the data input and model validation stages, ensuring that the quality of the data used for training and the performance of the models meet the required standards.

Expert-Ensured Data Input

In the first stage, the domain expert reviews a percentage of the input data collected from humans. This step allows us to identify and correct any inaccuracies, inconsistencies, or mislabeling, even those intrinsec in the nature of the topic (i.e. catching not random failures that non-experts are more likely to fall in) before feeding the data into our machine learning models. By doing so, we can significantly improve the quality of the data used for training, leading to more accurate and reliable models.

Expert-Validated Model Performance Metrics

In the second stage, after training the models with the expert-reviewed data, the domain expert validates the performance metrics of the models. This step helps ensure that our models are performing at the desired level of accuracy and reliability, and allows us to make any necessary adjustments or improvements before deploying the models in production.

The Machine Learning Lifecycle with Expert in the Loop

By incorporating the Expert in the Loop approach into our machine learning lifecycle, we have created a more robust and reliable process that ensures the quality of both our input data and the performance of our models. The revised lifecycle is as follows:

  1. Data coming from humans
  2. Expert ensuring data quality (reviewing a percentage of the input data)
  3. Training the models with the expert-reviewed data
  4. Expert validating the performance metrics of the models
  5. Production deployment

In conclusion, integrating the Expert in the Loop into our machine learning pipeline has allowed us to overcome the challenges of data quality, resulting in more accurate and reliable models that can better support Clarity AI’s sustainability assessment and reporting solutions.

Implementation and Best Practices

Leveraging Advanced Techniques to Improve Data Quality

When working with sustainability data in a small company like Clarity AI, it’s essential to employ advanced techniques to address common data issues such as missing data, inconsistent labeling, or poorly defined categories. Some of these techniques include transfer learning, data augmentation, and active learning.

Transfer Learning

Transfer learning is the process of using pre-trained models as a starting point and fine-tuning them for a specific task. By leveraging the knowledge gained from previous tasks, transfer learning can significantly reduce the amount of labeled data required for training and improve model performance.

Data Augmentation

Data augmentation is the technique of generating new training samples by applying various transformations to the existing data. In the context of NLP problems, this process can help increase the amount and diversity of text data available for training, leading to more robust and accurate models.

One approach to data augmentation in NLP is using generative language models, such as GPT-4. These models can generate new text samples that are similar in structure and content to the original data. To ensure that the generated data is correct and adheres to the complex definitions of the classes, we provide the generative model with the complete definition of the class, along with a few examples of correctly labeled text.

By doing so, the generative model can create new training samples that closely align with the class definitions and follow the same patterns as the correctly labeled examples. This augmented data can be used to improve the training process and ultimately enhance the performance of the machine learning model in the context of sustainability data.

Active Learning

Active learning is an approach where the model actively selects the most informative examples from the unlabeled data for the expert to label. This method allows us to efficiently utilize the expert’s time and knowledge while maximizing the impact of their input on the model’s performance.

Expert in the Loop: Best Practices for Implementation

To effectively implement the Expert in the Loop approach in your machine learning pipeline, consider the following best practices:

  1. Establish clear communication channels and processes between the MLEs and domain experts to facilitate collaboration and knowledge sharing.
  2. Determine the appropriate percentage of input data to be reviewed by the expert, considering the expert’s availability and the importance of data quality for your specific use case.
  3. Prioritize the review of data points that are most likely to be mislabeled, ambiguous, or have a high impact on the model’s performance.
  4. Continuously monitor the model’s performance metrics and adjust the expert’s involvement as needed to ensure the desired level of accuracy and reliability.
  5. Encourage an iterative approach, where the expert’s feedback is incorporated into the model training process, and the model’s performance is re-evaluated after each iteration.

By following these best practices and leveraging advanced techniques such as transfer learning, data augmentation, and active learning, MLEs can overcome the challenges of obtaining high-quality sustainability data and develop more accurate and reliable models to drive their organization’s success.

Conclusion

In this blog post, we have discussed the challenges of obtaining high-quality data for machine learning applications, particularly in startups like Clarity AI. By introducing the concept of “Expert in the Loop” and comparing it to the traditional “Human in the Loop” approach, we have highlighted the importance of involving domain experts in the data labeling and model validation process, especially when dealing with new or complex topics.

We have also explored various advanced techniques, such as transfer learning, data augmentation using generative language models, and active learning, that can help machine learning engineers overcome the challenges of working with imperfect sustainability data. By following the best practices for implementing the Expert in the Loop approach and leveraging these advanced techniques, MLEs can develop more accurate and reliable models to drive their organization’s success in the sustainability-tech industry.

Bonus Track: Tools and Resources

To assist MLEs in overcoming the challenges of working with sustainability data, here are some popular tools, libraries, and resources related to the techniques discussed in this blog post:

Transfer Learning:

  • Hugging Face Transformers (https://huggingface.co/transformers/): A popular library providing pre-trained models for various NLP tasks, including text classification, summarization, and translation.

Data Augmentation:

  • OpenAI GPT-4 (https://www.openai.com/): The official website of OpenAI, the organization behind the GPT-4 model, where you can find resources and guides on how to use generative language models for data augmentation.

Active Learning:

  • modAL (https://modal-python.readthedocs.io/): A Python library for active learning built on top of scikit-learn, offering various query strategies and tools for implementing custom active learning workflows.

By exploring these tools and resources, MLEs can continue to develop their skills and knowledge in the field of machine learning and successfully apply them to sustainability data challenges.

--

--