Nerd For Tech
Published in

Nerd For Tech

How to Avoid the Data Bias in AI Development?

Biased data will lead to biased results, and it is a simple fact of AI development. However, all methods may unintentionally introduce bias into the AI ​​model, and the situation becomes much more complicated. For example, the customer is building a voice recognition model, perhaps for use in cars. The voice itself has different tones, accents, filler words, and grammar (not to mention different languages ​​and dialects). Assuming that the voice recognition model is suitable for drivers with different demographics and backgrounds, the customer needs different data to represent each case.

Many AI companies have been willing to purchase data sets. They developed an early engine, hoping to enter the market quickly at a low cost and in a short time. The disadvantage of the data set is that it is already available and cannot be changed easily. Therefore, it is not specially optimized for a specific scenario.

In the early stage of developing a natural language understanding engine or speech recognition, it is a perfect way to use existing data sets. When modifying for the special scenario in the later stage, you need to supplement customized data. For example, if most of the data collected by customers are male voices, it is often difficult for the voice recognition model to recognize female voices.

The mainstream voice-based products in the market have this issue because the algorism model does not get enough data types during the training stage. Therefore, the challenge for the company is how to organize a complete data set to cover all cases, including edge cases.

A Korean two-person dialogue voice collection case study

Customer needs: Hundreds of hours of Korean voice conversation data collection

Project introduction: Two-person interactive dialogue (A: subject-related agency customer service, B: customer) around a given theme.

Content: customer dialogue chat

Duration: 15 minutes for each conversation

Collector age requirement: 18–60 years old

Recording requirements: record in a quiet environment to ensure clear call recording

Regarding data collection, ByteBridge has abundant overseas resources, covering Asia, Southeast Asia, the Middle East, North America, South America, Europe, Africa, and other regions. In a short time, we can find thousands of collection personnel on request.

According to the requirements, we gathered a certain number of Korean participants to record by phone, recording multiple topics including aviation, agriculture, delivery services, finance, banking, health, etc., and complete the recording as required.

NLP Service

We provide different types of NLP in E-commerce, Retail, Search engines, Social Media, etc. Our service includes Voice Classification, Sentiment Analysis, Text Recognition and Text Classification(Chatbot Relevance).

Partnered with over 30 different language-speaking communities across the globe, ByteBridge now provides data collection and text annotation services covering languages such as English, Chinese, Spanish, Korean, Bengali, Vietnamese, Indonesian, Turkish, Arabic, Russian and more.


Outsource your data labeling tasks to ByteBridge, you can get the high-quality ML training datasets cheaper and faster!

  • Free Trial Without Credit Card: you can get your sample result in a fast turnaround, check the output, and give feedback directly to our project manager.
  • 100% Human Validated
  • Transparent & Standard Pricing: clear pricing is available(labor cost included)

Why not have a try?



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store



Data labeling outsourced service: get your ML training datasets cheaper and faster!—