Closed Beta Results — A Deep Dive

Synesis One

Published in

Synesis One

9 min readSep 25, 2022

A detailed summary of our learnings from Closed Beta

TLDR:

Quality of the data submissions matter much more than quantity
Utterance approval rate will decline to reflect the learnings
Train2Earn is off to a great start, and it’s actually working!

Overview

From August 1st to August 15th, Synesis One conducted a closed beta of the first ever Train2Earn application. The campaigns were set up by Mind AI, our first customer and sister company.

Closed beta had two primary objectives:

Find bugs, identify user pain points, and solicit feedback of the app to fix/improve
Find out if community members are able to provide good quality data that AI companies can use for actual use cases

In order to achieve these objectives, we had one Architect (role of businesses that need data), Mind AI, creating campaigns of its data needs. We onboarded 46 unique Builders (role of anyone who provides data collection, labeling, or annotating service) by inviting a limited number of members from our own community and our partner communities to participate.

Campaigns — Raw data collection campaigns from Mind AI

Because Mind AI was the sole Architect in the closed beta, all the campaigns were focused on collecting natural language data. Each campaign had one topic sentence, and the prompt for the Builders would be to provide different ways to rephrase the topic sentence into a more specific or general way, as well as some possible causes and effects of the topic sentence.

For anyone unfamiliar with Mind AI campaigns or have no participated in our beta yet, here’s a sample campaign and sample approved submissions:

Results and stats from the closed beta

Here are the high level stats from our two week closed beta:

Duration: 15 days

Campaigns: 25

Unique participants: 46

Total submitted utterances: 3029 (407 specific / 866 general / 628 cause / 1128 effect)

Total approved utterances: 2165 (293 specific / 728 general / 419 cause / 725 effect)

Total Approval rate: 71.5% (72% specific / 84% general / 67% cause / 64% effect)

Total SNS rewarded: 28,674 SNS

Understanding what is actually needed by Mind AI

To understand why these campaigns are structured the way they are, and why these types of data are needed, we need to explore how Mind AI uses the data. Mind AI is an AI engine that uses natural language to mimic human reasoning and contextualize information. This means that Mind AI doesn’t run any statistical computations to come up with a predictive value (ML/DL systems), but rather uses a totally new data structure using natural language to understand and logically reason.

Being an AI that uses natural language, the lowest hanging fruit for commercialization is conversational AI, aka chat bots. Mind AI currently provides conversational AI services for customer service, online ordering, and recommendations for various companies in different domains. In order for the AI to understand and process customer inquiries better, it requires a robust database of knowledge around natural language.

For example, if I asked 5 different people how they would order pizza using an online chatbot, there would be 5 different variations/conversations. When you start considering typos, slangs, dialects, geographies, etc., you can see how quickly the need for these types of sentence variations increases for something as simple as ordering pizza.

Additionally, Mind AI collects entailment sentences (formerly Cause and Effect), which are different from specific and general sentence variations. Teaching the AI about potential cause and effects allows the AI to understand and logically reason with the given information, improving the AI’s ability to diagnose issues or make its own hypotheses regarding a topic sentence. If I teach the AI that “I spilled water on my laptop” is a cause of the topic sentence “My laptop is not working,” then the AI will understand that spilling water on your laptop results in the laptop not working. Therefore, if a customer service chat bot receives information that the customer spilled water on his or her laptop, then the AI can hypothesize that his or her laptop may be broken.

Taking a more detailed look at the results of the closed beta

As mentioned above, Mind AI created 25 campaigns throughout the two weeks that the closed beta was open. In order for us to really evaluate the quality of the data that was collected through the campaigns in closed beta, we asked the computational linguists and ontologists from Mind AI to test the quality of the data.

Test procedures for evaluating the quality of collected data

Testing methodology is somewhat technical, so we’ll cover this on a more surface level, understandable way.

First, the Mind AI team takes all of the utterances and converts the natural language into novel data structures called “Canonicals” that the AI engine can work with. These utterances are classified into their respective domains/industries and topics. The computational linguists then draft up 30+ number of test inquiries (could be viewed as utterances) to test the AI engine with. The inquiries are sent to the Mind AI engine, which uses the pool of the newly provided canonicals created from the data collected from Synesis One to process the inquiries correctly.

The test measures the precision and recall to evaluate the information retrieval performance of the AI engine using the newly provided dataset. To generalize further, precision would tell us how relevant the results are and recall tells us how complete the results are (high coverage). Getting the harmonic mean of the precision and recall gives us an “F-measure,” which indicates the test’s performance. Lastly, there’s another measure of accuracy that you will see in the results, called “Z_Total accuracy.” Accuracy is the fraction of correct classifications, usually expressed as a percentage. For example, if a classifier makes ten predictions and eight of them are correct, the accuracy is 80%.

Looking at the results from closed beta

Let’s take a look at how the campaigns performed from closed beta.

The results above raise a few very interesting questions:

How is it that the most accurate campaign has over 200 approved utterances, but the 2nd and 3rd most accurate campaigns only have 61 and 27 approved utterances, respectively?
Why do some campaigns with over 100 approved utterances have such low accuracies?
What is the right amount of approved utterances we need to collect for each campaign?

So, how many utterances is enough?

Let’s start at the highest levels by answering the third question. According to the linguists at Mind AI, the ideal number of utterances per each topic sentence is between 250–300, assuming that these are high quality utterances. According to the available data from its currently commercialized services, having 250–300 high quality utterances allows Mind AI to outperform its competitors, including the performance of Google Dialogflow (90%+ accuracy vs. 75–80% accuracy from Dialogflow).

Additionally, the team indicated that as the AI engine improves, the ideal number of utterances per campaign will decrease.

Another important factor that is directly correlated with the number of utterances we accumulate is accuracy. In order to sufficiently service Mind AI’s commercial clients, the accuracy for each topic needs to be at least 85%.

But why does recall, f-measure, and accuracy differ so much in our results?

The answer for the first and second question is very dependent on the quality of the data collected. To explore what low or high quality data looks like, let’s take some actual examples from the first and second campaigns from the results.

Campaign #1: Do you have halal food at your grocery store?

Result: 208 approved utterances, 0.83 recall, 0.909 f-measure

Sample approved utterances for Campaign 1

Campaign #2: Do you have gluten-free cat food?

Result: 61 approved utterances, 0.73 recall, 0.846 f-measure

Sample approved utterances for Campaign 2

Let’s compare the specific utterances from campaign 1 and campaign 2. If you look at campaign one’s specific utterances, they are extremely similar.

Do you have halal meat roll at your grocery store?
Do you have halal sausage at your grocery store?
Do you have halal beef at your grocery store today?
Do you have halal beef at your grocery store at the moment?

Syntactically, there’s not much difference. The only thing happening here is swapping out nouns or adding a little bit more details. These are considered low quality utterances. Now let’s look at the specific examples from campaign 2.

Will you have gluten free cat food next week?
Do you stock gluten-free cat food suitable for kittens?
Do you sell gluten-free cat food in different package sizes?

The first and second utterances are syntactically different, although the meaning and intent are the same. These differences are harder to think of rather than replacing nouns/synonyms, and are considered much higher quality utterances.

Let’s look at it from a real-world practical sense. If many different customers of a grocery store are chatting with a customer service agent or bot, it’s most likely that they won’t all have the same speech/syntax structure. Therefore, teaching the AI about different ways to say things, not just about synonyms is much more beneficial.

This explanation above for the specific utterances apply in the same way for the other types of utterance variations. This is why campaign 1 and 2 have very close accuracies, but very different numbers when it comes to the number of utterances collected. Campaign 1 has a large number of utterances, many being lower quality ones that just replace a few words. Campaign 2 has a robust variety of higher quality utterances, making it much more efficient for the AI to comprehend.

This also explains why some campaigns have over 100 utterances that were approved, but have such low recall and f-measures. As we can see from the closed beta, quality is much more important for the AI than quantity.

So what are the learnings and changes going forward?

The biggest takeaways from this evaluation of the data collected from the closed beta are as follows:

It is totally, 100% possible, to crowdsource data capable of powering AI for commercial use cases
We have to focus on getting higher quality data through more education and stricter validation

To address the first takeaway, the crowdsourced data for campaign 1 reached high levels of recall and f-measure, indicating that it is on its way to be sufficient enough for commercial usage. In addition, data collected for campaign 2 show even more promise. It reached extremely high levels of recall and f-measure with only 61 approved utterances. Imagine the potential if we had even just double the approved utterances of the same quality. Of course, for these datasets to be fully ready for commercial use, it’ll require some refining and supplementation from the experts, but this shows progress and potential. In fact, we will be launching some new campaigns in our open beta that will go directly into supporting one of Mind AI’s commercial clients (you heard it here first!).

Secondly, as many of the participants of the open beta may already know, validation has become a lot more strict. The lower quality utterances that may have been accepted in the closed beta will no longer be approved moving forward. There are big changes for the Mind AI campaigns as well, in that there will no longer be an “Effect” variation category and “Cause” will be changed to “Entailments.” These changes are accompanied by new and improved guidelines for the specific, general, and entailment utterances.

There are a lot of new product improvements that will help Mind AI collect a robust dataset as well, such as utterance quantity requirements based on the type of utterance (specific/general/entailment) and continuous education around coming up with high quality utterances.