Executive’s Corner with Vikrant Singh Tomar

Published in

Fluent.ai

8 min readMar 18, 2021

The Challenges of Building Dependable AI

Building artificial intelligence technology takes time and patience. Building efficient neural networks requires a tremendous amount of time and effort in first designing the network architectures and then training those architectures on typically large amounts of data.
In lab settings, these networks are tuned to near perfect accuracy on test datasets. Even then, it can be difficult to be absolutely certain of the accuracy of an AI solution once it is deployed in. A recent research paper suggests the issue of misbehaving AI runs deep and could mean rethinking how neural networks are trained from the bottom up. Fluent.ai Founder and CTO Vikrant Singh Tomar gives his thoughts on what needs to be done to make sure AI is not only accurate across all applications, but how these technologies can be relied on and trusted as they become more commonplace.

What is “underspecification” in AI training? What issues does it create for applications of AI?

Vikrant Singh Tomar (VST): In simple terms, “underspecification” is when your model has a problem and the problem is not specified with enough detail, or there isn’t enough data available for the model to come up with a specific answer. You may have some information about the problem and based on that information, you could find multiple solutions that satisfy those requirements, but the information might not be specific enough for you to find the one exact answer.

It may be easier to think of an example with image recognition. For example, let’s take a set of identical twins. My “twin” and I look exactly the same, however I have a cut on one side of my face and my identical twin doesn’t. We have been using pictures of us from the front to train our facial recognition model, but at the angle we were using, the cut on my face was not visible. So when we use the system to identify us among millions of people, the model will be able to identify,“oh, it’s one of the two”, but it won’t know which twin it is because it doesn’t have specific enough information about what actually makes us distinct from each other. It doesn’t have the detail of the cut on my face specified.

Underspecification is a major problem in training AI and maybe one that we haven’t finished exploring as an industry and a research field. What are some of the other issues behind how AI models are trained ?

VST: There are a number of problems that can come up in AI training. A related problem is known as a “data-mismatch” or “data-shift”. This happens when there’s some mismatch between the training data and the test data or the real world scenario. In simple terms, when you train a network, you’re trying to expose it to the variability that might exist in the real world scenario in order to teach the network how to solve the problem. For example, in speech recognition, models can be trained with clean speech or with specific kinds of noise, such as for smart home settings. However, if we take a model that was trained using speech in a smart home environment with background noise typical of that environment and apply it in a smart car, then that model won’t perform with the same accuracy. That’s because in a car there is driving noise and the noise of the highway; the general acoustic environment is very different. That is what is called “mismatch” or “data-shift” between the training environment and the deployment environment.

In some cases the data-shift is easy to detect but in others it can be less obvious. For example, if you train your model to use a specific signal processing front-end that cleans the signal, removes noise, even gets rid of some unwanted characteristics of the signal, and then you try it on another front-end, there might be a mismatch between the two, depending on the different specification of the front-end pieces. So typically you have to do some sort of tuning either of the front-end or the speech recognition solution to minimize that mismatch.

Another important issue, and one that is getting a lot of attention these days, is bias. Bias can be mostly avoided by carefully and vigilantly curating the data that is used for training and verifying a model. What makes a “good” training dataset is in part ensuring that the data is coming from multiple sources across diverse backgrounds and groups that appropriately represents the entire demographic. This issue is actually a classic statistical sampling problem; people often use various bias removing techniques when, for example, sampling a large population for some survey or doing election polling. The results would be biased and untrue if the selected samples are not a true representation of the target population.

Cleaning data before feeding it to a model is incredibly important in order to avoid AI learning certain unsavoury aspects of human behavior, such as sexism and racism, which can be reflected in data from the internet. Similarly, a carefully selected test or evaluation dataset helps surfacing biases that might have been missed during model training. Finally, in the absence of a “good” training dataset, teaching the model specifically what to avoid will further help produce better results.

So underspecification is a very different challenge from data-shift or bias?

VST: One of the characteristics of underspecification that makes it different from other issues with AI training is that we don’t know that we’re missing something. In mismatch or data-shift, we know very clearly “Hey, this variable or condition is missing”, for example, the car background noise when using a model trained for smart home noise.

Whereas with underspecification, we don’t know what the network is missing to find the exact answer to the problem it is dealing with. It all has to do with neural networks, which work in a way that resembles in some sense the way the human brain operates. It has two main components: “neurons” and the mechanisms that connect them. Essentially, the neurons determine what to do with the information that they receive. The connections between neurons determine how information flows through the network; how that information flow is controlled within the network is the key to how the AI will behave.

The neurone (circles) determine what to do with the information that they receive. The connections (lines)determine how information flows through the network.

In the structure of a neural network, the role of the neurons is determined by an equation. The role of the connections in the networks is to determine what the flow of information should be; these are called weights or parameters of the neural network. At the very beginning of the neural network building process, you have to give an “initialization value” to the weights. This used to be a very active research area in AI with people trying to decide how to minimize gaps in accuracy that emerge when people initialized their neural networks differently. Eventually those gaps were minimized but this issue remains an active research area, because the initialization can still have a substantial impact on the accuracy of a neural network. Two identical neural networks, trained on the same data but with minor variations on their initialization weights, can behave completely differently in the real world. For example with speech, we have found that if the subject speaks too fast, two neural network AI solutions trained on the same datasets will behave differently based on their initialization weights.

Even with extensive efforts, it is difficult to develop a comprehensive test set that captures all the variabilities or the scenarios that the AI network will encounter in the real world. This is why it is hard to anticipate when unpredictable behaviour might occur in real world usage. So underspecification really hints at such unpredictable outcomes, or the unknown unknowns.

How can Fluent and the rest of the AI community approach their training processes to minimize these issues within the models?

VST: There are several aspects of training that we can fine-tune or tweak. One is to adjust the network initialization process. At Fluent, we use a popular and accepted method to make our initialization values and outcomes consistent.

But really, it’s all about stress testing. That is what is going to make the biggest impact. At Fluent, we have increased how much testing we do, and we really stress test the AI solutions under different conditions. Does it fire just in silence? Or what happens when we say just the wake word and nothing else? Maybe you just say the wake word and then say something unrelated to the model, even just saying “blah blah blah”- we want to see how the model behaves. We want to test stress conditions like different noise scenarios, fast speech, slow speech — all those things ultimately help us ascertain that our networks behave properly.

Another important thing is that we use several metrics to track how a neural network behaves. We are not only looking at average performance over a test set but we are also interested in margin of error when testing the hundreds, or even thousands of different variations of a command. Let’s say my model is performing at 97% accuracy on average; however, I’m 95% confident that my real error on the real-world data is plus or minus 2–5% of that. The more we test, the more confident we can be about the behavioral model. So we are also starting to do this kind of analysis.

We are also starting to do a longtail test analysis, meaning we try to test for the many different and sometimes unconventional ways people might speak a command like “turn on light” (less common) vs “turn the light on” (more common). Someone might even say “it’s too dark in here” or “I can’t see properly” and the model is expected to react and increase the brightness. Looking at these commands that would be less commonly spoken could help improve accuracy.

As we learn more about how challenging it truly is to train an AI model to function properly in all scenarios, it chips away at people’s trust in AI in general. What do you think the industry as a whole or us here at Fluent can do to build that trust with people who are using the AI systems we build?

VST: There have been a lot of issues with AI and it’s primarily because I think people are expecting too much. Everybody has to take a step back; this technology is in its infancy. We are only just starting to scratch the surface of AI. There are a lot of things that aren’t fully developed yet. For example, with visual recognition, let’s say you have a chair that is sitting straight — most models would have no problem recognizing it. But if you have a chair that’s hanging upside down from a wire or in other odd positions, the model might not know how to process it; it thinks, “Is this still a chair?”.

The point is overall, people might be expecting too much too soon from AI, and this is where we see problems like bias in the training when it comes to racial recognition. You can minimize surprises by testing properly, by doing proper analysis. There is a spectrum of stress tests that can be applied, going beyond just average testing, and doing margin of error and confidence and control analysis. All those things come into the picture and allow us to be more confident about the performance of our AI models. We should be careful in deploying AI before it’s sufficiently mature.

Executive’s Corner with Vikrant Singh Tomar

The Challenges of Building Dependable AI

Written by Fluent.ai Inc.