Five Mistakes to Avoid When Training Your AI

Published in

tomorrow++

7 min readApr 30, 2021

If you want your system to perform better, you ought to teach it better; and, how do you verify if you have trained it well? By testing it better!

The majority of the AI solutions learn in two different stages. The first learning happens while working with controlled data sets and formulating base models. The second learning occurs on the go or periodically with user interactions in the form of feedback.

A sophisticated AI system will usually follow a two-stage learning mechanism, whereas the simple AI system may have only one formative stage. The two stages in which AI learns are:

1. Training

2. Feedback

The training stage is where you form the basic machine learning models for the first time and train the system using those models. The accuracy of these models is highly dependent on the training dataset.

On the contrary, when you deploy the AI system, some (or all) data can be fed back to the system for continuous learning, which we term feedback stage learning. This stage is relatively vulnerable to various ongoing risks. Mainly, no matter how accurate you develop your base models in the past, new data (coming through a feedback system) can cause these models to readjust and become better or worse. A lot would also depend on how the machine is learning in each stage.

Avoid these five mistakes when training your AI

Meticulous teaching is the fundamental requirement for having an excellent performance consistently. However, there are a few caustic mistakes to which traditional (and contemporary) data analytics or statistical methods are susceptible. The collective calling for such issues is often known as garbage in garbage out.

1. Not having enough data to train

How much data does one need to train the AI system effectively — well, It depends!

It’s not the answer you would expect when you are at the pointy end of your machine learning stage. Nevertheless, it does depend on your problem's complexity and the complexity of the algorithm you plan to use. Either way, the best way would be to use empirical investigation and arrive at an optimal number.

You may want to use standard sampling methods to collect required data and may wish to use standard sample size calculators as used in standard statistical analysis tools. However, due to the nature of machine learning algorithms, the amount of data is often insufficient. You most likely would need more than what a standard sample size calculation formula tells you.

Having more data may not be a big problem, as having it less would be. You have to make sure that there is enough data to reasonably capture the relationship within input parameters (a.k.a. features) and between input and output.

You may also use your domain expertise to reasonably assess how much data is enough to exhibit a full cycle of your business problem. It should cover all the possible seasonality and variations.

The model developed with this data's help will only be as good as the data you have or provide for training, so make sure that it is adequately available. If you feel that the data is not enough, which may be a rare scenario in the current big-data world, don’t rush. Wait until you get enough of it.

2. Not cleaning and validating the dataset

Too much data is of no use if it is of poor quality and can mean one or more of the following three things:

1. Data has noise, i.e., there is too much conflicting and misleading information. Confounding variables or parameters are present, and the essential variables are missing. Cleaning this type of data needs additional data points because it is unusable and not enough.

2. It is dirty data, i.e., several values are missing (though parameters are available), or the data has inconsistencies, errors, and a mix of numerical or categorical values in the same column. This type of data needs careful manual cleaning by the subject matter experts and may often need re-validation. Depending on resource availability, you may find it easier to obtain additional data instead of cleaning dirty data.

3. Inadequate or sparse data is the scenario where very few data points have actual values, and a significant part of the dataset is full of nulls or zeroes.

The type of issues present within the dataset is often not clear from the dataset itself, which is why I always recommend exploratory analysis and visualization to be applied at the outset. Doing this first pass gives you a level of confidence in data quality and can tell you if there is something amiss.

Based on the visual representation, an interesting question would be — do you see what you expected to see?

If the answer is “No,” then your data may be of poor quality and needs cleaning.

If the answer is “Yes,” it might be useful in finding some preliminary insights. This validation of the dataset is essential to proceed, and you should never miss it.

3. Not having enough spread in data

Having a large amount of data is not always a good thing unless it can represent all the possible use cases or scenarios. If the data is missing variety, it can lead to problems in the future — you increase the chances of losing on low-frequency, high-risk scenarios.

There is a point of diminishing returns for traditional predictive analysis as you obtain more and more data for training. Your data science team can often spot this point empirically.

However, since machine learning is an inductive process, your base model can only cover what it has seen in the data. So, if you miss on long-tail, a.k.a, edge cases, they will not be supported by your model. It merely means your AI will fail when that scenario occurs. That is the only and the most crucial reason your training data should have enough spread to represent the real population.

4. Ignoring near-misses and overrides

During initial training, it is hard to identify near-misses and disregarded data points. However, it becomes highly essential to pay close attention to near-misses, and human or machine overrides in a continuous learning loop with feedback.

When you deploy your AI system for the first time, it has only a base model that governs an AI's performance. However, as the system operation continues, the feedback loop feeds live data, and the system starts to adjust, either live or regularly.

If the model has missed correctly predicting or calculating any output just by a bit and thereby the decision has changed, it would be a near-miss. For example, in the case of a loan approval system, if 88.5% score means “loan approved” and 88.6% results in “loan declined,” then this scenario is a near-miss. From a technical and pure statistical point of view, it is correct; however, a margin of error may play a significant role from a real-life perspective. If contested by the affected party, such as loan applicants, chances of change in a decision are higher. Therefore, these types of data points are of particular interest, and you should not ignore them.

The same applies when a human operator supervises the AI system’s output and can decide to override it. You must treat the human operator override as a special-case scenario and feed it back to the training model. These scenarios either highlight inadequacies in the base model or provide new situations that never existed before. Ignoring overrides can degrade the model performance over time.

5. Conflating correlation and causation

In statistics, we often say, “correlation does not imply causation.” It usually refers to the inability to legitimately deducing a cause and effect relationship between input variables and output. The resulting conclusion still may not be incorrect or false, but the failure to establish this relationship is often an indicator of the lurking problem.

In similar terms, your model's predictive power does not necessarily imply that you have established an exact cause and effect relationship in your model. Your model may very well be conflating the correlation of input parameters and predicting output based on that.

You may think that “As long as it works, it shouldn’t matter.” However, the distinction matters since many machine learning algorithms pick upon parameters simply because there is a high correlation. Determining causality based on correlations can be very tricky and can potentially lead to contradictory conclusions. It would be much better to be able to prove that there truly is a causal relationship.

However, these days, developers and data scientists are merely relying on statistical patterns. Many of them fail to recognize that those patterns are only correlations amongst vast amounts of data rather than causative truths or natural laws, which govern the real world.

So, how do you deal with conflation?

Try this — during initial training and model building, soon after you find a correlation, don’t conclude too quickly. Take time to find other underlying factors, find the hidden factors, verify if they are correct, and then only conclude.

The basis for trust

As technology is improving day by day, it is placing powerful tools in the hands of people who do not understand how they work. It is creating significant business as well as societal risks.

Developers and data scientists are increasingly getting detached from an understanding of the intricacies of the tools they are using and the systems they’re creating.

The AI system means a black box is becoming a commonly accepted rhetoric. The only sure-fire way to trust this black box is going to be — training it meticulously and testing it rigorously!

Note: This article is part 8 of the 12-article series on AI. The series was first published by EFY magazine last year and now also available on my website at https://www.anandtamboli.com/insights.