Five mistakes to avoid when training your AI system

If you want your system to perform better, you ought to teach is better; and, how do you verify if you have trained it well? By testing it better!

Anand Tamboli®
Aug 31, 2019 · 7 min read

Meticulous teaching is the fundamental requirement for having an excellent performance consistently. However, there are a few caustic mistakes to which traditional (and contemporary) data analytics, or statistical methods are susceptible. The collective calling for such issues is often known as garbage in garbage out.

Here are five mistakes that you should avoid when training your AI system.


1. Not having enough data to train

You may ask, “How much data does one need to train the AI system effectively?”

I’d say, “It depends!”.

It may be a sour answer, especially if you are standing at the pointy end of your machine learning stage. Nevertheless, it indeed depends on the complexity of your problem at hand as well as the complexity of the algorithm you plan to use. Either way, the best way would be to use empirical investigation and arrive at an optimal number.

You may want to use standard sampling methods in the collection of required data and may wish to use standard sample size calculators as used in standard statistical analysis tools. However, due to the nature of machine learning algorithms, the amount of data is often insufficient. You most likely would need more than what a standard sample size calculation formula tells you.

Having more data may not be an as big problem as having it less would be. You have to make sure that there is enough data to reasonably capture the relationship that might exist within input parameters (a.k.a. features) and between input and output.

You may also use your domain expertise to reasonably assess how much data is enough to exhibit a full cycle of your business problem. It should cover all the possible seasonality and variations.

The model developed with the help of this data will only be as good as the data you have or provide for training, so, make sure that it is adequately available. If you feel that the data is not enough, which may be a rare scenario in the current big-data world, don’t rush, wait until you get enough of it.

2. Not cleaning & validating the dataset

Too much data is of no use if it is of poor quality, and can mean one or more of the below three things:

  1. Data has noise, i.e. there is too much conflicting and misleading information. Confounding variables or parameters are present and essential variables are missing. Cleaning this type of data needs additional data points, because, the current set is unusable and hence not enough.
  2. It is dirty data, i.e. several values are missing (though parameters are available), or the data has inconsistencies, errors and mix of numerical or categorical values in the same column. This type of data needs careful manual cleaning by subject matter experts and may often need re-validation. Depending on the resource availability, you may find it easier to obtain additional data instead of cleaning dirty data.
  3. Inadequate or sparse data is the scenario where very few data points have actual values, and a significant part of the dataset is full of nulls or zeroes.

The type of issues present within the dataset is often not clear from the dataset itself, which is why I always recommend exploratory analysis and visualisation to be applied at the outset. Doing this first pass not only gives you a level of confidence in data quality but also can tell you if there is something amiss.

Based on the visual representation, an interesting question would be — do you see what you expected to see?

If the answer is “No”, then your data may be of poor quality and needs cleaning.

If the answer is “Yes”, it might be useful in finding some preliminary insights. This validation of dataset is essential to proceed, and you should never miss it.

3. Not having enough spread in data

Having a large amount of data is not always a good thing unless it can represent all the possible use cases or scenarios. If the data is missing variety, then it can lead to problems in future — you increase the chance of losing on low-frequency high-risk scenarios.

For traditional predictive analysis, there is a point of diminishing returns as you obtain more and more data for training. Your data science team can often spot this point empirically.

However, since machine learning is an inductive process, your base model can only cover what it has seen in the data. So, if you miss on long-tail a.k.a, edge cases, they will not be supported by your model. It merely means your AI will fail when that scenario occurs. That is the only and the most crucial reason why your training data should have enough spread to represent the real population.

If you miss on long-tail a.k.a, edge cases, they will not be supported by your model.

4. Ignoring near-misses and overrides

During initial training, it is hard to identify near-misses and disregarded data points. However, in a continuous learning loop with feedback, it becomes highly essential to pay close attention to near-misses, and human or machine overrides.

When you deploy your AI system for the first time, it has an only base model that governs the performance of an AI. However, as system operation continues, the feedback loop feeds live data and system starts to adjust, either live or regularly.

If the model has missed to correctly predict or calculate any output just by a bit and thereby the decision has changed, it would be a near-miss. For example, in case of a loan approval system, if 88.5% score means “loan approved” and 88.6% results in “loan declined” then this scenario is a near-miss. From a technical and pure statistical point of view, it is correct; however, from a real-life perspective, a margin of error may play a significant role. If contested by the affected party, such as loan applicant, chances of change in a decision are higher. Therefore, these type of data points are of particular interest, and you should not ignore them.

Ignoring overrides can degrade the model performance over time.

The same applies when a human operator is supervising AI system’s output and can decide to override it. Human operator overriding output of an AI should always be treated as a special-case scenario, and you must feed it back to the training model. Each of these scenarios either highlights inadequacies in the base model or provide new situations that never existed before. Ignoring overrides can degrade the model performance over time.

5. Conflating correlation and causation

In statistics, we often say, “correlation does not imply causation”. It usually refers to the inability in legitimately deducing a cause and effect relationship between input variables and output. The resulting conclusion still may not be incorrect or false, but the failure to establish this relationship is often an indicator of lurking problem.

Correlation does not imply causation

On similar terms, the predictive power of your model does not necessarily imply that you have established an exact cause and effect relationship in your model. Your model may very well be conflating correlation of input parameters and predicting output based on that.

You may think that “As long as it works, it shouldn’t matter”. However, the distinction matters since many machine learning algorithms pick upon parameters simply because there is a high correlation. Determining causality based on correlations can be very tricky and can potentially lead to contradictory conclusions. It would be much better to be able to prove that there truly is a causal relationship.

However, these days, developers and data scientists are merely relying on statistical patterns. Many of them fail to recognise that those patterns are only correlations amongst vast amounts of data, rather than causative truths or natural laws, which govern the real world.

So, how do you deal with it?

Try this — during initial training and model building, soon after you find a correlation, don’t conclude too quickly. Take time to find other underlying factors, find the hidden factors, and verify if they are correct and then only conclude.


Conclusion

If you have to trust someone with their performance, there is one of the two ways to do it. Train them effectively such that their performance is guaranteed.

If you suspect training, even by a bit, then test them rigorously to ensure a better performance. Moreover, if you do both, i.e. train meticulously and test rigorously, you can be confident about the performance, and it forms a better basis for trust.

If you train meticulously and test rigorously, you can be confident about the performance of your AI system, and it forms a better basis for trust.


About the Author: I am many things packed inside one person: a serial entrepreneur, an award-winning published author, a prolific keynote speaker, a savvy business advisor, and an intense spiritual seeker. I write boldly, talk deeply, and mentor startups, empathetically.

If you liked this article, subscribe to my newsletter for more such articles and connect with me on LinkedIn.

tomorrow++

It is time we started thinking beyond tomorrow…

Anand Tamboli®

Written by

Award-Winning Author • Keynote Speaker • Transformation Specialist • Tech Futurist ⋆ https://www.anandtamboli.com

tomorrow++

It is time we started thinking beyond tomorrow…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade