Common AI Training or Data Training Problems That You Will Face Often

Lakshmi Prakash
Design and Development
5 min readDec 12, 2022

As someone who has been working on training AI for a few years now, I have learnt that some problems in this job can occur repeatedly. Now that I have identified these common issues, I know that I need to be prepared for this, and I know, to some extent, how to prevent these problems as well.

In this post, I’d like to share with you such problems, issues that you will most likely often face while training your artificial intelligence or model.

Not Enough Data: This could probably be the most common problem that one would face while training their model. Lack of data. Without data, where would, what would a machine or AI learn from? Starting from the simplest AI project to something as powerful as Google search, there’s data involved in every one of these projects, as you probably know already. “What amount of data is enough data?” could be the next question, which is also a valid one to ask. This is something that varies from project to project. Imagine research and thesis publications. Not all researchers need and use the same number, a constant number for their data collection, right?

Problems You Might Encounter with Data Training/Training Your AI

Not Having Relevant, Valid Data: In the era of “big data”, how can one not have enough data, one might ask. Well, when we say that there is not enough data, we mean there is not enough valid or relevant data. People might not be able to understand this right away, so let me explain using a few simple examples.

Let us say you are running a company and have a few hundred employees and a thousands of customers. Let us assume that your employees have employee IDs and your customers have some unique way of being identified in your database, by user name or customer ID. Now, no way we can get this data from a commonly available dataset on the Internet. This is your data, for which only you have (only you should have) access to. You need to tell us what the format of your employee ID and the format of your customer ID are, so that we can use valid Regexes to train an AI for you. This was just a simple example, but I hope you got the point.

Not Spending Enough Time on Data Cleaning: This is another problem that those working with data would often face. This could be because they do not have enough time and they are compelled to finish the task pretty soon. When you do not clean your data, this could lead to lot of mistakes, wrong results, confusion, false results, etc.

Seniors involved in the project must also understand the importance of data cleaning and data analysis. You don’t just collect data and input the data the way you got it.

Different People Working on and Making Changes to the Data or the Model: If it is compulsory that different people have to work on a dataset that is used in a project, then there better be some kind of shared knowledge among them: who are all authorized to work with the data, who is making what changes, are others updated on the changes that have been made, etc. You don’t want different people working on a dataset, with nobody in the team knowing who does what, and suddenly one day, your AI starts showing different results, so someone has to “quickly fix the problem, ASAP, immediately!”. No.

The Plan, The Solution Design, and The Data Available Not Being in Sync with One Another: When I encountered this problem a few times initially, I knew there was some problem somewhere that was glaring at me, but I couldn’t understand it. There was too much confusion internally and with the client, too. Everyone seemed to be confused regarding why there was no progress.

Once again, let me try to explain this using a different example. Let us say that according to the data your client has provided you, they have 1000s of candidates classified into “men” and “women” (two classes) and that is all the data they have given to you. Now, imagine that another person from the client’s team and your solutions architect or developers have worked out a plan for the AI, which produces results based on the individual’s education and place of birth. Yes, this again means that we don’t have valid, relevant data. But if your developers have already started working on building the product when you do not have enough data, and your data/ML team on the other side are working on data analysis, then when it comes to the stage where you need to use the model in product that has been built, then all of you would get confused regarding what is present, what must be done, and where to even begin.

It is to avoid this kind of confusion that it I suggest that the data analysts, ML team, and the solutioning team or developers sit down together starting from the very beginning of the project and make sure that they all have all the information they need and that they are all on the same page with the client.

Of course, the examples given here are all pretty simple ones, but I hope you get the point. Imagine working with a huge amount of data with a highly complicated code, involving many, many APIs, and a model that takes into account several features. These problems that seem to be simple could easily increase exponentially to give you a nightmare, so it is better to prevent these problems when you can, when you start working on the project, rather than wait for silly little mistakes to snowball into a big bunch of issues, all connected with one another, and the whole project failing when tested!

--

--

Lakshmi Prakash
Design and Development

A conversation designer and writer interested in technology, mental health, gender equality, behavioral sciences, and more.