Why the Data You Use Is More Important Than the Model Itself

A short read to keep in mind when you build your next ML model

Mahak Agarwal
The Startup
5 min readJul 7, 2020

--

Photo by Kevin Ku on Unsplash

Suppose you decide to work on an image classification problem. There are several datasets available which consist of thousands of images that you can use for your training and test sets. You choose one of them according to the computation power you wish to work with. All is well and you preprocess the data, by scaling or even changing some categorical variables to numbers, to make it suitable for the model. You even try different models to see which performs better, but somehow when you deploy the model in a real-world setting, it seems to falter.

If you have ever experienced this, please read on.

More often than not we don’t have the option of collecting the data we want to work on. If you’re not working for a company, that is, building up your skills in the field of ML or enrolled in a university studying something relevant to this, you will have to use datasets already complied with by a third party.

I have divided this article into two scenarios according to the options you have when you are working on a problem:

  • Case I: You have the option of collecting the data.
  • Case II: You use an already made dataset.

Case I

When you do have the freedom of collecting the data for the problem at hand, then the most important thing to keep in mind apart from the objective is to take steps to prevent data leakage.

Although there isn’t a very clear definition available which encompasses all its cases, I would like to convey it as clearly as I can in simple words:

Data leakage occurs when you have information about one or more attributes of the test set during training phase or you have information indicative of the target variable indirectly or you have access to features which normally should not be available while making a prediction which in some way indicates the target class/variable.

Data leakage adversely affects the model and can result in poor performance even after using decent models and regularization techniques.

There are other ways too in which data leakage can find its way into your model even after you have kept in mind all the possible precautions to prevent it. These are often way more complex and could be ignored by experienced professionals too. Examples of such cases and some methods to follow during data collection are highlighted in this paper. One key method that I would like to highlight from the above-mentioned paper is to timestamp the data as much as possible. It will help in ensuring that you are not using features that would be available after the target value has been predicted.

In cases where data leakage exists, the model might give a good performance on the training set, and the test set too, but it would perform poorly when you deploy it in a real-world setting. A very optimistic training set score could be due to data leakage, although very rarely we do come to that conclusion and often tag it as overfitting.

Case II

You have finalized one or more datasets depending on the problem and features you want and you are ready to work on your problem.

Even after trying different models, hyperparameter tuning, performing cross-validation, you see that the model is not performing as you expected it to. There could be a few reasons for it, one of them being that the dataset could be imbalanced.

Imbalanced datasets are usually common when the target class occurs rarely, like tumor detection in medicine, for example. One of the most common examples of an imbalanced dataset is the credit card fraud detection dataset. It has more than 99% instances of minority class(no fraud) and less than 1% of the majority class (fraud occurred). In most of these datasets, it’s relatively easy to see the number of instances of the target variable and perform appropriate procedures to overcome the problem of imbalance.

For example,

  • You could perform the undersampling of the majority class.
  • Oversampling of the minority class.
  • or generate synthetic data (using the SMOTE technique for example).

Detailed information on these methods is beyond the scope of this article and is available here.

Now, these methods certainly do work in some cases, but it is difficult to handle imbalanced datasets in Image Processing where the datasets just consist of thousands of images along with certain other information. In some cases, this other information is enough to make sense of the data because it is labeled with the number of instances of each category but in others, it’s possible that this information is not sufficient which makes it not only difficult to see if the dataset is balanced but also if it is fit for our problem.

One other move to follow when facing an imbalanced dataset is to consider metrics other than accuracy. A lot of platforms like Linkedin, Github where people describe their models/projects, their first instinct to prove it performs well, is to mention accuracy. I honestly don’t understand why.

It can be a good metric but should never be projected alone, and other metrics like precision, recall, or F1 score should be mentioned along with it depending upon the objective.

They would not only give in-depth information on the performance of the model but also remove any false hopes that accuracy alone could project, hence increasing credibility.

One other problem that might arise due to the use of an already made dataset is the difference in objectives. It is possible that the dataset was compiled for a different objective than what you are using it for. Even though the date-time interval or other information aligns with your problem statement, it does not necessarily mean it should be used. This difference could be crucial as it could lead to leakage in some way or underperforming models. An example of this could be using a dataset like Flickr8k for classifying objects in an image. Although it is a collection of over 8000 images and could be used for training, the actual objective of the dataset is different. Each picture in this dataset is labeled with five captions that describe the image. It could an ideal dataset for something like image caption generation for example and not object classification.

Conclusion

In conclusion, I would like to highlight that the data plays a way more important role than just being a starting point in your project. Before jumping onto the problem it is important to spend a considerable amount of time just analyzing all the files of a dataset or the data you have collected. It will give a lot more insight which could be useful for feature selection and model building.

Please do comment your views on the content of this article, I would be more than happy to read :)

--

--