Dataset & Dataset Splitting in Machine Learning

Md. Asifur Rahman
4 min readNov 22, 2022

--

Basic Concept of Dataset

In normal terms, dataset is a collection of data. A dataset has multiple rows and columns.

Columns are referred to as Features or Attributes.

Rows are referred to as Data Samples or Records.

Example of a dataset:

Here, PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked are called Features/Attributes.

And the information which are arranged from left to right horizontally are called Data Samples/Records. The entire row of 2 is a Data Sample.

The Distribution of Dataset

We can tell a dataset is normally distributed or not by plotting Histogram or from the Gaussian distribution.

In order to be considered a normal distribution, the dataset has to a bell-shaped symmetrical curve centered around the mean.

Here is a visual example of a normal distribution (bell-shaped curve):

Here is an example of a Histogram:

In a normal distribution, the shape and position of the histogram is determined by the mean ‘m’ and the standard deviation ‘s’ respectively. Here the term “position” means the location of the center of the “bell” along the horizontal axis. The bell is located at the mean (which in a normal distribution is also the value of the median and mode). Here the term “shape” means the spread or width of the bell-shaped curve, which is a measure of the variability in the data. In other words, the standard deviation s measures how spread out the data are from the center of the data, i.e. the mean. Because m can assume any real number value, and s can assume any non-negative real value, there are infinitely many unique normal distributions.

Summary:

If the curve is bell-shaped then the distribution is normal.

Dataset Splitting

There is a myth in Machine Learning that, the model has to be trained with 80% of the entire data. Then tested with 20% of the data. But this idea is simply wrong. Behind the scene, some data has to be given to the model for Validation.

Most of us assume that Test Data and Validation Data are same. But that is not the case. We can understand the situation by understanding how Epochs work in Machine Learning.

Epochs

An Epoch: When all the training data is used at once in the Machine Learning algorithm then it is called an epoch.

But since one epoch is too big to process by the computer at once, we divide it into several smaller batches. Therefore, we can see epoch:1, epoch:2 running and so on.

In every epoch the parameters of the model changes in order to get a higher accuracy score. Behind the scene, while running epoch:1 the model starts learning from the Train data and after the end of epoch:1 the model examines itself with the Validation data (Not the Test data). So the accuracy score which we get after the completion of every epoch is called Validation accuracy (Not the Test accuracy).

We have to stop the epoch before the Validation accuracy reaches 100%. Because if the Validation accuracy reaches 100% then the model will be overfitting.

Therefore, we generally use 60% data for training, 20% data for validation and the rest 20% for testing.

Summary:

  1. Train data — 60%
  2. Validation data — 20%
  3. Test data — 20%
  4. Accuracy after every Epoch is called Validation accuracy.
  5. Stop learning or stop epoch before the Validation accuracy reaches 100%.

Image Splitting & Augmentation

Generally, image datasets are very limited. With this limited dataset the models wouldn’t be able to perform well. Therefore, we use a very common technique call Image augmentation. But we have to be very careful when to use this technique.

Image augmentation is a technique to enrich the training datasets to improve the performance of computer vision algorithms. In image augmentation, we change the angle of image to create new images.

But we have to be very careful. If we use Augmentation technique before splitting the dataset, then the accuracy will be 100%.

No need to get excited about it because the same set of Augmented images will be in train data, validation data and test data. The model will get exactly the same kind of data and that’s why we get the wrong impression of getting 100% accuracy. It is an illusion accuracy.

So, the question is how can we solve this problem?

It’s pretty simple actually. We have to split the image data. Then we have to keep the train, validation and test data in separate folders. Only then we can perform the Image Augmentation technique.

Summary:

  1. Image Augmentation — changing the angle of an image to create more images.
  2. By doing Image Augmentation, we enrich the dataset.
  3. Always do splitting before doing image augmentation.

See you in the next Article. That’s it for now.

If you found this article interesting, helpful and if you learn something from this article, please follow, comment and leave feedback.

If you want to connect with me, here is my LinkedIn.

--

--

Md. Asifur Rahman

Graduate Research Assistant @York University, Canada | Data Engineer | Machine Learning Expert | www.linkedin.com/in/asifur-rahman-ar/ |