What’s the Role of Datasets in ML?

Fred Malack
unpack
Published in
4 min readMar 8, 2021

Datasets

To understand the context of what a dataset is and the role it plays in Machine learning (ML), we must first discuss the components of a dataset. A dataset, or data set, is simply a collection of data. The simplest and most common format for datasets you’ll find online is a spreadsheet or CSV format — a single file organized as a table of rows and columns. But some datasets will be stored in other formats, and they don’t have to be just one file. Sometimes a dataset may be a zip file or folder containing multiple data tables with related data.

So in simple terms we can simply conclude that dataset is the food for machine learning model.

Are there any types of dataset?

Dataset can come in many forms, but machine learning models rely on five primary data types. These include numerical data sets, Bivariate data sets, categorical data sets, Multivariate data sets ,Correlation data sets.

  • Numerical data sets

Numerical data set is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form. E.g Weight and height of a person

  • Bivariate data sets

A data set that has two variables is called a Bivariate data set. It deals with the relationship between the two variables. E.g To find the percentage score and gender of the students in a class. Score and gender can be considered as two variables

  • Categorical data sets

Categorical data sets represent types of data which may be divided into groups. E.g A person’s gender (male or female)

  • Multivariate data sets

Multivariate data consist of individual measurements that are acquired as a function of more than two variables. E.g If we have to measure the length, width, height, volume of a rectangular box, we have to use multiple variables to distinguish between those entities.

  • Correlation data sets

The set of values that demonstrate some relationship with each other indicates correlation data sets. E.g A tall person is considered to be heavier than a short person. So here the weight and height variables are dependent on each other.

What is a reference dataset?

As discussed above that dataset comes in different formats, so as they’re categories. Reference data sets are data used to classify or categorize other data. Typically, they are static or slowly changing over time.

Reference data is different from master data. While both provide context for business transactions, reference data is concerned with classification and categorization, while master data is concerned with business entities.

The following are examples of reference data sets.

  • Units of measurement
  • Corporate codes
  • Country codes
  • Fixed conversion rates e.g., weight, temperature, and length
  • Calendar structure and constraints

What does it take to create your own dataset?

When you want to create a dataset it might be because you have a database or other tabular data that you want to analyze and share. But data from a database isn’t the only kind of data you can put in a dataset.

Steps for creating custom data sets :-

  • Choose your collection method

You can build your own data set using internal resources or third-party services you hire. To collect the data, you can use automation, you can do it manually, or you can choose a combination of both. You may use your own devices, such as cameras or sensors.

  • Collect data in tiers

At this stage, you work with smaller datasets to analyze the effectiveness of your predictive model and adjust it as necessary. Start by breaking down the larger data set you have into smaller sets. For example, if you are aiming to work with 500,000 images, collect the data in tiers of 20,000–50,000 and increase that gradually or aggressively depending on the results of your model after training.

  • Validate the data

The purpose of validation is to ensure you’ve met the data quality metrics (i.e., variance, quality, quantity, density) you initially sought to achieve. This is the perfect time to prevent biases and collect data again before beginning annotation.

  • Annotate the data

Once you have validated, during the collection stage, that you have acquired the appropriate amount and variety of data, you will begin working on the most time-consuming task of your project: data annotation. You will have done some annotation during the earlier stages of this process, as you collected and tested the data for use with your algorithm.

  • Validate your model

At this stage, you will validate the quality of your algorithm. This is a key step for determining if the data you labeled is a good fit for the algorithm you are creating.

  • Repeat

Machine learning is not a one-and-done exercise, so you will repeat the collection, annotation, and validation steps again and again.

. . .

Citations

[1] https://en.wikipedia.org/wiki/Reference_data

[2] https://byjus.com/maths/data-sets/

[3] https://algorithmia.com/blog/the-importance-of-machine-learning-data

[4] https://medium.com/@bjdixon/citations-and-footnotes-on-medium-3713cc665722

[5] https://blog.cloudfactory.com/steps-to-create-custom-data-sets-for-computer-vision

--

--