Tips for preparing best Dataset

Souman Roy
MetaInsights
2 min readDec 25, 2017

--

In order to solve any Machine Learning problem, the most crucial thing is to know that the DATASET you us, is entirely depends on the problem you are trying to solve.

What is Data?
Data is Raw information. Its a representation of human and machine observation to the world. I.e everything can be represented as data the art, literature, Perception almost anything. We are surrounded by Data.

Here is the important consideration before you jump into M.L or Data Science:

# Quantity of Datasets
When you train a child to recognize an apple, If you typically give the 3–6 example, He /she will start accurately responding. Anyways, Computers are different from Human. Here you need to give from thousands to millions of example for a small model training for recognizing an apple. Here the quantity of data is completely application dependent. In General, You should never train your model with fewer data

# Cleaning & Structured Datasets
You should remove the unnecessary data, the next important things is to make it look very structured make sure that your data must eliminate or decrease the potential of Bias and Variance.

# Featured Selection
This also plays a crucial role in making algorithm works best the feature
Featured selection is very important. Let us understand a daily life example. Suppose you have to purchase a car, Now there are so many factors which can affect your decision.

# Overcome the Problem of overfitting and underfitting of data in Machine Learning Datasets.

Bias vs Variance

Bias: A set of Erroneous assumption in the learning Algorithm.

Variance: Sensityivey model towards Noise rather than an important feature of the relationship between Input and Output.

This is the few points you should keep in mind while solving Data Science Problems.

Thanks for reading.

--

--

Souman Roy
MetaInsights

Business Intelligence practitioner | Problem Solver | Founder MetaInsights, Solve for India