Creating a Simulated Dataset from Scikit-learn

You’ll need to create a synthetic data set.

3 min readJul 16, 2022

There are a variety of strategies available within scikit-learn for the generation of simulated data. There are three approaches that are very helpful among those.

1. make_regression() :

make_regression() is a good choice when we want a dataset that is made to be used with linear regression.

Parameters :n_samples : int, default=100
The total amount of samples taken.n_features : int, default=100
The number of features.noise : float, default=0.0
The standard deviation of the gaussian noise applied to the output.Although I've focused on the most important ones, there are a plethora of others parameter to consider.

2. make_classification() :

Using make_classification(), we may generate a simulated dataset for classification purposes.

Parameters :n_samples : int , default=100
The number of samples.n_features : int , default=20
The total number of features.n_classes : int , default=2
The number of classes (or labels) of the classification problem.

3. make_blobs() :

Scikit-learn gives us make_blobs() if we want a dataset that works well with clustering techniques.

Parameters :n_samples : int , default=100
The number of samples.n_features : int , default=2
The number of features for each sample.centers : int , default=None
The number of centers to generate.

The number of clusters that are made is set by the centers parameter, we can see the clusters made by make_blobs() by using the matplotlib library:

I hope you find this article helpful and have learned some new things ❤

Clap if you enjoyed this article and follow for more content like this.

Creating a Simulated Dataset from Scikit-learn

You’ll need to create a synthetic data set.

1. make_regression() :

2. make_classification() :

3. make_blobs() :

Reference :

Written by Suraj Yadav