Creating a Simulated Dataset from Scikit-learn
You’ll need to create a synthetic data set.
There are a variety of strategies available within scikit-learn for the generation of simulated data. There are three approaches that are very helpful among those.
1. make_regression() :
make_regression() is a good choice when we want a dataset that is made to be used with linear regression.
Parameters :n_samples : int, default=100
The total amount of samples taken.n_features : int, default=100
The number of features.noise : float, default=0.0
The standard deviation of the gaussian noise applied to the output.Although I've focused on the most important ones, there are a plethora of others parameter to consider.
2. make_classification() :
Using make_classification(), we may generate a simulated dataset for classification purposes.
Parameters :n_samples : int , default=100
The number of samples.n_features : int , default=20
The total number of features.n_classes : int , default=2
The number of classes (or labels) of the classification problem.
3. make_blobs() :
Scikit-learn gives us make_blobs() if we want a dataset that works well with clustering techniques.
Parameters :n_samples : int , default=100
The number of samples.n_features : int , default=2
The number of features for each sample.centers : int , default=None
The number of centers to generate.
The number of clusters that are made is set by the centers parameter, we can see the clusters made by make_blobs() by using the matplotlib library:
I hope you find this article helpful and have learned some new things ❤
Clap if you enjoyed this article and follow for more content like this.
Reference :
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
- https://www.oreilly.com/library/view/machine-learning-with/9781491989371/