
In this post, you will learn how to create a toy dataset for regression and classification and clustering in scikit-Learn.
Create Dataset for Regression –
To create a dataset for regression, we use the make_regression method in scikit-Learn.
# import library
from sklearn.datasets import make_regression
# create features and targets
features, target = make_regression(n_samples=100,
n_features=10,
n_informative=5,
n_targets=1,
random_state=42)
# print features and target
print("Features:\n", features[:5])
print("Target:\n", target[:5])
output -
Features:
[[ 1.06548038 1.44697788 0.19655478 -0.43973106 -0.18687164 0.8896308
-1.48556037 0.08228399 1.03184454 0.26705027]
[-0.07444592 -0.34271452 -0.80227727 -0.42064532 -1.41537074 0.17457781
0.40405086 0.25755039 -0.16128571 1.8861859 ]
[ 0.65655361 -0.68002472 0.2322537 0.34644821 0.25049285 0.47383292
-0.71435142 -1.1913035 0.29307247 1.86577451]
[ 1.11957491 -1.51519106 1.36687427 -2.30192116 -0.2750517 0.31125015
-0.24903604 3.07888081 1.64496771 0.57655696]
[-1.19787789 -0.98572605 0.50404652 0.95514232 -0.0626791 -1.03524232
-0.79287283 -0.55364931 -0.53025762 -0.10703036]]
Target:
[ 101.5074193 -107.10336154 -18.12691438 -23.34650396 -19.86185267]
Here n_features is the number of features that you need and n_informative is the number of features that are informative or useful to create the linear model. Here we set it to 5 so, Out of 10 features 5 will be redundant features. n_targets is the dimension of target vector and n_samples is the number of samples we need in the dataset.
Create Dataset for Classification –
To create a dataset for classification, we use the make_classification method in scikit-learn
# import library
from sklearn.datasets import make_classification
# create features and target
features, target = make_classification(n_samples=100,
n_features=10,
n_informative=10,
n_redundant=0,
n_classes=2,
weights=[0.3, 0.7],
random_state=42)
# print features and target
print("Features:\n", features[:5])
print("Targets:", target[:5])
output -
Features:
[[ 0.34616844 -0.00643522 0.61870661 -2.69344346 -0.69738434 2.01201405
-1.6681384 2.44836537 -1.82661678 2.4024435 ]
[ 0.06430876 2.61516685 -1.77179534 0.71196963 -0.50610832 -1.22212198
2.71551779 -1.48152975 3.6699482 2.49850127]
[-0.12542346 1.09702698 -3.58583695 -0.93441303 4.35226485 4.05243758
0.23626167 0.6824358 -0.39699933 -3.41368058]
[-1.59102699 -3.7605661 -2.79522881 -1.94146604 -2.55798687 -1.99565707
1.14428147 -1.42656534 0.66292419 -0.82558689]
[ 0.62665427 0.10382127 0.03313668 -0.315958 -1.36010698 2.36139973
0.59534839 1.78052086 2.78400145 -1.25503482]]
Targets: [0 1 1 1 1]
Here, we set n_classes to 2 means this is a binary classification problem. The weights = [0.3, 0.7] tells us that 30% of the observations belongs to the one class and 70% belongs to the second class.
Create Dataset for Clustering –
To create a dataset for clustering, we use the make_blob method in scikit-learn.
# import library
from sklearn.datasets import make_blobs
# create features and target
features, target = make_blobs(n_samples=100,
n_features=2,
centers=3,
random_state=42)
print("Features:\n", features[:5])
print("Target:\n", target[:5])
output -
Features:
[[-7.72642091 -8.39495682]
[ 5.45339605 0.74230537]
[-2.97867201 9.55684617]
[ 6.04267315 0.57131862]
[-6.52183983 -6.31932507]]
Target:
[2 1 0 1 2]
The centers parameter tells the number of clusters that we need.
Let’s visualize the clusters.
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
plt.scatter(features[:,0], features[:,1], c=target)
plt.show()
