How to Create toy Dataset for Machine Learning in sklearn

Spread the love

In this post, you will learn how to create a toy dataset for regression and classification and clustering in scikit-Learn.

Create Dataset for Regression –

To create a dataset for regression, we use the make_regression method in scikit-Learn.

# import library
from sklearn.datasets import make_regression

# create features and targets
features, target = make_regression(n_samples=100,
                                  n_features=10,
                                  n_informative=5,
                                  n_targets=1,
                                  random_state=42)

# print features and target
print("Features:\n", features[:5])
print("Target:\n", target[:5])

output - 
Features:
 [[ 1.06548038  1.44697788  0.19655478 -0.43973106 -0.18687164  0.8896308
  -1.48556037  0.08228399  1.03184454  0.26705027]
 [-0.07444592 -0.34271452 -0.80227727 -0.42064532 -1.41537074  0.17457781
   0.40405086  0.25755039 -0.16128571  1.8861859 ]
 [ 0.65655361 -0.68002472  0.2322537   0.34644821  0.25049285  0.47383292
  -0.71435142 -1.1913035   0.29307247  1.86577451]
 [ 1.11957491 -1.51519106  1.36687427 -2.30192116 -0.2750517   0.31125015
  -0.24903604  3.07888081  1.64496771  0.57655696]
 [-1.19787789 -0.98572605  0.50404652  0.95514232 -0.0626791  -1.03524232
  -0.79287283 -0.55364931 -0.53025762 -0.10703036]]
Target:
 [ 101.5074193  -107.10336154  -18.12691438  -23.34650396  -19.86185267]

Here n_features is the number of features that you need and n_informative is the number of features that are informative or useful to create the linear model. Here we set it to 5 so, Out of 10 features 5 will be redundant features. n_targets is the dimension of target vector and n_samples is the number of samples we need in the dataset.

Create Dataset for Classification –

To create a dataset for classification, we use the make_classification method in scikit-learn

# import library
from sklearn.datasets import make_classification

# create features and target
features, target = make_classification(n_samples=100,
                                      n_features=10,
                                      n_informative=10,
                                      n_redundant=0,
                                      n_classes=2,
                                      weights=[0.3, 0.7],
                                      random_state=42)

# print features and target
print("Features:\n", features[:5])
print("Targets:", target[:5])

output - 
Features:
 [[ 0.34616844 -0.00643522  0.61870661 -2.69344346 -0.69738434  2.01201405
  -1.6681384   2.44836537 -1.82661678  2.4024435 ]
 [ 0.06430876  2.61516685 -1.77179534  0.71196963 -0.50610832 -1.22212198
   2.71551779 -1.48152975  3.6699482   2.49850127]
 [-0.12542346  1.09702698 -3.58583695 -0.93441303  4.35226485  4.05243758
   0.23626167  0.6824358  -0.39699933 -3.41368058]
 [-1.59102699 -3.7605661  -2.79522881 -1.94146604 -2.55798687 -1.99565707
   1.14428147 -1.42656534  0.66292419 -0.82558689]
 [ 0.62665427  0.10382127  0.03313668 -0.315958   -1.36010698  2.36139973
   0.59534839  1.78052086  2.78400145 -1.25503482]]
Targets: [0 1 1 1 1]

Here, we set n_classes to 2 means this is a binary classification problem. The weights = [0.3, 0.7] tells us that 30% of the observations belongs to the one class and 70% belongs to the second class.

Create Dataset for Clustering –

To create a dataset for clustering, we use the make_blob method in scikit-learn.

# import library
from sklearn.datasets import make_blobs

# create features and target
features, target = make_blobs(n_samples=100,
                             n_features=2,
                             centers=3,
                             random_state=42)

print("Features:\n", features[:5])
print("Target:\n", target[:5])

output - 
Features:
 [[-7.72642091 -8.39495682]
 [ 5.45339605  0.74230537]
 [-2.97867201  9.55684617]
 [ 6.04267315  0.57131862]
 [-6.52183983 -6.31932507]]
Target:
 [2 1 0 1 2]

The centers parameter tells the number of clusters that we need.

Let’s visualize the clusters.

import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
plt.scatter(features[:,0], features[:,1], c=target)
plt.show()

Rating: 1 out of 5.

Leave a Reply