How to load Scikit-Learn dataset for Machine Learning

Spread the love

There are various datasets available in scikit-learn to get started quickly with Machine Learning. In this post, you will learn how to load an example dataset in scikit-learn.

1 . Scikit-Learn dataset for regression –

Let’s first read a dataset for regression then we look at how to read a dataset for classification.

We will start by reading the California housing dataset.

To access the California housing dataset from the scikit learn dataset module

from sklearn import datasets
housing = datasets.fetch_california_housing()

To know all the available dataset type

datasets.fetch_*?

To view information about a dataset

from pprint import pprint
pprint(housing.DESCR)

Viewing the California housing dataset –

To get the data type

housing.data
housing.data.shape
output - (20640, 8)

This means we have 20640 observations in the dataset and for each observations we have 8 features.

To get the features name or column names type

housing.feature_names

To get the target

housing.target
output - array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

Target shape

housing.target.shape
output - (20640,)

For each observations, we should have one target that is why it is 20640.

Convert to Pandas DataFrame –

To convert this dataset to pandas dataframe type

import pandas as pd
features = pd.DataFrame(housing.data, columns=housing.feature_names)
features.head()
target = pd.Series(housing.target)
target

2 . Scikit-Learn dataset for classification –

Loading classification dataset is very similar to the regression dataset, so I will show you quickly.

Let’s load the famous iris dataset.

from sklearn import datasets
iris = datasets.load_iris()

To get the data type

iris.data

To get feature names type

iris.feature_names

The shape (rows, columns ) of the data

iris.data.shape
output - (150, 4)

There are 150 iris flowers data and for each flowers we have 4 features.

To get the target type

iris.target

And to get the target names type –

iris.target_names
output - array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

So 0 refers to the setosa species, 1 refers to versicolor and 2 refers to virginica.

Rating: 1 out of 5.

Leave a Reply