How to load Scikit-Learn dataset for Machine Learning

Spread the love

There are various datasets available in scikit-learn to get started quickly with Machine Learning. In this post, you will learn how to load an example dataset in scikit-learn.

1 . Scikit-Learn dataset for regression –

Let’s first read a dataset for regression then we look at how to read a dataset for classification.

We will start by reading the California housing dataset.

To access the California housing dataset from the scikit learn dataset module

from sklearn import datasets
housing = datasets.fetch_california_housing()

To know all the available dataset type


To view information about a dataset

from pprint import pprint

Viewing the California housing dataset –

To get the data type
output - (20640, 8)

This means we have 20640 observations in the dataset and for each observations we have 8 features.

To get the features name or column names type


To get the target
output - array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

Target shape
output - (20640,)

For each observations, we should have one target that is why it is 20640.

Convert to Pandas DataFrame –

To convert this dataset to pandas dataframe type

import pandas as pd
features = pd.DataFrame(, columns=housing.feature_names)
target = pd.Series(

2 . Scikit-Learn dataset for classification –

Loading classification dataset is very similar to the regression dataset, so I will show you quickly.

Let’s load the famous iris dataset.

from sklearn import datasets
iris = datasets.load_iris()

To get the data type

To get feature names type


The shape (rows, columns ) of the data
output - (150, 4)

There are 150 iris flowers data and for each flowers we have 4 features.

To get the target type

And to get the target names type –

output - array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

So 0 refers to the setosa species, 1 refers to versicolor and 2 refers to virginica.

Rating: 1 out of 5.

Leave a Reply