How to load and view the iris dataset ?

Spread the love

In this post, we will learn how to load or read the iris dataset for machine learning.

Load the iris dataset –

first we need to import the required libraries that we need for our analysis.

# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Next, we will load the iris dataset from the scikit-learn datasets module.

# load the iris dataset
from sklearn import datasets
iris = datasets.load_iris()

The scikit-learn datasets module also contain many other datasets for machine learning which you can access the same as we did with iris.

To check which datasets are available, type –

datasets.load_*?

Let’s say that you want to read the digits dataset. To do that you have to write.

# read digit dataset
digits = datasets.load_digits()

To view information about the dataset, type

iris.DESCR

Viewing the iris dataset –

Now that we have loaded the iris dataset, let’s examine what is in it.

To access the observation variables type

iris.data

This outputs a Numpy array

Let’s also examine the shape of this numpy array

iris.data.shape
(150, 4)

So, here we have 150 observations and for each observations we have 4 features of the iris dataset.

You can access any particular row like this

# access the first row
iris.data[0]
array([5.1, 3.5, 1.4, 0.2])

To determine, what each of these values mean, we can see the feature names by typing

iris.feature_names
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

For example, the first flower has measurement of 5.1 cm for sepal length, 3.5 cm for sepal width, 1.4 cm for petal length and 0.2 cm for petal width.

We can also see the target variable

iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Let’s see what these number refers to.

iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The output of the iris.target_names variable gives the English names for the numbers in the iris.target variable. The number 0 corresponds to the setosa flower, number 1 corresponds to versicolor and number 2 corresponds to virginica flower.

Viewing the iris dataset with pandas –

We can also convert this iris dataset to a pandas dataframe for easier exploration.

import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.head()

This look much better than the numpy array.

Let’s take a quick look at the histogram of the values in the dataframe for sepal length.

iris_df['sepal length (cm)'].hist(bins=30);

You can also color the histogram by the target variable.

for class_number in np.unique(iris.target):
    plt.figure(1)
    iris_df['sepal length (cm)'].iloc[np.where(iris.target == class_number)[0]].hist(bins=30);

Here, we iterate through the target numbers for each flower and draw a color histogram for each. The below line

np.where(iris.target == class_number)[0]

finds the numpy index location for each class of flower.

Rating: 1 out of 5.

Leave a Reply