How to Randomly Sample Data from a Pandas DataFrame?

Spread the love

The sample method in Pandas let’s us randomly sample data from a DataFrame.

syntax –

dataframe.sample(n, frac, replace, weights, random_state, axis)

n – The number of rows to return. Default value is 1.

frac – A fraction of rows to return, like 0.5 for 50% of the rows

replace – sample with or without replacement. By default without replacement.

weights – Specifies the importance of certain rows or columns

random_state – the seed of the random generator

axis – Whether to sample rows or columns. By Default sample rows.

Example –

Let’s read a dataset to work with.

import pandas as pd

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv'
df = pd.read_csv(url)
df.head()

1 . Randomly Sample N Data Points from the DataFrame –

To randomly sample n data points from the dataframe, we can use the n parameter of sample method in pandas.

Let’s say we want to randomly sample 10 data points from the dataframe.

df.sample(n=10, random_state=42)

We use the random_state parameter for reproducibility. If you run the above code again you will get the same sets of 10 data points.

2 . Randomly Sample Fraction of Data Points from the DataFrame –

To randomly sample the fractions of data points, we can use the frac parameter of sample method.

Let’s say we want to randomly sample 20% of the data from the dataframe.

df.sample(frac=0.2, random_state=42)

Only top few rows are shown here.

3 . Random Sampling Without Replacement –

By Default pandas does random sampling without replacement i.e. same data point can’t be selected more than once. You can explicitly set this using the replace parameter.

df.sample(n=5, replace=False)

4 . Random Sampling With Replacement –

To do random sampling with replacement, set the replace parameter to True i.e. same data points can be selected more than once.

df.sample(n=5, replace=True)

5. Using DataFrame column as Weights –

Rows with larger value in the column are more likely to be sampled.

df.sample(n=5, weights='Net Sales')

Leave a Reply