The sample method in Pandas let’s us randomly sample data from a DataFrame.
dataframe.sample(n, frac, replace, weights, random_state, axis)
n – The number of rows to return. Default value is 1.
frac – A fraction of rows to return, like 0.5 for 50% of the rows
replace – sample with or without replacement. By default without replacement.
weights – Specifies the importance of certain rows or columns
random_state – the seed of the random generator
axis – Whether to sample rows or columns. By Default sample rows.
Let’s read a dataset to work with.
import pandas as pd url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv' df = pd.read_csv(url) df.head()
1 . Randomly Sample N Data Points from the DataFrame –
To randomly sample n data points from the dataframe, we can use the n parameter of sample method in pandas.
Let’s say we want to randomly sample 10 data points from the dataframe.
We use the random_state parameter for reproducibility. If you run the above code again you will get the same sets of 10 data points.
2 . Randomly Sample Fraction of Data Points from the DataFrame –
To randomly sample the fractions of data points, we can use the frac parameter of sample method.
Let’s say we want to randomly sample 20% of the data from the dataframe.
Only top few rows are shown here.
3 . Random Sampling Without Replacement –
By Default pandas does random sampling without replacement i.e. same data point can’t be selected more than once. You can explicitly set this using the replace parameter.
4 . Random Sampling With Replacement –
To do random sampling with replacement, set the replace parameter to True i.e. same data points can be selected more than once.
5. Using DataFrame column as Weights –
Rows with larger value in the column are more likely to be sampled.
df.sample(n=5, weights='Net Sales')