
In this post you will learn how to randomly select rows from a dataframe in pyspark.
sample method –
Sometimes you may want to randomly select rows from a dataframe. You can do this by using the sample method on a DataFrame. You can also sample with or without replacement.
Let’s read a dataset to work with. We will use the clothing store sales data.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv')
df.show(5)

Let’s say we want to sample 50% of the data without replacement. For that we will write
seed = 26
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).show(5)
