In this post you will learn how to randomly select rows from a dataframe in pyspark.
sample method –
Sometimes you may want to randomly select rows from a dataframe. You can do this by using the sample method on a DataFrame. You can also sample with or without replacement.
Let’s read a dataset to work with. We will use the clothing store sales data.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv') df.show(5)
Let’s say we want to sample 50% of the data without replacement. For that we will write
seed = 26 withReplacement = False fraction = 0.5 df.sample(withReplacement, fraction, seed).show(5)