How to Randomly Select Rows from a DataFrame in PySpark?

Spread the love

In this post you will learn how to randomly select rows from a dataframe in pyspark.

sample method –

Sometimes you may want to randomly select rows from a dataframe. You can do this by using the sample method on a DataFrame. You can also sample with or without replacement.

Let’s read a dataset to work with. We will use the clothing store sales data.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv')
df.show(5)

Let’s say we want to sample 50% of the data without replacement. For that we will write

seed = 26
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).show(5)

Rating: 1 out of 5.

Leave a Reply