Sometimes you may want to randomly split a pyspark dataframe into different parts. A common example is in Machine Learning where you want to creating a training, validation and test set. For this purpose you can use the randomSplit method in pyspark.
In this example we will split the dataframe into three parts. And because this method is designed to be randomized, we will also specify a seed. It’s important to note that if you don’t specify a proportion for each dataframe that adds up to one, they will be normalized so that they do.
Let’s read a dataset.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv') df.show(5)
Now, Let’s use randomSplit to split this dataframe into training, validation and test set.
dataframes = df.randomSplit([0.6, 0.2, 0.2], seed=26)