In our previous post we learned how to read a csv file in PySpark. In this post we will learn how to write a pyspark dataframe to a csv file.
Write PySpark DataFrame to a CSV file –
Let’s first read a csv file. We will use the titanic dataset.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('csv').option('header','true').load('../data/titanic.csv') df.show(5)
Now, to write this dataframe to a csv file, we will write.
or we can write
Write a PySpark DataFrame to a csv file with Header –
By Default PySpark don’t include the headers or column names when saving a dataframe to a csv file. For this we have to use option in PySpark.
To include the headers we have to write
Options when writing to a csv file –
We already saw the header options but there are many other options when writing to a csv file in PySpark. Which will be listed below at the end of the post.
Let’s say you want to save the dataframe as a TSV file. we can easily do this with options.
Save Modes –
Save models specifies what will happen if spark finds data at the specified location.
append – Appends the output files to the list of files that already exist at that location.
overwrite – Will completely overwrite any data that already exists there.
errorIfExists – Throws an error and fails the write if data or files already exist at the specified location.
ignore – if data or files exist at the location, do nothing with the current dataframe.
Let’s say you want to overwrite if a file already exists.
CSV Options –
As I said before there are many options when reading or writing a csv file in PySpark. All are listed below.