How to Write a PySpark DataFrame to a CSV File ?

Spread the love

In our previous post we learned how to read a csv file in PySpark. In this post we will learn how to write a pyspark dataframe to a csv file.

Write PySpark DataFrame to a CSV file –

Let’s first read a csv file. We will use the titanic dataset.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.format('csv').option('header','true').load('../data/titanic.csv')
df.show(5)

Now, to write this dataframe to a csv file, we will write.

df.write.csv('../data/titanic1.csv')

or we can write

df.write.format('csv').save('../data/titanic2.csv')

Write a PySpark DataFrame to a csv file with Header –

By Default PySpark don’t include the headers or column names when saving a dataframe to a csv file. For this we have to use option in PySpark.

To include the headers we have to write

df.write.format('csv').option('header','true').save('../data/titanic3.csv')

Options when writing to a csv file –

We already saw the header options but there are many other options when writing to a csv file in PySpark. Which will be listed below at the end of the post.

Let’s say you want to save the dataframe as a TSV file. we can easily do this with options.

df.write.format('csv').option('header','true').option('sep','\t').save('../data/titanic.tsv')

Save Modes –

Save models specifies what will happen if spark finds data at the specified location.

append – Appends the output files to the list of files that already exist at that location.

overwrite – Will completely overwrite any data that already exists there.

errorIfExists – Throws an error and fails the write if data or files already exist at the specified location.

ignore – if data or files exist at the location, do nothing with the current dataframe.

Let’s say you want to overwrite if a file already exists.

df.write.format('csv').option('header','true').mode('overwrite').save('../data/titanic3.csv')

CSV Options –

As I said before there are many options when reading or writing a csv file in PySpark. All are listed below.

Related Posts –

  1. How to Read a CSV File into a DataFrame in PySpark ?

Rating: 1 out of 5.

Leave a Reply