In our previous post we learned how to read a JSON file in PySpark. In this post we will learn how to write a PySpark dataframe to a JSON file.
Write a PySpark DataFrame to a JSON File –
Writing JSON file is just as simple as reading them. Let’s first read a dataset to work with. We will use the flights data file.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('json').load('../data/flight-data.json') df.show(5)
Now to write this dataframe to a JSON file, we have to write.
Save Modes –
Save models specifies what will happen if spark finds data at the specified location.
append – Appends the output files to the list of files that already exist at that location.
overwrite – Will completely overwrite any data that already exists there.
errorIfExists – Throws an error and fails the write if data or files already exist at the specified location.
ignore – if data or files exist at the location, do nothing with the current dataframe.
Let’s say you want to overwrite if a file already exists.
JSON Options –
There are various options when reading or writing JSON files in PySpark. You can use options like this.
df = spark.read.format('json').option('inferSchema','true').load('../data/flight-data.json')
The complete list of options that is available are given below.
Related Posts –
- How to Read a JSON File into a DataFrame in PySpark ?
- How to Read a CSV File into a DataFrame in PySpark ?
- How to Write a PySpark DataFrame to a CSV File ?