How to Read and Write Parquet Files in PySpark?

Spread the love

Parquet Files –

Parquet is an open source column-oriented data store that provides a variety of storage optimizations, especially for analytics workloads. It provides columnar compression, which saves storage space and allows for reading individual columns instead of entire files. It is a file format that works exceptionally well with Apache Spark and is in fact the default file format. Spark recommend writing data out to Parquet for long-term storage because reading from a parquet file will always be more efficient than JSON or CSV.

Reading Parquet Files –

Reading a Parquet file is very similar to reading csv files, all you have to do is change the format options when reading the file.

To read a Parquet file in PySpark you have to write.

# read a parquet file
df = spark.read.format('parquet').load('../data/2010-summary.parquet')
df.show(5)

Writing Parquet Files –

Writing Parquet is as easy as reading it. We simply specify the location for the file.

df.write.format('parquet').mode('overwrite').save('../data/flight-data.parquet')

Related Posts –

  1. How to Read a CSV File into a DataFrame in PySpark ?
  2. How to Read a JSON File into a DataFrame in PySpark ?

Rating: 1 out of 5.

Leave a Reply