Parquet Files –
Parquet is an open source column-oriented data store that provides a variety of storage optimizations, especially for analytics workloads. It provides columnar compression, which saves storage space and allows for reading individual columns instead of entire files. It is a file format that works exceptionally well with Apache Spark and is in fact the default file format. Spark recommend writing data out to Parquet for long-term storage because reading from a parquet file will always be more efficient than JSON or CSV.
Reading Parquet Files –
Reading a Parquet file is very similar to reading csv files, all you have to do is change the format options when reading the file.
To read a Parquet file in PySpark you have to write.
# read a parquet file df = spark.read.format('parquet').load('../data/2010-summary.parquet') df.show(5)
Writing Parquet Files –
Writing Parquet is as easy as reading it. We simply specify the location for the file.
Related Posts –
- How to Read a CSV File into a DataFrame in PySpark ?
- How to Read a JSON File into a DataFrame in PySpark ?