# How to Read and Write ORC Files in PySpark?

## ORC Files –

ORC is a self-describing, type aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because spark understands the file format quite well. An often asked question is – What is the difference between ORC and Parquet? For the most part, they are quite similar, the fundamental difference is that Parquet is further optimized for use with Spark, whereas ORC is further optimized for Hive.

To read a ORC file in PySpark, we have to write

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df.show(5)
df.write.format('orc').mode('overwrite').save('../data/flight-data.orc')