
ORC Files –
ORC is a self-describing, type aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because spark understands the file format quite well. An often asked question is – What is the difference between ORC and Parquet? For the most part, they are quite similar, the fundamental difference is that Parquet is further optimized for use with Spark, whereas ORC is further optimized for Hive.
Reading ORC Files –
To read a ORC file in PySpark, we have to write
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('orc').load('../data/2010-summary.orc')
df.show(5)
Writing ORC Files –
And to save a PySpark dataframe to a ORC file, we write
df.write.format('orc').mode('overwrite').save('../data/flight-data.orc')
Related Posts –
- How to Read a CSV File into a DataFrame in PySpark ?
- How to Read a JSON File into a DataFrame in PySpark ?
- How to Read and Write Parquet Files in PySpark?