How to Read a Parquet File in Pandas?

Spread the love

To read a Parquet file in pandas we use the read_parquet() function.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop.

Syntax –

pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False, **kwargs)

Parameters –

  • path: The file path to the parquet file. The file path can also point to a directory containing multiple files. The file path can also be a valid file URL. Valid URL schemes are http, ftp, s3, gs, and file.
  • engine: This parameter indicates which parquet library to use. Available options are auto, pyarrow or fastparquet.
  • columns: This parameter indicates the columns to be read into the data frame.
  • storage_options: Extra options for a certain storage connection, such as host, port, username, password, and so on.
  • use_nullable_dtypes: This is a boolean parameter. If True, use types for the resultant data frame that uses pd.NA as the missing value indicator.

Examples –

Before we read a Parquet file in Pandas, we need to install pyarrow. Let’s use pip to install it.

pip install pyarrow

Once installed, we can use read_parquet() function to read a Parquet file in Pandas.

df = pd.read_parquet('clothing_store_sales.parquet')
df.head()

Related Posts –

  1. How to Write a Pandas DataFrame to Parquet File?

Rating: 1 out of 5.

Leave a Reply