
To read a Parquet file in pandas we use the read_parquet() function.
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop.
Syntax –
pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False, **kwargs)
Parameters –
path
: The file path to the parquet file. The file path can also point to a directory containing multiple files. The file path can also be a valid file URL. Valid URL schemes arehttp
,ftp
,s3
,gs
, andfile
.engine
: This parameter indicates which parquet library to use. Available options areauto
,pyarrow
orfastparquet
.columns
: This parameter indicates the columns to be read into the data frame.storage_options
: Extra options for a certain storage connection, such as host, port, username, password, and so on.use_nullable_dtypes
: This is a boolean parameter. IfTrue
, use types for the resultant data frame that usespd.NA
as the missing value indicator.
Examples –
Before we read a Parquet file in Pandas, we need to install pyarrow. Let’s use pip to install it.
pip install pyarrow
Once installed, we can use read_parquet() function to read a Parquet file in Pandas.
df = pd.read_parquet('clothing_store_sales.parquet')
df.head()
