To write a pandas dataframe to Parquet File we use the to_parquet() method in pandas.
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop.
DataFrame.to_parquet(path=None, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None, **kwargs)
path: This is the path to the Parquet file.
engine: This parameter indicates which Parquet library to use. The available options are
compression: This parameter indicates the type of compression to use. The available options are
brotli. The default compression is
index: This is a boolean parameter. If
True, the DataFrame’s indexes are written to the file. If
False, the indexes are ignored.
partition_cols: These are the names of the columns that partition the DataFrame. The order in which the columns are given determines the order in which they are partitioned.
storage_options: These are the extra options for a certain storage connection, such as a host, port, username, password, and so on.
Let’s read a dataset in pandas.
import pandas as pd url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv' df = pd.read_csv(url) df.head()
Now before we write a dataframe to parquet file, we need to install pyarrow or fastparquet. Let’s install pyarrow using pip.
pip install pyarrow
Now, we can write a pandas dataframe to a parquet file using the to_parquet() method.
# write to parquet file df.to_parquet("clothing_store_sales.parquet")