How to Write a Pandas DataFrame to Parquet File?

Spread the love

To write a pandas dataframe to Parquet File we use the to_parquet() method in pandas.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop.

Syntax –

DataFrame.to_parquet(path=None, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None, **kwargs)

Parameters –

  • path: This is the path to the Parquet file.
  • engine: This parameter indicates which Parquet library to use. The available options are auto, pyarrow, and fastparquet.
  • compression: This parameter indicates the type of compression to use. The available options are snappy, gzip, and brotli. The default compression is snappy.
  • index: This is a boolean parameter. If True, the DataFrame’s indexes are written to the file. If False, the indexes are ignored.
  • partition_cols: These are the names of the columns that partition the DataFrame. The order in which the columns are given determines the order in which they are partitioned.
  • storage_options: These are the extra options for a certain storage connection, such as a host, port, username, password, and so on.

Examples –

Let’s read a dataset in pandas.

import pandas as pd

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv'
df = pd.read_csv(url)
df.head()

Now before we write a dataframe to parquet file, we need to install pyarrow or fastparquet. Let’s install pyarrow using pip.

pip install pyarrow

Now, we can write a pandas dataframe to a parquet file using the to_parquet() method.

# write to parquet file
df.to_parquet("clothing_store_sales.parquet")

Related Posts –

  1. How to Read a Parquet File in Pandas?

Rating: 1 out of 5.

Leave a Reply