How to Create a DataFrame in PySpark?

Spread the love

In this post we will learn various ways to create a DataFrame in PySpark.

Create a DataFrame in PySpark –

Let’s first see how to create a DataFrame manually in PySpark.

Here we have some data about Restaurants.

Now, Let’s say we want to create a PySPark DataFrame using this data.

To do that first we define the schema of this DataFrame.

from pyspark.sql.types import StructType, StructField, StringType, LongType

ManualSchema = StructType([
    StructField("Restaurant", LongType(), True),
    StructField("Quality Rating", StringType(), True),
    StructField("Mean Price", LongType(), True)
])

Here, we define the name of the column, the data type that it holds and a Boolean Flag which specifies whether that column can contain missing or null values.

Now, we will create Rows that holds the data.

from pyspark.sql import Row

data = [
    Row(1, "Good", 18),
    Row(2, "Very Good", 22),
    Row(3, "Good", 28),
    Row(4, "Excellent", 38),
    Row(5, "very Good", 33)
]

Now, we can use createDataFrame method to create the dataframe.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(data, ManualSchema)
df.show()

Create a PySpark DataFrame from Pandas DataFrame –

Let’s create a Pandas DataFrame before converting it to PySpark DataFrame.

import pandas as pd

file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/Restaurant.csv'
pandasDf = pd.read_csv(file_path)
pandasDf.head()

Now, we can convert this pandas dataframe to pyspark by directly passing it to the createDataFrame method.

sparkDf = spark.createDataFrame(pandasDf)
sparkDf.show(5)

Create PySpark DataFrame from a CSV file –

You can create a PySpark DataFrame from a csv file like this

df = spark.read.format('csv').option('header','true').load('../data/Restaurant.csv')
df.show(5)

To learn more about it read this postHow to Read a CSV File into a DataFrame in PySpark ?

Create PySpark DataFrame from a JSON file –

You can also create a PySpark DataFrame from a JSON file like this.

df = spark.read.format('json').option('header','true').load('../data/flight-data.json')
df.show(5)

To learn more about it read this post –How to Read a JSON File into a DataFrame in PySpark?

Rating: 1 out of 5.

Leave a Reply