
In this post we will learn various ways to create a DataFrame in PySpark.
Create a DataFrame in PySpark –
Let’s first see how to create a DataFrame manually in PySpark.
Here we have some data about Restaurants.

Now, Let’s say we want to create a PySPark DataFrame using this data.
To do that first we define the schema of this DataFrame.
from pyspark.sql.types import StructType, StructField, StringType, LongType
ManualSchema = StructType([
StructField("Restaurant", LongType(), True),
StructField("Quality Rating", StringType(), True),
StructField("Mean Price", LongType(), True)
])
Here, we define the name of the column, the data type that it holds and a Boolean Flag which specifies whether that column can contain missing or null values.
Now, we will create Rows that holds the data.
from pyspark.sql import Row
data = [
Row(1, "Good", 18),
Row(2, "Very Good", 22),
Row(3, "Good", 28),
Row(4, "Excellent", 38),
Row(5, "very Good", 33)
]
Now, we can use createDataFrame method to create the dataframe.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, ManualSchema)
df.show()

Create a PySpark DataFrame from Pandas DataFrame –
Let’s create a Pandas DataFrame before converting it to PySpark DataFrame.
import pandas as pd
file_path = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/Restaurant.csv'
pandasDf = pd.read_csv(file_path)
pandasDf.head()

Now, we can convert this pandas dataframe to pyspark by directly passing it to the createDataFrame method.
sparkDf = spark.createDataFrame(pandasDf)
sparkDf.show(5)

Create PySpark DataFrame from a CSV file –
You can create a PySpark DataFrame from a csv file like this
df = spark.read.format('csv').option('header','true').load('../data/Restaurant.csv')
df.show(5)

To learn more about it read this post – How to Read a CSV File into a DataFrame in PySpark ?
Create PySpark DataFrame from a JSON file –
You can also create a PySpark DataFrame from a JSON file like this.
df = spark.read.format('json').option('header','true').load('../data/flight-data.json')
df.show(5)

To learn more about it read this post –How to Read a JSON File into a DataFrame in PySpark?