
In this post you will learn how to read a csv file into a Dataframe in PySpark and the various options available when reading a csv file.
Reading a CSV file in PySpark –
CSV stands for comma-separated values. This is a common text file format in which each line represents a single record, and commas separate each fields within a record.
To read a CSV file, like any other format, we must first create a DataFrameReader for that specific format. Here we specify the format to be csv as we are reading a csv file.
spark.read.format('csv')
After this, we have the option of specifying a schema as well as modes as options.
Let’s read a csv file to illustrate it. We will read the titanic dataset.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').load('../data/titanic.csv')
df.show(5)

You can see that we have successfully read the csv file but the column names are little weird. Instead of PassengerId and Survived it is _c0 and _c1. The reason is by default header is false. To correct this we need to used options when reading a csv file.
Options when reading a CSV file –
To correct the column names or headers we have to use the header option.
df = spark.read.format('csv').option('header', 'true').load('../data/titanic.csv')
df.show(5)

If you want you can chain multiple options like this. Let’s use the inferSchema option to infer column type when reading a file.
df = spark.read.format('csv').option('header', 'true').option('inferSchema','true').load('../data/titanic.csv')
Or use the shortcut method options.
df = spark.read.format('csv').options(header='true', inferSchema='true').load('../data/titanic.csv')
If you want you can also manually apply the schema.
Let’s creates the schema first.
from pyspark.sql.types import StructType, StructField, StringType, LongType, FloatType
manualSchema = StructType([
StructField("PassengerId", LongType(), True),
StructField("Survived", LongType(), True),
StructField("Pclass", LongType(), True),
StructField("Name", StringType(), True),
StructField("Sex", StringType(), True),
StructField("Age", LongType(), True),
StructField("SibSp", LongType(), True),
StructField("Parch", LongType(), True),
StructField("Ticket", StringType(), True),
StructField("Fare", FloatType(), True),
StructField("Cabin", StringType(), True),
StructField("Embarked", StringType(), True)
])
Now let’s apply it.
df = spark.read.format('csv').option('header','true').schema(manualSchema).load('../data/titanic.csv')
df.show(5)
CSV Options –
There are various csv options when read or writing a csv file in PySpark which can be found here –





