How to Read a CSV File into a DataFrame in PySpark ?

Spread the love

In this post you will learn how to read a csv file into a Dataframe in PySpark and the various options available when reading a csv file.

Reading a CSV file in PySpark –

CSV stands for comma-separated values. This is a common text file format in which each line represents a single record, and commas separate each fields within a record.

To read a CSV file, like any other format, we must first create a DataFrameReader for that specific format. Here we specify the format to be csv as we are reading a csv file.

spark.read.format('csv')

After this, we have the option of specifying a schema as well as modes as options.

Let’s read a csv file to illustrate it. We will read the titanic dataset.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').load('../data/titanic.csv')
df.show(5)

You can see that we have successfully read the csv file but the column names are little weird. Instead of PassengerId and Survived it is _c0 and _c1. The reason is by default header is false. To correct this we need to used options when reading a csv file.

Options when reading a CSV file –

To correct the column names or headers we have to use the header option.

df = spark.read.format('csv').option('header', 'true').load('../data/titanic.csv')
df.show(5)

If you want you can chain multiple options like this. Let’s use the inferSchema option to infer column type when reading a file.

df = spark.read.format('csv').option('header', 'true').option('inferSchema','true').load('../data/titanic.csv')

Or use the shortcut method options.

df = spark.read.format('csv').options(header='true', inferSchema='true').load('../data/titanic.csv')

If you want you can also manually apply the schema.

Let’s creates the schema first.

from pyspark.sql.types import StructType, StructField, StringType, LongType, FloatType

manualSchema = StructType([
    StructField("PassengerId", LongType(), True),
    StructField("Survived", LongType(), True),
    StructField("Pclass", LongType(), True),
    StructField("Name", StringType(), True),
    StructField("Sex", StringType(), True),
    StructField("Age", LongType(), True),
    StructField("SibSp", LongType(), True),
    StructField("Parch", LongType(), True),
    StructField("Ticket", StringType(), True),
    StructField("Fare", FloatType(), True),
    StructField("Cabin", StringType(), True),
    StructField("Embarked", StringType(), True)
])

Now let’s apply it.

df = spark.read.format('csv').option('header','true').schema(manualSchema).load('../data/titanic.csv')
df.show(5)

CSV Options –

There are various csv options when read or writing a csv file in PySpark which can be found here –

Rating: 1 out of 5.

Leave a Reply