What is Schemas in PySpark?

Spread the love

Schemas –

A schema defines the column names and types of a DataFrame. We can either let a data source define the schema (called schema-on-read) or we can define it explicitly ourselves.

Let’s read a csv dataset in PySpark.

df = spark.read.format('csv').option('header','true').load('../data/titanic.csv')
df.show(5)

Now we can look at the schema of this dataframe using DataFrame.schema

df.schema
StructType(List(StructField(PassengerId,StringType,true),StructField(Survived,StringType,true),StructField(Pclass,StringType,true),StructField(Name,StringType,true),StructField(Sex,StringType,true),StructField(Age,StringType,true),StructField(SibSp,StringType,true),StructField(Parch,StringType,true),StructField(Ticket,StringType,true),StructField(Fare,StringType,true),StructField(Cabin,StringType,true),StructField(Embarked,StringType,true)))

A schema is a StructType made up of a number of fields , StructFields, that have a name, type, a Boolean flag which specifies whether that column can contains missing or null values and finally users can optionally specify associated metadata with that column.

Schemas can contain other StructTypes (Spark’s complex types). If the types in the data at runtime do not match the schema, spark will throw an error.

Manually Defining the Schema –

When using spark for production Extract, Transform and Load (ETL), it is often a good idea to define your schemas manually, especially when working with untyped data sources like CSV and JSON because schema inference can vary depending on the type of data that you read in.

Let’s see how to define a Schema manually.

from pyspark.sql.types import StructType, StructField, StringType, LongType, FloatType

manualSchema = StructType([
    StructField("PassengerId", LongType(), True),
    StructField("Survived", LongType(), True),
    StructField("Pclass", LongType(), True),
    StructField("Name", StringType(), True),
    StructField("Sex", StringType(), True),
    StructField("Age", LongType(), True),
    StructField("SibSp", LongType(), True),
    StructField("Parch", LongType(), True),
    StructField("Ticket", StringType(), True),
    StructField("Fare", FloatType(), True),
    StructField("Cabin", StringType(), True),
    StructField("Embarked", StringType(), True)
])
df = spark.read.format('csv').schema(manualSchema).option("header", "true").load('../data/titanic.csv')
df.show(5)

Rating: 1 out of 5.

Leave a Reply