How to Compute Pearson Correlation Coefficient in PySpark?

Spread the love

To Compute the Pearson Correlation Coefficient in PySpark, we use the corr() function.

Syntax –

corr(column1, column2)

Read a Dataset –

Let’s read a dataset to work with. We will use the clothing store sales data.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv') \
    .options(header='true', inferSchema='true') \
    .load('../data/clothing_store_sales.csv')
df.show(5)

Compute Pearson Correlation Coefficient in PySpark –

Let’s compute the Pearson correlation coefficient of Net Sales and Age columns.

from pyspark.sql.functions import corr
df.select(corr("Net Sales", "Age")).show()

You can also compute it like this –

df.stat.corr("Net Sales", "Age")
#output
-0.010635891709415892

Rating: 1 out of 5.

Leave a Reply