
To Compute the Pearson Correlation Coefficient in PySpark, we use the corr() function.
Syntax –
corr(column1, column2)
Read a Dataset –
Let’s read a dataset to work with. We will use the clothing store sales data.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv') \
.options(header='true', inferSchema='true') \
.load('../data/clothing_store_sales.csv')
df.show(5)

Compute Pearson Correlation Coefficient in PySpark –
Let’s compute the Pearson correlation coefficient of Net Sales and Age columns.
from pyspark.sql.functions import corr
df.select(corr("Net Sales", "Age")).show()

You can also compute it like this –
df.stat.corr("Net Sales", "Age")
#output
-0.010635891709415892