How to Calculate Correlation in Python

Spread the love

What is Correlation in Statistics?

Correlation in statistics refers to the degree to which two or more variables fluctuate together. If the change in one variable systematically relates to the change in another variable, we say that these variables are correlated.

Correlation can be positive (both variables increase or decrease together), negative (one variable increases when the other decreases), or zero (no relationship). The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0.

  • A correlation of 1.0 shows a perfect positive correlation. This is when one variable increases, the other one also increases.
  • A correlation of -1.0 shows a perfect negative correlation. This is when one variable increases, the other decreases.
  • A correlation of 0 shows no linear relationship between the movement of the two variables.

The most commonly used method to calculate correlation is Pearson’s correlation coefficient. It assumes a normal distribution of the variables. Spearman’s rank correlation does not make any assumptions about the distribution and is more robust, it is based on the ranked values of the variables rather than their actual values.

It’s important to note that correlation does not imply causation. Just because two variables are correlated, it doesn’t mean that the change in one variable is the cause of the change in the other.

How to Calculate Pearson’s Correlation Coefficient in Python?

Calculating Pearson’s correlation coefficient in Python is quite straightforward with the help of the scipy library. Here is a simple example using two lists of numbers:

from scipy.stats import pearsonr

# Here's two lists of numbers:
x = [1, 2, 3, 4, 5]
y = [2, 3, 4, 5, 6]

# You can calculate the Pearson's correlation coefficient with scipy's pearsonr() function:
corr, _ = pearsonr(x, y)

print('Pearsons correlation: %.3f' % corr)

In this example, pearsonr(x, y) calculates the Pearson’s correlation coefficient between the two lists of numbers. The pearsonr() function actually returns two values:

  1. The correlation coefficient.
  2. The p-value for testing non-correlation.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

If you’re dealing with a pandas DataFrame, you can calculate Pearson’s correlation coefficient for every pair of columns like this:

import pandas as pd

# Create a simple dataframe
df = pd.DataFrame({
   'A': [1, 2, 3, 4, 5],
   'B': [2, 3, 4, 5, 6],
   'C': [3, 2, 4, 2, 1]
})

# Compute pairwise correlation of columns, excluding NA/null values
corr_matrix = df.corr()

print(corr_matrix)

The df.corr() function returns a DataFrame that contains the Pearson correlation coefficient between each pair of columns in the DataFrame. Note that the diagonal of this matrix is always 1, since a column is perfectly correlated with itself.

How to Calculate Spearman Rank Correlation in Python?

Calculating Spearman rank correlation in Python is quite simple using the scipy library. Here’s an example with two lists of numbers:

from scipy.stats import spearmanr

# Here's two lists of numbers:
x = [1, 2, 3, 4, 5]
y = [2, 3, 4, 5, 6]

# You can calculate the Spearman rank correlation with scipy's spearmanr() function:
corr, _ = spearmanr(x, y)

print('Spearmans correlation: %.3f' % corr)

In this example, spearmanr(x, y) calculates the Spearman rank correlation coefficient between the two lists of numbers. The spearmanr() function returns two values:

  1. The correlation coefficient.
  2. The p-value for testing non-correlation.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets.

If you’re working with a pandas DataFrame, you can calculate the Spearman rank correlation coefficient for every pair of columns like this:

import pandas as pd

# Create a simple dataframe
df = pd.DataFrame({
   'A': [1, 2, 3, 4, 5],
   'B': [2, 3, 4, 5, 6],
   'C': [3, 2, 4, 2, 1]
})

# Compute pairwise correlation of columns using Spearman rank correlation, excluding NA/null values
corr_matrix = df.corr(method='spearman')

print(corr_matrix)

The df.corr(method='spearman') function returns a DataFrame that contains the Spearman rank correlation coefficient between each pair of columns in the DataFrame. Note that the diagonal of this matrix is always 1, since a column is perfectly correlated with itself.

Rating: 1 out of 5.

Leave a Reply