
What is Point-Biserial Correlation in Statistics?
The point-biserial correlation coefficient (rpb or rbs) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomous variable, like whether a test score is higher or lower than the median score. The other variable (e.g. X) can either be interval or ratio-level (continuous).
The point-biserial correlation measures the strength and direction of the association between the two variables. It is equivalent to the Pearson correlation, that is, if you have one continuous and one dichotomous variable, the point-biserial correlation coefficient is equal to the Pearson correlation coefficient.
The value of the point-biserial correlation coefficient lies between -1 and +1, similar to the Pearson correlation. A positive value indicates a positive relationship between the variables, while a negative value indicates a negative relationship. A value close to 0 implies no relationship between the variables.
Here are some general guidelines for interpreting the size of a correlation coefficient:
- 0.00-0.19: very weak
- 0.20-0.39: weak
- 0.40-0.59: moderate
- 0.60-0.79: strong
- 0.80-1.0: very strong
As always, it’s important to remember that correlation does not imply causation. A correlation between two variables does not necessarily mean that changes in one variable cause changes in the other.
How to Calculate Point-Biserial Correlation in Python?
Calculating the point-biserial correlation in Python is straightforward using the scipy
library’s pointbiserialr
function. Here’s an example:
from scipy.stats import pointbiserialr
# Here's a continuous variable and a binary variable:
x = [1, 2, 3, 4, 5]
y = [0, 1, 0, 1, 0]
# You can calculate the point-biserial correlation coefficient with scipy's pointbiserialr() function:
corr, _ = pointbiserialr(x, y)
print('Point-Biserial correlation: %.3f' % corr)
In this example, pointbiserialr(x, y)
calculates the point-biserial correlation coefficient between the two lists of numbers. The pointbiserialr()
function actually returns two values:
- The correlation coefficient.
- The p-value for testing non-correlation.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a point-biserial correlation at least as extreme as the one computed from these datasets.
Please note that the point-biserial correlation is sensitive to the assumption that the continuous variable is normally distributed within each group defined by the binary variable. If your data doesn’t meet this assumption, you may need to use a non-parametric alternative, like the rank-biserial correlation.
Related Posts
1. How to Calculate Correlation in Python?
2. How to Calculate Partial Correlation in Python?
3. How to Calculate Cross-Correlation in Python?