The difference between correlation and regression coefficient

Intro to Stats
Regression
Author

Nicolai Berk

Published

January 12, 2020

Often, students confuse the regression coefficient \(\hat{\beta}\) in OLS models with the correlation coefficient Pearson’s r. This brief guide should provide some clarity.

An example

First, let’s look at some data from the iris dataset in R and compare the correlation coefficient r to the regression coefficient (\(\hat{\beta}\)) in an OLS model:

data(iris)
cor(iris[, c("Petal.Length", "Sepal.Length")])
             Petal.Length Sepal.Length
Petal.Length    1.0000000    0.8717538
Sepal.Length    0.8717538    1.0000000

Correlation coefficients are given by the cor()-function (this could be wrapped in corrtable() to make the output visually more appealing). We can see that the correlation coefficient Pearson’s r equals 0.87. Let’s compare that to the regression coefficient \(\hat\beta\):

summary(lm(Petal.Length ~ Sepal.Length, data = iris))

Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.47747 -0.59072 -0.00668  0.60484  2.49512 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  -7.10144    0.50666  -14.02 <0.0000000000000002 ***
Sepal.Length  1.85843    0.08586   21.65 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8678 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 0.00000000000000022

\(\hat\beta\) equals 1.86 for the same two variables! Since both are estimates of the variables’ correlation, how can they be so different? To get an answer to this question, let’s consult the formulas.

The common correlation coefficient: Pearson’s r

Pearson’s coefficient is given by1:

\[\begin{align} \displaystyle r_{xy} = \frac{\sum_{n=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{n=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{n=1}^{n}(y_i - \bar{y})^2}} \end{align}\]

Which basically means that the correlation coefficient of x and y (\(r_{xy}\)) equals the sum (\(\sum_{n=1}^{n}\)) of the product of the deviations of \(x\) and \(y\) from their respective means \(\bar{x}\) and \(\bar{y}\) (this upper part is called the covariance of \(x\) and \(y\)), divided by the absolute summed deviation of \(x\) from it’s mean \(\bar{x}\), multiplied by the absolute summed deviation of \(y\). In other words, this measure estimates the covariation of \(x\) and \(y\), standardised for the variation in each of these variables. This becomes more obvious in an alternative specification of the same formula:

\[\begin{align} r_{xy} = \frac{1}{n-1} \sum_{i=1}^{n}(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \end{align}\]

Note that \((\frac{x_i - \bar{x}}{s_x})\) is exactly the formula for \(x\)’s z-scores, so a standardised representation of it (and similarly for \(y\)).

The regression coefficient \(\hat{\beta}\)

Let’s compare this to the formula of our regression coefficient:

\[\begin{align} \hat{\beta} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \end{align}\]

It looks eerily similar to equation (1) for \(r_{xy}\)! In fact the numerator is identical - the covariance of \(x\) and \(y\). What differs is that the denominator (the bottom part of the equation) contains the formula for the variance in \(x\). And if we replaced \(\sqrt{\sum_{n=1}^{n}(y_i - \bar{y})^2}\) with \(\sqrt{\sum_{n=1}^{n}(x_i - \bar{x})^2}\) in (1), they would be fully identical! Hence these two measures differ only in whether the covariance is standardised by the variation of \(x\) and \(y\) (\(r_{xy}\)) or \(x\) alone (\(\hat\beta\)). This is also the reason why regressing \(y\) on \(x\) does usually not produce the same estimate as regressing \(x\) on \(y\) - the denominators differ. The formulas are so similar that we can express \(\hat{\beta}\) as a function of Pearson’s r:

\[\begin{align} \hat{\beta} = r_{xy}\frac{s_y}{s_x} \end{align}\]

Only the ratio of the standard deviations is needed to transform one into another! This means that if the standard deviation of \(x\) and \(y\) are equal (\(s_x = s_y\)), then \(\hat{\beta} = r_{xy}\)! Remember that the correlation coefficient \(r_{xy}\) is an indicator of the overall strength of a relationship, which varies between 0 and 1. \(\hat{\beta}\) indicates the change in the expected value of \(y_i\) given a one-unit change in \(x_i\). In essence, they mainly differ by scale, as indicated in (4).

What if x and y are standardised?

So how do we scale \(x\) equal to \(y\)? You might already have guessed that standardising the estimates should do the trick, as then the scale of the deviations for both variables is given in the unit of standard deviations from their own mean. Let’s check this in R:

iris$sep.l.std <- scale(iris$Sepal.Length, center = T, scale = T)
iris$pet.l.std <- scale(iris$Petal.Length, center = T, scale = T)
summary(lm(pet.l.std ~ sep.l.std, data = iris))

Call:
lm(formula = pet.l.std ~ sep.l.std, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.40343 -0.33463 -0.00379  0.34263  1.41343 

Coefficients:
                         Estimate            Std. Error t value
(Intercept) 0.0000000000000004644 0.0401386997212241151    0.00
sep.l.std   0.8717537758865827602 0.0402731681036472208   21.65
                       Pr(>|t|)    
(Intercept)                   1    
sep.l.std   <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4916 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 0.00000000000000022

We can see that the estimate is now 0.87 - equal to our correlation coefficient. Since we re-scaled the estimates, a one-unit change in \(x\) is now equal to a one-standard-deviation change in \(x\).