data(iris)
cor(iris[, c("Petal.Length", "Sepal.Length")])
Petal.Length Sepal.Length
Petal.Length 1.0000000 0.8717538
Sepal.Length 0.8717538 1.0000000
Nicolai Berk
January 12, 2020
Often, students confuse the regression coefficient \(\hat{\beta}\) in OLS models with the correlation coefficient Pearson’s r. This brief guide should provide some clarity.
First, let’s look at some data from the iris dataset in R and compare the correlation coefficient r to the regression coefficient (\(\hat{\beta}\)) in an OLS model:
Petal.Length Sepal.Length
Petal.Length 1.0000000 0.8717538
Sepal.Length 0.8717538 1.0000000
Correlation coefficients are given by the cor()-function (this could be wrapped in corrtable() to make the output visually more appealing). We can see that the correlation coefficient Pearson’s r equals 0.87. Let’s compare that to the regression coefficient \(\hat\beta\):
Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-2.47747 -0.59072 -0.00668 0.60484 2.49512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.10144 0.50666 -14.02 <0.0000000000000002 ***
Sepal.Length 1.85843 0.08586 21.65 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8678 on 148 degrees of freedom
Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
F-statistic: 468.6 on 1 and 148 DF, p-value: < 0.00000000000000022
\(\hat\beta\) equals 1.86 for the same two variables! Since both are estimates of the variables’ correlation, how can they be so different? To get an answer to this question, let’s consult the formulas.
Pearson’s coefficient is given by1:
\[\begin{align} \displaystyle r_{xy} = \frac{\sum_{n=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{n=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{n=1}^{n}(y_i - \bar{y})^2}} \end{align}\]
Which basically means that the correlation coefficient of x and y (\(r_{xy}\)) equals the sum (\(\sum_{n=1}^{n}\)) of the product of the deviations of \(x\) and \(y\) from their respective means \(\bar{x}\) and \(\bar{y}\) (this upper part is called the covariance of \(x\) and \(y\)), divided by the absolute summed deviation of \(x\) from it’s mean \(\bar{x}\), multiplied by the absolute summed deviation of \(y\). In other words, this measure estimates the covariation of \(x\) and \(y\), standardised for the variation in each of these variables. This becomes more obvious in an alternative specification of the same formula:
\[\begin{align} r_{xy} = \frac{1}{n-1} \sum_{i=1}^{n}(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) \end{align}\]
Note that \((\frac{x_i - \bar{x}}{s_x})\) is exactly the formula for \(x\)’s z-scores, so a standardised representation of it (and similarly for \(y\)).
Let’s compare this to the formula of our regression coefficient:
\[\begin{align} \hat{\beta} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \end{align}\]
It looks eerily similar to equation (1) for \(r_{xy}\)! In fact the numerator is identical - the covariance of \(x\) and \(y\). What differs is that the denominator (the bottom part of the equation) contains the formula for the variance in \(x\). And if we replaced \(\sqrt{\sum_{n=1}^{n}(y_i - \bar{y})^2}\) with \(\sqrt{\sum_{n=1}^{n}(x_i - \bar{x})^2}\) in (1), they would be fully identical! Hence these two measures differ only in whether the covariance is standardised by the variation of \(x\) and \(y\) (\(r_{xy}\)) or \(x\) alone (\(\hat\beta\)). This is also the reason why regressing \(y\) on \(x\) does usually not produce the same estimate as regressing \(x\) on \(y\) - the denominators differ. The formulas are so similar that we can express \(\hat{\beta}\) as a function of Pearson’s r:
\[\begin{align} \hat{\beta} = r_{xy}\frac{s_y}{s_x} \end{align}\]
Only the ratio of the standard deviations is needed to transform one into another! This means that if the standard deviation of \(x\) and \(y\) are equal (\(s_x = s_y\)), then \(\hat{\beta} = r_{xy}\)! Remember that the correlation coefficient \(r_{xy}\) is an indicator of the overall strength of a relationship, which varies between 0 and 1. \(\hat{\beta}\) indicates the change in the expected value of \(y_i\) given a one-unit change in \(x_i\). In essence, they mainly differ by scale, as indicated in (4).
So how do we scale \(x\) equal to \(y\)? You might already have guessed that standardising the estimates should do the trick, as then the scale of the deviations for both variables is given in the unit of standard deviations from their own mean. Let’s check this in R:
iris$sep.l.std <- scale(iris$Sepal.Length, center = T, scale = T)
iris$pet.l.std <- scale(iris$Petal.Length, center = T, scale = T)
summary(lm(pet.l.std ~ sep.l.std, data = iris))
Call:
lm(formula = pet.l.std ~ sep.l.std, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.40343 -0.33463 -0.00379 0.34263 1.41343
Coefficients:
Estimate Std. Error t value
(Intercept) 0.0000000000000004644 0.0401386997212241151 0.00
sep.l.std 0.8717537758865827602 0.0402731681036472208 21.65
Pr(>|t|)
(Intercept) 1
sep.l.std <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4916 on 148 degrees of freedom
Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
F-statistic: 468.6 on 1 and 148 DF, p-value: < 0.00000000000000022
We can see that the estimate is now 0.87 - equal to our correlation coefficient. Since we re-scaled the estimates, a one-unit change in \(x\) is now equal to a one-standard-deviation change in \(x\).
You can simply find these formulas in the articles on Pearson’s r and linear regression on Wikipedia.↩︎