3 Multiple regression model

In this chapter, we’ll analyze the computation and the properties of the OLS estimator when the number of covariates ($K$) is at least 2. Actually, we’ll analyze in depth the case when $K=2$ (generalizing from $K=2$ to $K>2$ being quite simple) and, compared to the previous two chapters, we’ll insist on two important points:

the use of matrix algebra, which makes the computation of the estimator and the analysis of its properties elegant and compact,
the correlation between the covariates; actually we’ll show that the multiple and the simple linear models are different only if there exists such a correlation.

We’ll roughly follow the same plan as for the simple linear model: Section 3.1 presents the structural model and the data set that we’ll use throughout the chapter, Section 3.2 the computation of the estimator, Section 3.3 the geometry of the multiple linear model, Section 3.4 the computation of the estimator with R, Section 3.5 its statistical properties and Section 3.6 the inference methods (confidence interval and tests). Finally, Section 3.7 presents system estimation and the constrained least squares estimator.

3.1 Model and data set

To illustrate the multiple regression model, we’ll use the example of the estimation of a model explaining economic growth, using a cross-section of countries.

Structural model

One of the most popular growth models in the economic literature is the Solow-Swan model. The production $Y$ (or more precisely the added value or GDP) is performed using two production factors, labor $L$ and capital $K$. Physical labor is transformed in effective labor using a term called $A$. $A$ is time-varying (typically increasing) and therefore represents the effect of technical progress which increases the productivity of labor. The functional form of the production function is a Cobb-Douglas:

\[ Y(t) = K(t) ^ \kappa \left[A(t)L(t)\right]^{1 - \kappa} \tag{3.1}\]

Coefficients for capital and labor are respectively $\kappa$ and $1-\kappa$. They represent the elasticity of the production respective to each factor, but also the share of each factor in the national income ($\kappa$ is therefore the share of profits and $1-\kappa$ the share of wages).

Each variable is a continuous function of time $t$. We’ll denote, for each variable $\dot{V}=\frac{d V}{d t}$ the derivative with respect to time. Finally, we’ll denote $y(t) = \frac{Y(t)}{A(t)L(t)}$ and $k(t)=\frac{Y(t)}{A(t)L(t)}$ production and capital per unit of effective labor. We therefore have:

\[ y(t) = k(t) ^ \kappa \tag{3.2}\]

We’ll hereafter omit $(t)$ to make the notation less cluttered. Variation of capital is investment less depreciation. We assume that investment equals savings and that a constant percentage of income ($i$) is saved every year. The depreciation rate is denoted by $\delta$. We then have:

\[ \dot{K} = \frac{d K}{d t} = i Y - \delta K \]

The growth of the capital stock per unit of effective labor $k$ is then:

\[ \dot{k} = \frac{d \frac{K}{AL}}{dt}=\frac{\dot{K}AL - (A\dot{L} + \dot{A}L)K}{A^2L^2}= \frac{\dot{K}}{AL} - \frac{K}{AL}\left(\frac{\dot{A}}{A} + \frac{\dot{L}}{L}\right) = (iy - \delta k)- k\left(\frac{\dot{A}}{A} + \frac{\dot{L}}{L}\right) \]

Denoting $n = \frac{\dot{L}}{L}$ and $g = \frac{\dot{A}}{A}$ the growth rates of $L$ and $A$, i.e., the demographic growth rate and the technological progress rate, we finally have:

\[ \dot{k}(t)=iy(t)-(n+g+\delta)k(t) = ik(t) ^ \kappa - (n+g+\delta)k(t) \]

At the steady state, the growth rate of capital per unit of effective labor is 0. Solving $\dot{k}(t)=0$ we get the steady state value of $k(t)$, denoted by $k^*$:

\[ k^* = \left(\frac{i}{n + g+ \delta}\right) ^ \frac{1}{1-\kappa} \]

Or:

\[ \left(\frac{K}{Y}\right)^*= \frac{k^*}{y^*}=k^{*(1-\kappa)} = \frac{i}{n + g+ \delta} \]

From Equation 3.1, the production per capita is:

\[ \frac{Y(t)}{L(t)} = A(t) k(t) ^ {\kappa} \]

Replacing $k(t)$ by $k^*$, $A(t)$ by $A(0)e^{gt}$ and taking logs, we get:

\[ \ln\frac{Y(t)}{L(t)} = \ln A(0) + g t + \frac{\kappa}{1-\kappa} \ln i - \frac{\kappa}{1 - \kappa} \ln (n + g + \delta) \]

Finally, let’s denote $\ln A(0) = a + \epsilon$. $\epsilon$ is an error term which represents the initial dispersion between countries in terms of initial value of technical progress. With $C=a + gt$, and $v = \ln(n + g + \delta)$, the linear model that we wish to estimate is finally:

\[ \ln\frac{Y}{L} = C + \frac{\kappa}{1-\kappa} \ln i - \frac{\kappa}{1 - \kappa} \ln v + \epsilon \tag{3.3}\]

We therefore get a multiple regression model for which the response is the log of GPD per capita, ($\ln \frac{Y}{L})$ and the two covariates are $\ln i_n$ (the saving rate) and $\ln v_n$ (the sum of the demographic growth, the technical progress and the depreciation rates). Moreover, the structural model imposes some restrictions on the coefficients that can be tested. The two slopes are, in terms of the structural parameters of the theoretical model: $\beta_i = \frac{\kappa}{1-\kappa}$ and $\beta_v = -\frac{\kappa}{1 - \kappa}$. Moreover, $\kappa$ is the elasticity of the GDP with the capital and also the share of profits in GDP. A common approximation for the value of this parameter is about 1/3, which implies: $\beta_i = - \beta_v = \frac{\kappa}{1-\kappa}=0.5$.

Mankiw, Romer, and Weil (1992) proposed a generalization of the Solow-Swan model that includes human capital, denoted by $H$. The production function is now:

\[ Y(t) = K(t) ^ \kappa H(t) ^ \lambda \left[A(t)L(t)\right]^{1 - \kappa - \lambda} \]

$\lambda$ is the share of human capital in the GDP and the share of labor is now $(1 - \kappa - \lambda)$. The model is very similar to the one we previously developed. We first compute the growth rate of physical and human capital ($\dot{k}$ and $\dot{h}$) per unit of effective labor, we set these two growth rates to 0 to get the stocks of physical and human capital at the steady state per unit of effective labor ($k^*$ and $h^*$) and we introduce these two values in the production function to get:

\[ \ln\frac{Y}{L} = C + \frac{\kappa}{1-\kappa-\lambda} \ln i + \frac{\lambda}{1-\kappa-\lambda} \ln e - \frac{\kappa + \lambda}{1 - \kappa-\lambda} \ln v + \epsilon \tag{3.4}\]

where $e$ is the per capita level of human capital. The model now contains three covariates and three slopes ($\beta_i$, $\beta_e$ and $\beta_v$). Moreover, the structural model implies a structural restriction ($\beta_i + \beta_e + \beta_v = 0$) that is testable.

Data set

growth contains the data used by Mankiw, Romer, and Weil (1992). It consists of 121 countries for 1985.

growth %>% print(n = 3)

# A tibble: 121 × 9
  country group  gdp60 gdp85 gdpgwth popgwth   inv school  growth
  <chr>   <fct>  <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl>   <dbl>
1 Algeria other   2485  4371     4.8   0.026 0.241  0.045  0.565 
2 Angola  lqdata  1588  1171     0.8   0.021 0.058  0.018 -0.305 
3 Benin   lqdata  1116  1071     2.2   0.024 0.108  0.018 -0.0412
# ℹ 118 more rows

This data set contains a variable called group which enables the selection of subsamples. The modalities of this variable are:

"oil": for countries whose most part of the GDP is linked to oil extraction,
"oecd": for OECD countries,
"lqdata": for countries with low quality data,
"other": for other countries.

The variables used in the following regressions are per capita GDP in 1985 (gdp85), investment rate (inv) and growth rate of the population (popgwth). To get the variable denoted by v in the previous section, we need to add to the growth rate of the population the technical progress rate and the rate of depreciation. As these two variables are difficult to measure consistently, the authors assume that they don’t exhibit any cross-country variation and that they sum to 5%. Therefore, v equals popgwth + 0.05.

We first investigate the relationship between the two covariates: inv and popgwth. Figure 3.1 presents the scatterplot, the size of the points being proportional to GDP per capita.

growth %>% ggplot(aes(popgwth, inv)) +
    geom_point(aes(size = gdp85, shape = group)) + stat_ellipse() +
    geom_smooth(color = "black")

Figure 3.1: Investment rate and demographic growth

There is a weak negative correlation between the two variables and rich countries are in general characterized by a low demographic growth rate and a high investment rate. We also remark that there is an outlier, Kuwait, which has a very high demographic growth rate. We then compute the variable v and we rename inv and school in i and e as in Equation 3.4:

growth <- mutate(growth, v = popgwth + 0.05) %>% 
  rename(i = inv, e = school)

3.2 Computation of the OLS estimator

In this section, we present the computation of the OLS estimator with a number of covariates $K \geq 2$. We start with the case where $K=2$, for which it is possible to compute the estimators using roughly the same notations as the one used for the simple linear regression model. Then we’ll perform the computation in the general case using matrix algebra.

For one observation, denoting $x_{n1}$ and $x_{n2}$ the two covariates for observation $n$, the model is:

\[ y_{n}=\alpha+\beta_1 x_{n1}+\beta_2 x_{n2}+\epsilon_{n} \]

Each observation is now a point in the 3-D space defined by $(x_1, x_2, y)$ and $\gamma ^ \top = (\alpha, \beta_1, \beta_2)$ are the coordinates of a plane that returns the expected value of $y$ for a given value of $x_1$ and $x_2$. The residual sum of squares is:

\[f(\alpha, \beta) =\sum_{n = 1} ^ N\left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right)^2\] which leads to the following three first-order conditions:

\[ \left\{ \begin{array}{rcl} \frac{\partial f}{\partial \alpha} &=& -2 \sum_{n = 1} ^ N \left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right) = 0 \\ \frac{\partial f}{\partial \beta_1} &=& -2\sum_{n = 1} ^ N x_{n1}\left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right)=0 \\ \frac{\partial f}{\partial \beta_2} &=& -2\sum_{n = 1} ^ N x_{n2}\left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right)=0 \end{array} \right. \tag{3.5}\]

Dividing the first line of Equation 3.5 by the sample size, we get:

\[ \bar{y} - \alpha - \beta_1 \bar{x}_1 - \beta_2 \bar{x}_2 = 0 \tag{3.6}\]

which means that the sample mean is on the regression plane, or that the sum of the residuals is zero. For the last two lines of Equation 3.5, the terms in parentheses are the residual for one observation and as its mean is 0, they indicate that the sample covariance between the residuals and both covariates should be 0. Therefore, we get exactly the same conditions as the one obtained for the simple linear regression model. Subtracting Equation 3.6 from the last two lines of Equation 3.5, we get:

\[ \left\{ \begin{array}{l} \sum_{n = 1} ^ N x_{n1}\left[(y_{n}-\bar{y})-\beta_1(x_{n1}-\bar{x}_1) - \beta_2(x_{n2} - \bar{x}_2)\right]=0 \\ \sum_{n = 1} ^ N x_{n2}\left[(y_{n}-\bar{y})-\beta_1(x_{n1}-\bar{x}_1) - \beta_2(x_{n2} - \bar{x}_2)\right]=0 \\ \end{array} \right. \] or replacing $x_{nk}$ by $x_{nk} - \bar{x}_k$ and developing terms:

\[ \small{ \left\{ \begin{array}{lcl} \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(y_{n} - \bar{y}) &=& \beta_1\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) ^ 2 + \beta_2 \sum_{n = 1} ^ {N} (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2) \\ \sum_{n = 1} ^ N (x_{n2}- \bar{x}_2)(y_{n}-\bar{y}) &=& \beta_1\sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2) + \beta_2 \sum_{n = 1} ^ N (x_{n2} - \bar{x}_2) ^ 2 \\ \end{array} \right.} \tag{3.7}\]

We therefore have a system of two linear equations with two unknown parameters ($\beta_1$ and $\beta_2$) that could be solved, for example, by substitution. However, the use of matrix algebra enables to solve such a problem in a much simpler way. Denote:

\[ X=\left(\begin{array}{cc} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \vdots & \vdots \\ x_{N1} & x_{N2} \\ \end{array}\right) \;\; y=\left(\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \\ \end{array}\right) \;\; \beta=\left(\begin{array}{c} \beta_1 \\ \beta_2 \\ \end{array}\right) \] $X$ is a matrix with two columns ($K$ in the general case) and $N$ lines (the number of observations) and $y$ is a vector of length $N$. We define $\tilde{I} = I - J / N$ where $I$ is a $N \times N$ identity matrix and $J$ a $N\times N$ matrix of ones. For example, for $N = 3$:

\[ \tilde{I} = I - J / N = \left( \begin{array}{cccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right) - \left( \begin{array}{cccc} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{array} \right) \]

Premultiplying a vector $z$ of length $N$ by $J/N$, we get a vector of length $N$ containing the sample mean of $z$ $\bar{z}$ repeated $N$ times. As premultiplying $z$ by the identity matrix returns $z$, premultiplying $z$ by $\tilde{I}$ returns a vector of length $N$ containing the $N$ values of $z$ in difference from the sample mean. Note that $\tilde{I}$ is idempotent, which means that $\tilde{I} \times \tilde{I}$. It can be checked by using direct multiplication, but also by reminding that premultiplying a vector by $\tilde{I}$ removes the sample mean from the values of the vector. The transformed vector then has a zero mean, so that applying the same premultiplication one more time will leave it unchanged. Therefore, $\tilde{I}(\tilde{I} z) = \tilde{I}z$. Denoting $\tilde{z} = \tilde{I} z$, we get:

\[ \tilde{X}= \tilde{I} X = \left(\begin{array}{cc} x_{11} - \bar{x}_1 & x_{12} - \bar{x}_2\\ x_{21} - \bar{x}_1 & x_{22} - \bar{x}_2\\ \vdots & \vdots \\ x_{N1} - \bar{x}_1 & x_{N2} - \bar{x}_2\\ \end{array}\right) ,\;\; \tilde{y} = \tilde{I} y =\left(\begin{array}{c} y_1 - \bar{y}\\ y_2 - \bar{y}\\ \vdots \\ y_N - \bar{y}\\ \end{array}\right) \]

Then:

\[ \tilde{X} ^ \top \tilde{X} = \left(\begin{array}{cc} \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1) ^ 2 & \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2)\\ \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2) & \sum_{n = 1} ^ N (x_{n2} - \bar{x}_2) ^ 2\\ \end{array} \right) = \left(\begin{array}{cc} S_{11} & S_{12}\\ S_{12} & S_{22}\\ \end{array} \right) \] and \[ \tilde{X}^\top \tilde{y} = \left(\begin{array}{c} \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(y_{n}- \bar{y})\\ \sum_{n = 1} ^ N (x_{n2} - \bar{x}_2)(y_{n}- \bar{y})\\ \end{array} \right) = \left(\begin{array}{cc} S_{1y} \\ S_{2y} \\ \end{array} \right) \]

$S_{kk}$ is the total variation of $x_k$, $S_{kl}$ the covariation of $x_k$ and $x_l$ and $S_{ky}$ the covariation between covariate $k$ and the response. Note that the quantities $S_{kk}$ and $S_{ky}$ were already present in the simple linear model as $S_{xx}$ and $S_{xy}$ (there are now two of them). The new term is $S_{kl}$, which measures the correlation between the two covariates.

Equation 3.7 can then be written in matrix form as:

\[ \tilde{X}^\top \tilde{y} = \tilde{X} ^ \top \tilde{X} \beta \]

And the OLS estimator is obtained by premultiplying both sides of the equation by the inverse of $\tilde{X} ^ \top \tilde{X}$:

\[ \hat{\beta} = \left(\tilde{X} ^ \top \tilde{X}\right) ^ {- 1} \tilde{X}^\top \tilde{y} \tag{3.8}\]

Note that premultiplying the vector $\tilde{X}^\top \tilde{y}$ by the inverse of $\tilde{X} ^ \top \tilde{X}$ is a natural extension of the computation we’ve performed for the simple linear model, which consisted of dividing $S_{xy}$ by $S_{xx}$. To understand this formula, we write $\tilde{X} ^ \top \tilde{X}$ and $\tilde{X} ^ \top \tilde{y}$ as:

\[ \tilde{X} ^ \top \tilde{X} = \left(\begin{array}{cc} S_{11} & S_{12}\\ S_{12} & S_{22}\\ \end{array} \right) = N\left(\begin{array}{cc} \hat{\sigma}_1^2 & \hat{\sigma}_{12}\\ \hat{\sigma}_{12} & \hat{\sigma}_2^2\\ \end{array} \right) = N\hat{\sigma}_1\hat{\sigma}_2 \left(\begin{array}{cc} \frac{\hat{\sigma}_1}{\hat{\sigma}_2} & \hat{\rho}_{12}\\ \hat{\rho}_{12} & \frac{\hat{\sigma}_2}{\hat{\sigma}_1}\\ \end{array} \right) \tag{3.9}\]

and

\[ X^\top y = \left( \begin{array}{cc} S_{1y} \\ S_{2y} \\ \end{array} \right) = N\left( \begin{array}{cc} \hat{\sigma}_{1y} \\ \hat{\sigma}_{2y} \\ \end{array} \right) = N\hat{ \sigma}_y \left(\begin{array}{cc} \hat{\sigma}_1\hat{\rho}_{1y} \\ \hat{\sigma}_2\hat{\rho}_{2y} \\ \end{array} \right) \]

the first formulation uses the total sample variations / covariations,
the second one divides every term by $N$ to obtain sample variances and covariances,
the third one divides the covariances by the product of the standard deviations to get sample coefficients of correlation.

To compute the estimator, we need to compute the inverse of $\tilde{X} ^ \top \tilde{X}$, which is:

\[ \left(\tilde{X} ^ \top \tilde{X}\right) ^ {- 1} = \frac{ \left(\begin{array}{cc} S_{22} & -S_{12}\\ -S_{12} & S_{11}\\ \end{array} \right)} {S_{11} S_{22} - S_{12} ^ 2} = \frac{ \left(\begin{array}{cc} \hat{\sigma}_2^2 & -\hat{\sigma}_{12}\\ -\hat{\sigma}_{12} & \hat{\sigma}_1^2\\ \end{array} \right) } {N (\hat{\sigma}_1 ^ 2 \hat{\sigma}_2 ^ 2 - \hat{\sigma}_{12}^2)} = \frac{ \left(\begin{array}{cc} \displaystyle\frac{\hat{\sigma}_2}{\hat{\sigma}_1} & - \hat{\rho}_{12}\\ - \hat{\rho}_{12} & \displaystyle\frac{\hat{\sigma}_1}{\hat{\sigma}_2}\\ \end{array} \right) } {N \hat{\sigma}_1 \hat{\sigma}_2 (1 - \hat{\rho}_{12} ^ 2)} \tag{3.10}\]

Equation 3.8 finally gives:

\[ \left\{\begin{array}{l} \hat{\beta}_1 = \displaystyle \frac{S_{22}S_{1y} - S_{12}S_{2y}}{S_{11}S_{22} - S_{12} ^ 2} = \frac{\hat{\sigma}_2 ^ 2 \hat{\sigma}_{1y} - \hat{\sigma}_{12} \hat{\sigma}_{2y}} {\hat{\sigma}_1 ^ 2 \hat{\sigma}_2 ^ 2 - \hat{\sigma} _ {12} ^ 2} = \frac{\hat{\rho}_{1y} - \hat{\rho}_{12}\hat{\rho}_{2y}}{1 - \hat{\rho}_{12} ^ 2} \frac{\hat{\sigma}_y}{\hat{\sigma}_1} \\ \hat{\beta}_2 = \displaystyle \frac{S_{11}S_{2y} - S_{12}S_{1y}}{S_{11}S_{22} - S_{12} ^ 2} = \frac{\hat{\sigma}_1 ^ 2 \hat{\sigma}_{2y} - \hat{\sigma}_{12} \hat{\sigma}_{1y}} {\hat{\sigma}_1 ^ 2 \hat{\sigma}_2 ^ 2 - \hat{\sigma}_{12} ^ 2} = \frac{\hat{\rho}_{2y} - \hat{\rho}_{12} \hat{\rho}_{1y}} {1-\hat{\rho}_{12}^2} \frac{\hat{\sigma}_y}{\hat{\sigma}_2} \\ \end{array} \right. \]

If the two covariates are uncorrelated in the sample ($S_{12} = \hat{\sigma}_{12} = \hat{\rho}_{12} = 0$), we have:

\[ \left\{ \begin{array}{l} \hat{\beta}_1 = \displaystyle \frac{S_{1y}}{S_{11}} = \frac{\hat{\sigma}_{1y}} {\hat{\sigma}_1 ^ 2} = \hat{\rho}_{1y} \frac{\hat{\sigma}_y}{\hat{\sigma}_1} \\ \hat{\beta}_2 = \displaystyle \frac{S_{2y}}{S_{22}} = \frac{\hat{\sigma}_{2y}} {\hat{\sigma}_2 ^ 2} = \hat{\rho}_{2y} \frac{\hat{\sigma}_y}{\hat{\sigma}_2} \end{array} \right. \]

which is exactly the same formula that we had for the unique slope in the case of the simple regression model. This means that, if $x_1$ and $x_2$ are uncorrelated in the sample, regressing $y$ on $x_1$ or on $x_1$ and $x_2$ leads exactly to the same estimator for the slope of $x_1$.

The general formula for $\hat{\beta}_1$ in term of the estimator of the simple linear model $\hat{\beta}_1^S$ (Equation 1.10) is:

\[ \hat{\beta}_1 = \frac{\hat{\rho}_{1y} - \hat{\rho}_{12}\hat{\rho}_{2y}}{1 - \hat{\rho}_{12} ^ 2} \frac{\hat{\sigma}_y}{\hat{\sigma}_1} = \hat{\beta}_1^s \frac{1 - \displaystyle\frac{\hat{\rho}_{12}\hat{\rho}_{2y}}{\hat{\rho}_{1y}}} {1 - \hat{\rho}_{12} ^ 2}= \hat{\beta}_1^s \frac{1 - \hat{\rho}_{12} ^ 2 \displaystyle\frac{\hat{\rho}_{2y}}{\hat{\rho}_{12}\hat{\rho}_{1y}}} {1 - \hat{\rho}_{12} ^ 2} \tag{3.11}\]

We have $\mid\hat{\rho}_{12} \hat{\rho}_{1y}\mid \leq \mid\hat{\rho}_{2y}\mid$, or $\left| \frac{\hat{\rho}_{2y}}{\hat{\rho}_{12}\hat{\rho}_{1y}}\right| \geq 1$.

Consider the case where the two covariates are positively correlated with the response. As an example, consider the wage equation with $x_1$ education and $x_2$ a dummy for males. Two cases should then be analyzed:

the two covariates are positively correlated (males are more educated than females on average). In this case $\hat{\rho}_{2y}/(\hat{\rho}_{12}\hat{\rho}_{1y})>1$, and the numerator of Equation 3.11 is lower than $1 - \hat{\rho}_{12} ^ 2$, so that $\hat{\beta}_1<\hat{\beta}_1^S$. $\hat{\beta}_1 ^ S$ is upward biased because it estimates the sum of the positive direct effect of education on wage and a positive indirect effect (more education leads to a subpopulation with a higher share of males and therefore higher wages),
the two covariates are negatively correlated (males are less educated than females on average). In this case, $\hat{\rho}_{2y}/(\hat{\rho}_{12}\hat{\rho}_{1y})<0$ and the numerator of Equation 3.11 is greater than 1, so that $\hat{\beta}_1>\hat{\beta}_1^S$. $\hat{\beta}_1 ^ S$ is downward biased because it estimates the sum of the positive direct effect of education on wage and a negative indirect effect (more education leads to a subpopulation with a lower share of males and therefore lower wages).

The general derivation of the OLS estimator can be performed using matrix algebra, denoting $j_N$ a vector of 1 of length $N$, $Z = (j_N, X)$ a vector formed by binding a vector of 1 to the matrix of covariates and $\gamma^\top = (\alpha, \beta ^ \top)$ the vector of parameters obtained by adding the intercept $\alpha$ to the vector of slopes:

\[ f(\gamma) = (y - Z\gamma)^\top (y - Z\gamma) = y ^ \top y + \gamma ^ \top Z ^ \top Z \gamma -2 \gamma ^ \top Z ^ \top y \]

The $K+1$ first-order conditions are:

\[ \frac{\partial f}{\partial \gamma} = 2 Z ^ \top Z \gamma - 2 Z ^ \top y = -2 Z ^ \top (y - Z\gamma) = 0 \]

The last expression indicates that all the columns of $Z$ are orthogonal to the vector of residuals $y - Z\gamma$. Solving for $\gamma$, we get:

\[ \hat{\gamma} = (Z ^ \top Z) ^ {- 1} Z ^ \top y \tag{3.12}\]

The matrix of second derivatives is:

\[ \frac{\partial^2 f}{\partial \gamma\partial^\top \gamma} = 2 Z ^ \top Z \]

It is a positive semi-definite matrix, so that $\hat{\gamma}$ is a minimum of $f$. Comparing Equation 3.12 and Equation 3.8, we can see that the formula of the OLS is the same with different matrices (respectively $\tilde{X}$ and $Z$), the first equation returning $\hat{\beta}$ and the second one $\hat{\gamma} ^ \top = (\hat{\alpha}, \hat{\beta} ^ \top)$.

3.3 Geometry of least squares

Geometry of the multiple regression model

The geometry of the multiple regression model is presented in Figure 3.2.

We now have $N = 3$ and $K = 2$ and therefore each variable is a vector in the 3-D space. As the two covariates $x_1$ and $x_2$ are linearly independent, they span a subspace of dimension 2, which is a plane in this 3-D space. $\hat{y}$ is the orthogonal projection of $y$ on this subspace and $\hat{\epsilon}$, is the vector that links $\hat{y}$ and $y$. $\hat{\epsilon}$ is therefore the projection of $y$ on the complement to the subspace defined by $(x_1, x_2)$, which is a straight line orthogonal to the plane spanned by $(x_1, x_2)$. Therefore $\hat{\epsilon}$ is orthogonal to $x_1$ and to $x_2$, which means that the residuals are uncorrelated with the two covariates. The decomposition of $y$ on the sum of two orthogonal vectors $\hat{\epsilon}$ and $\hat{y}$ doesn’t depend on the two variables $x_1$ and $x_2$ per se, but on the subspace spanned by $x_1$ and $x_2$. This means that any couple of independent linear combination of $x_1$ and $x_2$ will leads to the same subspace as the one defined by $x_1$ and $x_2$ and therefore to the same residuals and the same fitted values.

More formally, as $\hat{\beta} = (X ^ \top X) ^ {-1} X ^ \top y$, we have $\hat{y} = X \hat{\beta} = X (X ^ \top X) ^ {-1} X ^ \top y = P_X y$. $P$ is sometimes called the “hat” matrix, as it “puts a hat” on $y$. This matrix transforms the vector of response on a vector of prediction. As $\hat{\epsilon} = y - \hat{y}$, we also have $\hat{\epsilon} = y - P_X y = (I-P_X) y=M_Xy$. We therefore consider two matrices $P_X$ and $M_X$:

\[ \left\{ \begin{array}{rcl} P_X &=& X (X ^ \top X) ^ {-1} X ^ \top \\ M_X &=& I - X (X ^ \top X) ^ {-1} X ^ \top \\ \end{array} \right. \tag{3.13}\]

that are square and symmetric of dimension $N \times N$, which means that they are in practice large matrices, and are therefore never computed in practice. However, they have very interesting analytical features. First, they are idempotent, which means that $P_X \times P_X = P_X$ and $M_X \times M_X = M_X$. This means that while premultiplying a vector by such a matrix, this vector is projected in a subspace. For example, premultiplying $y$ by $P_X$ gives $\hat{y}$, the vector of fitted values. It is a linear combination of $x_1$ and $x_2$ and therefore belongs to the subspace spanned by $x_1$ and $x_2$. Therefore, $P_X \hat{y}$ obviously equal $\hat{y}$ and therefore $P_X \times P_X = P_X$. Except for the identity matrix, idempotent matrices are not full rank. Their rank can be easily computed using the fact that the rank of a matrix is equal to its trace (the sum of the diagonal elements) and that the trace of a product of matrices is invariant to any permutation of the matrices: $\mbox{tr} ABC = \mbox{tr} BCA = \mbox{tr} CAB$.

For a regression with an intercept, the model matrix $Z$ has $K + 1$ column, the first one being a column of one. In this case, the rank of $P_Z$ and $M_Z$ are: $\mbox{rank} \,P_Z = \mbox{tr}\, P_Z = \mbox{tr}\, Z (Z ^ \top Z) ^ {-1} Z ^ \top = \mbox{tr}\, (Z ^ \top Z) ^ {-1} Z ^ \top Z = \mbox{tr}\, I_{K+1} = K + 1$ and $\mbox{rank} \,M_Z = \mbox{tr}\, (I_N - P_X) = \mbox{tr}\, I_N - \mbox{tr}\, P_Z = N - K - 1$.

Finally, the two matrices are orthogonal: $P_ZM_Z= P_Z(I - P_Z)=P_Z-P_Z=0$, which means that they perform the projection of a vector on two orthogonal subspaces.

Getting back to Figure 3.2, $P_X$ project $y$ on the 2-D subspace (a plane) spanned by $x_1$ and $x_2$ and $M_X$ project $y$ on a 1-D subspace (the straight line orthogonal to the previous plane). $M_X$ and $P_X$ perform therefore an orthogonal decomposition of $y$ in $\hat{y}$ and $\hat{\epsilon}$, which means that $\hat{y} + \hat{\epsilon} = y$ and that $\hat{y}^\top\hat{\epsilon} = 0$.

Frisch-Waugh theorem

Consider the regression of $y$ on a set of regressors $X$ which, for some reasons, is separated in two subsets $X_1$ and $X_2$. Suppose that we are only interested in the coefficients $\beta_2$ associated with $X_2$. The Frisch-Waugh theorem states that the same estimator $\hat{\beta}_2$ is obtained:

by regressing $y$ on $X_1$ and $X_2$,
by first regressing $y$ and each column of $X_2$ on $X_1$, then taking the residuals $M_1y$ and $M_1{X}_2$ of these regressions and finally regressing $M_1y$ on $M_1X_2$.

Figure 3.3 illustrates the Frisch-Waugh theorem.

The first regression is the regression of $y$ on $x_1$ and $x_2$. We then get an orthogonal decomposition of $y$ in the vector of fitted values $\hat{y}=P_{12}y$ and of the residuals $\hat{\epsilon}_{12}=M_{12}y$. We also show in this figure the decomposition of $\hat{y}$ in $x_1$ and $x_2$, which is represented by the sum of the two vectors $\hat{\beta}_1 x_1$ and $\hat{\beta}_2 x_2$. $\hat{\beta}_2$ is the estimator of $x_2$ on this first regression and is represented by the ratio between $\hat{\beta}_2 x_2$ and $x_2$.

The second regression is the regression of $M_1y$ on $M_1x_2$. $M_1x_2$ is the residual of the regression of $x_2$ on $x_1$. Therefore, this vector lies in the line that is in the plane spanned by $x_1$ and $x_2$ and is orthogonal to $x_1$. $M_1y$ is the residual of $y$ on $x_1$; it is therefore orthogonal to $x_1$.

As both $M_1y$ and $M_{12}y$ are orthogonal to $x_1$, so is the vector that joins those two vectors. Therefore, this vector is parallel to $M_1x_2$, and it is therefore the fitted value of the second regression ($M_1y$ on $M_1x_2$), $P_{M_1x_2}y$. It is also equal to $\tilde{\beta}_2 M_1x_2$, $\tilde{\beta}_2$ being the estimator of $x_2$ on the second regression. Note also that $\hat{\epsilon}_{12} = M_{12}y$ is the residual of the second regression, which is therefore the same as the residuals of the first regression.

Finally, consider the regression of $\hat{\beta}_2 x_2$ on $x_1$. The residual of this regression is $\hat{\beta}_2M_1x_2$. As it lies on the plane spanned by $x_1$ and $x_2$ and is orthogonal to $x_1$, it is parallel to $P_{M_1x_2}y=\tilde{\beta}_2M_1x_2$. Moreover, the Frisch-Waugh theorem states that both vectors have the same length and are therefore identical, which means that $\tilde{\beta}_2 = \hat{\beta}_2$ and that the two regressions give identical estimators.

The Frisch-Waugh is easily demonstrated using some geometric arguments. Consider the regression with all the covariates:

\[ y = X_1 \hat{\beta}_1 + X_2 \hat{\beta}_2 + M_{12}y \]

Then, premultiply both sides of the model by $M_1$:

\[ M_1 y = M_1 X_1 \hat{\beta}_1 + M_1 X_2 \hat{\beta}_2 + M_1 M_{12}y \]

$M_1 X_1 \hat{\beta}_1$ is 0 as $X_1 \hat{\beta}_1$ is obviously in the subset spanned by $X_1$ and therefore its projection on the orthogonal complement is 0. $M_{12} y$ is orthogonal to the subset spanned by $X$ and is therefore also orthogonal to the subset spanned by $X_1$. Therefore $M_1 M_{12} y = M_{12} y$. We therefore have:

\[ M_1 y = M_1 X_2 \hat{\beta}_2 + M_{12}y \]

For which the estimation is:

\[ \hat{\beta}_2 = (X_2^\top M_1 X_2) ^ {-1} X_2 ^ \top M_1 y \]

which is exactly the estimation obtained by regressing $M_1 y$ on $M_1 X_2$. We finally note (an important result that will be used in Section 3.6.3), that $\hat{\epsilon}_1=M_1y$, $\hat{\epsilon}_{12}=M_{12}y$ and $P_{M_1x_2}y$ form a right triangle, $\hat{\epsilon}_1$ being the hypotenuse. Therefore, using the Pythagorean theorem, we have:

\[ \mid\mid \hat{\epsilon}_{12}^2\mid\mid + \mid\mid P_{M_1X_2}y\mid\mid = \mid\mid \hat{\epsilon}_{1}^2\mid\mid \tag{3.14}\]

3.4 Computation with R

To estimate the multiple linear model, we use as for the single linear model the lm function; the difference being that now, on the right side of the formula, we have several variables (here two), separated by the + operator. Actually, formulas have a much richer syntax that includes other operators, for example * and :. This will be discussed in Chapter 4.

slw_tot <- lm(log(gdp85) ~ log(i) + log(v), growth)
slw_tot %>% coef
## (Intercept)      log(i)      log(v) 
##      9.6293      1.4780     -0.4573

Computation using matrix algebra

The estimator can also be computed “by hand”, using matrix algebra. To start, we use the model.frame function which, as lm, has formula and data arguments. For pedagogical purposes, we add the group variable in the formula.

mf <- model.frame(log(gdp85) ~ log(i) + log(v) + group, growth)
head(mf, 3)
##   log(gdp85) log(i) log(v)  group
## 1      8.383 -1.423 -2.577  other
## 2      7.066 -2.847 -2.645 lqdata
## 3      6.976 -2.226 -2.604 lqdata
nrow(mf)
## [1] 107
nrow(growth)
## [1] 121

model.frame returns a data frame that contains the data required for estimating the model described in the formula. More precisely, it performs three tasks:

it selects only the columns of the initial data frame that are required for the estimation,
it transforms these variables if required, here gdp85, i and v are transformed in logarithms and the columns are renamed accordingly,
it selects only the observations for which there are no missing values for the relevant variables; note here that growth has 121 rows and mf only 107 rows.

Several interesting elements of the model can be extracted from the model frame. The $Z$ matrix is obtained using the model.matrix function, which also uses a formula/data interface, the data being the model frame mf:

Z <- model.matrix(log(gdp85) ~ log(i) + log(v) + group, mf)
head(Z, 3)

  (Intercept) log(i) log(v) groupother grouplqdata groupoil
1           1 -1.423 -2.577          1           0        0
2           1 -2.847 -2.645          0           1        0
3           1 -2.226 -2.604          0           1        0

Note that the model matrix includes an intercept¹ and the group variable, which is a categorical variable is transformed into a set of dummy variables. More precisely, a dummy variable is created for all the modalities except the first one ("oecd"). The response is obtained using the model.response function:

y <- model.response(mf)
head(y, 3)
##     1     2     3 
## 8.383 7.066 6.976

Once the model matrix and the response vector are created, the estimator can easily be computed using matrix operators provided by R. In particular:

%*% is the matrix product operator (* performs an element per element product),
t() transposes a matrix,
solve() solves a linear system of equation or computes the inverse of a matrix,
crossprod takes the inner products of two matrices (or of one matrix and a vector).

The most straightforward formula to get the OLS estimator is:

solve(t(Z) %*% Z) %*% t(Z) %*% y

               [,1]
(Intercept) 15.0991
log(i)       0.8332
log(v)       1.6089
groupother  -1.4135
grouplqdata -2.1321
groupoil    -0.8888

But crossprod is more efficient, $A^\top B$ being obtained using crossprod(A, B) and $A^\top A$ is either crossprod(A, A) or crossprod(A). Therefore, $Z^\top Z$ and $Z^\top y$ are respectively obtained using:

crossprod(Z)
crossprod(Z, y)

Moreover, solve can be used to solve a system of linear equations: solve(A, z) compute the vector $w$ such that $Aw=z$, which is $w=A^{-1}z$. Therefore, the OLS estimator can be computed using the more efficient and compact following code:

solve(crossprod(Z), crossprod(Z, y))

Efficient computation: QR decomposition

The efficient method used by lm to compute the OLS estimate is the QR decomposition:² the matrix of covariates $Z$ can be written as the product of an orthonormal matrix $Q$ of dimension $N \times (K + 1)$ (therefore $Q^\top Q = I$) and an upper triangular matrix $R$ of dimension $(K+1) \times (K+1)$. Then, the linear model can be written as:

\[ y = Z\gamma + \epsilon = QR\gamma + \epsilon = Q \delta + \epsilon \]

with $\delta = R \gamma$. With $Q$ as the matrix of covariates, the OLS estimator is $\hat{\delta} = (Q ^ \top Q ) ^ {-1} Q ^ \top y = Q^\top y$, which is obtained without matrix inversion. Then $\hat{y} = Q \hat{\gamma} = Q Q ^ \top y$ and $\hat{\epsilon} = (I - Q Q ^ \top) y$. With $Z$ as the matrix of covariates, the OLS estimator is $\hat{\gamma} = R ^ {-1} \delta$. The inverse of a triangular matrix can be very easily obtained with a very high numerical accuracy. Moreover, $\hat{\gamma}$ can be obtained without inverting $R$. With $K = 2$, we have:

\[ \left(\begin{array}{c} \delta_0 \\ \delta_1 \\ \delta_2 \end{array}\right) = \left( \begin{array}{ccc} r_{11} & r_{12} & r_{13} \\ 0 & r_{22} & r_{23} \\ 0 & 0 & r_{33} \end{array} \right) \left(\begin{array}{c} \alpha \\ \beta_1 \\ \beta_2 \end{array}\right) = \left(\begin{array}{c} r_{11} \alpha + r_{12} \beta_1 + r_{13} \beta_2 \\ r_{22} \beta_1 + r_{23} \beta_2 \\ r_{33} \beta_2 \end{array}\right) \] which can be solved recursively:

$\beta_2 = \delta_2 / r_{33}$,
$\beta_1 = (\delta_1 - r_{23}\beta_2) / r_{22}$,
$\alpha = (\delta_0 - r_{12} \beta_1 - r_{13} \beta_2) / r_{11}$.

Finally, $(X^\top X) ^ {-1} = (R^\top Q^ \top Q R) ^ {-1} = (R ^ \top R) ^{-1} = R ^ {-1} R ^ {- 1 \top}$.

We illustrate the computation of the OLS estimator using the QR decomposition and using the previously estimated growth model, without the group covariate. The QR decomposition is performed using the qr function:

Z <- model.matrix(log(gdp85) ~ log(i) + log(v), mf)
qrZ <- qr(Z)

qr returns an object of class qr, and the qr.R and qr.Q function can be used to retrieve the two matrices. We check that $R$ is upper triangular and that $Q^\top Q = I$:

R <- qr.R(qrZ)
Q <- qr.Q(qrZ)
crossprod(Q)

          [,1]      [,2]      [,3]
[1,] 1.000e+00 3.849e-17 8.088e-17
[2,] 3.849e-17 1.000e+00 2.971e-17
[3,] 8.088e-17 2.971e-17 1.000e+00

  (Intercept) log(i)  log(v)
1      -10.34 19.193 27.1993
2        0.00  5.084 -0.4102
3        0.00  0.000  1.3600

$\hat{\delta}$ is then obtained as the cross-products of $Q$ and $y$:

hdelta <- crossprod(Q, y) %>% drop
hdelta
## [1] -83.678   7.702  -0.622

and $\hat{\gamma}$ is obtained by solving recursively $\hat{\delta} = R \hat{\gamma}$:

beta_2 <- hdelta[3] / R[3, 3]
beta_1 <- (hdelta[2] - R[2, 3] * beta_2) / R[2, 2]
alpha <- (hdelta[1] - R[1, 2] * beta_1  - R[1, 3] * beta_2) / R[1, 1]
c(alpha, beta_1, beta_2)
## [1]  9.6293  1.4780 -0.4573

3.5 Properties of the estimators

In this section, we’ll briefly analyze the statistical properties of the OLS estimator with more than one covariate. Most of these properties are similar to the one we have described in Chapter 2. They’ll be presented in this section using matrix algebra.

Unbiasedness of the OLS estimator

The vector of slopes can be written as a linear combination of the vector of response and then of vector of the errors:

\[ \begin{array}{rcl} \hat{\beta}&=&(X^\top\tilde{I}X)^{-1}X^\top\tilde{I}y \\ &=&(X^\top\tilde{I}X)^{-1}X^\top\tilde{I}(X\beta+\epsilon)\\ &=&\beta+(X^\top\tilde{I}X)^{-1}X^\top\tilde{I}\epsilon \\ &=&\beta+(\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\epsilon \\ \end{array} \tag{3.15}\]

The expected value of $\hat{\beta}$ conditional on $X$ is:

\[ \mbox{E}(\hat{\beta}\mid X) = \beta + (\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\mbox{E}(\epsilon \mid X) \]

The unbiasedness condition is therefore that $\mbox{E}(\epsilon \mid X) = 0$, which is a direct generalization of the result obtained for the simple linear regression model, namely $\epsilon$ has a constant expected value (that can be set to 0 without any restriction) whatever the value of the covariates. It implies also that the population covariance between the errors and any of the covariates is 0.

Variance of the OLS estimator

The variance of $\hat{\beta}$ is now a matrix of variances and covariances:

\[\mbox{V}(\hat{\beta}\mid X) = \mbox{E}\left[(\hat{\beta}- \beta)(\hat{\beta}- \beta)^\top\mid X\right] \]

Using Equation 3.15:

\[\mbox{V}(\hat{\beta}\mid X) = \mbox{E}\left[(\tilde{X} ^ \top\tilde{X})^{-1}\tilde{X}^\top\epsilon \epsilon ^ \top \tilde{X} (\tilde{X}^\top\tilde{X})^{-1} \mid X\right] \]

\[ \mbox{V}(\hat{\beta}\mid X)= \frac{1}{N} \left(\frac{1}{N} \tilde{X}^\top\tilde{X}\right)^{-1}\left[\frac{1}{N}\mbox{E}(\tilde{X}^\top \epsilon \epsilon ^ \top \tilde{X} \mid X)\right] \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right) ^ {-1} \tag{3.16}\]

This is a sandwich formula, the meat: $\frac{1}{N}\mbox{E}\left(X'\bar{I}\epsilon \epsilon ^ \top \bar{I} X \mid X\right)$ being surrounded by two slices of bread: $\left(\frac{1}{N} \tilde{X}^\top\tilde{X}\right)^{-1}$. Note that the two matrices are square and of dimension $K$. The bread is just the inverse of the covariance matrix of the covariates. The meat is the variance of the score vector, i.e., the vector of the first-order conditions. For $K = 2$, it is the expected value of:

\[ \small{ \frac{1}{N} \left( \begin{array}{cccc} \left(\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) \epsilon_n)\right) ^ 2 & \left(\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) \epsilon_n)\right) \left(\sum_{n=1} ^ N (x_{n2} - \bar{x}_2) \epsilon_n)\right) \\ \left(\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) \epsilon_n)\right) \left(\sum_{n=1} ^ N (x_{n2} - \bar{x}_2) \epsilon_n)\right) & \left(\sum_{n=1} ^ N (x_{n2} - \bar{x}_2) \epsilon_n)\right) ^ 2 \end{array} \right) } \]

which is a generalization of the single regression case where the “meat” reduces to the scalar $\left(\sum_{n=1} ^ N (x_n - \bar{x}) \epsilon_n)\right) ^ 2$. As for the simple regression model, the formula of the variance simplifies with the hypothesis that the errors are homoskedastic ($\mbox{E}(\epsilon_n ^ 2 \mid x) = \sigma_\epsilon ^ 2$) and uncorrelated ($\mbox{E}(\epsilon_n \epsilon_m \mid x) = 0 \; \forall \; m \neq n$). In this case, the meat reduces to $\sigma_\epsilon ^ 2 \frac{1}{N} \tilde{X} ^ \top \tilde{X}$, i.e., up to a scalar to the matrix of covariance of the covariates, and Equation 3.16 becomes:

\[ V(\hat{\beta})= \sigma_\epsilon^2(\tilde{X}^\top\tilde{X})^{-1} \tag{3.17}\]

which can be rewritten, using Equation 3.10:

\[ \mbox{V}(\hat{\beta})= \frac{\sigma_\epsilon^2}{N\hat{\sigma}_1\hat{\sigma}_2(1-\hat{\rho}_{12}^2)} \left(\begin{array}{cc} \frac{\hat{\sigma}_2}{\hat{\sigma}_1} & -\hat{\rho}_{12}\\ -\hat{\rho}_{12} & \frac{\hat{\sigma}_1}{\hat{\sigma}_2}\\ \end{array} \right) \]

from which we get:

\[ \sigma_{\hat{\beta}_k}= \frac{\sigma_\epsilon}{\sqrt{N}\hat{\sigma}_k\sqrt{1-\hat{\rho}_{12}^2}} \mbox{ for }k = 1, 2,\; \hat{\sigma}_{\hat{\beta}_1\hat{\beta}_2}= -\frac{\hat{\rho}_{12}\sigma_\epsilon^2}{N\hat{\sigma}_1\hat{\sigma}_2(1-\hat{\rho}_{12}^2)} \mbox{ and } \hat{\rho}_{\hat{\beta}_1\hat{\beta}_2} = -\hat{\rho}_{12} \tag{3.18}\]

First remark that if $\hat{\rho}_{12} = 0$, which means that the two covariates are uncorrelated, the formula for the standard deviation of a slope in the multiple regression model reduces to the formula of the single regression model, which means that the standard deviation is proportional to:

the standard deviation of the error,
the inverse of the standard deviation of the corresponding covariate,
the inverse of the square root of the sample size.

When the two covariates are correlated, the last term $1/\sqrt{1 - \hat{\rho}_{12}^2}$ is added and inflates the standard deviation. This means that the more the covariates are correlated (whatever the sign of the correlation) the larger is the standard deviation of the slope. The intuition is that, if the two covariates are highly correlated, it is difficult to estimate precisely the separate effect of each of them.

$\sigma_\epsilon^2$ being unknown, $\sigma^2_{\hat{\beta}}$ can’t be computed. If the error were observed, a natural estimator of $\sigma_\epsilon^2$ would be $\sum_{n=1}^N (\epsilon_n-\bar{\epsilon}) ^ 2/N$. As the errors are unknown, one can use the residuals instead, which are related to the errors by the relation: $\hat{\epsilon} = M_Z y = M_Z (Z\gamma + \epsilon) = M_Z \epsilon$, the last equality standing because $Z\gamma$ is a vector of the subspace defined by the columns of $Z$ and therefore $M_ZZ\gamma = 0$. Therefore, we have: $\hat{\epsilon} ^ \top \hat{\epsilon} = \epsilon ^ \top M_Z \epsilon$; as it is a scalar, it is equal to its trace. Using the rule of permutation, we get: $\hat{\epsilon} ^ \top \hat{\epsilon} = \epsilon ^ \top M_Z \epsilon = \mbox{tr}\, M_Z \epsilon \epsilon ^ \top$. With spherical disturbances, we have $\mbox{E}(\hat{\epsilon} ^ \top \hat{\epsilon}) = \mbox{tr}\, M_Z \sigma_\epsilon ^ 2 I = \sigma_\epsilon ^ 2 \mbox{tr} M_Z = (N - K - 1) \sigma_\epsilon ^ 2$. Therefore, an unbiased estimator of $\sigma_\epsilon^2$ is:

\[ \dot{\sigma}_\epsilon^2 = \frac{\hat{\epsilon}^\top\hat{\epsilon}}{N - K - 1} = \frac{\sum_{n=1}^N \hat{\epsilon}_n ^ 2}{N - K - 1} \tag{3.19}\]

and replacing $\sigma_\epsilon ^ 2$ by $\dot{\sigma}_\epsilon ^ 2$ in Equation 3.17, we get an unbiased estimator of the covariance matrix of the estimators:

\[ \hat{V}(\hat{\beta}) = \dot{\sigma}_\epsilon^2(\tilde{X}^\top\tilde{X})^{-1} \tag{3.20}\]

We now go back to the estimation of the growth model. As in Mankiw, Romer, and Weil (1992), we use a restricted sample by excluding countries for which most of the GDP is linked to oil extraction and those with low quality data.

growth_sub <- growth %>% filter(! group %in% c("oil", "lqdata"))
slw_tot <- lm(log(gdp85) ~ log(i) + log(v), growth_sub)

The covariance matrix of the estimators is obtained using the vcov function:

vcov(slw_tot)

            (Intercept)  log(i)  log(v)
(Intercept)      2.3811 0.13554 0.80956
log(i)           0.1355 0.02922 0.03213
log(v)           0.8096 0.03213 0.28501

The summary method computes detailed results of the regression, in particular the table of coefficients, a matrix that can be extracted using the coef method:

slw_tot %>% summary %>% coef

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)    5.346     1.5431   3.464 8.990e-04
log(i)         1.318     0.1709   7.708 5.383e-11
log(v)        -2.017     0.5339  -3.778 3.224e-04

The OLS estimator is BLUE

The demonstration made in Section 2.1.4 for the simple linear model can easily be extended to the multiple regression model. The OLS estimator is: $\hat{\gamma} = (Z^\top Z)^{-1}Z^\top y = \gamma + (Z^\top Z)^{-1}Z^\top \epsilon$. Consider another linear estimator:

\[ \tilde{\gamma}=Ay=\left[(Z^\top Z)^{-1}Z^\top + D\right]y = (I + DZ)\gamma + \left[(Z^\top Z)^{-1}Z^\top + D\right]\epsilon \]

The unbiasedness of $\tilde{\gamma}$ implies that $DZ=0$, so that:

\[ \tilde{\gamma}-\gamma= \left[(Z^\top Z)^{-1}Z^\top + D\right]\epsilon = (\hat{\gamma}- \gamma) + D\epsilon = (\hat{\gamma}- \gamma) + Dy \] because $DZ=0$ implies that $D\epsilon=Dy$. Therefore, $\tilde{\gamma}- \hat{\gamma}=Dy$. The covariance between the OLS estimator and the vector of differences of the two estimators is:

\[ \mbox{E}\left[(\tilde{\gamma}-\hat{\gamma})(\hat{\gamma} - \gamma)^\top\right] = \mbox{E}\left[D\epsilon\epsilon^\top Z(Z^\top Z) ^ {-1}\right] \] with spherical disturbances, this reduces to:

\[ \mbox{E}\left[(\tilde{\gamma}-\hat{\gamma})(\hat{\gamma} - \gamma)^\top\right] = \sigma_\epsilon ^ 2DZ(Z^\top Z) ^ {-1} =0 \] Therefore, we can write the variance of $\tilde{\gamma}$ as:

\[ \mbox{V}(\tilde{\gamma}) = \mbox{V}\left[\hat{\gamma} + (\tilde{\gamma}-\hat{\gamma})\right] = \mbox{V}(\hat{\gamma}) + \mbox{V}(Dy) \] as $\mbox{V}(Dy)$ is a covariance matrix, it is semi-definite positive and, therefore, the difference between the variance matrix of any unbiased linear estimator and the OLS estimator is a semi-definite positive matrix, which means that the OLS estimator is the most efficient linear unbiased estimator.

Asymptotic properties of the OLS estimator

Asymptotic properties of the multiple regression model are direct extensions of those we have seen for the simple regression model. With $\mbox{E}(\hat{\beta}) = \beta$ and $\mbox{V}(\hat{\beta}) = \frac{\sigma_\epsilon ^ 2}{N} \left(\frac{1}{N}\tilde{X}^ \top X\right) ^ {- 1}$, the OLS estimator is consistent ($\mbox{plim} \;\hat{\beta} = \beta$) if the covariance matrix of the covariates $\frac{1}{N}\tilde{X} ^ \top \tilde{X}$ converges to a finite matrix. The central-limit theorem implies that:

\[ \sqrt{N}(\hat{\beta}_N - \beta) \xrightarrow{d} \mathcal{N}\left(0, \sigma_\epsilon ^ 2 \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right)^ {-1}\right) \]

\[ \hat{\beta}_N \stackrel{a}{\sim} \mathcal{N}\left(\beta, \frac{\sigma_\epsilon ^ 2}{N} \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right)^ {-1}\right) \]

The coefficient of determination

From Equation 1.14, one way to write the coefficient of determination for the simple linear regression model is:

\[ R^2 = 1 - \frac{\sum_{n=1} ^ N \hat{\epsilon}_n^2}{\sum_{n=1} ^ N (y_n - \hat{y})^2} = 1 - \frac{\hat{\sigma}_\epsilon^2}{\hat{\sigma}_y^2} \]

The last term is obtained by dividing the residual sum of squares and the total variation of $y$ by the sample size $N$. This gives rise to biased estimators of the variances of the errors and of the response. Consider now the unbiased estimators: the numerator should then be divided by $N - K - 1$ and $\dot{\sigma}_\epsilon ^ 2$ is obtained. Similarly, the denominator should be divided by $N-1$ to obtain an unbiased estimation of the variance of the response. We then obtain the adjusted coefficient of determination:

\[ \bar{R}^2 = 1 - \frac{\dot{\sigma}_\epsilon^2}{\dot{\sigma}_y^2}= 1 - \frac{\sum_{n=1} ^ N \hat{\epsilon}_n^2 / (N-K-1)}{\sum_{n=1} ^ N (y_n - \hat{y})^2 / (N-1)} = 1 - \frac{N - 1}{N - K - 1}\frac{\sum_{n=1} ^ N \hat{\epsilon}_n^2}{\sum_{n=1} ^ N (y_n - \hat{y})^2} \] $R^2$ necessarily increases when one more covariate is added to the regression because, even if this covariate is irrelevant, its sample correlation with the response will never be exactly 0 and therefore the sum of square will decrease. This is not the case with $\bar{R}^2$ because $\dot{\sigma}^2_\epsilon$ is the ratio of two terms which both increase when a covariate is added. Note also that $\bar{R}^2$ is not necessarily positive.

3.6 Confidence interval and test

In Section 2.3, we have seen how to compute a confidence interval for a single parameter and how to perform a test for one hypothesis. The confidence interval was a segment, i.e., a range of values that contains the true value of the parameter with a given probability, and the tests were performed using a normal or a Student distribution. Of course, the same kind of analysis can be performed with a multiple regression. But, in this latter case:

a confidence interval can also be computed for several coefficients one at a time,
tests of multiple hypotheses can be performed, using a chi-squared or a Fisher distribution.

To illustrate these points, we’ll use the Solow model. Remember that the model to estimate is:

\[ \ln y = \alpha + \beta_i \ln i + \beta_v \ln v + \epsilon \]

where $y$ is the per capita gdp, $i$ the investment rate and $v$ the sum of the labor force growth rate, the depreciation rate and the technical progress rate. Moreover, the relation between $\beta_i$ and $\beta_v$ and the structural parameter $\kappa$, which is the share of profits in the GDP is $\beta_i = - \beta_v =\kappa / (1 -\kappa)$. We’ll analyze the confidence interval for the couple of coefficients $(\beta_i, \beta_v)$ and we’ll test two hypotheses:

the first is imposed by the model; we must have $\beta_i + \beta_v = 0$,
the second corresponds to a reasonable value of the share of profits, which is approximately one-third; therefore, we’ll test the hypothesis that $\kappa = 1/3$, which implies that $\beta_i = 0.5$.

Note that these two hypotheses imply that $\beta_v=-0.5$.

Simple confidence interval and test

The asymptotic distribution of the estimators is a multivariate normal distribution.³ The distribution of one estimator (say $\hat{\beta}_1$)⁴ is a univariate normal distribution with, for the two covariates case, a standard deviation equal to: $\sigma_{\hat{\beta}_1} = \frac{\sigma_\epsilon}{\sqrt{N}\hat{\sigma}_1\sqrt{1 - \hat{\rho}_{12} ^ 2}}$. Therefore $(\hat{\beta}_1 - \beta_1) / \sigma_{\hat{\beta}_1}$ follows (exactly if $\epsilon$ is normal) a standard normal distribution. Confidence interval and tests for a coefficient are therefore computed exactly as for the case of the simple linear regression model. In particular, using Equation 3.18 and Equation 3.19, we get:

\[ t_k = \frac{\hat{\beta}_k - \beta_k}{\hat{\sigma}_{\hat{\beta}_k}} = \frac{\hat{\beta}_k - \beta_k}{\frac{\dot{\sigma}_\epsilon}{\sqrt{N}\hat{\sigma}_1\sqrt{1 -\hat{\rho}_{12}^2}}} \tag{3.21}\]

that follows exactly a Student distribution with $N-K-1$ degrees of freedom if the errors are normal and asymptotically a normal distribution whatever the distribution of the errors. Therefore, $1 - \alpha$ confidence intervals are: $\hat{\beta}_k \pm \mbox{cv}_{1-\alpha / 2} \dot{\sigma}_{\hat{\beta}_k}$ where $\mbox{cv}_{1-\alpha / 2}$ is the critical value of either a Student or a normal distribution. For a given hypothesis: $H_0:\beta_k = \beta_{k0}$, $t_{k0}=(\hat{\beta}_k - \beta_{k0})/\hat{\sigma}_{\hat{\beta}_k}$ is a draw on a normal or a Student distribution if $H_0$ is true.

Remember that “reasonable” values of the two slopes in our growth model should be $+/-0.5$. We check whether these values are in the confidence intervals, using the confint function:

confint(slw_tot, level = 0.95)

              2.5 % 97.5 %
(Intercept)  2.2698  8.422
log(i)       0.9768  1.658
log(v)      -3.0814 -0.953

$0.5$ and $-0.5$ are not in the 95% confidence interval for respectively $\hat{\beta}_i$ and $\hat{\beta}_v$. The hypotheses that the true values of the parameters are 0 is very easy to compute, as they are based on the t statistics that are routinely returned by the summary method for lm objects. But, in our example, the hypothesis that $\beta_i = 0$ is the hypothesis that the share of profits is 0, which is of little interest. More interestingly, we could check the hypothesis that the coefficients are equal to $\pm 0.5$. In this case, we can “manually” compute the test statistics by extracting the relevant elements in the matrix returned by coef(summary(x)):

v <- slw_tot %>% summary %>% coef
(v[2:3, 1] - c(0.5, - 0.5)) / v[2:3, 2]
## log(i) log(v) 
##  4.783 -2.842

which confirms that both hypotheses are rejected. The same kind of linear hypothesis on one coefficient can be simpler tested by using a different parametrization of the same model. The trick is to write the model in such a way that a subset of coefficients are 0 if the hypothesis is true. For example, in order to test the hypothesis that $\beta_i = 0.5$, we must have in the model $(\beta_i - 0.5)\ln i$; therefore, the term $-0.5\ln i$ is added on the right side of the formula and should therefore be added also on the left side. Adding also $0.5 \ln v$ on both sides of the formula, we finally get:

\[ (\ln y - 0.5 \ln i + 0.5 \ln v) = \alpha + (\beta_i - 0.5) \ln i + (\beta_v + 0.5) \ln v + \epsilon \]

and the two slopes are now equal to 0 under $H_0$.

slw_totb <- lm(log(gdp85 * sqrt(v / i) ) ~ log(i) + log(v), growth_sub)
slw_totb %>% summary %>% coef

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   5.3459     1.5431   3.464 8.990e-04
log(i)        0.8176     0.1709   4.783 8.924e-06
log(v)       -1.5172     0.5339  -2.842 5.830e-03

which gives exactly the same values for the $t$ statistics.

A simple hypothesis may also concern a linear combination of several coefficients and not the value of one coefficient. For example, the structural growth model implies that $\beta_i + \beta_v = 0$. If the hypothesis is true, $\mbox{E}(\hat{\beta}_i + \hat{\beta}_v) = 0$, the variance being $\mbox{V}(\hat{\beta}_i + \hat{\beta}_v) = \hat{\sigma}_{\hat{\beta}_1}^2 + \hat{\sigma}_{\hat{\beta}_2}^2 + 2 \hat{\sigma}_{\hat{\beta}_1\hat{\beta}_2}$, the statistic can then be computed by extracting the relevant elements of the covariance matrix of the fitted model:

v <- vcov(slw_tot)
v_sum <- v[2, 2] + v[3, 3] + 2 * v[2, 3]
stat_sum <- (coef(slw_tot)[2] + coef(slw_tot)[3]) %>% unname
t_sum <- stat_sum / sqrt(v_sum) %>% unname
pval_sum <- 2 * pt(abs(t_sum), df = df.residual(slw_tot), 
                   lower.tail = FALSE)
c(stat = stat_sum, t = t_sum, pv = pval_sum)
##    stat       t      pv 
## -0.6996 -1.1372  0.2592

The hypothesis is therefore not rejected, even at the 10% level. Once again, such a linear hypothesis can be more easily tested using a different parametrization. Introducing in the model the term $(\beta_i + \beta_v) \ln i$, subtracting $\beta_v \ln_i$ and rearranging terms:

\[ \begin{array}{rcl} \ln y &=& \alpha + (\beta_i + \beta_v) \ln i + \beta_v \ln v - \beta_v \ln i+ \epsilon \\ & = & \alpha + (\beta_i + \beta_v) \ln i + \beta_v \ln \frac{v}{i} + \epsilon \end{array} \] We then have a model for which the two covariates are now $\ln i$ and $\ln v / i$, the hypothesis being that the coefficient associated to $\ln i$ is equal to 0.

slw_totc <- lm(log(gdp85) ~ log(i) + log(v / i), growth_sub)
slw_totc %>% summary %>% coef

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   5.3459     1.5431   3.464 0.0008990
log(i)       -0.6996     0.6152  -1.137 0.2592063
log(v/i)     -2.0172     0.5339  -3.778 0.0003224

Joint confidence interval and test of joint hypothesis

Joint confidence interval

We now consider the computation of confidence interval for more than one parameter. The distribution of $\hat{\beta}$ is:

\[ \hat{\beta} \stackrel{a}{\sim} \mathcal{N}\left(\beta, \frac{\sigma_\epsilon ^ 2}{N} \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right)^ {-1}\right) \]

In the simple regression model, subtracting the expected value and dividing by the standard deviation, we get a standard normal deviate. Taking the square, we get a $\chi^2$ with 1 degree of freedom. If the $K$ slopes are uncorrelated, $\sum_{k=1} ^ K(\hat{\beta}_k - \beta_{k0}) ^ 2/\sigma_{\hat{\beta}_k} ^ 2$ is a $\chi^2$ with $K$ degrees of freedom. If the slopes are correlated, this correlation should be “corrected”; more precisely, a quadratic form of the vector of slopes in deviation from its expectation with the inverse of its variance should be computed:

\[ q_K = (\hat{\beta}-\beta) ^ \top \mbox{V}(\hat{\beta})^{-1} (\hat{\beta}-\beta)= (\hat{\beta}-\beta) ^ \top \frac{N}{\sigma_\epsilon ^ 2} \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right) (\hat{\beta}-\beta) \sim \chi^2_K \]

For $K=2$, we have:

\[ q_2 = \left(\hat{\beta}_1 - \beta_1, \hat{\beta}_2 - \beta_2 \right) \frac{N}{\sigma_\epsilon ^ 2} \left( \begin{array}{cc} \hat{\sigma}_1 ^ 2 & \hat{\sigma}_{12} \\ \ \hat{\sigma}_{12} & \hat{\sigma}_2 ^ 2 \end{array} \right) \left( \begin{array}{c} \hat{\beta}_1 - \beta_1 \\ \hat{\beta}_2 - \beta_2 \end{array} \right) \]

\[ q_2 = \frac{N}{\sigma_\epsilon ^ 2} \left[ \hat{\sigma}_1^2(\hat{\beta}_1 - \beta_1) ^ 2+ \hat{\sigma}_2^2(\hat{\beta}_2 - \beta_2) ^ 2+ 2 \hat{\sigma}_{12}(\hat{\beta}_1 - \beta_1)(\hat{\beta}_2 - \beta_2) \right] \tag{3.22}\]

Using Equation 3.21, this expression can also be rewritten in terms of the Student statistics $t_k$:

\[ q_2 = \frac{1}{1 - \hat{\rho}_{12} ^ 2}\left(t_1 ^ 2 + t_2 ^ 2 + 2 \hat{\rho}_{12} t_1 t_2\right) = \frac{1}{1 - \hat{\rho}_{\hat{\beta}_1\hat{\beta}_2} ^ 2}\left(t_1 ^ 2 + t_2 ^ 2 - 2 \hat{\rho}_{\hat{\beta}_1\hat{\beta}_2} t_1 t_2\right) \tag{3.23}\]

the last equality resulting from the fact that the coefficient of correlation of two slopes is the opposite of the coefficient of correlation of the corresponding covariates (see Equation 3.18). Therefore, a $(1-\alpha)$ confidence interval for a couple of coefficients $(\beta_1, \beta_2)$ is the set of values for which Equation 3.23 is lower than the critical value for a $\chi ^ 2$ with 2 degrees of freedom, which is, for example, 5.99 for two degrees of freedom and $1-\alpha = 0.95$. Equating Equation 3.23 to this critical value, we get the equation of an ellipse, with two particular nested cases:

if $\hat{\rho}_{12} = 0$, i.e., if the two covariates and therefore the two coefficients are uncorrelated, the expression reduces to the sum of squares of the t statistics, which is an ellipse with vertical and horizontal tangents,
if $\hat{\rho}_{12} = 0$ and $\hat{\sigma}_1 = \hat{\sigma}_2$, the expression reduces further to the equation of a circle.

Equation 3.23 can’t be computed as it depends on $\sigma_\epsilon$ which is unknown. Replacing $\sigma_\epsilon$ by $\dot{\sigma}_\epsilon$, we get:

\[ \hat{q}_2 = \frac{1}{1 - \hat{\rho}_{12} ^ 2}\left(\hat{t}_1 ^ 2 + \hat{t}_2 ^ 2 + 2 \hat{\rho}_{12} \hat{t}_1 \hat{t}_2\right) \tag{3.24}\]

where $\hat{q}_2$ now follows asymptotically a $\chi ^ 2$ distribution with 2 degrees of freedom. If the errors are normal, dividing by $K$ (here 2), we get an exact Fisher $F$ distribution with $2$ and $N - K - 1$ degrees of freedom. In the simple case of no correlation between the covariates, the $\chi^2$ and the $F$ statistics are therefore the sum and the mean of the squares of the t statistics. As for the Student distribution, which converges in distribution to a normal distribution, $K \times$ the $F$ statistic converges in distribution to a $\chi ^ 2$ with $K$ degrees of freedom. For example, with $K=2$, the critical value for a $F$ distribution with $2$ and $\infty$ degrees of freedom is half the corresponding $\chi^2$ value with 2 degrees of freedom (5.99 for the 95% confidence level), which is $2.996$. The confidence interval is represented in Figure 3.4; the circle point is the point estimation, the vertical and horizontal segments are the separate confidence intervals for both coefficients at the 95% level. We’ve also added a diamond point that corresponds to the hypothesis $\kappa = 1/3$, which implies that $\beta_i = - \beta_v = \frac{\kappa}{1-\kappa} = 0.5$. Finally, we add a line with a slope equal to $-1$ and an intercept equal to 0 which corresponds to the hypothesis that $\beta_i = - \beta_v$.

Figure 3.4: Ellipse of confidence for the two coefficients

The confidence ellipse is a “fat sausage”⁵ with the long part of the sausage slightly oriented in the lower-left/upper-right direction. This is because, as we have seen previously (see Figure 3.1), the two covariates exhibit a small negative correlation, which implies a small positive correlations between $\hat{\beta}_i$ and $\hat{\beta}_v$. The ellipse is also “higher” than “wide”, because $v$ has a smaller variance than $i$ and therefore $\hat{\beta}_v$ has a larger variance than $\hat{\beta}_i$. Note the difference between a set of two simple hypotheses and a joint hypothesis. Consider for example:

$(\beta_i = 1.1, \beta_v = -1)$ represented by a square point: both simple hypotheses are not rejected (the two values are in the unidimensional confidence interval), but the joint hypothesis is rejected, as the corresponding square point is outside the 95% confidence interval ellipse,
$(\beta_i = 1.25, \beta_v = -3.1)$ represented by a triangle point: the simple hypothesis $\beta_i = 1.25$ is not rejected, the simple hypothesis $\beta_b = -3.1$ is rejected, but the joint hypothesis is not rejected, the triangle point being inside the 95% confidence interval ellipse.

The hypothesis that the two coefficients sum to 0 is not rejected as some points of the straight line that figures this hypothesis are in the confidence ellipse. Concerning the hypothesis that $\beta_i = 0.5$ and $\beta_v = -0.5$, the two simple hypotheses and the joint hypothesis are rejected at the 95% confidence level; the two estimates are neither in the segments representing the simple confidence interval nor inside the ellipse figuring the joint confidence interval.

Joint hypothesis

To test a joint hypothesis for the values of a couple of parameters $(\beta_{10}, \beta_{20})$, we have just seen that we can simply check whether the corresponding point is inside or outside the confidence interval ellipse. We can also compute the statistic given in Equation 3.24 for $(\beta_1=\beta_{10}, \beta_2=\beta_{20})$ and compare it to the critical value.

The statistic is computed using the elements of the matrix returned by coef(summary(x)), which contains in particular the estimations and their standard deviations. We first compute the t statistics corresponding to the two simple hypothesis $\beta_i = 0.5$ and $\beta_v = -0.5$:

sc <- slw_tot %>% summary %>% coef
t_i <- (sc[2, 1] - 0.5) / sc[2, 2]
t_v <- (sc[3, 1] + 0.5) / sc[3, 2]

We then apply the simplified formula, which is the sum or the mean of the squares of the t statistics:

St2 <- t_i ^ 2 + t_v ^ 2
Mt2 <- (t_i ^ 2 + t_v ^ 2) / 2
c(St2, Mt2)
## [1] 30.95 15.47

for which the distributions under $H_0$ are respectively $\chi ^ 2$ and $F$, with the following critical values:

qchisq(0.95, df = 2)
## [1] 5.991
qf(0.95, df1 = 2, df2 = df.residual(slw_tot))
## [1] 3.124

The critical values being much smaller than the computed statistics, the joint hypothesis is clearly rejected. The exact formula corrects the correlation between the two estimators using the coefficient of correlation between the two covariates. We compute this coefficient using the data frame of the fitted model, which is obtained using the model.frame function. Using the initial data frame growth wouldn’t give the correct value, as the estimation is not performed on the full data set because of missing data and because the estimation is performed on a subsample with some groups of countries that are excluded:

mf <- model.frame(slw_tot)
r_iv <- summarise(mf, r = cor(`log(i)`, `log(v)`)) %>% pull(r)
r_iv
## [1] -0.352

Note that the names of the covariates are not regular names as they contain parentheses, therefore they should be surrounded by the sign. The coefficient of correlation between the two covariates is $-0.352$. We obtain:

Fstat <- (t_i ^ 2 + t_v ^ 2 + 2 * r_iv * t_i * t_v) / (1 - r_iv ^ 2) / 2
Fstat
## [1] 23.13

The statistic (23.125) is slightly greater than the approximative value (15.475) which was previously computed; we therefore reject once again the null hypothesis.

The three tests

In the previous subsection, we started with a general (unconstrained) model, we constructed a confidence ellipse for the two parameters, and we were able to test a set of hypotheses, either by checking whether the values of the parameters corresponding to the hypotheses were inside the confidence ellipse, or by computing the value of the statistic for the tested values of the parameters. Actually, this testing principle, based on the unconstrained model, is just one way of testing a hypothesis. The geometry of least squares and the Frisch-Waugh theorem highlights the fact that any set of hypotheses can be tested using the fact that this set of hypotheses gives rise to two models: a constrained model, which imposes the hypotheses and an unconstrained model, which doesn’t impose the hypotheses. The same test can be performed using the constrained model, the unconstrained model or both, which give rise to three test principles:⁶

Wald test is based only on the unconstrained model,
lagrange multiplier or score test is based on the constrained model,
likelihood ratio test is based on the comparison between the two models.

A set of $J$ linear hypotheses is written as:

\[ R \gamma = q \]

where $R$ is a matrix of dimension $J \times (K + 1)$ and $q$ is a vector of length $J$. $J$, the number of hypotheses, is necessarily lower or equal to $K$. Actually, a set of $J$ hypotheses can always be rewritten as a model of the form $y = X_1\beta_1 + X_2\beta_2 + \epsilon$, the hypothesis being $\mbox{H}_0: \beta_2 = 0$. In this setting, the three tests are easily constructed using the Frisch-Waugh theorem, with $H_0: \beta_2 = 0$.

Wald test

The Wald test is based on the unconstrained model, for which a vector of slopes $\hat{\beta}_2$ is estimated. Using the Frisch-Waugh theorem, this vector can be obtained as the regression of the residuals of $y$ on $X_1$ ($M_1y$) on the residuals of every column of $X_2$ on $X_1$ ($M_1X_2$). Then, $\hat{\beta}_2 = (X_2^\top M_1 X_2) ^ {-1} X_2^\top M_1 y$, with expected value and variance equal to $\beta_2$ (0 under $H_0$) and $\mbox{V}(\hat{\beta}_2) = \sigma_\epsilon^2 (X_2^\top M_1 X_2) ^ {-1}$. Convergence in distribution implies that:

\[ \hat{\beta}_2 \stackrel{a}{\sim} \mathcal{N} \left(0, \mbox{V}(\hat{\beta}_2)\right) \]

The distribution of the quadratic form of a centered vector of normal random variables of length $J$ with the inverse of its covariance matrix is a $\chi^2$ with $J$ degrees of freedom:

\[ \mbox{wald} = \hat{\beta}_2 ^ \top \mbox{V}(\hat{\beta}_2) ^ {-1} \hat{\beta}_2 = \frac{\hat{\beta}_2 ^ \top (X_2^\top M_1 X_2) \hat{\beta}_2} {\sigma_\epsilon^2} \]

which is also, replacing $\hat{\beta}_2$ by its expression:

\[ \mbox{wald} = \frac{y ^ \top M_1 X_2 (X_2^\top M_1 X_2) ^ {-1} X_2^\top M_1 y} {\sigma_\epsilon^2} = \frac{y ^ \top P_{M_1X_2} y} {\sigma_\epsilon^2} \]

Lagrange multiplier or score test

Consider the constrained model, which imposes $\beta_2 = 0$. It is therefore obtained by regressing $y$ on $X_1$ only, and the vector of residuals is $\hat{\epsilon}_1 = M_1 y$. The idea of this test is that, if $H_0$ is true, $X_2 ^ \top \hat{\epsilon}_1$ should be close to zero, $\hat{\epsilon}_1$ being “almost” orthogonal to the subspace spanned by $X_2$. Therefore, we consider the vector $X_2 ^ \top \hat{\epsilon}_1 = X_2 ^ \top M_1 y$, which, under $H_0$, should have a 0 expected value. The variance of this vector is: $\sigma_\epsilon ^ 2 (X_2^\top M_1X_2)^{-1}$ so that, applying the central-limit theorem:

\[ X_2 ^ \top \hat{\epsilon}_1 \equiv X_2 ^ \top M_1 y \stackrel{a}{\sim} \mathcal{N} \left(0, \sigma_\epsilon ^ 2 X_2 ^ \top M_1 X_2\right) \]

The statistic is obtained, as previously, by computing the quadratic form:

\[ \mbox{score}=y ^ \top M_1 X_2 \left(\sigma_\epsilon ^ 2 X_2 ^ \top M_1 X_2\right) ^ {-1} X_2 ^ \top M_1 y= \frac{y ^ \top M_1 X_2 (X_2^\top M_1 X_2) ^ {-1} X_2^\top M_1 y} {\sigma_\epsilon^2} = \frac{y ^ \top P_{M_1X_2} y} {\sigma_\epsilon^2} \]

Likelihood ratio test

The likelihood ratio test is based on the comparison of the objective function (the sum of square residuals) for the constrained and the unconstrained model. Remember, from Equation 3.14, that:

\[ \mid\mid \hat{\epsilon}_{12}^2\mid\mid + \mid\mid P_{M_1X_2}y\mid\mid = \mid\mid \hat{\epsilon}_{1}^2\mid\mid \]

The first term on the left and the term on the right are residual sums of squares (respectively for the unconstrained and the constrained model). Therefore, the likelihood ratio test is based on: $\mbox{SSR}_c - \mbox{SSR}_{nc} = ||P_{M_1X_2}y|| ^ 2 = y ^ \top P_{M_1X_2} y$. Dividing by $\sigma_\epsilon^2$, we get exactly the same statistic as previously.

Computation of the three tests

Consider the augmented Solow model: $y = \alpha + \beta_i i + \beta_v v + \beta_e e + \epsilon$, with $\beta_i = \frac{\kappa}{1 - \kappa - \lambda}$, $\beta_v = -\frac{\kappa + \lambda}{1 - \kappa - \lambda}$ and $\beta_e = \frac{\lambda}{1 - \kappa - \lambda}$. For convenience, we compute the logarithm of the variables before using lm:

growth_sub2 <- growth_sub %>% 
  mutate(y = log(gdp85), i = log(i), v = log(v), e = log(e))
mrw2 <- lm(y ~ i + v + e, growth_sub2)

We consider two hypotheses:

$\beta_i + \beta_v + \beta_e = 0$, this hypothesis is directly implied by the structural model,
$\kappa = 1 /3$, the share of profits has a reasonable value.

The second hypothesis implies that $\beta_i = \frac{1/3}{2/3 - \lambda}$ and $\beta_e = \frac{\lambda}{2/3 - \lambda}$, and therefore that $\beta_e = 2 \beta_i - 1$. The model can be reparametrized in such way that two slopes are 0 if the two hypotheses are satisfied:

\[ y + e - v = \alpha + \beta_i (i + 2e - 3v) + (\beta_e-2\beta_i+1)(e-v)+(\beta_i + \beta_e+\beta_v)v \]

Therefore, the unconstrained model can be written as a model with $y + e - v$ as the response and $i + 2 e - 3 v$, $e-v$ and $v$ as the three covariates and the constrained model as a model with the same response but with $i + 2 e - 3$ as the unique covariate.

growth_sub2 <- growth_sub2 %>% 
  mutate(y2 = y + e - v, i2 = i + 2 * e - 3 * v, e2 = e - v)
mrw_c <-  lm(y2 ~ i2,         growth_sub2)
mrw_nc <- lm(y2 ~ i2 + e2 + v, growth_sub2)
coef_nc <- mrw_nc %>% summary %>% coef
coef_nc

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  7.79131     1.1924  6.5340 8.301e-09
i2           0.70037     0.1506  4.6510 1.488e-05
e2           0.32981     0.3611  0.9133 3.642e-01
v           -0.06886     0.4654 -0.1480 8.828e-01

The two hypotheses can be tested one by one, as the test is that:

the slope of $e2 = e-v$ equals 0 for the hypothesis that $\kappa = 1/3$,
the slope of $v$ equals 0 for the hypothesis that $\beta_i+\beta_e+\beta_v=0$.

Both hypotheses are not rejected, even at the 10% level. To test the joint hypothesis, we can first use the approximate formula which is the sum or the mean of the squares of the t statistics (respectively $\chi^2$ with 2 degrees of freedom and F with 2 and 71 degrees of freedom):

appr_chisq <- coef_nc[3, 3] ^ 2 + coef_nc[4, 3] ^ 2
appr_f <- appr_chisq / 2
c(appr_chisq, appr_f)
## [1] 0.856 0.428
pchisq(appr_chisq, df = 2, lower.tail = FALSE)
## [1] 0.6518
pf(appr_f, df1 = 2, df2 = 71, lower.tail = FALSE)
## [1] 0.6535

Based on this approximation, the joint hypothesis is clearly not rejected. We now turn to the computation of the statistics, using the three test principles.

Wald test

Considering the initial unconstrained model, for which the formula is $y \sim i + v + e$, the set of the two hypotheses can be written in matrix form as:

\[ \left( \begin{array}{cccc} 0 & 2 & 0 & - 1 \\ 0 & 1 & 1 & 1 \\ \end{array} \right) \left( \begin{array}{c} \alpha \\ \beta_i \\ \beta_v \\ \beta_e \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right) \] The matrix $R$ and the vector $q$ are created in R:

R <- matrix(0, 2, 4) ; q <- c(1, 0)
R[1, 2] <- 2 ; R[1, 4] <- - 1 ; R[2, 2:4] <- 1

We then have $H_0: R\gamma - q = 0$ and, for the fitted unconstrained model, we get: $R\hat{\gamma} - q$. Under $H_0$ the expected value of this vector is 0 and its estimated variance is $R^\top \hat{V}_{\hat{\gamma}}R=\hat{\sigma}_\epsilon ^ 2R^\top(Z^\top Z)^{-1}R$. Then:

\[ (R\hat{\gamma} - q)^\top \left[R \hat{V}_{\hat{\gamma}}R ^ \top\right]^{-1}(R\hat{\gamma} - q) \tag{3.25}\]

is asymptotically a $\chi^2$ with $J=2$ degrees of freedom. Dividing by $J$, we get a F statistic with $2$ and $71$ degrees of freedom.

t(R %*% coef(mrw2) - q) %*% 
  solve(R %*% vcov(mrw2) %*% t(R)) %*%  
  (R %*% coef(mrw2) - q) / 2
##        [,1]
## [1,] 0.4233

car::linearHypothesis performs Wald tests with a nice syntax: the first argument is a fitted model, and the second one is a vector of characters which contains the character representation of the hypothesis:

car::linearHypothesis(mrw2, c("i + v + e = 0", "e = 2 * i - 1")) %>% gaze
## F = 0.423, df: 2-71, pval = 0.657

Likelihood ratio test

The computation of the likelihood ratio statistic is very simple once the two models have been estimated. The residual sums of squares of the two models are extracted using the deviance method, and we divide the difference of the two sums of squares by $\dot{\sigma}_\epsilon^2$ (the sigma method is used to extract the residual standard error) and by 2 to get an F statistic.

(deviance(mrw_c) - deviance(mrw2))/ sigma(mrw2) ^ 2 / 2
## [1] 0.4233

Score test

For the score test, we consider the reparametrized model and we define $Z_1$ as a matrix containing a column of one and $i + 2 e - 3 v$ and $X_2$ as a matrix containing $e-v$ and $v$.

Z1 <- model.matrix(~ i2, model.frame(mrw_nc))
X2 <- model.matrix(~ e2 + v - 1, model.frame(mrw_nc))

The test is based on the vector $X_2 ^ \top \hat{\epsilon}_c$, where $\hat{\epsilon}_c$ is the vector of the residuals for the constrained model. Under $H_0$, the expected value of $X_2 ^ \top \hat{\epsilon}_c$ is 0 and its variance is $\sigma_\epsilon ^ 2 X_2^ \top M_1 X_2$. $M_1 X_2$ is a matrix of residuals of all the columns of $X_2$ on $Z_1$.

ec <- resid(mrw_c)
M1X2 <- resid(lm(X2 ~ Z1))

Note that we’ve used lm with only a formula argument. In this case, the response and the covariates are vectors or matrices and not columns of a tibble. Note also that the left-hand side of a formula is a matrix and not a vector; in this case, each column is supposed to be a response and lm fit as many models than there are columns in this matrix. Then, the resid method no longer returns a vector, but a matrix, each column being a vector of residuals for one of the fitted models.

The statistic is then $\hat{\epsilon}_c^\top X_2 \left[X_2^ \top M_1 X_2\right]^{-1} X_2 \hat{\epsilon}_c / \sigma_\epsilon^2$ and it is computed using an estimator of $\sigma_\epsilon^2$, and dividing by $J=2$ to get an F statistic.

t(crossprod(X2, ec)) %*% 
  solve(crossprod(M1X2)) %*% 
  crossprod(X2, ec) / sigma(mrw_c) ^ 2 / 2
##        [,1]
## [1,] 0.4301

The statistic is slightly different from the one computed previously, the difference being only due to the fact that the estimation of $\sigma_\epsilon$ is based, for the score test, on the constrained model.

Testing that all the slopes are 0

The test that all the slopes are 0 is routinely reported by software performing OLS estimation. It can be computed using any of the three test principles, but the likelihood ratio test is particularly appealing in this context, the constrained model being a model with only an intercept: $y_n = \alpha + \epsilon_n$. In this case, $\hat{\alpha} = \bar{y}$ and $\hat{\epsilon} = (y_n - \bar{y})$. Therefore, the residual sum of squares for the constrained model is $S_{yy} = \sum_n (y_n - \bar{y}) ^ 2$ and is also denoted by $\mbox{TSS}$ for total sum of squares. The statistic is then: $(\mbox{TSS} - \mbox{RSS})/\sigma_\epsilon ^ 2 \sim \chi ^ 2_K$, which is a $\chi ^ 2$ with $K$ degrees of freedom if the hypothesis that all the slopes are 0 is true. To compute this statistic, $\sigma_\epsilon^2$ has to be estimated. A natural estimator is $\mbox{RSS} / (N - K - 1)$ but, if $H_0$ is true, $\mbox{TSS} / (N - 1)$ is also an unbiased estimator. Moreover, dividing by the sample size ($N$) and not by the number of degrees of freedom leads to biased but consistent estimators. Using the first estimator of $\sigma_\epsilon^2$ and dividing by $K$, we get the $F$ statistic with $K$ and $N-K-1$ degrees of freedom:

\[ \frac{\mbox{TSS} -\mbox{RSS}}{\mbox{RSS}}\frac{N - K - 1}{K} \sim F_{K, N-K-1} \]

Using $\mbox{TSS} / N$ as an estimator of $\sigma_\epsilon^2$, we get a very simple statistic that is asymptotically a $\chi^2$ with $K$ degrees of freedom:

\[ N \frac{\mbox{TSS} -\mbox{RSS}}{\mbox{TSS}} \stackrel{a}{\sim} \chi^2_{K} \]

These two statistics are closely related to the $R^2$ which is, using this notation, equal to $1 - \mbox{RSS} / \mbox{TSS} = (\mbox{TSS} - \mbox{RSS}) / \mbox{TSS}$. We can then write the $F$ statistic as:

\[ \frac{R^2}{1 - R ^ 2}\frac{N - K - 1}{K} \] and the asymptotic $\chi^2$ statistic as $N R^2$.

There is no easy way to extract the $R^2$ and the $F$ statistic with R. Both are computed by the summary method of lm:

slm <- slw_tot %>% summary
slm %>% names

 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

and can be extracted “manually” from the list returned by summary using the $ operator:

slm$r.squared
## [1] 0.5989
slm$adj.r.squared
## [1] 0.5878
slm$fstatistic
## value numdf dendf 
## 53.76  2.00 72.00

More simply, micsr:rsq and micsr:ftest can be used:

slw_tot %>% rsq
## [1] 0.5989
slw_tot %>% rsq(type = "adj")
## [1] 0.5878
slw_tot %>% ftest %>% gaze
## F = 53.755, df: 2-72, pval = 0.000

The interest of this testing strategy is not limited to the test that all the slopes of a real model are 0. It can also be used to test any set of hypotheses, using reparametrization and the Frisch-Waugh theorem. Consider for example the hypothesis that $\kappa = 0.5$ and $\beta_i + \beta_e + \beta_v=0$. We have seen previously that, after reparametrization, this corresponds to the model:

mrw_nc <- lm(y2 ~ i2 + e2 + v, growth_sub2)
mrw_nc %>% coef
## (Intercept)          i2          e2           v 
##     7.79131     0.70037     0.32981    -0.06886

with, if the two hypotheses are true, the two slopes associated with e2 and v equal to 0. Now, using the Frisch-Waugh theorem, and denoting $Z_1 = (j, i_2)$ and $X_2 = (e_2, v)$:

y2b <- lm(y2 ~ i2, growth_sub2) %>% resid
e2b <- lm(e2 ~ i2, growth_sub2) %>% resid
vb <- lm(v ~ i2, growth_sub2) %>% resid
mrw_ncb <- lm(y2b ~ e2b + vb - 1)
mrw_ncb %>% coef
##      e2b       vb 
##  0.32981 -0.06886

We get exactly the same estimators as previously for e2 and v, but now the joint hypothesis is that all the slopes of the second model are 0. Therefore, the test is based on the F statistic that is returned by summary(lm(x)) and doesn’t require any further calculus:

mrw_ncb %>% ftest %>% gaze
## F = 0.435, df: 2-73, pval = 0.649

Actually, a degrees of freedom correction should be performed to get exactly the same results because mrw_ncb has $75 - 2 = 73$ degrees of freedom, as the real number of degrees of freedom is $75 - 4 = 71$.

3.7 System estimation and constrained least squares

Very often in economics, the phenomenon under investigation is not well described by a single equation, but by a system of equations. Moreover, there may be inter-equations constraints on the coefficients. It is particularly the case in the field of the microeconometrics of consumption or production. For example, the behavior of a producer is described by a minimum cost equation along with equations of factor demand and the behavior of a consumer is described by a set of demand equations.

System of equations

We consider therefore a system of $L$ equations denoted by $y_l=Z_l\beta_l+\epsilon_l$, with $l=1\ldots L$. In matrix form, the system can be written as follows:

\[ \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_L \end{array} \right) = \left( \begin{array}{ccccc} Z_1 & 0 & \ldots & 0 \\ 0 & Z_2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & Z_L \end{array} \right) \left( \begin{array}{c} \gamma_1 \\ \gamma_2 \\ \vdots \\ \gamma_L \end{array} \right) + \left( \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_L \end{array} \right) \tag{3.26}\]

Therefore, the whole system can be estimated directly by stacking the vector of responses and by constructing a block-diagonal matrix of covariates, each block being the matrix of covariates for one equation.

As an example, consider the analysis of production characteristics (returns to scale, elasticities of substitution between pairs of inputs). The modern approach of production analysis consists of first considering the minimum cost function, which depends on the level of production and on input unit prices $C(y, p_1, \ldots p_J)$ and then of deriving the demands for input using Shepard Lemma:

\[ \frac{\partial C}{\partial p_j} = x_j(y, p_1, \ldots, p_J) \]

The cost function is obviously homogeneous of degree 1 in input unit prices, which means that, for a given level of input, if all the prices increase proportionally, the quantity of the different inputs are the same and therefore the cost function increases by the same percentage. This writes: $C(y, \lambda p_1, \ldots \lambda p_J) = \lambda C(y, p_1, \ldots, p_J)$ and $x_j(y, \lambda p_1, \ldots, \lambda p _J) = x_j(y, p_1, \ldots p_J)$; the latter relation indicating that the demands for input are homogeneous of degree 0 in input unit prices. Among the different functional forms that have been proposed to estimate the cost function, the translog specification is the most popular. It can be considered as the second-order approximation of a general cost function:

\[ \ln C = \alpha + \beta_y \ln y + \frac{1}{2}\beta_{yy} \ln^2 y +\sum_i \beta_i \ln p_i + \frac{1}{2} \sum_i\sum_j \beta_{ij} \ln p_i \ln p_j \]

Using Shephard Lemma, the cost share of input $i$ is the derivative of $\ln C$ with $\ln p_i$.

\[ s_i = \frac{\ln C}{\ln p_i} = \frac{p_i x_i}{C} = \beta_i + \beta_{ii} \ln p_i + \frac{1}{2}\sum_{j\neq i} (\beta_{ij} + \beta_{ji}) \ln p_j \]

Homogeneity of degree 1 in input prices implies that the cost shares don’t depend on the level of prices. Therefore, $\sum_j^I \beta_{ij} = 0$, or $\beta_{iI} = - \sum_j^{I-1} \beta_{ij}$ and:

\[ \begin{array}{rcl} \sum_i^I\sum_j^I \beta_{ij} \ln p_i \ln p_j &=& \sum_i^I \ln p_i \left[\sum_j^{I-1} \beta_{ij}\ln p_j + \beta_{iJ}\ln p_I\right] \\ &=& \sum_i^I \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} + \ln p_I \sum_j^{I-1} \beta_{Ij} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} + \ln p_I \sum_j^{I-1} \beta_{jI} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} - \ln p_I \sum_i^{I-1} \sum_j^{I-1}\beta_{ji} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \sum_j^{I-1} \beta_{ij} \ln \frac{p_i}{p_I}\ln\frac{p_j}{p_I} \end{array} \]

Moreover, the cost shares sum to 1 whatever the value of the prices, so that $\sum_i \beta_i = 1$. Therefore, the cost function can be rewritten as:

\[ C^* = \alpha + \beta_y \ln y + \frac{1}{2}\beta_{yy} \ln^2 y + \sum_i ^{I-1}\beta_i p_i^* + \frac{1}{2} \sum_{i=1} ^ {I - 1} \beta_{ii} {p_i^*} ^ 2 + \sum_{i=1} ^ {I - 1} \sum_{j>i} ^ {I - 1} \beta_{ij} p_i^* p_j^* \]

where $z^* = \ln (z / p_I)$ and the cost shares are:

\[ s_i = \beta_i + \sum_{j=1} ^ {I-1} \beta_{ij} p_j ^ * \]

Consider the case where $I=3$. In this case, the complete system of equations is:

\[ \left\{ \begin{array}{rcl} C^* &=& \alpha + \beta_y \ln y + \frac{1}{2}\beta_{yy} \ln^2 y + \beta_1 p_1 ^ * + \beta_2 p_2 ^ * + \frac{1}{2} \beta_{11} {p_1^*} ^ 2 + \beta_{12} p_1^* p_2^* + \frac{1}{2} \beta_{22} {p_2^*} ^ 2 \\ s_1 &=& \beta_1 + \beta_{11} p_1 ^ * + \beta_{12} p_2 ^ * \\ s_2 &=& \beta_2 + \beta_{12} p_1 ^ * + \beta_{22} p_2 ^* \\ \end{array} \right. \]

There are 14 parameters to estimate in total (8 in the cost function and 3 in each of the cost share equations), but there are 6 linear restrictions; for example, the coefficient of $p_1^*$ in the cost equation should be equal to the intercept of the cost share for the first factor.

We estimate the translog cost function with the apples data set of Ivaldi et al. (1996) who studied the production cost of apple producers. Farms in this sample produce apples and other fruits (respectively apples and otherprod). The authors observe the sales of apples and other fruits as well as the quantity of apple produced. Therefore, they are able to compute the unit price of apples. Both sales are divided by this unit price, so that apples is measured in apple quantity, and otherprod is measured in “equivalent” apple quantities. Therefore, they can be summed in order to have a unique output variable y. The expenses in the tree factors are given by capital, labor and materials and the corresponding unit prices are pc, pl and pm. The data set is an unbalanced panel of 173 farms observed for three years (1984, 1985 and 1986). We consider only one year (1985) and, for a reason that will be clear later, we divide all the variables by their sample mean:

ap <- apples %>% filter(year == 1985) %>%
    transmute(y = otherprod + apples,
              ct = capital + labor + materials,
              sl = labor / ct, sm = materials / ct,
              pk = log(pc / mean(pc)), pl = log(pl / mean(pl)) - pk,
              pm = log(pm / mean(pm)) - pk, ct = log(ct / mean(ct)) - pk,
              y = log(y / mean(y)), y2 = 0.5 * y ^ 2,
              ct = ct, pl2 = 0.5 * pl ^ 2,
              pm2 = 0.5 * pm ^ 2, plm = pl * pm)

We then create three formulas corresponding to the system of three equations (the cost function and the two factor shares):

eq_ct <- ct ~ y + y2 + pl + pm + pl2 + plm + pm2
eq_sl <- sl ~ pl + pm
eq_sm <- sm ~ pl + pm

These equations can be estimated one by one using OLS, but in this case, the trans-equations restrictions are ignored. The whole system can also be estimated directly by stacking the three vectors of responses and constructing a block diagonal matrix of covariates, each block being the relevant set of covariates for one equation. We use for this purpose the Formula package (Zeileis and Croissant 2010). This package extends usual formulas in two directions: first, several covariates can be indicated on the left- hand side of the formula (using the + operator), and several parts can be defined on both sides of the formula, using the | sign. For example:

y_1 + y_2 | y_3 ~ x_1 + x_2 | x_3 | x_4 + x_5

This formula has two sets of responses, the first containing $y_1$ and $y_2$, and the second $y_3$. Three sets of covariates are defined on the right-hand side of the formula.

For our production analysis, we first create a “meta” formula which contains the three responses on the left side and the whole set of covariates on the right side. Then, we extract, using model.matrix, the three model matrices for the three equations (Z_c, Z_l and Z_m respectively for the cost, labor share and material share equations). The column names of these matrices are customized using the nms_cols function which, for example, turns the original column names of Z_l ((Intercept), pl and pm) to sl_cst, sl_pl and sl_pm. We then construct the block diagonal matrix (using the Matrix::bdiag function) and use our customized names:

library(Formula)
eq_sys <- Formula(ct + sl + sm ~ y + y2 + pl + pm + pl2 + plm + pm2)
mf <- model.frame(eq_sys, ap)  ; Z_c <- model.matrix(eq_ct, mf) 
Z_l <- model.matrix(eq_sl, mf) ; Z_m <- model.matrix(eq_sm, mf)
nms_cols <- function(x, label)
    paste(label, c("cst", colnames(x)[-1]), sep = "_")
nms_c <- nms_cols(Z_c, "cost") ; nms_l <- nms_cols(Z_l, "sl")
nms_m <- nms_cols(Z_m, "sm")
Zs <- Matrix::bdiag(Z_c, Z_l, Z_m) %>% as.matrix
colnames(Zs) <- c(nms_c, nms_l, nms_m)
head(Zs, 2)

     cost_cst  cost_y cost_y2 cost_pl cost_pm cost_pl2 cost_plm
[1,]        1  0.1661  0.0138 -0.3187 -0.0787  0.05077  0.02508
[2,]        1 -1.2880  0.8294 -0.3480 -0.9062  0.06056  0.31537
     cost_pm2 sl_cst sl_pl sl_pm sm_cst sm_pl sm_pm
[1,] 0.003097      0     0     0      0     0     0
[2,] 0.410581      0     0     0      0     0     0

Formula::model.part enables to retrieve any part of the model. Here, we want to extract the three responses which are on the only left side of the formula. Therefore, we set lrs and rhs respectively to 1 and 0. The result is a data frame with three variables ct, sl and sm. We then use dplyr::pivot_longer, to stack the three responses in one column. Note the use of the optional argument cols_vary that is set to "slowest", so that the elements are stacked columnwise:

Y <- model.part(eq_sys, mf, rhs = 0, lhs = 1)
ys <- Y %>% pivot_longer(1:3, cols_vary = "slowest", 
                         names_to = "equation", values_to = "response")
print(ys, n = 2)

# A tibble: 411 × 2
  equation response
  <chr>       <dbl>
1 ct          0.232
2 ct         -1.25 
# ℹ 409 more rows

The estimation can be performed using the response vector and the matrix of covariates:

lm(ys$response ~ Zs - 1)

However, nicer input is obtained by constructing a tibble by binding the columns of ys and Zs and then using the usual formula-data interface.

stack_data <- ys %>% bind_cols(Zs)
stack_data %>% print(n = 2)

# A tibble: 411 × 16
  equation response cost_cst cost_y cost_y2 cost_pl cost_pm cost_pl2
  <chr>       <dbl>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
1 ct          0.232        1  0.166  0.0138  -0.319 -0.0787   0.0508
2 ct         -1.25         1 -1.29   0.829   -0.348 -0.906    0.0606
# ℹ 409 more rows
# ℹ 8 more variables: cost_plm <dbl>, cost_pm2 <dbl>, sl_cst <dbl>,
#   sl_pl <dbl>, sl_pm <dbl>, sm_cst <dbl>, sm_pl <dbl>,
#   sm_pm <dbl>

In order to avoid having to write the whole long list of covariates, the dot can be used on the right-hand side of the formula, which means in this context all the variables (except the response which is on the left-hand side of the formula). The intercept and the equation variable should be omitted from the regression.

ols_unconst <- lm(response ~ . - 1 - equation, stack_data)
ols_unconst %>% coef
## cost_cst   cost_y  cost_y2  cost_pl  cost_pm cost_pl2 cost_plm 
##  0.01742  0.43634  0.11320  0.39270  0.46545  0.23015 -0.21353 
## cost_pm2   sl_cst    sl_pl    sl_pm   sm_cst    sm_pl    sm_pm 
##  0.18063  0.49383  0.11957 -0.10219  0.32165 -0.09014  0.10369

Constrained least squares

Linear restrictions on the vector of coefficients to be estimated can be represented using a matrix $R$ and a numeric vector $q$: $R\gamma = q$,⁷ where $\gamma ^ \top = (\gamma_1 ^ \top, \ldots, \gamma_L ^ \top)$ is the stacked vector of the coefficients for the whole system of equations. The OLS estimator is now the solution of a constrained optimization problem. Denoting $\lambda$ a vector of Lagrange multipliers,⁸ and $\epsilon$ the errors, the objective function is:

\[ L = \epsilon^\top \epsilon + 2\lambda^\top(R\gamma-q) \]

The Lagrangian can also be written as:

\[ L = y^\top y - 2 \gamma^\top Z^\top y + \gamma^\top Z^\top Z \gamma + 2\lambda (R\gamma-q) \]

The first-order conditions are:

\[ \left\{ \begin{array}{rcl} \frac{\partial L}{\partial \gamma}&=&-2Z^\top y + 2 Z^\top Z \gamma + 2R^\top \lambda =0\\ \frac{\partial L}{\partial \lambda}&=&2(R\gamma-q)=0 \end{array} \right. \]

which can also be written in matrix form as:

\[ \left( \begin{array}{cc} Z^\top Z & R^\top \\ R & 0 \\ \end{array} \right) \left( \begin{array}{c} \gamma \\ \lambda \end{array} \right) = \left( \begin{array}{c} Z^\top y\\ q \end{array} \right) \]

The constrained OLS estimator can be obtained using the formula for the inverse of a partitioned matrix:⁹

\[ \left( \begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array} \right)^{-1} = \left( \begin{array}{cc} B_{11} & B_{12} \\ B_{21} & B_{22} \end{array} \right) = \left( \begin{array}{cc} A_{11}^{-1}(I+A_{12}F_2A_{21}A_{11}^{-1}) & - A_{11}^{-1}A_{12}F_2 \\ -F_2A_{21}A_{11}^{-1} & F_2 \end{array} \right) \]

with $F_2=\left(A_{22}-A_{21}A_{11}^{-1}A_{12}\right)^{-1}$. We have here $F_2=-\left(R(Z^\top Z)^{-1}R^\top\right)^{-1}$. The constrained estimator is then: $\hat{\gamma}_c=B_{11}Z^\top y+ B_{12}q$, with $B_{11} = (Z^\top Z)^{-1}\left(I-R^\top(R(Z^\top Z)^{-1}R^\top)^{-1}R(Z^\top Z)^{-1}\right)$ and $B_{12}=(Z^\top Z)^{-1}R^\top\left(R(Z^\top Z)^{-1}R^\top\right)^{-1}$

The unconstrained estimator being $\hat{\beta}_{nc}=\left(Z^\top Z\right)^{-1}Z^\top y$, we finally get:

\[ \hat{\gamma}_c=\hat{\gamma}_{nc} - (Z^\top Z)^{-1}R^\top(R(Z^\top Z)^{-1}R^\top)^{-1}(R\hat{\gamma}_{nc}-q) \tag{3.27}\]

The difference between the constrained and unconstrained estimators is then a linear combination of the excess of the linear constraints of the model evaluated for the unconstrained model. For the system of cost and factor shares for apple production previously described, we have 14 coefficients and 6 restrictions:

R <- matrix(0, nrow = 6, ncol = 14)
R[1, c(4,  9)] <- R[2, c(5, 12)] <- R[3, c(6, 10)] <- 
  R[4, c(7, 11)] <- R[5, c(7, 13)] <- 
  R[6, c(8, 14)] <- c(1, -1)

For example, the first line of R returns the difference between the fourth coefficient (cost_pl) and the ninth coefficient (sl_cst). As the vector $q$ is 0 in our example, this means that the coefficient of $p_l^*$ in the cost equation should equal the intercept in the labor share equation. Applying Equation 3.27, we get:

excess <- R %*% coef(ols_unconst)
XpX <- crossprod(model.matrix(ols_unconst))
beta_c <- coef(ols_unconst) -
    drop(solve(XpX) %*% t(R) %*%
         solve(R %*% solve(XpX) %*% t(R)) %*% excess)
beta_c
## cost_cst   cost_y  cost_y2  cost_pl  cost_pm cost_pl2 cost_plm 
##  0.02019  0.44485  0.11118  0.49570  0.34208  0.11690 -0.10409 
## cost_pm2   sl_cst    sl_pl    sl_pm   sm_cst    sm_pl    sm_pm 
##  0.10775  0.49570  0.11690 -0.10409  0.34208 -0.10409  0.10775

More simply, micsr::clm can be used, which computes the constrained least squares estimator with, as arguments, a lm object (the unconstrained model) and R, the matrix of restrictions (and optionally a q vector):

ols_const <- clm(ols_unconst, R)

and returns a lm object.

Finally, the systemfit package (Henningsen and Hamann 2007) is devoted to system estimation and provides a systemfit function. Its main arguments are a list of equations and a data frame, but it also has restrict.matrix and restrict.rhs arguments to provide respectively $R$ and $q$.

systemfit::systemfit(list(cost = eq_ct, labor = eq_sl, 
                          materials = eq_sm),
                     data = ap, restrict.matrix = R)

The output of systemfit is large and is not reproduced here. The full strength of this package will be presented in later chapters, while describing the seemingly unrelated regression (Section 6.4.4) and the three-stage least squares (Section 7.4) estimators.

Davidson, Russell, and James G. MacKinnon. 1993. Estimation and Inference in Econometrics. New-York: Oxford University Press.

Greene, William H. 2018. Econometrics Analysis. 8th ed. Pearson.

Henningsen, Arne, and Jeff D. Hamann. 2007. “Systemfit: A Package for Estimating Systems of Simultaneous Equations in r.” Journal of Statistical Software 23 (4): 1–40. https://www.jstatsoft.org/v23/i04/.

Ivaldi, Marc, Norbert Ladoux, Hervé Ossard, and Michel Simioni. 1996. “Comparing Fourier and Translog Specifications of Multiproduct Technology: Evidence from an Incomplete Panel of French Farmers.” Journal of Applied Econometrics 11 (6): 649–67. https://doi.org/10.1002/(sici)1099-1255(199611)11:6<649::aid-jae416>3.0.co;2-4.

Mankiw, N. Gregory, David Romer, and David N. Weil. 1992. “A Contribution to the Empirics of Economic Growth.” The Quarterly Journal of Economics 107 (2): 407–37. https://ideas.repec.org/a/oup/qjecon/v107y1992i2p407-437..html.

Stock, James H., and Mark W. Watson. 2015. Introduction to Econometrics. Pearson.

Zeileis, Achim, and Yves Croissant. 2010. “Extended Model Formulas in R: Multiple Parts and Multiple Responses.” Journal of Statistical Software 34 (1): 1–13. https://www.jstatsoft.org/v34/i01/.

As seen in Section 2.1.4.2; to remove it, one has to use either + 0 or - 1 in the formula.↩︎
See Davidson and MacKinnon (1993), section 1.5, pp. 25-31.↩︎
If the errors are normal, the exact distribution of the estimators is normal (see Section 2.2.2).↩︎
Or of a linear combination of several estimators.↩︎
Stock and Watson (2015), p. 235.↩︎
This section is largely based on Davidson and MacKinnon (1993), section 3.6, pp. 88-94. The three “classical” tests are often understood as tests suitable for models estimated by maximum likelihood (see Section 5.3.1). Davidson and MacKinnon (1993) advocate the presentation of the three test principles for other estimators, including the linear regression model.↩︎
See Section 3.6.4.1.↩︎
These multipliers are multiplied by 2 in order to simplify the first-order conditions.↩︎
See Greene (2018), online appendix p. 1076.↩︎

# Multiple regression model {#sec-mult_reg_chapter} ```{r } #| include: false #| label: multiple_regression source("../_commonR.R") ``` In this chapter, we'll analyze the computation and the properties of the OLS estimator when the number of covariates ($K$) is at least 2. Actually, we'll analyze in depth the case when $K=2$ (generalizing from $K=2$ to $K>2$ being quite simple) and, compared to the previous two chapters, we'll insist on two important points: - the use of matrix algebra, which makes the computation of the estimator and the analysis of its properties elegant and compact, - the correlation between the covariates; actually we'll show that the multiple and the simple linear models are different only if there exists such a correlation. We'll roughly follow the same plan as for the simple linear model: @sec-model_data_multiple presents the structural model and the data set that we'll use throughout the chapter, @sec-comp_ols_mult the computation of the estimator, @sec-geometry_multiple_ols the geometry of the multiple linear model, @sec-computation_R_multiple the computation of the estimator with **R**, @sec-properties_ols_multiple its statistical properties and @sec-confint_test_multiple the inference methods (confidence interval and tests). Finally, @sec-system_equation presents system estimation and the constrained least squares estimator. ## Model and data set {#sec-model_data_multiple} To illustrate the multiple regression model, we'll use the example of the estimation of a model explaining economic growth, using a cross-section of countries. ### Structural model One of the most popular growth models in the economic literature is the Solow-Swan model. The production $Y$ (or more precisely the added value or GDP) is performed using two production factors, labor $L$ and capital $K$. Physical labor is transformed in effective labor using a term called $A$. $A$ is time-varying (typically increasing) and therefore represents the effect of technical progress which increases the productivity of labor. The functional form of the production function is a Cobb-Douglas: $$ Y(t) = K(t) ^ \kappa \left[A(t)L(t)\right]^{1 - \kappa} $$ {#eq-production_cobb_douglas} Coefficients for capital and labor are respectively $\kappa$ and $1-\kappa$. They represent the elasticity of the production respective to each factor, but also the share of each factor in the national income ($\kappa$ is therefore the share of profits and $1-\kappa$ the share of wages). Each variable is a continuous function of time $t$. We'll denote, for each variable $\dot{V}=\frac{d V}{d t}$ the derivative with respect to time. Finally, we'll denote $y(t) = \frac{Y(t)}{A(t)L(t)}$ and $k(t)=\frac{Y(t)}{A(t)L(t)}$ production and capital per unit of effective labor. We therefore have: $$ y(t) = k(t) ^ \kappa $$ {\#eq-production_function} We'll hereafter omit $(t)$ to make the notation less cluttered. Variation of capital is investment less depreciation. We assume that investment equals savings and that a constant percentage of income ($i$) is saved every year. The depreciation rate is denoted by $\delta$. We then have: $$ \dot{K} = \frac{d K}{d t} = i Y - \delta K $$ The growth of the capital stock per unit of effective labor $k$ is then: $$ \dot{k} = \frac{d \frac{K}{AL}}{dt}=\frac{\dot{K}AL - (A\dot{L} + \dot{A}L)K}{A^2L^2}= \frac{\dot{K}}{AL} - \frac{K}{AL}\left(\frac{\dot{A}}{A} + \frac{\dot{L}}{L}\right) = (iy - \delta k)- k\left(\frac{\dot{A}}{A} + \frac{\dot{L}}{L}\right) $$ Denoting $n = \frac{\dot{L}}{L}$ and $g = \frac{\dot{A}}{A}$ the growth rates of $L$ and $A$, i.e., the demographic growth rate and the technological progress rate, we finally have: $$ \dot{k}(t)=iy(t)-(n+g+\delta)k(t) = ik(t) ^ \kappa - (n+g+\delta)k(t) $$ At the steady state, the growth rate of capital per unit of effective labor is 0. Solving $\dot{k}(t)=0$ we get the steady state value of $k(t)$, denoted by $k^*$: $$ k^* = \left(\frac{i}{n + g+ \delta}\right) ^ \frac{1}{1-\kappa} $$ Or: $$ \left(\frac{K}{Y}\right)^*= \frac{k^*}{y^*}=k^{*(1-\kappa)} = \frac{i}{n + g+ \delta} $$ From @eq-production_cobb_douglas, the production per capita is: $$ \frac{Y(t)}{L(t)} = A(t) k(t) ^ {\kappa} $$ Replacing $k(t)$ by $k^*$, $A(t)$ by $A(0)e^{gt}$ and taking logs, we get: $$ \ln\frac{Y(t)}{L(t)} = \ln A(0) + g t + \frac{\kappa}{1-\kappa} \ln i - \frac{\kappa}{1 - \kappa} \ln (n + g + \delta) $$ Finally, let's denote $\ln A(0) = a + \epsilon$. $\epsilon$ is an error term which represents the initial dispersion between countries in terms of initial value of technical progress. With $C=a + gt$, and $v = \ln(n + g + \delta)$, the linear model that we wish to estimate is finally: $$ \ln\frac{Y}{L} = C + \frac{\kappa}{1-\kappa} \ln i - \frac{\kappa}{1 - \kappa} \ln v + \epsilon $$ {#eq-solow_equation} We therefore get a multiple regression model for which the response is the log of GPD per capita, ($\ln \frac{Y}{L})$ and the two covariates are $\ln i_n$ (the saving rate) and $\ln v_n$ (the sum of the demographic growth, the technical progress and the depreciation rates). Moreover, the structural model imposes some restrictions on the coefficients that can be tested. The two slopes are, in terms of the structural parameters of the theoretical model: $\beta_i = \frac{\kappa}{1-\kappa}$ and $\beta_v = -\frac{\kappa}{1 - \kappa}$. Moreover, $\kappa$ is the elasticity of the GDP with the capital and also the share of profits in GDP. A common approximation for the value of this parameter is about 1/3, which implies: $\beta_i = - \beta_v = \frac{\kappa}{1-\kappa}=0.5$. @MANK:ROME:WEIL:92\index[author]{Mankiw}\index[author]{Romer}\index[author]{Weil} proposed a generalization of the Solow-Swan model that includes human capital, denoted by $H$. The production function is now: $$ Y(t) = K(t) ^ \kappa H(t) ^ \lambda \left[A(t)L(t)\right]^{1 - \kappa - \lambda} $$ $\lambda$ is the share of human capital in the GDP and the share of labor is now $(1 - \kappa - \lambda)$. The model is very similar to the one we previously developed. We first compute the growth rate of physical and human capital ($\dot{k}$ and $\dot{h}$) per unit of effective labor, we set these two growth rates to 0 to get the stocks of physical and human capital at the steady state per unit of effective labor ($k^*$ and $h^*$) and we introduce these two values in the production function to get: $$ \ln\frac{Y}{L} = C + \frac{\kappa}{1-\kappa-\lambda} \ln i + \frac{\lambda}{1-\kappa-\lambda} \ln e - \frac{\kappa + \lambda}{1 - \kappa-\lambda} \ln v + \epsilon $$ {#eq-growth_equation} where $e$ is the per capita level of human capital. The model now contains three covariates and three slopes ($\beta_i$, $\beta_e$ and $\beta_v$). Moreover, the structural model implies a structural restriction ($\beta_i + \beta_e + \beta_v = 0$) that is testable. ### Data set {#sec-data_solow_mrw} `growth`\idxdata{growth}{micsr.data} contains the data used by @MANK:ROME:WEIL:92\index[author]{Mankiw}\index[author]{Romer}\index[author]{Weil}. It consists of 121 countries for 1985. ```{r } #| label: growth_print growth %>% print(n = 3) ``` This data set contains a variable called `group` which enables the selection of subsamples. The modalities of this variable are: - `"oil"`: for countries whose most part of the GDP is linked to oil extraction, - `"oecd"`: for OECD countries, - `"lqdata"`: for countries with low quality data, - `"other"`: for other countries. The variables used in the following regressions are per capita GDP in 1985 (`gdp85`), investment rate (`inv`) and growth rate of the population (`popgwth`). To get the variable denoted by `v` in the previous section, we need to add to the growth rate of the population the technical progress rate and the rate of depreciation. As these two variables are difficult to measure consistently, the authors assume that they don't exhibit any cross-country variation and that they sum to 5%. Therefore, `v` equals `popgwth + 0.05`. We first investigate the relationship between the two covariates: `inv` and `popgwth`. @fig-invpop presents the scatterplot, the size of the points being proportional to GDP per capita. \idxfun{ggplot}{ggplot2}\idxfun{geom\_point}{ggplot2}\idxfun{geom\_smooth}{ggplot2}\idxfun{stat\_ellipse}{ggplot2} ```{r } #| label: fig-invpop #| fig.cap: "Investment rate and demographic growth" growth %>% ggplot(aes(popgwth, inv)) + geom_point(aes(size = gdp85, shape = group)) + stat_ellipse() + geom_smooth(color = "black") ``` There is a weak negative correlation between the two variables and rich countries are in general characterized by a low demographic growth rate and a high investment rate. We also remark that there is an outlier, Kuwait, which has a very high demographic growth rate. We then compute the variable `v` and we rename `inv` and `school` in `i` and `e` as in @eq-growth_equation: \idxfun{mutate}{dplyr}\idxfun{rename}{dplyr} ```{r } #| label: mutate_and_rename_growth growth <- mutate(growth, v = popgwth + 0.05) %>% rename(i = inv, e = school) ``` ## Computation of the OLS estimator {#sec-comp_ols_mult} In this section, we present the computation of the OLS estimator with a number of covariates $K \geq 2$. We start with the case where $K=2$, for which it is possible to compute the estimators using roughly the same notations as the one used for the simple linear regression model. Then we'll perform the computation in the general case using matrix algebra. For one observation, denoting $x_{n1}$ and $x_{n2}$ the two covariates for observation $n$, the model is: $$ y_{n}=\alpha+\beta_1 x_{n1}+\beta_2 x_{n2}+\epsilon_{n} $$ Each observation is now a point in the 3-D space defined by $(x_1, x_2, y)$ and $\gamma ^ \top = (\alpha, \beta_1, \beta_2)$ are the coordinates of a plane that returns the expected value of $y$ for a given value of $x_1$ and $x_2$. The residual sum of squares is: $$f(\alpha, \beta) =\sum_{n = 1} ^ N\left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right)^2$$ which leads to the following three first-order conditions: $$ \left\{ \begin{array}{rcl} \frac{\partial f}{\partial \alpha} &=& -2 \sum_{n = 1} ^ N \left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right) = 0 \\ \frac{\partial f}{\partial \beta_1} &=& -2\sum_{n = 1} ^ N x_{n1}\left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right)=0 \\ \frac{\partial f}{\partial \beta_2} &=& -2\sum_{n = 1} ^ N x_{n2}\left(y_{n}-\alpha-\beta_1x_{n1}-\beta_2x_{n2}\right)=0 \end{array} \right. $$ {#eq-multiple_ols_foc} Dividing the first line of @eq-multiple_ols_foc by the sample size, we get: $$ \bar{y} - \alpha - \beta_1 \bar{x}_1 - \beta_2 \bar{x}_2 = 0 $$ {#eq-multiple_ols_mean_point} which means that the sample mean is on the regression plane, or that the sum of the residuals is zero. For the last two lines of @eq-multiple_ols_foc, the terms in parentheses are the residual for one observation and as its mean is 0, they indicate that the sample covariance between the residuals and both covariates should be 0. Therefore, we get exactly the same conditions as the one obtained for the simple linear regression model. Subtracting @eq-multiple_ols_mean_point from the last two lines of @eq-multiple_ols_foc, we get: $$ \left\{ \begin{array}{l} \sum_{n = 1} ^ N x_{n1}\left[(y_{n}-\bar{y})-\beta_1(x_{n1}-\bar{x}_1) - \beta_2(x_{n2} - \bar{x}_2)\right]=0 \\ \sum_{n = 1} ^ N x_{n2}\left[(y_{n}-\bar{y})-\beta_1(x_{n1}-\bar{x}_1) - \beta_2(x_{n2} - \bar{x}_2)\right]=0 \\ \end{array} \right. $$ or replacing $x_{nk}$ by $x_{nk} - \bar{x}_k$ and developing terms: $$ \small{ \left\{ \begin{array}{lcl} \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(y_{n} - \bar{y}) &=& \beta_1\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) ^ 2 + \beta_2 \sum_{n = 1} ^ {N} (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2) \\ \sum_{n = 1} ^ N (x_{n2}- \bar{x}_2)(y_{n}-\bar{y}) &=& \beta_1\sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2) + \beta_2 \sum_{n = 1} ^ N (x_{n2} - \bar{x}_2) ^ 2 \\ \end{array} \right.} $$ {#eq-multiple_ols_foc_covariates} \newpage We therefore have a system of two linear equations with two unknown parameters ($\beta_1$ and $\beta_2$) that could be solved, for example, by substitution. However, the use of matrix algebra enables to solve such a problem in a much simpler way. Denote: $$ X=\left(\begin{array}{cc} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \vdots & \vdots \\ x_{N1} & x_{N2} \\ \end{array}\right) \;\; y=\left(\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \\ \end{array}\right) \;\; \beta=\left(\begin{array}{c} \beta_1 \\ \beta_2 \\ \end{array}\right) $$ $X$ is a matrix with two columns ($K$ in the general case) and $N$ lines (the number of observations) and $y$ is a vector of length $N$. We define $\tilde{I} = I - J / N$ where $I$ is a $N \times N$ identity matrix and $J$ a $N\times N$ matrix of ones. For example, for $N = 3$: $$ \tilde{I} = I - J / N = \left( \begin{array}{cccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right) - \left( \begin{array}{cccc} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{array} \right) $$ Premultiplying a vector $z$ of length $N$ by $J/N$, we get a vector of length $N$ containing the sample mean of $z$ $\bar{z}$ repeated $N$ times. As premultiplying $z$ by the identity matrix returns $z$, premultiplying $z$ by $\tilde{I}$ returns a vector of length $N$ containing the $N$ values of $z$ in difference from the sample mean. Note that $\tilde{I}$ is **idempotent**\index[general]{indempotent matrix}, which means that $\tilde{I} \times \tilde{I}$. It can be checked by using direct multiplication, but also by reminding that premultiplying a vector by $\tilde{I}$ removes the sample mean from the values of the vector. The transformed vector then has a zero mean, so that applying the same premultiplication one more time will leave it unchanged. Therefore, $\tilde{I}(\tilde{I} z) = \tilde{I}z$. Denoting $\tilde{z} = \tilde{I} z$, we get: $$ \tilde{X}= \tilde{I} X = \left(\begin{array}{cc} x_{11} - \bar{x}_1 & x_{12} - \bar{x}_2\\ x_{21} - \bar{x}_1 & x_{22} - \bar{x}_2\\ \vdots & \vdots \\ x_{N1} - \bar{x}_1 & x_{N2} - \bar{x}_2\\ \end{array}\right) ,\;\; \tilde{y} = \tilde{I} y =\left(\begin{array}{c} y_1 - \bar{y}\\ y_2 - \bar{y}\\ \vdots \\ y_N - \bar{y}\\ \end{array}\right) $$ Then: $$ \tilde{X} ^ \top \tilde{X} = \left(\begin{array}{cc} \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1) ^ 2 & \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2)\\ \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(x_{n2} - \bar{x}_2) & \sum_{n = 1} ^ N (x_{n2} - \bar{x}_2) ^ 2\\ \end{array} \right) = \left(\begin{array}{cc} S_{11} & S_{12}\\ S_{12} & S_{22}\\ \end{array} \right) $$ and $$ \tilde{X}^\top \tilde{y} = \left(\begin{array}{c} \sum_{n = 1} ^ N (x_{n1} - \bar{x}_1)(y_{n}- \bar{y})\\ \sum_{n = 1} ^ N (x_{n2} - \bar{x}_2)(y_{n}- \bar{y})\\ \end{array} \right) = \left(\begin{array}{cc} S_{1y} \\ S_{2y} \\ \end{array} \right) $$ $S_{kk}$ is the total variation of $x_k$, $S_{kl}$ the covariation of $x_k$ and $x_l$ and $S_{ky}$ the covariation between covariate $k$ and the response. Note that the quantities $S_{kk}$ and $S_{ky}$ were already present in the simple linear model as $S_{xx}$ and $S_{xy}$ (there are now two of them). The new term is $S_{kl}$, which measures the correlation between the two covariates. @eq-multiple_ols_foc_covariates can then be written in matrix form as: $$ \tilde{X}^\top \tilde{y} = \tilde{X} ^ \top \tilde{X} \beta $$ And the OLS estimator is obtained by premultiplying both sides of the equation by the inverse of $\tilde{X} ^ \top \tilde{X}$: $$ \hat{\beta} = \left(\tilde{X} ^ \top \tilde{X}\right) ^ {- 1} \tilde{X}^\top \tilde{y} $$ {#eq-mls} Note that premultiplying the vector $\tilde{X}^\top \tilde{y}$ by the inverse of $\tilde{X} ^ \top \tilde{X}$ is a natural extension of the computation we've performed for the simple linear model, which consisted of dividing $S_{xy}$ by $S_{xx}$. To understand this formula, we write $\tilde{X} ^ \top \tilde{X}$ and $\tilde{X} ^ \top \tilde{y}$ as: $$ \tilde{X} ^ \top \tilde{X} = \left(\begin{array}{cc} S_{11} & S_{12}\\ S_{12} & S_{22}\\ \end{array} \right) = N\left(\begin{array}{cc} \hat{\sigma}_1^2 & \hat{\sigma}_{12}\\ \hat{\sigma}_{12} & \hat{\sigma}_2^2\\ \end{array} \right) = N\hat{\sigma}_1\hat{\sigma}_2 \left(\begin{array}{cc} \frac{\hat{\sigma}_1}{\hat{\sigma}_2} & \hat{\rho}_{12}\\ \hat{\rho}_{12} & \frac{\hat{\sigma}_2}{\hat{\sigma}_1}\\ \end{array} \right) $$ {\#eq-XpX} and $$ X^\top y = \left( \begin{array}{cc} S_{1y} \\ S_{2y} \\ \end{array} \right) = N\left( \begin{array}{cc} \hat{\sigma}_{1y} \\ \hat{\sigma}_{2y} \\ \end{array} \right) = N\hat{ \sigma}_y \left(\begin{array}{cc} \hat{\sigma}_1\hat{\rho}_{1y} \\ \hat{\sigma}_2\hat{\rho}_{2y} \\ \end{array} \right) $$ - the first formulation uses the total sample variations / covariations, - the second one divides every term by $N$ to obtain sample variances and covariances, - the third one divides the covariances by the product of the standard deviations to get sample coefficients of correlation. To compute the estimator, we need to compute the inverse of $\tilde{X} ^ \top \tilde{X}$, which is: $$ \left(\tilde{X} ^ \top \tilde{X}\right) ^ {- 1} = \frac{ \left(\begin{array}{cc} S_{22} & -S_{12}\\ -S_{12} & S_{11}\\ \end{array} \right)} {S_{11} S_{22} - S_{12} ^ 2} = \frac{ \left(\begin{array}{cc} \hat{\sigma}_2^2 & -\hat{\sigma}_{12}\\ -\hat{\sigma}_{12} & \hat{\sigma}_1^2\\ \end{array} \right) } {N (\hat{\sigma}_1 ^ 2 \hat{\sigma}_2 ^ 2 - \hat{\sigma}_{12}^2)} = \frac{ \left(\begin{array}{cc} \displaystyle\frac{\hat{\sigma}_2}{\hat{\sigma}_1} & - \hat{\rho}_{12}\\ - \hat{\rho}_{12} & \displaystyle\frac{\hat{\sigma}_1}{\hat{\sigma}_2}\\ \end{array} \right) } {N \hat{\sigma}_1 \hat{\sigma}_2 (1 - \hat{\rho}_{12} ^ 2)} $$ {#eq-XpXm1} @eq-mls finally gives: $$ \left\{\begin{array}{l} \hat{\beta}_1 = \displaystyle \frac{S_{22}S_{1y} - S_{12}S_{2y}}{S_{11}S_{22} - S_{12} ^ 2} = \frac{\hat{\sigma}_2 ^ 2 \hat{\sigma}_{1y} - \hat{\sigma}_{12} \hat{\sigma}_{2y}} {\hat{\sigma}_1 ^ 2 \hat{\sigma}_2 ^ 2 - \hat{\sigma} _ {12} ^ 2} = \frac{\hat{\rho}_{1y} - \hat{\rho}_{12}\hat{\rho}_{2y}}{1 - \hat{\rho}_{12} ^ 2} \frac{\hat{\sigma}_y}{\hat{\sigma}_1} \\ \hat{\beta}_2 = \displaystyle \frac{S_{11}S_{2y} - S_{12}S_{1y}}{S_{11}S_{22} - S_{12} ^ 2} = \frac{\hat{\sigma}_1 ^ 2 \hat{\sigma}_{2y} - \hat{\sigma}_{12} \hat{\sigma}_{1y}} {\hat{\sigma}_1 ^ 2 \hat{\sigma}_2 ^ 2 - \hat{\sigma}_{12} ^ 2} = \frac{\hat{\rho}_{2y} - \hat{\rho}_{12} \hat{\rho}_{1y}} {1-\hat{\rho}_{12}^2} \frac{\hat{\sigma}_y}{\hat{\sigma}_2} \\ \end{array} \right. $$ If the two covariates are uncorrelated in the sample ($S_{12} = \hat{\sigma}_{12} = \hat{\rho}_{12} = 0$), we have: $$ \left\{ \begin{array}{l} \hat{\beta}_1 = \displaystyle \frac{S_{1y}}{S_{11}} = \frac{\hat{\sigma}_{1y}} {\hat{\sigma}_1 ^ 2} = \hat{\rho}_{1y} \frac{\hat{\sigma}_y}{\hat{\sigma}_1} \\ \hat{\beta}_2 = \displaystyle \frac{S_{2y}}{S_{22}} = \frac{\hat{\sigma}_{2y}} {\hat{\sigma}_2 ^ 2} = \hat{\rho}_{2y} \frac{\hat{\sigma}_y}{\hat{\sigma}_2} \end{array} \right. $$ which is exactly the same formula that we had for the unique slope in the case of the simple regression model. This means that, if $x_1$ and $x_2$ are uncorrelated in the sample, regressing $y$ on $x_1$ or on $x_1$ and $x_2$ leads exactly to the same estimator for the slope of $x_1$. The general formula for $\hat{\beta}_1$ in term of the estimator of the simple linear model $\hat{\beta}_1^S$ (@eq-slrbeta) is: $$ \hat{\beta}_1 = \frac{\hat{\rho}_{1y} - \hat{\rho}_{12}\hat{\rho}_{2y}}{1 - \hat{\rho}_{12} ^ 2} \frac{\hat{\sigma}_y}{\hat{\sigma}_1} = \hat{\beta}_1^s \frac{1 - \displaystyle\frac{\hat{\rho}_{12}\hat{\rho}_{2y}}{\hat{\rho}_{1y}}} {1 - \hat{\rho}_{12} ^ 2}= \hat{\beta}_1^s \frac{1 - \hat{\rho}_{12} ^ 2 \displaystyle\frac{\hat{\rho}_{2y}}{\hat{\rho}_{12}\hat{\rho}_{1y}}} {1 - \hat{\rho}_{12} ^ 2} $$ {\#eq-mult_simple} We have $\mid\hat{\rho}_{12} \hat{\rho}_{1y}\mid \leq \mid\hat{\rho}_{2y}\mid$, or $\left| \frac{\hat{\rho}_{2y}}{\hat{\rho}_{12}\hat{\rho}_{1y}}\right| \geq 1$. Consider the case where the two covariates are positively correlated with the response. As an example, consider the wage equation with $x_1$ education and $x_2$ a dummy for males. Two cases should then be analyzed: - the two covariates are positively correlated (males are more educated than females on average). In this case $\hat{\rho}_{2y}/(\hat{\rho}_{12}\hat{\rho}_{1y})>1$, and the numerator of @eq-mult_simple is lower than $1 - \hat{\rho}_{12} ^ 2$, so that $\hat{\beta}_1<\hat{\beta}_1^S$. $\hat{\beta}_1 ^ S$ is upward biased because it estimates the sum of the positive direct effect of education on wage and a positive indirect effect (more education leads to a subpopulation with a higher share of males and therefore higher wages), - the two covariates are negatively correlated (males are less educated than females on average). In this case, $\hat{\rho}_{2y}/(\hat{\rho}_{12}\hat{\rho}_{1y})<0$ and the numerator of @eq-mult_simple is greater than 1, so that $\hat{\beta}_1>\hat{\beta}_1^S$. $\hat{\beta}_1 ^ S$ is downward biased because it estimates the sum of the positive direct effect of education on wage and a negative indirect effect (more education leads to a subpopulation with a lower share of males and therefore lower wages). The general derivation of the OLS estimator can be performed using matrix algebra, denoting $j_N$ a vector of 1 of length $N$, $Z = (j_N, X)$ a vector formed by binding a vector of 1 to the matrix of covariates and $\gamma^\top = (\alpha, \beta ^ \top)$ the vector of parameters obtained by adding the intercept $\alpha$ to the vector of slopes: $$ f(\gamma) = (y - Z\gamma)^\top (y - Z\gamma) = y ^ \top y + \gamma ^ \top Z ^ \top Z \gamma -2 \gamma ^ \top Z ^ \top y $$ The $K+1$ first-order conditions are: $$ \frac{\partial f}{\partial \gamma} = 2 Z ^ \top Z \gamma - 2 Z ^ \top y = -2 Z ^ \top (y - Z\gamma) = 0 $$ The last expression indicates that all the columns of $Z$ are orthogonal to the vector of residuals $y - Z\gamma$. Solving for $\gamma$, we get: $$ \hat{\gamma} = (Z ^ \top Z) ^ {- 1} Z ^ \top y $$ {#eq-multiple_ols_gamma} The matrix of second derivatives is: $$ \frac{\partial^2 f}{\partial \gamma\partial^\top \gamma} = 2 Z ^ \top Z $$ It is a positive semi-definite matrix, so that $\hat{\gamma}$ is a minimum of $f$. Comparing @eq-multiple_ols_gamma and @eq-mls, we can see that the formula of the OLS is the same with different matrices (respectively $\tilde{X}$ and $Z$), the first equation returning $\hat{\beta}$ and the second one $\hat{\gamma} ^ \top = (\hat{\alpha}, \hat{\beta} ^ \top)$. ## Geometry of least squares {#sec-geometry_multiple_ols} ### Geometry of the multiple regression model \index[general]{geometry of least squares!multiple linear regression model|(} The geometry of the multiple regression model is presented in @fig-multregmodel. ```{r } #| label: fig-multregmodel #| fig-cap: "Geometry of the multiple regression model" #| echo: false knitr::include_graphics("./tikz/fig/OLS3D.png", auto_pdf = TRUE) ``` We now have $N = 3$ and $K = 2$ and therefore each variable is a vector in the 3-D space. As the two covariates $x_1$ and $x_2$ are linearly independent, they span a subspace of dimension 2, which is a plane in this 3-D space. $\hat{y}$ is the orthogonal projection of $y$ on this subspace and $\hat{\epsilon}$, is the vector that links $\hat{y}$ and $y$. $\hat{\epsilon}$ is therefore the projection of $y$ on the complement to the subspace defined by $(x_1, x_2)$, which is a straight line orthogonal to the plane spanned by $(x_1, x_2)$. Therefore $\hat{\epsilon}$ is orthogonal to $x_1$ and to $x_2$, which means that the residuals are uncorrelated with the two covariates. The decomposition of $y$ on the sum of two orthogonal vectors $\hat{\epsilon}$ and $\hat{y}$ doesn't depend on the two variables $x_1$ and $x_2$ per se, but on the subspace spanned by $x_1$ and $x_2$. This means that any couple of independent linear combination of $x_1$ and $x_2$ will leads to the same subspace as the one defined by $x_1$ and $x_2$ and therefore to the same residuals and the same fitted values. More formally, as $\hat{\beta} = (X ^ \top X) ^ {-1} X ^ \top y$, we have $\hat{y} = X \hat{\beta} = X (X ^ \top X) ^ {-1} X ^ \top y = P_X y$. $P$ is sometimes called the "hat" matrix, as it "puts a hat" on $y$. This matrix transforms the vector of response on a vector of prediction. As $\hat{\epsilon} = y - \hat{y}$, we also have $\hat{\epsilon} = y - P_X y = (I-P_X) y=M_Xy$. We therefore consider two matrices $P_X$ and $M_X$: $$ \left\{ \begin{array}{rcl} P_X &=& X (X ^ \top X) ^ {-1} X ^ \top \\ M_X &=& I - X (X ^ \top X) ^ {-1} X ^ \top \\ \end{array} \right. $$ {#eq-projmatrix} that are square and symmetric of dimension $N \times N$, which means that they are in practice large matrices, and are therefore never computed in practice. However, they have very interesting analytical features. First, they are idempotent, which means that $P_X \times P_X = P_X$ and $M_X \times M_X = M_X$. This means that while premultiplying a vector by such a matrix, this vector is projected in a subspace. For example, premultiplying $y$ by $P_X$ gives $\hat{y}$, the vector of fitted values. It is a linear combination of $x_1$ and $x_2$ and therefore belongs to the subspace spanned by $x_1$ and $x_2$. Therefore, $P_X \hat{y}$ obviously equal $\hat{y}$ and therefore $P_X \times P_X = P_X$. Except for the identity matrix, idempotent matrices are not full rank. Their rank can be easily computed using the fact that the rank of a matrix is equal to its trace (the sum of the diagonal elements) and that the trace of a product of matrices is invariant to any permutation of the matrices: $\mbox{tr} ABC = \mbox{tr} BCA = \mbox{tr} CAB$. For a regression with an intercept, the model matrix $Z$ has $K + 1$ column, the first one being a column of one. In this case, the rank of $P_Z$ and $M_Z$ are: $\mbox{rank} \,P_Z = \mbox{tr}\, P_Z = \mbox{tr}\, Z (Z ^ \top Z) ^ {-1} Z ^ \top = \mbox{tr}\, (Z ^ \top Z) ^ {-1} Z ^ \top Z = \mbox{tr}\, I_{K+1} = K + 1$ and $\mbox{rank} \,M_Z = \mbox{tr}\, (I_N - P_X) = \mbox{tr}\, I_N - \mbox{tr}\, P_Z = N - K - 1$. Finally, the two matrices are orthogonal: $P_ZM_Z= P_Z(I - P_Z)=P_Z-P_Z=0$, which means that they perform the projection of a vector on two orthogonal subspaces. Getting back to @fig-multregmodel, $P_X$ project $y$ on the 2-D subspace (a plane) spanned by $x_1$ and $x_2$ and $M_X$ project $y$ on a 1-D subspace (the straight line orthogonal to the previous plane). $M_X$ and $P_X$ perform therefore an orthogonal decomposition of $y$ in $\hat{y}$ and $\hat{\epsilon}$, which means that $\hat{y} + \hat{\epsilon} = y$ and that $\hat{y}^\top\hat{\epsilon} = 0$. \index[general]{geometry of least squares!multiple linear regression model|)} ### Frisch-Waugh theorem \index[general]{Frisch-Waugh theorem|(} Consider the regression of $y$ on a set of regressors $X$ which, for some reasons, is separated in two subsets $X_1$ and $X_2$. Suppose that we are only interested in the coefficients $\beta_2$ associated with $X_2$. The Frisch-Waugh theorem states that the same estimator $\hat{\beta}_2$ is obtained: - by regressing $y$ on $X_1$ and $X_2$, - by first regressing $y$ and each column of $X_2$ on $X_1$, then taking the residuals $M_1y$ and $M_1{X}_2$ of these regressions and finally regressing $M_1y$ on $M_1X_2$. @fig-Frischwaugh illustrates the Frisch-Waugh theorem. ```{r } #| label: fig-Frischwaugh #| fig-cap: "Frisch-Waugh theorem" #| echo: false #| out.width: "80%" knitr::include_graphics("./tikz/fig/frishWaugh.png", auto_pdf = TRUE) ``` The **first regression** is the regression of $y$ on $x_1$ and $x_2$. We then get an orthogonal decomposition of $y$ in the vector of fitted values $\hat{y}=P_{12}y$ and of the residuals $\hat{\epsilon}_{12}=M_{12}y$. We also show in this figure the decomposition of $\hat{y}$ in $x_1$ and $x_2$, which is represented by the sum of the two vectors $\hat{\beta}_1 x_1$ and $\hat{\beta}_2 x_2$. $\hat{\beta}_2$ is the estimator of $x_2$ on this first regression and is represented by the ratio between $\hat{\beta}_2 x_2$ and $x_2$. The **second regression** is the regression of $M_1y$ on $M_1x_2$. $M_1x_2$ is the residual of the regression of $x_2$ on $x_1$. Therefore, this vector lies in the line that is in the plane spanned by $x_1$ and $x_2$ and is orthogonal to $x_1$. $M_1y$ is the residual of $y$ on $x_1$; it is therefore orthogonal to $x_1$. As both $M_1y$ and $M_{12}y$ are orthogonal to $x_1$, so is the vector that joins those two vectors. Therefore, this vector is parallel to $M_1x_2$, and it is therefore the fitted value of the second regression ($M_1y$ on $M_1x_2$), $P_{M_1x_2}y$. It is also equal to $\tilde{\beta}_2 M_1x_2$, $\tilde{\beta}_2$ being the estimator of $x_2$ on the second regression. Note also that $\hat{\epsilon}_{12} = M_{12}y$ is the residual of the second regression, which is therefore the same as the residuals of the first regression. Finally, consider the regression of $\hat{\beta}_2 x_2$ on $x_1$. The residual of this regression is $\hat{\beta}_2M_1x_2$. As it lies on the plane spanned by $x_1$ and $x_2$ and is orthogonal to $x_1$, it is parallel to $P_{M_1x_2}y=\tilde{\beta}_2M_1x_2$. Moreover, the Frisch-Waugh theorem states that both vectors have the same length and are therefore identical, which means that $\tilde{\beta}_2 = \hat{\beta}_2$ and that the two regressions give identical estimators. The Frisch-Waugh is easily demonstrated using some geometric arguments. Consider the regression with all the covariates: $$ y = X_1 \hat{\beta}_1 + X_2 \hat{\beta}_2 + M_{12}y $$ Then, premultiply both sides of the model by $M_1$: $$ M_1 y = M_1 X_1 \hat{\beta}_1 + M_1 X_2 \hat{\beta}_2 + M_1 M_{12}y $$ $M_1 X_1 \hat{\beta}_1$ is 0 as $X_1 \hat{\beta}_1$ is obviously in the subset spanned by $X_1$ and therefore its projection on the orthogonal complement is 0. $M_{12} y$ is orthogonal to the subset spanned by $X$ and is therefore also orthogonal to the subset spanned by $X_1$. Therefore $M_1 M_{12} y = M_{12} y$. We therefore have: $$ M_1 y = M_1 X_2 \hat{\beta}_2 + M_{12}y $$ For which the estimation is: $$ \hat{\beta}_2 = (X_2^\top M_1 X_2) ^ {-1} X_2 ^ \top M_1 y $$ which is exactly the estimation obtained by regressing $M_1 y$ on $M_1 X_2$. We finally note (an important result that will be used in @sec-three_tests), that $\hat{\epsilon}_1=M_1y$, $\hat{\epsilon}_{12}=M_{12}y$ and $P_{M_1x_2}y$ form a right triangle, $\hat{\epsilon}_1$ being the hypotenuse. Therefore, using the Pythagorean theorem, we have: $$ \mid\mid \hat{\epsilon}_{12}^2\mid\mid + \mid\mid P_{M_1X_2}y\mid\mid = \mid\mid \hat{\epsilon}_{1}^2\mid\mid $$ {\#eq-pyth_ssr} \index[general]{Frisch-Waugh theorem|)} \newpage ## Computation with R {#sec-computation_R_multiple} To estimate the multiple linear model, we use as for the single linear model the `lm` function; the difference being that now, on the right side of the formula, we have several variables (here two), separated by the `+` operator. Actually, formulas have a much richer syntax that includes other operators, for example `*` and `:`. This will be discussed in [Chapter -@sec-interpretation_chapter]. \idxfun{lm}{stats}\idxfun{coef}{stats} ```{r } #| label: estimation_solow_model #| collapse: true slw_tot <- lm(log(gdp85) ~ log(i) + log(v), growth) slw_tot %>% coef ``` ### Computation using matrix algebra The estimator can also be computed "by hand", using matrix algebra. To start, we use the `model.frame` function which, as `lm`, has `formula` and `data` arguments. For pedagogical purposes, we add the `group` variable in the formula. \idxfun{head}{utils}\idxfun{model.frame}{stats}\idxfun{nrow}{base} ```{r } #| label: model_frame_growth #| collapse: true mf <- model.frame(log(gdp85) ~ log(i) + log(v) + group, growth) head(mf, 3) nrow(mf) nrow(growth) ``` `model.frame` returns a data frame that contains the data required for estimating the model described in the formula. More precisely, it performs three tasks: - it selects only the columns of the initial data frame that are required for the estimation, - it transforms these variables if required, here `gdp85`, `i` and `v` are transformed in logarithms and the columns are renamed accordingly, - it selects only the observations for which there are no missing values for the relevant variables; note here that `growth` has 121 rows and `mf` only 107 rows. Several interesting elements of the model can be extracted from the model frame. The $Z$ matrix is obtained using the `model.matrix` function, which also uses a `formula`/`data` interface, the data being the model frame `mf`: \idxfun{model.matrix}{stats}\idxfun{head}{utils} ```{r } #| label: model_matrix_growth Z <- model.matrix(log(gdp85) ~ log(i) + log(v) + group, mf) head(Z, 3) ``` Note that the model matrix includes an intercept^[As seen in @sec-remove_intercept; to remove it, one has to use either `+ 0` or `- 1` in the formula.] and the `group` variable, which is a categorical variable is transformed into a set of dummy variables. More precisely, a dummy variable is created for all the modalities except the first one (`"oecd"`). The response is obtained using the `model.response` function: \idxfun{model.response}{stats}\idxfun{head}{utils} ```{r } #| label: model_response_growth #| collapse: true y <- model.response(mf) head(y, 3) ``` Once the model matrix and the response vector are created, the estimator can easily be computed using matrix operators provided by `R`. In particular: - `%*%` is the matrix product operator (`*` performs an element per element product), - `t()` transposes a matrix, - `solve()` solves a linear system of equation or computes the inverse of a matrix, - `crossprod` takes the inner products of two matrices (or of one matrix and a vector). The most straightforward formula to get the OLS estimator is: \idxfun{solve}{base} ```{r } #| label: ols_straight_formula solve(t(Z) %*% Z) %*% t(Z) %*% y ``` But `crossprod` is more efficient, $A^\top B$ being obtained using `crossprod(A, B)` and $A^\top A$ is either `crossprod(A, A)` or `crossprod(A)`. Therefore, $Z^\top Z$ and $Z^\top y$ are respectively obtained using: \idxfun{crossprod}{base} ```{r } #| results: hide #| label: using_crossprod crossprod(Z) crossprod(Z, y) ``` Moreover, `solve` can be used to solve a system of linear equations: `solve(A, z)` compute the vector $w$ such that $Aw=z$, which is $w=A^{-1}z$. Therefore, the OLS estimator can be computed using the more efficient and compact following code: \idxfun{solve}{base}\idxfun{crossprod}{base} ```{r} #| results: hide #| label: using_crossprod_solve solve(crossprod(Z), crossprod(Z, y)) ``` ### Efficient computation: QR decomposition \index[general]{QR decomposition|(} The efficient method used by `lm` to compute the OLS estimate is the QR decomposition:^[See @DAVI:MACK:93\index[author]{Davidson}\index[author]{McKinnon}, section 1.5, pp. 25-31.] the matrix of covariates $Z$ can be written as the product of an orthonormal matrix $Q$ of dimension $N \times (K + 1)$ (therefore $Q^\top Q = I$) and an upper triangular matrix $R$ of dimension $(K+1) \times (K+1)$. Then, the linear model can be written as: $$ y = Z\gamma + \epsilon = QR\gamma + \epsilon = Q \delta + \epsilon $$ with $\delta = R \gamma$. With $Q$ as the matrix of covariates, the OLS estimator is $\hat{\delta} = (Q ^ \top Q ) ^ {-1} Q ^ \top y = Q^\top y$, which is obtained without matrix inversion. Then $\hat{y} = Q \hat{\gamma} = Q Q ^ \top y$ and $\hat{\epsilon} = (I - Q Q ^ \top) y$. With $Z$ as the matrix of covariates, the OLS estimator is $\hat{\gamma} = R ^ {-1} \delta$. The inverse of a triangular matrix can be very easily obtained with a very high numerical accuracy. Moreover, $\hat{\gamma}$ can be obtained without inverting $R$. With $K = 2$, we have: $$ \left(\begin{array}{c} \delta_0 \\ \delta_1 \\ \delta_2 \end{array}\right) = \left( \begin{array}{ccc} r_{11} & r_{12} & r_{13} \\ 0 & r_{22} & r_{23} \\ 0 & 0 & r_{33} \end{array} \right) \left(\begin{array}{c} \alpha \\ \beta_1 \\ \beta_2 \end{array}\right) = \left(\begin{array}{c} r_{11} \alpha + r_{12} \beta_1 + r_{13} \beta_2 \\ r_{22} \beta_1 + r_{23} \beta_2 \\ r_{33} \beta_2 \end{array}\right) $$ which can be solved recursively: - $\beta_2 = \delta_2 / r_{33}$, - $\beta_1 = (\delta_1 - r_{23}\beta_2) / r_{22}$, - $\alpha = (\delta_0 - r_{12} \beta_1 - r_{13} \beta_2) / r_{11}$. Finally, $(X^\top X) ^ {-1} = (R^\top Q^ \top Q R) ^ {-1} = (R ^ \top R) ^{-1} = R ^ {-1} R ^ {- 1 \top}$. We illustrate the computation of the OLS estimator using the QR decomposition and using the previously estimated growth model, without the `group` covariate. The QR decomposition is performed using the `qr` function: \idxfun{model.matrix}{stats}\idxfun{qr}{base} ```{r} #| label: qr_decomposition Z <- model.matrix(log(gdp85) ~ log(i) + log(v), mf) qrZ <- qr(Z) ``` `qr` returns an object of class `qr`, and the `qr.R` and `qr.Q` function can be used to retrieve the two matrices. We check that $R$ is upper triangular and that $Q^\top Q = I$: \idxfun{qr.R}{base}\idxfun{qr.Q}{base}\idxfun{crossprod}{base} ```{r} #| label: extracting_Q_R #| collapse: false R <- qr.R(qrZ) Q <- qr.Q(qrZ) crossprod(Q) R ``` $\hat{\delta}$ is then obtained as the cross-products of $Q$ and $y$: \idxfun{drop}{base}\idxfun{crossprod}{base} ```{r} #| collapse: true hdelta <- crossprod(Q, y) %>% drop hdelta ``` and $\hat{\gamma}$ is obtained by solving recursively $\hat{\delta} = R \hat{\gamma}$: ```{r} #| collapse: true beta_2 <- hdelta[3] / R[3, 3] beta_1 <- (hdelta[2] - R[2, 3] * beta_2) / R[2, 2] alpha <- (hdelta[1] - R[1, 2] * beta_1 - R[1, 3] * beta_2) / R[1, 1] c(alpha, beta_1, beta_2) ``` \index[general]{QR decomposition|)} ## Properties of the estimators {#sec-properties_ols_multiple} In this section, we'll briefly analyze the statistical properties of the OLS estimator with more than one covariate. Most of these properties are similar to the one we have described in @sec-stat_prop_slm. They'll be presented in this section using matrix algebra. ### Unbiasedness of the OLS estimator \index[general]{unbiasedness!multiple linear regression model|(} The vector of slopes can be written as a linear combination of the vector of response and then of vector of the errors: $$ \begin{array}{rcl} \hat{\beta}&=&(X^\top\tilde{I}X)^{-1}X^\top\tilde{I}y \\ &=&(X^\top\tilde{I}X)^{-1}X^\top\tilde{I}(X\beta+\epsilon)\\ &=&\beta+(X^\top\tilde{I}X)^{-1}X^\top\tilde{I}\epsilon \\ &=&\beta+(\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\epsilon \\ \end{array} $$ {#eq-hbeta}                    The expected value of $\hat{\beta}$ conditional on $X$ is: $$ \mbox{E}(\hat{\beta}\mid X) = \beta + (\tilde{X}^\top\tilde{X})^{-1}\tilde{X}^\top\mbox{E}(\epsilon \mid X) $$ \newpage The unbiasedness condition is therefore that $\mbox{E}(\epsilon \mid X) = 0$, which is a direct generalization of the result obtained for the simple linear regression model, namely $\epsilon$ has a constant expected value (that can be set to 0 without any restriction) whatever the value of the covariates. It implies also that the population covariance between the errors and any of the covariates is 0. \index[general]{unbiasedness!multiple linear regression model|)} ### Variance of the OLS estimator {#sec-variance_ols} The variance of $\hat{\beta}$ is now a matrix of variances and covariances: $$\mbox{V}(\hat{\beta}\mid X) = \mbox{E}\left[(\hat{\beta}- \beta)(\hat{\beta}- \beta)^\top\mid X\right] $$ Using @eq-hbeta: $$\mbox{V}(\hat{\beta}\mid X) = \mbox{E}\left[(\tilde{X} ^ \top\tilde{X})^{-1}\tilde{X}^\top\epsilon \epsilon ^ \top \tilde{X} (\tilde{X}^\top\tilde{X})^{-1} \mid X\right] $$ $$ \mbox{V}(\hat{\beta}\mid X)= \frac{1}{N} \left(\frac{1}{N} \tilde{X}^\top\tilde{X}\right)^{-1}\left[\frac{1}{N}\mbox{E}(\tilde{X}^\top \epsilon \epsilon ^ \top \tilde{X} \mid X)\right] \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right) ^ {-1} $$ {#eq-general_variance} \index[general]{sandwich!multiple linear regression|(} This is a **sandwich** formula, the **meat**: $\frac{1}{N}\mbox{E}\left(X'\bar{I}\epsilon \epsilon ^ \top \bar{I} X \mid X\right)$ being surrounded by two slices of **bread**: $\left(\frac{1}{N} \tilde{X}^\top\tilde{X}\right)^{-1}$. Note that the two matrices are square and of dimension $K$. The bread is just the inverse of the covariance matrix of the covariates. The meat is the variance of the score vector, i.e., the vector of the first-order conditions. For $K = 2$, it is the expected value of: $$ \small{ \frac{1}{N} \left( \begin{array}{cccc} \left(\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) \epsilon_n)\right) ^ 2 & \left(\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) \epsilon_n)\right) \left(\sum_{n=1} ^ N (x_{n2} - \bar{x}_2) \epsilon_n)\right) \\ \left(\sum_{n=1} ^ N (x_{n1} - \bar{x}_1) \epsilon_n)\right) \left(\sum_{n=1} ^ N (x_{n2} - \bar{x}_2) \epsilon_n)\right) & \left(\sum_{n=1} ^ N (x_{n2} - \bar{x}_2) \epsilon_n)\right) ^ 2 \end{array} \right) } $$ which is a generalization of the single regression case where the "meat" reduces to the scalar $\left(\sum_{n=1} ^ N (x_n - \bar{x}) \epsilon_n)\right) ^ 2$. As for the simple regression model, the formula of the variance simplifies with the hypothesis that the errors are homoskedastic ($\mbox{E}(\epsilon_n ^ 2 \mid x) = \sigma_\epsilon ^ 2$) and uncorrelated ($\mbox{E}(\epsilon_n \epsilon_m \mid x) = 0 \; \forall \; m \neq n$). In this case, the meat reduces to $\sigma_\epsilon ^ 2 \frac{1}{N} \tilde{X} ^ \top \tilde{X}$, i.e., up to a scalar to the matrix of covariance of the covariates, and @eq-general_variance becomes: \index[general]{sandwich!multiple linear regression|)} \index[general]{homoskedasticity}\index[general]{uncorrelation} $$ V(\hat{\beta})= \sigma_\epsilon^2(\tilde{X}^\top\tilde{X})^{-1} $$ {\#eq-vbeta} which can be rewritten, using @eq-XpXm1: $$ \mbox{V}(\hat{\beta})= \frac{\sigma_\epsilon^2}{N\hat{\sigma}_1\hat{\sigma}_2(1-\hat{\rho}_{12}^2)} \left(\begin{array}{cc} \frac{\hat{\sigma}_2}{\hat{\sigma}_1} & -\hat{\rho}_{12}\\ -\hat{\rho}_{12} & \frac{\hat{\sigma}_1}{\hat{\sigma}_2}\\ \end{array} \right) $$ from which we get: $$ \sigma_{\hat{\beta}_k}= \frac{\sigma_\epsilon}{\sqrt{N}\hat{\sigma}_k\sqrt{1-\hat{\rho}_{12}^2}} \mbox{ for }k = 1, 2,\; \hat{\sigma}_{\hat{\beta}_1\hat{\beta}_2}= -\frac{\hat{\rho}_{12}\sigma_\epsilon^2}{N\hat{\sigma}_1\hat{\sigma}_2(1-\hat{\rho}_{12}^2)} \mbox{ and } \hat{\rho}_{\hat{\beta}_1\hat{\beta}_2} = -\hat{\rho}_{12} $$ {#eq-var_covar_slopes} First remark that if $\hat{\rho}_{12} = 0$, which means that the two covariates are uncorrelated, the formula for the standard deviation of a slope in the multiple regression model reduces to the formula of the single regression model, which means that the standard deviation is proportional to: - the standard deviation of the error, - the inverse of the standard deviation of the corresponding covariate, - the inverse of the square root of the sample size. When the two covariates are correlated, the last term $1/\sqrt{1 - \hat{\rho}_{12}^2}$ is added and inflates the standard deviation. This means that the more the covariates are correlated (whatever the sign of the correlation) the larger is the standard deviation of the slope. The intuition is that, if the two covariates are highly correlated, it is difficult to estimate precisely the separate effect of each of them. $\sigma_\epsilon^2$ being unknown, $\sigma^2_{\hat{\beta}}$ can't be computed. If the error were observed, a natural estimator of $\sigma_\epsilon^2$ would be $\sum_{n=1}^N (\epsilon_n-\bar{\epsilon}) ^ 2/N$. As the errors are unknown, one can use the residuals instead, which are related to the errors by the relation: $\hat{\epsilon} = M_Z y = M_Z (Z\gamma + \epsilon) = M_Z \epsilon$, the last equality standing because $Z\gamma$ is a vector of the subspace defined by the columns of $Z$ and therefore $M_ZZ\gamma = 0$. Therefore, we have: $\hat{\epsilon} ^ \top \hat{\epsilon} = \epsilon ^ \top M_Z \epsilon$; as it is a scalar, it is equal to its trace. Using the rule of permutation, we get: $\hat{\epsilon} ^ \top \hat{\epsilon} = \epsilon ^ \top M_Z \epsilon = \mbox{tr}\, M_Z \epsilon \epsilon ^ \top$. With spherical disturbances, we have $\mbox{E}(\hat{\epsilon} ^ \top \hat{\epsilon}) = \mbox{tr}\, M_Z \sigma_\epsilon ^ 2 I = \sigma_\epsilon ^ 2 \mbox{tr} M_Z = (N - K - 1) \sigma_\epsilon ^ 2$. Therefore, an unbiased estimator of $\sigma_\epsilon^2$ is: $$ \dot{\sigma}_\epsilon^2 = \frac{\hat{\epsilon}^\top\hat{\epsilon}}{N - K - 1} = \frac{\sum_{n=1}^N \hat{\epsilon}_n ^ 2}{N - K - 1} $$ {#eq-unbiased_res_se} and replacing $\sigma_\epsilon ^ 2$ by $\dot{\sigma}_\epsilon ^ 2$ in @eq-vbeta, we get an unbiased estimator of the covariance matrix of the estimators: $$ \hat{V}(\hat{\beta}) = \dot{\sigma}_\epsilon^2(\tilde{X}^\top\tilde{X})^{-1} $$ {\#eq-hvbeta} We now go back to the estimation of the growth model. As in @MANK:ROME:WEIL:92\index[author]{Mankiw}\index[author]{Romer}\index[author]{Weil}, we use a restricted sample by excluding countries for which most of the GDP is linked to oil extraction and those with low quality data. \idxdata{growth}{micsr.data}\idxfun{filter}{dplyr}\idxfun{lm}{stats} ```{r } #| label: growth_subsample growth_sub <- growth %>% filter(! group %in% c("oil", "lqdata")) slw_tot <- lm(log(gdp85) ~ log(i) + log(v), growth_sub) ``` The covariance matrix of the estimators is obtained using the `vcov` function: \idxfun{vcov}{stats} ```{r } #| label: vcov_function_growth vcov(slw_tot) ``` The `summary` method computes detailed results of the regression, in particular the table of coefficients, a matrix that can be extracted using the `coef` method: \idxfun{coef}{stats} ```{r} #| label: summary_growth slw_tot %>% summary %>% coef ``` ### The OLS estimator is BLUE \index[general]{best linear unbiased estimator!multiple linear regression model|(} The demonstration made in @sec-simple_ols_blue for the simple linear model can easily be extended to the multiple regression model. The OLS estimator is: $\hat{\gamma} = (Z^\top Z)^{-1}Z^\top y = \gamma + (Z^\top Z)^{-1}Z^\top \epsilon$. Consider another linear estimator: $$ \tilde{\gamma}=Ay=\left[(Z^\top Z)^{-1}Z^\top + D\right]y = (I + DZ)\gamma + \left[(Z^\top Z)^{-1}Z^\top + D\right]\epsilon $$ The unbiasedness of $\tilde{\gamma}$ implies that $DZ=0$, so that: $$ \tilde{\gamma}-\gamma= \left[(Z^\top Z)^{-1}Z^\top + D\right]\epsilon = (\hat{\gamma}- \gamma) + D\epsilon = (\hat{\gamma}- \gamma) + Dy $$ because $DZ=0$ implies that $D\epsilon=Dy$. Therefore, $\tilde{\gamma}- \hat{\gamma}=Dy$. The covariance between the OLS estimator and the vector of differences of the two estimators is: $$ \mbox{E}\left[(\tilde{\gamma}-\hat{\gamma})(\hat{\gamma} - \gamma)^\top\right] = \mbox{E}\left[D\epsilon\epsilon^\top Z(Z^\top Z) ^ {-1}\right] $$ with spherical disturbances, this reduces to: $$ \mbox{E}\left[(\tilde{\gamma}-\hat{\gamma})(\hat{\gamma} - \gamma)^\top\right] = \sigma_\epsilon ^ 2DZ(Z^\top Z) ^ {-1} =0 $$ Therefore, we can write the variance of $\tilde{\gamma}$ as: $$ \mbox{V}(\tilde{\gamma}) = \mbox{V}\left[\hat{\gamma} + (\tilde{\gamma}-\hat{\gamma})\right] = \mbox{V}(\hat{\gamma}) + \mbox{V}(Dy) $$ as $\mbox{V}(Dy)$ is a covariance matrix, it is semi-definite positive and, therefore, the difference between the variance matrix of any unbiased linear estimator and the OLS estimator is a semi-definite positive matrix, which means that the OLS estimator is the most efficient linear unbiased estimator. \index[general]{best linear unbiased estimator!multiple linear regression model|)} ### Asymptotic properties of the OLS estimator \index[general]{asymptotic properties!multiple linear regression model|(} Asymptotic properties of the multiple regression model are direct extensions of those we have seen for the simple regression model. With $\mbox{E}(\hat{\beta}) = \beta$ and $\mbox{V}(\hat{\beta}) = \frac{\sigma_\epsilon ^ 2}{N} \left(\frac{1}{N}\tilde{X}^ \top X\right) ^ {- 1}$, the OLS estimator is consistent ($\mbox{plim} \;\hat{\beta} = \beta$) if the covariance matrix of the covariates $\frac{1}{N}\tilde{X} ^ \top \tilde{X}$ converges to a finite matrix. The central-limit theorem implies that: $$ \sqrt{N}(\hat{\beta}_N - \beta) \xrightarrow{d} \mathcal{N}\left(0, \sigma_\epsilon ^ 2 \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right)^ {-1}\right) $$ or $$ \hat{\beta}_N \stackrel{a}{\sim} \mathcal{N}\left(\beta, \frac{\sigma_\epsilon ^ 2}{N} \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right)^ {-1}\right) $$ \index[general]{asymptotic properties!multiple linear regression model|)} ### The coefficient of determination From @eq-r2_slrm, one way to write the coefficient of determination for the simple linear regression model is: $$ R^2 = 1 - \frac{\sum_{n=1} ^ N \hat{\epsilon}_n^2}{\sum_{n=1} ^ N (y_n - \hat{y})^2} = 1 - \frac{\hat{\sigma}_\epsilon^2}{\hat{\sigma}_y^2} $$ The last term is obtained by dividing the residual sum of squares and the total variation of $y$ by the sample size $N$. This gives rise to biased estimators of the variances of the errors and of the response. Consider now the unbiased estimators: the numerator should then be divided by $N - K - 1$ and $\dot{\sigma}_\epsilon ^ 2$ is obtained. Similarly, the denominator should be divided by $N-1$ to obtain an unbiased estimation of the variance of the response. We then obtain the **adjusted coefficient of determination**:\index[general]{coefficient of determination!adjusted} $$ \bar{R}^2 = 1 - \frac{\dot{\sigma}_\epsilon^2}{\dot{\sigma}_y^2}= 1 - \frac{\sum_{n=1} ^ N \hat{\epsilon}_n^2 / (N-K-1)}{\sum_{n=1} ^ N (y_n - \hat{y})^2 / (N-1)} = 1 - \frac{N - 1}{N - K - 1}\frac{\sum_{n=1} ^ N \hat{\epsilon}_n^2}{\sum_{n=1} ^ N (y_n - \hat{y})^2} $$ $R^2$ necessarily increases when one more covariate is added to the regression because, even if this covariate is irrelevant, its sample correlation with the response will never be exactly 0 and therefore the sum of square will decrease. This is not the case with $\bar{R}^2$ because $\dot{\sigma}^2_\epsilon$ is the ratio of two terms which both increase when a covariate is added. Note also that $\bar{R}^2$ is not necessarily positive. \newpage ## Confidence interval and test {#sec-confint_test_multiple} \index[general]{confidence interval!multiple linear regression model|(} In @sec-confint_test_slm, we have seen how to compute a confidence interval for a single parameter and how to perform a test for one hypothesis. The confidence interval was a segment, i.e., a range of values that contains the true value of the parameter with a given probability, and the tests were performed using a normal or a Student distribution. Of course, the same kind of analysis can be performed with a multiple regression. But, in this latter case: - a confidence interval can also be computed for several coefficients one at a time, - tests of multiple hypotheses can be performed, using a chi-squared or a Fisher distribution. To illustrate these points, we'll use the Solow model. Remember that the model to estimate is: $$ \ln y = \alpha + \beta_i \ln i + \beta_v \ln v + \epsilon $$ where $y$ is the per capita gdp, $i$ the investment rate and $v$ the sum of the labor force growth rate, the depreciation rate and the technical progress rate. Moreover, the relation between $\beta_i$ and $\beta_v$ and the structural parameter $\kappa$, which is the share of profits in the GDP is $\beta_i = - \beta_v =\kappa / (1 -\kappa)$. We'll analyze the confidence interval for the couple of coefficients $(\beta_i, \beta_v)$ and we'll test two hypotheses: - the first is imposed by the model; we must have $\beta_i + \beta_v = 0$, - the second corresponds to a reasonable value of the share of profits, which is approximately one-third; therefore, we'll test the hypothesis that $\kappa = 1/3$, which implies that $\beta_i = 0.5$. Note that these two hypotheses imply that $\beta_v=-0.5$. ### Simple confidence interval and test The asymptotic distribution of the estimators is a multivariate normal distribution.^[If the errors are normal, the exact distribution of the estimators is normal (see @sec-clt).] The distribution of one estimator (say $\hat{\beta}_1$)^[Or of a linear combination of several estimators.] is a univariate normal distribution with, for the two covariates case, a standard deviation equal to: $\sigma_{\hat{\beta}_1} = \frac{\sigma_\epsilon}{\sqrt{N}\hat{\sigma}_1\sqrt{1 - \hat{\rho}_{12} ^ 2}}$. Therefore $(\hat{\beta}_1 - \beta_1) / \sigma_{\hat{\beta}_1}$ follows (exactly if $\epsilon$ is normal) a standard normal distribution. Confidence interval and tests for a coefficient are therefore computed exactly as for the case of the simple linear regression model. In particular, using @eq-var_covar_slopes and @eq-unbiased_res_se, we get: $$ t_k = \frac{\hat{\beta}_k - \beta_k}{\hat{\sigma}_{\hat{\beta}_k}} = \frac{\hat{\beta}_k - \beta_k}{\frac{\dot{\sigma}_\epsilon}{\sqrt{N}\hat{\sigma}_1\sqrt{1 -\hat{\rho}_{12}^2}}} $$ {#eq-student_multiple} that follows exactly a Student distribution with $N-K-1$ degrees of freedom if the errors are normal and asymptotically a normal distribution whatever the distribution of the errors. Therefore, $1 - \alpha$ confidence intervals are: $\hat{\beta}_k \pm \mbox{cv}_{1-\alpha / 2} \dot{\sigma}_{\hat{\beta}_k}$ where $\mbox{cv}_{1-\alpha / 2}$ is the critical value of either a Student or a normal distribution. For a given hypothesis: $H_0:\beta_k = \beta_{k0}$, $t_{k0}=(\hat{\beta}_k - \beta_{k0})/\hat{\sigma}_{\hat{\beta}_k}$ is a draw on a normal or a Student distribution if $H_0$ is true. Remember that "reasonable" values of the two slopes in our growth model should be $+/-0.5$. We check whether these values are in the confidence intervals, using the `confint` function: \idxfun{confint}{stats} ```{r } #| label: confint_growth confint(slw_tot, level = 0.95) ``` $0.5$ and $-0.5$ are not in the 95% confidence interval for respectively $\hat{\beta}_i$ and $\hat{\beta}_v$. The hypotheses that the true values of the parameters are 0 is very easy to compute, as they are based on the t statistics that are routinely returned by the `summary` method for `lm` objects. But, in our example, the hypothesis that $\beta_i = 0$ is the hypothesis that the share of profits is 0, which is of little interest. More interestingly, we could check the hypothesis that the coefficients are equal to $\pm 0.5$. In this case, we can "manually" compute the test statistics by extracting the relevant elements in the matrix returned by `coef(summary(x))`: \idxfun{coef}{stats} ```{r } #| label: extraction_summary_coef #| collapse: true v <- slw_tot %>% summary %>% coef (v[2:3, 1] - c(0.5, - 0.5)) / v[2:3, 2] ``` which confirms that both hypotheses are rejected. The same kind of linear hypothesis on one coefficient can be simpler tested by using a different parametrization of the same model. The trick is to write the model in such a way that a subset of coefficients are 0 if the hypothesis is true. For example, in order to test the hypothesis that $\beta_i = 0.5$, we must have in the model $(\beta_i - 0.5)\ln i$; therefore, the term $-0.5\ln i$ is added on the right side of the formula and should therefore be added also on the left side. Adding also $0.5 \ln v$ on both sides of the formula, we finally get: $$ (\ln y - 0.5 \ln i + 0.5 \ln v) = \alpha + (\beta_i - 0.5) \ln i + (\beta_v + 0.5) \ln v + \epsilon $$ and the two slopes are now equal to 0 under $H_0$. \idxfun{coef}{stats}\idxfun{lm}{stats} ```{r } #| label: constrained_coefs_pm05 slw_totb <- lm(log(gdp85 * sqrt(v / i) ) ~ log(i) + log(v), growth_sub) slw_totb %>% summary %>% coef ``` which gives exactly the same values for the $t$ statistics. A simple hypothesis may also concern a linear combination of several coefficients and not the value of one coefficient. For example, the structural growth model implies that $\beta_i + \beta_v = 0$. If the hypothesis is true, $\mbox{E}(\hat{\beta}_i + \hat{\beta}_v) = 0$, the variance being $\mbox{V}(\hat{\beta}_i + \hat{\beta}_v) = \hat{\sigma}_{\hat{\beta}_1}^2 + \hat{\sigma}_{\hat{\beta}_2}^2 + 2 \hat{\sigma}_{\hat{\beta}_1\hat{\beta}_2}$, the statistic can then be computed by extracting the relevant elements of the covariance matrix of the fitted model: \idxfun{coef}{stats}\idxfun{vcov}{stats}\idxfun{df.residual}{stats}\idxfun{unname}{base}\idxfun{pt}{stats}\idxfun{sqrt}{base} ```{r } #| label: test_sum_two_coefficients #| collapse: true v <- vcov(slw_tot) v_sum <- v[2, 2] + v[3, 3] + 2 * v[2, 3] stat_sum <- (coef(slw_tot)[2] + coef(slw_tot)[3]) %>% unname t_sum <- stat_sum / sqrt(v_sum) %>% unname pval_sum <- 2 * pt(abs(t_sum), df = df.residual(slw_tot), lower.tail = FALSE) c(stat = stat_sum, t = t_sum, pv = pval_sum) ``` The hypothesis is therefore not rejected, even at the 10% level. Once again, such a linear hypothesis can be more easily tested using a different parametrization. Introducing in the model the term $(\beta_i + \beta_v) \ln i$, subtracting $\beta_v \ln_i$ and rearranging terms: $$ \begin{array}{rcl} \ln y &=& \alpha + (\beta_i + \beta_v) \ln i + \beta_v \ln v - \beta_v \ln i+ \epsilon \\ & = & \alpha + (\beta_i + \beta_v) \ln i + \beta_v \ln \frac{v}{i} + \epsilon \end{array} $$ We then have a model for which the two covariates are now $\ln i$ and $\ln v / i$, the hypothesis being that the coefficient associated to $\ln i$ is equal to 0. \idxfun{lm}{stats}\idxfun{coef}{stats} ```{r } #| label: sum_two_coef_repar slw_totc <- lm(log(gdp85) ~ log(i) + log(v / i), growth_sub) slw_totc %>% summary %>% coef ``` ```{r } #| echo: false #| label: confellipse confellipse <- function(object){ v <- car::confidenceEllipse(object, level = c(0.99, 0.95, 0.9), draw = FALSE) growth_ell <- Reduce("rbind", lapply(1:3, function(x) cbind(level = as.character(names(v)[x]), data.frame(v[[x]])) )) %>% as_tibble ci <- confint(object, level = 0.95) ebx <- data.frame(x = ci[2, 1], xend = ci[2, 2], y = coef(object)[3]) eby <- data.frame(y = ci[3, 1], yend = ci[3, 2], x = coef(object)[2]) growth_ell %>% ggplot(aes(x, y)) + geom_path(aes(linetype = level)) + geom_segment(data = ebx, aes(x = x, xend = xend, y = y, yend = y), lwd = 1) + geom_segment(data = eby, aes(x = x, xend = x, y = y, yend = yend), lwd = 1) + geom_point(data = data.frame(x = coef(object)[2], y = coef(object)[3]), shape = 1, size = 4) } ``` ### Joint confidence interval and test of joint hypothesis #### Joint confidence interval We now consider the computation of confidence interval for more than one parameter. The distribution of $\hat{\beta}$ is: $$ \hat{\beta} \stackrel{a}{\sim} \mathcal{N}\left(\beta, \frac{\sigma_\epsilon ^ 2}{N} \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right)^ {-1}\right) $$ In the simple regression model, subtracting the expected value and dividing by the standard deviation, we get a standard normal deviate. Taking the square, we get a $\chi^2$ with 1 degree of freedom. If the $K$ slopes are uncorrelated, $\sum_{k=1} ^ K(\hat{\beta}_k - \beta_{k0}) ^ 2/\sigma_{\hat{\beta}_k} ^ 2$ is a $\chi^2$ with $K$ degrees of freedom. If the slopes are correlated, this correlation should be "corrected"; more precisely, a quadratic form of the vector of slopes in deviation from its expectation with the inverse of its variance should be computed: $$ q_K = (\hat{\beta}-\beta) ^ \top \mbox{V}(\hat{\beta})^{-1} (\hat{\beta}-\beta)= (\hat{\beta}-\beta) ^ \top \frac{N}{\sigma_\epsilon ^ 2} \left(\frac{1}{N}\tilde{X}^\top\tilde{X}\right) (\hat{\beta}-\beta) \sim \chi^2_K $$  For $K=2$, we have: $$ q_2 = \left(\hat{\beta}_1 - \beta_1, \hat{\beta}_2 - \beta_2 \right) \frac{N}{\sigma_\epsilon ^ 2} \left( \begin{array}{cc} \hat{\sigma}_1 ^ 2 & \hat{\sigma}_{12} \\ \ \hat{\sigma}_{12} & \hat{\sigma}_2 ^ 2 \end{array} \right) \left( \begin{array}{c} \hat{\beta}_1 - \beta_1 \\ \hat{\beta}_2 - \beta_2 \end{array} \right) $$ $$ q_2 = \frac{N}{\sigma_\epsilon ^ 2} \left[ \hat{\sigma}_1^2(\hat{\beta}_1 - \beta_1) ^ 2+ \hat{\sigma}_2^2(\hat{\beta}_2 - \beta_2) ^ 2+ 2 \hat{\sigma}_{12}(\hat{\beta}_1 - \beta_1)(\hat{\beta}_2 - \beta_2) \right] $$ {#eq-ellipsebeta} Using @eq-student_multiple, this expression can also be rewritten in terms of the Student statistics $t_k$: $$ q_2 = \frac{1}{1 - \hat{\rho}_{12} ^ 2}\left(t_1 ^ 2 + t_2 ^ 2 + 2 \hat{\rho}_{12} t_1 t_2\right) = \frac{1}{1 - \hat{\rho}_{\hat{\beta}_1\hat{\beta}_2} ^ 2}\left(t_1 ^ 2 + t_2 ^ 2 - 2 \hat{\rho}_{\hat{\beta}_1\hat{\beta}_2} t_1 t_2\right) $$ {#eq-ellipse} the last equality resulting from the fact that the coefficient of correlation of two slopes is the opposite of the coefficient of correlation of the corresponding covariates (see @eq-var_covar_slopes). Therefore, a $(1-\alpha)$ confidence interval for a couple of coefficients $(\beta_1, \beta_2)$ is the set of values for which @eq-ellipse is lower than the critical value for a $\chi ^ 2$ with 2 degrees of freedom, which is, for example, 5.99 for two degrees of freedom and $1-\alpha = 0.95$. Equating @eq-ellipse to this critical value, we get the equation of an ellipse, with two particular nested cases: - if $\hat{\rho}_{12} = 0$, i.e., if the two covariates and therefore the two coefficients are uncorrelated, the expression reduces to the sum of squares of the t statistics, which is an ellipse with vertical and horizontal tangents, - if $\hat{\rho}_{12} = 0$ and $\hat{\sigma}_1 = \hat{\sigma}_2$, the expression reduces further to the equation of a circle. @eq-ellipse can't be computed as it depends on $\sigma_\epsilon$ which is unknown. Replacing $\sigma_\epsilon$ by $\dot{\sigma}_\epsilon$, we get: $$ \hat{q}_2 = \frac{1}{1 - \hat{\rho}_{12} ^ 2}\left(\hat{t}_1 ^ 2 + \hat{t}_2 ^ 2 + 2 \hat{\rho}_{12} \hat{t}_1 \hat{t}_2\right) $$ {#eq-ellipse_est} where $\hat{q}_2$ now follows asymptotically a $\chi ^ 2$ distribution with 2 degrees of freedom. If the errors are normal, dividing by $K$ (here 2), we get an exact Fisher $F$ distribution with $2$ and $N - K - 1$ degrees of freedom. In the simple case of no correlation between the covariates, the $\chi^2$ and the $F$ statistics are therefore the sum and the mean of the squares of the t statistics. As for the Student distribution, which converges in distribution to a normal distribution, $K \times$ the $F$ statistic converges in distribution to a $\chi ^ 2$ with $K$ degrees of freedom. For example, with $K=2$, the critical value for a $F$ distribution with $2$ and $\infty$ degrees of freedom is half the corresponding $\chi^2$ value with 2 degrees of freedom (5.99 for the 95% confidence level), which is $2.996$. The confidence interval is represented in @fig-ellipse; the circle point is the point estimation, the vertical and horizontal segments are the separate confidence intervals for both coefficients at the 95% level. We've also added a diamond point that corresponds to the hypothesis $\kappa = 1/3$, which implies that $\beta_i = - \beta_v = \frac{\kappa}{1-\kappa} = 0.5$. Finally, we add a line with a slope equal to $-1$ and an intercept equal to 0 which corresponds to the hypothesis that $\beta_i = - \beta_v$. ```{r } #| label: fig-ellipse #| echo: false #| fig.width: 4 #| fig.height: 6 #| out.width: "30%" #| fig-cap: "Ellipse of confidence for the two coefficients" confellipse(slw_tot) + geom_point(data = data.frame(x = 0.5, y = - 0.5), shape = 5, size = 4) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = -1) + geom_point(data = tibble(x = c(1.1, 1.25), y = c(-1, -3.1)), shape = c(0, 2), size = 3) + labs(x = "ln i", y = "ln v", linetype = NULL) ``` The confidence ellipse is a "fat sausage"^[@STOC:WATS:15, p. 235.]\index[author]{Stock}\index[author]{Watson} with the long part of the sausage slightly oriented in the lower-left/upper-right direction. This is because, as we have seen previously (see @fig-invpop), the two covariates exhibit a small negative correlation, which implies a small positive correlations between $\hat{\beta}_i$ and $\hat{\beta}_v$. The ellipse is also "higher" than "wide", because $v$ has a smaller variance than $i$ and therefore $\hat{\beta}_v$ has a larger variance than $\hat{\beta}_i$. Note the difference between a set of two simple hypotheses and a joint hypothesis. Consider for example: - $(\beta_i = 1.1, \beta_v = -1)$ represented by a square point: both simple hypotheses are not rejected (the two values are in the unidimensional confidence interval), but the joint hypothesis is rejected, as the corresponding square point is outside the 95% confidence interval ellipse, - $(\beta_i = 1.25, \beta_v = -3.1)$ represented by a triangle point: the simple hypothesis $\beta_i = 1.25$ is not rejected, the simple hypothesis $\beta_b = -3.1$ is rejected, but the joint hypothesis is not rejected, the triangle point being inside the 95% confidence interval ellipse. The hypothesis that the two coefficients sum to 0 is not rejected as some points of the straight line that figures this hypothesis are in the confidence ellipse. Concerning the hypothesis that $\beta_i = 0.5$ and $\beta_v = -0.5$, the two simple hypotheses and the joint hypothesis are rejected at the 95% confidence level; the two estimates are neither in the segments representing the simple confidence interval nor inside the ellipse figuring the joint confidence interval. \index[general]{confidence interval!multiple linear regression model|)} #### Joint hypothesis \index[general]{test!multiple linear regression model|(} To test a joint hypothesis for the values of a couple of parameters $(\beta_{10}, \beta_{20})$, we have just seen that we can simply check whether the corresponding point is inside or outside the confidence interval ellipse. We can also compute the statistic given in @eq-ellipse_est for $(\beta_1=\beta_{10}, \beta_2=\beta_{20})$ and compare it to the critical value. The statistic is computed using the elements of the matrix returned by `coef(summary(x))`, which contains in particular the estimations and their standard deviations. We first compute the t statistics corresponding to the two simple hypothesis $\beta_i = 0.5$ and $\beta_v = -0.5$: \idxfun{coef}{stats} ```{r } #| label: student_coefs_pm05 #| collapse: true sc <- slw_tot %>% summary %>% coef t_i <- (sc[2, 1] - 0.5) / sc[2, 2] t_v <- (sc[3, 1] + 0.5) / sc[3, 2] ``` We then apply the simplified formula, which is the sum or the mean of the squares of the t statistics: ```{r } #| label: chisqf_approx_coef_pm05 #| collapse: true St2 <- t_i ^ 2 + t_v ^ 2 Mt2 <- (t_i ^ 2 + t_v ^ 2) / 2 c(St2, Mt2) ``` for which the distributions under $H_0$ are respectively $\chi ^ 2$ and $F$, with the following critical values: \idxfun{qchisq}{stats}\idxfun{qf}{stats}\idxfun{df.residual}{stats} ```{r } #| collapse: true #| label: quantiles_chisq_f qchisq(0.95, df = 2) qf(0.95, df1 = 2, df2 = df.residual(slw_tot)) ``` The critical values being much smaller than the computed statistics, the joint hypothesis is clearly rejected. The exact formula corrects the correlation between the two estimators using the coefficient of correlation between the two covariates. We compute this coefficient using the data frame of the fitted model, which is obtained using the `model.frame` function. Using the initial data frame `growth` wouldn't give the correct value, as the estimation is not performed on the full data set because of missing data and because the estimation is performed on a subsample with some groups of countries that are excluded: \idxfun{summarise}{dplyr}\idxfun{model.frame}{stats}\idxfun{pull}{dplyr} ```{r } #| collapse: true #| label: coef_corr_i_v mf <- model.frame(slw_tot) r_iv <- summarise(mf, r = cor(`log(i)`, `log(v)`)) %>% pull(r) r_iv ``` Note that the names of the covariates are not regular names as they contain parentheses, therefore they should be surrounded by the \texttt{`} sign. The coefficient of correlation between the two covariates is $`r round(r_iv, 3)`$. We obtain: ```{r } #| label: fstat_coef_pm05 #| collapse: true Fstat <- (t_i ^ 2 + t_v ^ 2 + 2 * r_iv * t_i * t_v) / (1 - r_iv ^ 2) / 2 Fstat ``` The statistic (`r round(Fstat, 3)`) is slightly greater than the approximative value (`r round(Mt2, 3)`) which was previously computed; we therefore reject once again the null hypothesis. \index[general]{test!multiple linear regression model|)} ### The three tests {#sec-three_tests} \index[general]{test!three test principles!linear regression models|(}  In the previous subsection, we started with a general (unconstrained) model, we constructed a confidence ellipse for the two parameters, and we were able to test a set of hypotheses, either by checking whether the values of the parameters corresponding to the hypotheses were inside the confidence ellipse, or by computing the value of the statistic for the tested values of the parameters. Actually, this testing principle, based on the unconstrained model, is just one way of testing a hypothesis. The geometry of least squares and the Frisch-Waugh theorem highlights the fact that any set of hypotheses can be tested using the fact that this set of hypotheses gives rise to two models: a **constrained model**, which imposes the hypotheses and an **unconstrained model**, which doesn't impose the hypotheses.\index[general]{constrained model}\index[general]{unconstrained model} The same test can be performed using the constrained model, the unconstrained model or both, which give rise to three test principles:^[This section is largely based on @DAVI:MACK:93, section 3.6, pp. 88-94. The three "classical" tests are often understood as tests suitable for models estimated by maximum likelihood (see @sec-three_tests_ml). @DAVI:MACK:93 advocate the presentation of the three test principles for other estimators, including the linear regression model.] - **Wald** test is based only on the unconstrained model, - **lagrange multiplier** or **score** test is based on the constrained model, - **likelihood ratio** test is based on the comparison between the two models. A set of $J$ linear hypotheses is written as: $$ R \gamma = q $$ where $R$ is a matrix of dimension $J \times (K + 1)$ and $q$ is a vector of length $J$. $J$, the number of hypotheses, is necessarily lower or equal to $K$. Actually, a set of $J$ hypotheses can always be rewritten as a model of the form $y = X_1\beta_1 + X_2\beta_2 + \epsilon$, the hypothesis being $\mbox{H}_0: \beta_2 = 0$. In this setting, the three tests are easily constructed using the Frisch-Waugh theorem, with $H_0: \beta_2 = 0$. #### Wald test \index[general]{Wald test|(} The Wald test is based on the unconstrained model, for which a vector of slopes $\hat{\beta}_2$ is estimated. Using the Frisch-Waugh theorem, this vector can be obtained as the regression of the residuals of $y$ on $X_1$ ($M_1y$) on the residuals of every column of $X_2$ on $X_1$ ($M_1X_2$). Then, $\hat{\beta}_2 = (X_2^\top M_1 X_2) ^ {-1} X_2^\top M_1 y$, with expected value and variance equal to $\beta_2$ (0 under $H_0$) and $\mbox{V}(\hat{\beta}_2) = \sigma_\epsilon^2 (X_2^\top M_1 X_2) ^ {-1}$. Convergence in distribution implies that: $$ \hat{\beta}_2 \stackrel{a}{\sim} \mathcal{N} \left(0, \mbox{V}(\hat{\beta}_2)\right) $$ The distribution of the quadratic form of a centered vector of normal random variables of length $J$ with the inverse of its covariance matrix is a $\chi^2$ with $J$ degrees of freedom: $$ \mbox{wald} = \hat{\beta}_2 ^ \top \mbox{V}(\hat{\beta}_2) ^ {-1} \hat{\beta}_2 = \frac{\hat{\beta}_2 ^ \top (X_2^\top M_1 X_2) \hat{\beta}_2} {\sigma_\epsilon^2} $$ which is also, replacing $\hat{\beta}_2$ by its expression: $$ \mbox{wald} = \frac{y ^ \top M_1 X_2 (X_2^\top M_1 X_2) ^ {-1} X_2^\top M_1 y} {\sigma_\epsilon^2} = \frac{y ^ \top P_{M_1X_2} y} {\sigma_\epsilon^2} $$ \index[general]{Wald test|)} #### Lagrange multiplier or score test \index[general]{score test|(} Consider the constrained model, which imposes $\beta_2 = 0$. It is therefore obtained by regressing $y$ on $X_1$ only, and the vector of residuals is $\hat{\epsilon}_1 = M_1 y$. The idea of this test is that, if $H_0$ is true, $X_2 ^ \top \hat{\epsilon}_1$ should be close to zero, $\hat{\epsilon}_1$ being "almost" orthogonal to the subspace spanned by $X_2$. Therefore, we consider the vector $X_2 ^ \top \hat{\epsilon}_1 = X_2 ^ \top M_1 y$, which, under $H_0$, should have a 0 expected value. The variance of this vector is: $\sigma_\epsilon ^ 2 (X_2^\top M_1X_2)^{-1}$ so that, applying the central-limit theorem: $$ X_2 ^ \top \hat{\epsilon}_1 \equiv X_2 ^ \top M_1 y \stackrel{a}{\sim} \mathcal{N} \left(0, \sigma_\epsilon ^ 2 X_2 ^ \top M_1 X_2\right) $$ The statistic is obtained, as previously, by computing the quadratic form: $$ \mbox{score}=y ^ \top M_1 X_2 \left(\sigma_\epsilon ^ 2 X_2 ^ \top M_1 X_2\right) ^ {-1} X_2 ^ \top M_1 y= \frac{y ^ \top M_1 X_2 (X_2^\top M_1 X_2) ^ {-1} X_2^\top M_1 y} {\sigma_\epsilon^2} = \frac{y ^ \top P_{M_1X_2} y} {\sigma_\epsilon^2} $$ \index[general]{score test|)} #### Likelihood ratio test \index[general]{likelihood ratio test|(} The likelihood ratio test is based on the comparison of the objective function (the sum of square residuals) for the constrained and the unconstrained model. Remember, from @eq-pyth_ssr, that: $$ \mid\mid \hat{\epsilon}_{12}^2\mid\mid + \mid\mid P_{M_1X_2}y\mid\mid = \mid\mid \hat{\epsilon}_{1}^2\mid\mid $$ The first term on the left and the term on the right are residual sums of squares (respectively for the unconstrained and the constrained model). Therefore, the likelihood ratio test is based on: $\mbox{SSR}_c - \mbox{SSR}_{nc} = ||P_{M_1X_2}y|| ^ 2 = y ^ \top P_{M_1X_2} y$. Dividing by $\sigma_\epsilon^2$, we get exactly the same statistic as previously. \index[general]{likelihood ratio test|)} \index[general]{test!three test principles!linear regression models|)} ### Computation of the three tests Consider the augmented Solow model: $y = \alpha + \beta_i i + \beta_v v + \beta_e e + \epsilon$, with $\beta_i = \frac{\kappa}{1 - \kappa - \lambda}$, $\beta_v = -\frac{\kappa + \lambda}{1 - \kappa - \lambda}$ and $\beta_e = \frac{\lambda}{1 - \kappa - \lambda}$. For convenience, we compute the logarithm of the variables before using `lm`: \idxfun{mutate}{dplyr}\idxfun{lm}{stats} ```{r} #| label: compute_logs growth_sub2 <- growth_sub %>% mutate(y = log(gdp85), i = log(i), v = log(v), e = log(e)) mrw2 <- lm(y ~ i + v + e, growth_sub2) ``` We consider two hypotheses: - $\beta_i + \beta_v + \beta_e = 0$, this hypothesis is directly implied by the structural model, - $\kappa = 1 /3$, the share of profits has a reasonable value. The second hypothesis implies that $\beta_i = \frac{1/3}{2/3 - \lambda}$ and $\beta_e = \frac{\lambda}{2/3 - \lambda}$, and therefore that $\beta_e = 2 \beta_i - 1$. The model can be reparametrized in such way that two slopes are 0 if the two hypotheses are satisfied: $$ y + e - v = \alpha + \beta_i (i + 2e - 3v) + (\beta_e-2\beta_i+1)(e-v)+(\beta_i + \beta_e+\beta_v)v $$ Therefore, the unconstrained model can be written as a model with $y + e - v$ as the response and $i + 2 e - 3 v$, $e-v$ and $v$ as the three covariates and the constrained model as a model with the same response but with $i + 2 e - 3$ as the unique covariate. \idxfun{lm}{stats}\idxfun{coef}{stats} ```{r} #| label: constrained_unconstr_models_growth growth_sub2 <- growth_sub2 %>% mutate(y2 = y + e - v, i2 = i + 2 * e - 3 * v, e2 = e - v) mrw_c <- lm(y2 ~ i2, growth_sub2) mrw_nc <- lm(y2 ~ i2 + e2 + v, growth_sub2) coef_nc <- mrw_nc %>% summary %>% coef coef_nc ``` The two hypotheses can be tested one by one, as the test is that: - the slope of $e2 = e-v$ equals 0 for the hypothesis that $\kappa = 1/3$, - the slope of $v$ equals 0 for the hypothesis that $\beta_i+\beta_e+\beta_v=0$. Both hypotheses are not rejected, even at the 10% level. To test the joint hypothesis, we can first use the approximate formula which is the sum or the mean of the squares of the t statistics (respectively $\chi^2$ with 2 degrees of freedom and F with 2 and 71 degrees of freedom): \idxfun{pf}{stats}\idxfun{pchisq}{stats} ```{r} #| label: approx_results_joint_hyp #| collapse: TRUE appr_chisq <- coef_nc[3, 3] ^ 2 + coef_nc[4, 3] ^ 2 appr_f <- appr_chisq / 2 c(appr_chisq, appr_f) pchisq(appr_chisq, df = 2, lower.tail = FALSE) pf(appr_f, df1 = 2, df2 = 71, lower.tail = FALSE) ``` Based on this approximation, the joint hypothesis is clearly not rejected. We now turn to the computation of the statistics, using the three test principles. #### Wald test {#sec-wald_test_example} \index[general]{Wald test|(} Considering the initial unconstrained model, for which the formula is $y \sim i + v + e$, the set of the two hypotheses can be written in matrix form as: $$ \left( \begin{array}{cccc} 0 & 2 & 0 & - 1 \\ 0 & 1 & 1 & 1 \\ \end{array} \right) \left( \begin{array}{c} \alpha \\ \beta_i \\ \beta_v \\ \beta_e \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right) $$ The matrix $R$ and the vector $q$ are created in **R**: \idxfun{matrix}{base} ```{r} #| label: R_constraints_growth R <- matrix(0, 2, 4) ; q <- c(1, 0) R[1, 2] <- 2 ; R[1, 4] <- - 1 ; R[2, 2:4] <- 1 ``` We then have $H_0: R\gamma - q = 0$ and, for the fitted unconstrained model, we get: $R\hat{\gamma} - q$. Under $H_0$ the expected value of this vector is 0 and its estimated variance is $R^\top \hat{V}_{\hat{\gamma}}R=\hat{\sigma}_\epsilon ^ 2R^\top(Z^\top Z)^{-1}R$. Then: $$ (R\hat{\gamma} - q)^\top \left[R \hat{V}_{\hat{\gamma}}R ^ \top\right]^{-1}(R\hat{\gamma} - q) $$ {#eq-wald_test_formula} is asymptotically a $\chi^2$ with $J=2$ degrees of freedom. Dividing by $J$, we get a F statistic with $2$ and $71$ degrees of freedom. \idxfun{coef}{stats}\idxfun{solve}{base}\idxfun{vcov}{stats} ```{r} #| label: wald_test_growth #| collapse: true t(R %*% coef(mrw2) - q) %*% solve(R %*% vcov(mrw2) %*% t(R)) %*% (R %*% coef(mrw2) - q) / 2 ``` `car::linearHypothesis` performs Wald tests with a nice syntax: the first argument is a fitted model, and the second one is a vector of characters which contains the character representation of the hypothesis: \idxfun{linearHypothesis}{car}\idxfun{gaze}{micsr} ```{r} #| label: linear_hypothesis #| collapse: true car::linearHypothesis(mrw2, c("i + v + e = 0", "e = 2 * i - 1")) %>% gaze ``` \index[general]{Wald test|)} #### Likelihood ratio test \index[general]{likelihood ratio test|(} The computation of the likelihood ratio statistic is very simple once the two models have been estimated. The residual sums of squares of the two models are extracted using the `deviance` method, and we divide the difference of the two sums of squares by $\dot{\sigma}_\epsilon^2$ (the `sigma` method is used to extract the residual standard error) and by 2 to get an F statistic. \idxfun{deviance}{stats}\idxfun{sigma}{stats} ```{r} #| label: lr_test_growth #| collapse: true (deviance(mrw_c) - deviance(mrw2))/ sigma(mrw2) ^ 2 / 2 ``` \index[general]{likelihood ratio test|)} #### Score test \index[general]{score test|(} For the score test, we consider the reparametrized model and we define $Z_1$ as a matrix containing a column of one and $i + 2 e - 3 v$ and $X_2$ as a matrix containing $e-v$ and $v$. \idxfun{model.matrix}{stats}\idxfun{model.frame}{stats} ```{r} #| label: score_test_model_matrix Z1 <- model.matrix(~ i2, model.frame(mrw_nc)) X2 <- model.matrix(~ e2 + v - 1, model.frame(mrw_nc)) ``` The test is based on the vector $X_2 ^ \top \hat{\epsilon}_c$, where $\hat{\epsilon}_c$ is the vector of the residuals for the constrained model. Under $H_0$, the expected value of $X_2 ^ \top \hat{\epsilon}_c$ is 0 and its variance is $\sigma_\epsilon ^ 2 X_2^ \top M_1 X_2$. $M_1 X_2$ is a matrix of residuals of all the columns of $X_2$ on $Z_1$. \idxfun{resid}{stats}\idxfun{lm}{stats} ```{r} ec <- resid(mrw_c) M1X2 <- resid(lm(X2 ~ Z1)) ``` Note that we've used `lm` with only a formula argument. In this case, the response and the covariates are vectors or matrices and not columns of a tibble. Note also that the left-hand side of a formula is a matrix and not a vector; in this case, each column is supposed to be a response and `lm` fit as many models than there are columns in this matrix. Then, the `resid` method no longer returns a vector, but a matrix, each column being a vector of residuals for one of the fitted models. The statistic is then $\hat{\epsilon}_c^\top X_2 \left[X_2^ \top M_1 X_2\right]^{-1} X_2 \hat{\epsilon}_c / \sigma_\epsilon^2$ and it is computed using an estimator of $\sigma_\epsilon^2$, and dividing by $J=2$ to get an F statistic. \idxfun{crossprod}{base}\idxfun{sigma}{stats}\idxfun{solve}{base} ```{r} #| collapse: true t(crossprod(X2, ec)) %*% solve(crossprod(M1X2)) %*% crossprod(X2, ec) / sigma(mrw_c) ^ 2 / 2 ``` The statistic is slightly different from the one computed previously, the difference being only due to the fact that the estimation of $\sigma_\epsilon$ is based, for the score test, on the constrained model. \index[general]{score test|)} ### Testing that all the slopes are 0 \index[general]{F statistic|(} The test that all the slopes are 0 is routinely reported by software performing OLS estimation. It can be computed using any of the three test principles, but the likelihood ratio test is particularly appealing in this context, the constrained model being a model with only an intercept: $y_n = \alpha + \epsilon_n$. In this case, $\hat{\alpha} = \bar{y}$ and $\hat{\epsilon} = (y_n - \bar{y})$. Therefore, the residual sum of squares for the constrained model is $S_{yy} = \sum_n (y_n - \bar{y}) ^ 2$ and is also denoted by $\mbox{TSS}$ for total sum of squares. The statistic is then: $(\mbox{TSS} - \mbox{RSS})/\sigma_\epsilon ^ 2 \sim \chi ^ 2_K$, which is a $\chi ^ 2$ with $K$ degrees of freedom if the hypothesis that all the slopes are 0 is true. To compute this statistic, $\sigma_\epsilon^2$ has to be estimated. A natural estimator is $\mbox{RSS} / (N - K - 1)$ but, if $H_0$ is true, $\mbox{TSS} / (N - 1)$ is also an unbiased estimator. Moreover, dividing by the sample size ($N$) and not by the number of degrees of freedom leads to biased but consistent estimators. Using the first estimator of $\sigma_\epsilon^2$ and dividing by $K$, we get the $F$ statistic with $K$ and $N-K-1$ degrees of freedom: $$ \frac{\mbox{TSS} -\mbox{RSS}}{\mbox{RSS}}\frac{N - K - 1}{K} \sim F_{K, N-K-1} $$ Using $\mbox{TSS} / N$ as an estimator of $\sigma_\epsilon^2$, we get a very simple statistic that is asymptotically a $\chi^2$ with $K$ degrees of freedom: $$ N \frac{\mbox{TSS} -\mbox{RSS}}{\mbox{TSS}} \stackrel{a}{\sim} \chi^2_{K} $$ These two statistics are closely related to the $R^2$ which is, using this notation, equal to $1 - \mbox{RSS} / \mbox{TSS} = (\mbox{TSS} - \mbox{RSS}) / \mbox{TSS}$. We can then write the $F$ statistic as: $$ \frac{R^2}{1 - R ^ 2}\frac{N - K - 1}{K} $$ and the asymptotic $\chi^2$ statistic as $N R^2$. There is no easy way to extract the $R^2$ and the $F$ statistic with **R**. Both are computed by the `summary` method of `lm`: \idxfun{names}{base} ```{r} #| label: summary_lm_names slm <- slw_tot %>% summary slm %>% names ``` and can be extracted "manually" from the list returned by `summary` using the `$` operator: ```{r} #| label: extracting_r2_f #| collapse: true slm$r.squared slm$adj.r.squared slm$fstatistic ``` More simply, `micsr:rsq` and `micsr:ftest` can be used: \idxfun{rsq}{micsr}\idxfun{ftest}{micsr}\idxfun{gaze}{micsr} ```{r} #| label: extracting_r2_f_micsr #| collapse: true slw_tot %>% rsq slw_tot %>% rsq(type = "adj") slw_tot %>% ftest %>% gaze ``` The interest of this testing strategy is not limited to the test that all the slopes of a real model are 0. It can also be used to test any set of hypotheses, using reparametrization and the Frisch-Waugh theorem. Consider for example the hypothesis that $\kappa = 0.5$ and $\beta_i + \beta_e + \beta_v=0$. We have seen previously that, after reparametrization, this corresponds to the model: \idxfun{lm}{stats}\idxfun{coef}{stats} ```{r} #| label: constrained_reparam_growth #| collapse: true mrw_nc <- lm(y2 ~ i2 + e2 + v, growth_sub2) mrw_nc %>% coef ``` with, if the two hypotheses are true, the two slopes associated with `e2` and `v` equal to 0. Now, using the Frisch-Waugh theorem, and denoting $Z_1 = (j, i_2)$ and $X_2 = (e_2, v)$: \idxfun{lm}{stats}\idxfun{coef}{stats}\idxfun{resid}{stats} ```{r} #| label: frisch_waugh_growth #| collapse: true y2b <- lm(y2 ~ i2, growth_sub2) %>% resid e2b <- lm(e2 ~ i2, growth_sub2) %>% resid vb <- lm(v ~ i2, growth_sub2) %>% resid mrw_ncb <- lm(y2b ~ e2b + vb - 1) mrw_ncb %>% coef ``` We get exactly the same estimators as previously for `e2` and `v`, but now the joint hypothesis is that all the slopes of the second model are 0. Therefore, the test is based on the F statistic that is returned by `summary(lm(x))` and doesn't require any further calculus: \idxfun{ftest}{micsr}\idxfun{gaze}{micsr} ```{r} #| label: frisch_waugh_growth_fstat #| collapse: true mrw_ncb %>% ftest %>% gaze ``` Actually, a degrees of freedom correction should be performed to get exactly the same results because `mrw_ncb` has $75 - 2 = 73$ degrees of freedom, as the real number of degrees of freedom is $75 - 4 = 71$. \index[general]{F statistic|)} \index[general]{test!three test principles!linear regression models|)} ## System estimation and constrained least squares {#sec-system_equation} \index[general]{system estimation|(} Very often in economics, the phenomenon under investigation is not well described by a single equation, but by a system of equations. Moreover, there may be inter-equations constraints on the coefficients. It is particularly the case in the field of the microeconometrics of consumption or production. For example, the behavior of a producer is described by a minimum cost equation along with equations of factor demand and the behavior of a consumer is described by a set of demand equations. ### System of equations {#sec-sys_eq_ols} We consider therefore a system of $L$ equations denoted by $y_l=Z_l\beta_l+\epsilon_l$, with $l=1\ldots L$. In matrix form, the system can be written as follows: $$ \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_L \end{array} \right) = \left( \begin{array}{ccccc} Z_1 & 0 & \ldots & 0 \\ 0 & Z_2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & Z_L \end{array} \right) \left( \begin{array}{c} \gamma_1 \\ \gamma_2 \\ \vdots \\ \gamma_L \end{array} \right) + \left( \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_L \end{array} \right) $$ {#eq-system_equation} Therefore, the whole system can be estimated directly by stacking the vector of responses and by constructing a block-diagonal matrix of covariates, each block being the matrix of covariates for one equation. As an example, consider the analysis of production characteristics (returns to scale, elasticities of substitution between pairs of inputs). The modern approach of production analysis consists of first considering the minimum cost function, which depends on the level of production and on input unit prices $C(y, p_1, \ldots p_J)$ and then of deriving the demands for input using Shepard Lemma: $$ \frac{\partial C}{\partial p_j} = x_j(y, p_1, \ldots, p_J) $$ The cost function is obviously homogeneous of degree 1 in input unit prices, which means that, for a given level of input, if all the prices increase proportionally, the quantity of the different inputs are the same and therefore the cost function increases by the same percentage. This writes: $C(y, \lambda p_1, \ldots \lambda p_J) = \lambda C(y, p_1, \ldots, p_J)$ and $x_j(y, \lambda p_1, \ldots, \lambda p _J) = x_j(y, p_1, \ldots p_J)$; the latter relation indicating that the demands for input are homogeneous of degree 0 in input unit prices. Among the different functional forms that have been proposed to estimate the cost function, the translog specification is the most popular. It can be considered as the second-order approximation of a general cost function: $$ \ln C = \alpha + \beta_y \ln y + \frac{1}{2}\beta_{yy} \ln^2 y +\sum_i \beta_i \ln p_i + \frac{1}{2} \sum_i\sum_j \beta_{ij} \ln p_i \ln p_j $$ Using Shephard Lemma, the cost share of input $i$ is the derivative of $\ln C$ with $\ln p_i$. $$ s_i = \frac{\ln C}{\ln p_i} = \frac{p_i x_i}{C} = \beta_i + \beta_{ii} \ln p_i + \frac{1}{2}\sum_{j\neq i} (\beta_{ij} + \beta_{ji}) \ln p_j $$ Homogeneity of degree 1 in input prices implies that the cost shares don't depend on the level of prices. Therefore, $\sum_j^I \beta_{ij} = 0$, or $\beta_{iI} = - \sum_j^{I-1} \beta_{ij}$ and: $$ \begin{array}{rcl} \sum_i^I\sum_j^I \beta_{ij} \ln p_i \ln p_j &=& \sum_i^I \ln p_i \left[\sum_j^{I-1} \beta_{ij}\ln p_j + \beta_{iJ}\ln p_I\right] \\ &=& \sum_i^I \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} + \ln p_I \sum_j^{I-1} \beta_{Ij} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} + \ln p_I \sum_j^{I-1} \beta_{jI} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \ln p_i \sum_j^{I-1} \beta_{ij} \ln\frac{p_j}{p_I} - \ln p_I \sum_i^{I-1} \sum_j^{I-1}\beta_{ji} \ln\frac{p_j}{p_I} \\ &=& \sum_i^{I-1} \sum_j^{I-1} \beta_{ij} \ln \frac{p_i}{p_I}\ln\frac{p_j}{p_I} \end{array} $$ Moreover, the cost shares sum to 1 whatever the value of the prices, so that $\sum_i \beta_i = 1$. Therefore, the cost function can be rewritten as: $$ C^* = \alpha + \beta_y \ln y + \frac{1}{2}\beta_{yy} \ln^2 y + \sum_i ^{I-1}\beta_i p_i^* + \frac{1}{2} \sum_{i=1} ^ {I - 1} \beta_{ii} {p_i^*} ^ 2 + \sum_{i=1} ^ {I - 1} \sum_{j>i} ^ {I - 1} \beta_{ij} p_i^* p_j^* $$ where $z^* = \ln (z / p_I)$ and the cost shares are: $$ s_i = \beta_i + \sum_{j=1} ^ {I-1} \beta_{ij} p_j ^ * $$ Consider the case where $I=3$. In this case, the complete system of equations is: $$ \left\{ \begin{array}{rcl} C^* &=& \alpha + \beta_y \ln y + \frac{1}{2}\beta_{yy} \ln^2 y + \beta_1 p_1 ^ * + \beta_2 p_2 ^ * + \frac{1}{2} \beta_{11} {p_1^*} ^ 2 + \beta_{12} p_1^* p_2^* + \frac{1}{2} \beta_{22} {p_2^*} ^ 2 \\ s_1 &=& \beta_1 + \beta_{11} p_1 ^ * + \beta_{12} p_2 ^ * \\ s_2 &=& \beta_2 + \beta_{12} p_1 ^ * + \beta_{22} p_2 ^* \\ \end{array} \right. $$ There are 14 parameters to estimate in total (8 in the cost function and 3 in each of the cost share equations), but there are 6 linear restrictions; for example, the coefficient of $p_1^*$ in the cost equation should be equal to the intercept of the cost share for the first factor. We estimate the translog cost function with the `apples` data set \idxdata{apples}{micsr} of @IVAL:LADO:OSSA:SIMI:96\index[author]{Ivaldi}\index[author]{Ladoux}\index[author]{Ossard}\index[author]{Simioni} who studied the production cost of apple producers. Farms in this sample produce apples and other fruits (respectively `apples` and `otherprod`). The authors observe the sales of apples and other fruits as well as the quantity of apple produced. Therefore, they are able to compute the unit price of apples. Both sales are divided by this unit price, so that `apples` is measured in apple quantity, and `otherprod` is measured in "equivalent" apple quantities. Therefore, they can be summed in order to have a unique output variable `y`. The expenses in the tree factors are given by `capital`, `labor` and `materials` and the corresponding unit prices are `pc`, `pl` and `pm`. The data set is an unbalanced panel of 173 farms observed for three years (1984, 1985 and 1986). We consider only one year (1985) and, for a reason that will be clear later, we divide all the variables by their sample mean: \idxfun{filter}{dplyr}\idxfun{transmute}{dplyr} ```{r } #| label: apples ap <- apples %>% filter(year == 1985) %>% transmute(y = otherprod + apples, ct = capital + labor + materials, sl = labor / ct, sm = materials / ct, pk = log(pc / mean(pc)), pl = log(pl / mean(pl)) - pk, pm = log(pm / mean(pm)) - pk, ct = log(ct / mean(ct)) - pk, y = log(y / mean(y)), y2 = 0.5 * y ^ 2, ct = ct, pl2 = 0.5 * pl ^ 2, pm2 = 0.5 * pm ^ 2, plm = pl * pm) ``` We then create three formulas corresponding to the system of three equations (the cost function and the two factor shares): ```{r } #| label: apples_equations eq_ct <- ct ~ y + y2 + pl + pm + pl2 + plm + pm2 eq_sl <- sl ~ pl + pm eq_sm <- sm ~ pl + pm ``` These equations can be estimated one by one using OLS, but in this case, the trans-equations restrictions are ignored. The whole system can also be estimated directly by stacking the three vectors of responses and constructing a block diagonal matrix of covariates, each block being the relevant set of covariates for one equation. We use for this purpose the **Formula** package [@ZEIL:CROI:10]\index[author]{Zeileis}\index[author]{Croissant}. This package extends usual formulas in two directions: first, several covariates can be indicated on the left- hand side of the formula (using the `+` operator), and several parts can be defined on both sides of the formula, using the `|` sign. For example: ```{r} #| results: false y_1 + y_2 | y_3 ~ x_1 + x_2 | x_3 | x_4 + x_5 ``` This formula has two sets of responses, the first containing $y_1$ and $y_2$, and the second $y_3$. Three sets of covariates are defined on the right-hand side of the formula. For our production analysis, we first create a "meta" formula which contains the three responses on the left side and the whole set of covariates on the right side. Then, we extract, using `model.matrix`, the three model matrices for the three equations (`Z_c`, `Z_l` and `Z_m` respectively for the cost, labor share and material share equations). The column names of these matrices are customized using the `nms_cols` function which, for example, turns the original column names of `Z_l` (`(Intercept)`, `pl` and `pm`) to `sl_cst`, `sl_pl` and `sl_pm`. We then construct the block diagonal matrix (using the `Matrix::bdiag` function) and use our customized names: \idxfun{Formula}{Formula}\idxfun{model.frame}{stats}\idxfun{model.matrix}{stats}\idxfun{paste}{base}\idxfun{bdiag}{Matrix}\idxfun{head}{utils} ```{r } #| label: bdiag_covariates library(Formula) eq_sys <- Formula(ct + sl + sm ~ y + y2 + pl + pm + pl2 + plm + pm2) mf <- model.frame(eq_sys, ap) ; Z_c <- model.matrix(eq_ct, mf) Z_l <- model.matrix(eq_sl, mf) ; Z_m <- model.matrix(eq_sm, mf) nms_cols <- function(x, label) paste(label, c("cst", colnames(x)[-1]), sep = "_") nms_c <- nms_cols(Z_c, "cost") ; nms_l <- nms_cols(Z_l, "sl") nms_m <- nms_cols(Z_m, "sm") Zs <- Matrix::bdiag(Z_c, Z_l, Z_m) %>% as.matrix colnames(Zs) <- c(nms_c, nms_l, nms_m) head(Zs, 2) ``` `Formula::model.part` enables to retrieve any part of the model. Here, we want to extract the three responses which are on the only left side of the formula. Therefore, we set `lrs` and `rhs` respectively to 1 and 0. The result is a data frame with three variables `ct`, `sl` and `sm`. We then use `dplyr::pivot_longer`, to stack the three responses in one column. Note the use of the optional argument `cols_vary` that is set to `"slowest"`, so that the elements are stacked columnwise: \idxfun{model.part}{Formula}\idxfun{pivot\_longer}{tidyr} ```{r } #| label: stack_responses Y <- model.part(eq_sys, mf, rhs = 0, lhs = 1) ys <- Y %>% pivot_longer(1:3, cols_vary = "slowest", names_to = "equation", values_to = "response") print(ys, n = 2) ``` The estimation can be performed using the response vector and the matrix of covariates: \idxfun{lm}{stats} ```{r } #| results: false #| label: system_estimation_matrix lm(ys$response ~ Zs - 1) ``` However, nicer input is obtained by constructing a tibble by binding the columns of `ys` and `Zs` and then using the usual formula-data interface. \idxfun{bind\_cols}{dplyr} ```{r} stack_data <- ys %>% bind_cols(Zs) stack_data %>% print(n = 2) ``` In order to avoid having to write the whole long list of covariates, the dot can be used on the right-hand side of the formula, which means in this context all the variables (except the response which is on the left-hand side of the formula). The intercept and the `equation` variable should be omitted from the regression. \idxfun{lm}{stats}\idxfun{coef}{stats} ```{r } #| label: system_estimation_tibble #| collapse: true ols_unconst <- lm(response ~ . - 1 - equation, stack_data) ols_unconst %>% coef ``` ### Constrained least squares {#sec-constrained_ls} \index[general]{constrained least squares|(} Linear restrictions on the vector of coefficients to be estimated can be represented using a matrix $R$ and a numeric vector $q$: $R\gamma = q$,^[See @sec-wald_test_example.] where $\gamma ^ \top = (\gamma_1 ^ \top, \ldots, \gamma_L ^ \top)$ is the stacked vector of the coefficients for the whole system of equations. The OLS estimator is now the solution of a constrained optimization problem. Denoting $\lambda$ a vector of Lagrange multipliers,^[These multipliers are multiplied by 2 in order to simplify the first-order conditions.] and $\epsilon$ the errors, the objective function is: $$ L = \epsilon^\top \epsilon + 2\lambda^\top(R\gamma-q) $$ The Lagrangian can also be written as: $$ L = y^\top y - 2 \gamma^\top Z^\top y + \gamma^\top Z^\top Z \gamma + 2\lambda (R\gamma-q) $$ The first-order conditions are: $$ \left\{ \begin{array}{rcl} \frac{\partial L}{\partial \gamma}&=&-2Z^\top y + 2 Z^\top Z \gamma + 2R^\top \lambda =0\\ \frac{\partial L}{\partial \lambda}&=&2(R\gamma-q)=0 \end{array} \right. $$ which can also be written in matrix form as: $$ \left( \begin{array}{cc} Z^\top Z & R^\top \\ R & 0 \\ \end{array} \right) \left( \begin{array}{c} \gamma \\ \lambda \end{array} \right) = \left( \begin{array}{c} Z^\top y\\ q \end{array} \right) $$ The constrained OLS estimator can be obtained using the formula for the inverse of a partitioned matrix:^[See @GREE:18\index[author]{Greene}, online appendix p. 1076.] \index[general]{inverse of a partitioned matrix} $$ \left( \begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array} \right)^{-1} = \left( \begin{array}{cc} B_{11} & B_{12} \\ B_{21} & B_{22} \end{array} \right) = \left( \begin{array}{cc} A_{11}^{-1}(I+A_{12}F_2A_{21}A_{11}^{-1}) & - A_{11}^{-1}A_{12}F_2 \\ -F_2A_{21}A_{11}^{-1} & F_2 \end{array} \right) $$ with $F_2=\left(A_{22}-A_{21}A_{11}^{-1}A_{12}\right)^{-1}$. We have here $F_2=-\left(R(Z^\top Z)^{-1}R^\top\right)^{-1}$. The constrained estimator is then: $\hat{\gamma}_c=B_{11}Z^\top y+ B_{12}q$, with $B_{11} = (Z^\top Z)^{-1}\left(I-R^\top(R(Z^\top Z)^{-1}R^\top)^{-1}R(Z^\top Z)^{-1}\right)$ and $B_{12}=(Z^\top Z)^{-1}R^\top\left(R(Z^\top Z)^{-1}R^\top\right)^{-1}$ The unconstrained estimator being $\hat{\beta}_{nc}=\left(Z^\top Z\right)^{-1}Z^\top y$, we finally get: $$ \hat{\gamma}_c=\hat{\gamma}_{nc} - (Z^\top Z)^{-1}R^\top(R(Z^\top Z)^{-1}R^\top)^{-1}(R\hat{\gamma}_{nc}-q) $$ {#eq-const_lm} \index[general]{constrained least squares|)} The difference between the constrained and unconstrained estimators is then a linear combination of the excess of the linear constraints of the model evaluated for the unconstrained model. For the system of cost and factor shares for apple production previously described, we have 14 coefficients and 6 restrictions: \idxfun{matrix}{base} ```{r } #| label: restrict_matrix_apples R <- matrix(0, nrow = 6, ncol = 14) R[1, c(4, 9)] <- R[2, c(5, 12)] <- R[3, c(6, 10)] <- R[4, c(7, 11)] <- R[5, c(7, 13)] <- R[6, c(8, 14)] <- c(1, -1) ``` For example, the first line of `R` returns the difference between the fourth coefficient (`cost_pl`) and the ninth coefficient (`sl_cst`). As the vector $q$ is 0 in our example, this means that the coefficient of $p_l^*$ in the cost equation should equal the intercept in the labor share equation. Applying @eq-const_lm, we get: \idxfun{coef}{stats}\idxfun{crossprod}{base}\idxfun{solve}{base}\idxfun{drop}{base}\idxfun{model.matrix}{stats} ```{r } #| label: constrained_ls_manual #| collapse: true excess <- R %*% coef(ols_unconst) XpX <- crossprod(model.matrix(ols_unconst)) beta_c <- coef(ols_unconst) - drop(solve(XpX) %*% t(R) %*% solve(R %*% solve(XpX) %*% t(R)) %*% excess) beta_c ``` More simply, `micsr::clm` \idxfun{clm}{micsr} can be used, which computes the constrained least squares estimator with, as arguments, a `lm` object (the unconstrained model) and `R`, the matrix of restrictions (and optionally a `q` vector): ```{r } #| label: constrained_ls_clm ols_const <- clm(ols_unconst, R) ``` and returns a `lm` object. Finally, the **systemfit** package [@HENN:HAMA:07]\index[author]{Henningsen}\index[author]{Hamann} is devoted to system estimation and provides a `systemfit`\idxfun{systemfit}{systemfit} function. Its main arguments are a list of equations and a data frame, but it also has `restrict.matrix` and `restrict.rhs` arguments to provide respectively $R$ and $q$. ```{r } #| results: hide #| label: system_est_systemfit systemfit::systemfit(list(cost = eq_ct, labor = eq_sl, materials = eq_sm), data = ap, restrict.matrix = R) ``` The output of `systemfit` is large and is not reproduced here. The full strength of this package will be presented in later chapters, while describing the seemingly unrelated regression (@sec-sur) and the three-stage least squares (@sec-three_sls) estimators.     \index[general]{system estimation|)}