1 Simple linear regression model

Obviously there will be rarely a deterministic relationship between a response and one or several covariates. Education has a positive causal effect on wage, but there are many other variables that affect wages: sex, ethnicity, experience, abilities, to name a few. Even if a large set of relevant covariates are observed, some are not (this is in particular the case for abilities). Therefore, we won’t try to modelize the value of $y$ as a function of covariates $x$ and some unknown parameters $\gamma$: $y = f(x, \gamma)$ but, more modestly, the conditional expectation of $y$:

\[ \mbox{E}(y | x) = f(x, \gamma) \tag{1.1}\]

Back to our education / wage model, the conditional expectation is the mean value of the wage in the population for a given value of education. It is therefore a function of $x$ which takes as many values as there are distinct values of $x$.

In this chapter, we discuss the simplest econometric model, which is the simple linear regression model. This is a simple model because there is only one covariate $x$. This is a linear model because the $f$ function is assumed to be linear. Equation 1.1 can then be rewritten as follow:

\[ \mbox{E}(y | x) = \alpha + \beta x \tag{1.2}\]

Section 1.1 details the notion of conditional expectation and its relation with the notion of covariance. Section 1.2 presents the model and the data set that will be used throughout this chapter. Section 1.3 is devoted to the computation of the ordinary least squares estimator. Section 1.4 presents the geometry of least squares. Section 1.5 explains how to compute the ordinary least squares estimator using R. Finally Section 1.6 presents the notion of data generator process and explains how simulations can be usefully performed.

1.1 Conditional expectation and covariance

To estimate the two unknown parameters $\alpha$ (the intercept) and $\beta$ (the slope), we need a data set that contains different individuals for which $x$ and $y$ are observed. Typically, we’ll consider a random sample, which is a small subset of the whole population consisting of a set of individuals randomly drawn from this population.

Linear models: two examples

To understand what the hypothesis of the simple linear model implies, we consider two data sets. The first one is called birthwt (Mullahy 1997) and contains birth weights (birthwt in ounces) of babies and smoking habits of mothers. The cigarettes variable is the number of cigarettes smoked per day during pregnancy (for about 85% of the mothers, it is equal to 0). We first compute a binary variable for smoking mothers:

birthwt <- birthwt %>% mutate(smoke = ifelse(cigarettes > 0, 1, 0))

and consider a simple linear regression model with birthwt as the response and smoke as a covariate. As the covariate takes only two values, so does the conditional expectation. In a linear regression model, Equation 1.2 will therefore returns two values, $\alpha$ for $x = 0$ and $\alpha + \beta$ for $x = 1$. $\alpha$ and $\alpha + \beta$ are therefore the expected birth weights for respectively non-smoking and smoking mothers. Note that, because we have as many parameters as values of the covariate, the linear hypothesis is necessarily supported. Natural estimators of the conditional expectations are the conditional sample means, i.e., the average birth weights in the sample for non-smoking and smoking women:

cond_means <- birthwt %>%
    group_by(smoke) %>%
    summarise(birthwt = mean(birthwt))
cond_means

# A tibble: 2 × 2
  smoke birthwt
  <dbl>   <dbl>
1     0    120.
2     1    111.

Therefore, our estimation of $\alpha$ is $\hat{\alpha} = 120.1$ and the estimation of $\alpha + \beta$ is $111.1$, so that $\hat{\beta} = 111.1 - 120.1 = -8.9$ ounces, which is the estimation of birth weight’s loss caused by smoking during pregnancy. This can be illustrated using a scatterplot (see Figure 1.1), with smoke on the horizontal axis and birthwt on the vertical axis, all the points having only two possible abscissa values, 0 and 1. The conditional means are represented by square points, and we also drew the line that contains these two points. The intercept is therefore the y-value of the square point for $x = 0$ which is $\hat{\alpha} = 120.1$. The slope of the line is the ratio of the vertical and the horizontal distances between the two large points, which are respectively $\hat{\beta}$ and $1 - 0 = 1$ and is therefore equal to $\hat{\beta}$.

Figure 1.1: Birth weight for non-smoking and smoking mothers

Consider now the relationship between education and wage. The adoptees data set (Plug 2004) contains the number of years of education educ and the annual income income (in thousands of US$) for 16481 individuals. We restrict the sample to individuals with an education level between 12 (high school degree) and 15 years (bachelor’s degree).

adoptees <- adoptees %>%
    filter(educ >= 12, educ <= 15)

and we compute the mean income for the four values of education:

adoptees %>% group_by(educ) %>%
    summarise(income = mean(income))

# A tibble: 4 × 2
   educ income
  <dbl>  <dbl>
1    12   58.9
2    13   68.8
3    14   70.5
4    15   78.4

The scatterplot is presented in Figure 1.2:

This time, we have four distinct values of the covariate and we can estimate four conditional sample means, but there are only two parameters to estimate for the simple linear model. Therefore, we can’t estimate directly $\alpha$ and $\beta$ using the conditional means, except in the improbable case where the four conditional means lie on a straight line, which means that, for one additional year of education, the average wage increases exactly by the same amount. We can see in Figure 1.2 that it is not the case: the increase of income is 9.9 for a 13th year of education, 1.7 for a 14th year and 8 for a 15th year. We therefore need a formal method of estimation which enables us to obtain values of $\hat{\alpha}$ and $\hat{\beta}$. The remaining of this chapter is devoted to the presentation of this estimator, which is called the ordinary least squares (OLS) estimator.

Linear model, conditional expectation and covariance

For now, it is very important to understand the relation between the conditional expectation and the covariance. The covariance between $x$ and $y$ is, denoting $\mu_x = \mbox{E}(x)$ and $\mu_y = \mbox{E}(y)$:

\[ \sigma_{xy} = \mbox{cov}(x, y) = \mbox{E}\left[(x - \mu_x)(y - \mu_y)\right] = \mbox{E}(xy) - \mu_x\mu_y \] It is the expectation of the product of the two variables in deviations from their respective expectations or the expectation of the product minus the product of the expectations. The expectation is for the two variables $x$ and $y$ and the law of repeated expectation states that this expectation can be written as:

\[ \mbox{cov}(x, y) = \mbox{E}_x\left[\mbox{E}_y\left((y - \mu_y)(x - \mu_x)\mid x\right)\right] = \mbox{E}_x\left[(x - \mu_x)\mbox{E}_y(y - \mu_y\mid x)\right] \]

Therefore, the covariance between $x$ and $y$ is also the covariance between $x$ and the conditional expectation of $y$ (which is a function of $x$).

As an example, consider the case where $y$ is the hourly wage and $x$ the education level, which takes the values 0 (highschool), 3 (bachelor) and 5 (graduate), the frequencies for the three levels of education being 0.4, 0.3 and 0.3. Let denote $h$, $b$ and $g$ these three education levels and assume that $\mbox{E}(y\mid x) = 8 + 0.5 x$, which means that one more year of education increases the expected wage by $0.5. The expected level of education is $\mu_x = 0.4 \times 0 + 0.3 \times 3 + 0.3 \times 5 = 2.4$. The expected wage for the three levels of education are $\mu_{xh} = 8$, $\mu_{xb} = 9.5$ and $\mu_{xg} = 10.5$. The expected wage is then $\mu_y = 0.4 \times 8 + 0.3 \times 9.5 + 0.3 \times 10.5 = 9.2$. We first compute the first term of the covariance between wage and education for $x = x_h = 0$:

\[ \begin{array}{rcl} \mbox{E}\left[ (x - \mu_x)(y - \mu_y)\mid x = x_h\right] &=& \mbox{E}\left[ (x_h - \mu_x)(y - \mu_y)\mid x = x_h\right] \\ &=& (x_h - \mu_x) \mbox{E}\left[ (y - \mu_y)\mid x = x_h\right] \\ &=& (x_h - \mu_x)(\mbox{E}(y\mid x = x_h) - \mu_y)\\ &=& (0 - 2.4) \times (8 - 9.2) = 2.88 \end{array} \tag{1.3}\]

Similarly, we get $(3 - 2.4) \times (9.5 - 9.2) = 0.18$ for the bachelor level and $(5 - 2.4)\times (10.5 - 9.2) = 3.38$ for the master level. Finally the covariance is obtained as the weight average of these three terms:

\[ \sigma_{xy} = 0.4 \times 2.88 + 0.3 \times 0.18 + 0.3 \times 3.38 = 2.22 \]

Now consider the case where the conditional expectation of the hourly wage is the same for the three education levels: $\mbox{E}(y \mid x = x_h) = \mbox{E}(y \mid x = x_b) = \mbox{E}(y \mid x = x_m)$. These three conditional expectations are therefore equal to the unconditional expectation $\mu_y$ and Equation 1.3 is clearly 0 and so is the covariance. Therefore, a constant conditional expectation of $y$ given $x$ implies that the covariance is 0. Note that the opposite is not true. For example, if $\mu_{yh} = 8$, $\mu_{yb} = 13.2$ and $\mu_{ym} = 6.2$, the relation between the conditional mean of the wage and the level of education is no more linear, and it is not monotonous but inversely U-shaped (the wage first increases with education and then decreases). The unconditional mean is still: $\mu_y = 0.4 \times 8 + 0.3 \times 13.2 + 0.3 \times 6.2 = 9.2$, but the covariance is then:

\[ \sigma_{xy} = 0.4 \times (0-2.4) \times (8 - 9.2) + 0.3 \times (3-2.4) \times (13.2 - 9.2) + 0.3 \times (5 - 2.4) \times (6.2 - 9.2) = 0 \]

Therefore, a 0 covariance doesn’t imply that the conditional mean is constant. However, in the special case where the conditional mean is a linear function of the conditional variable, the two properties of 0 covariance and constant conditional means are equivalent.

Consider now the sample counterpart of this theoretical covariance. Consider a sample of 10 individuals for which the levels of education and the hourly wages are:

education	0	0	0	0	3	3	3	5	5	5
wage	8	7	8	9	9	9.5	10	10	10.5	11

The sample means of $x$ and $y$ are : $\bar{x} = 2.4$ and $\bar{y} = 9.2$. The conditional means of $y$ for a given value of $x$ in the sample are simply the means for the three subsamples defined by an education level:

$x_h=0$ : $\bar{y}_h = \frac{8+7+8+9}{4} = 8$,
$x_b=3$ : $\bar{y}_b = \frac{9+9.5+10}{3} = 9.5$,
$x_m=5$ : $\bar{y}_m = \frac{10, 10.5, 11}{3} = 10.5$,

and the covariance is obtained as the mean of the products of the two variables in deviations from their mean, or equivalently by the difference between the mean of the products and the products of the means:¹

\[ \hat{\sigma}_{xy} = \frac{\sum_{n=1}^N (x_n - \bar{x}) (y_n - \bar{y})}{N}=\frac{\sum_{n=1}^N x_n y_n}{N}-\bar{x}\bar{y} \]

The sum of the products is $\sum_n x_n y_n = 243$ and the covariance is then:

\[ \hat{\sigma}_{xy}=\frac{243}{10} - 2.4 \times 9.2 = 2.22 \] We also can consider the three subsamples defined by education levels, denoting $n=1, ..., H$, $n=H+1, ..., H+B$, $n=H+B+1, ..., N$ (with $H=4$, $B=3$ and $N=10$) observations for, respectively, a high-school, a bachelor and a master level. We then have:

\[ \hat{\sigma}_{xy} = \frac{\sum_{n=1}^H (x_n - \bar{x})(y_n - \bar{y}) + \sum_{n=H+1}^{H+B} (x_n - \bar{x})(y_n - \bar{y}) + \sum_{n=H+B+1}^N (x_n - \bar{x})(y_n - \bar{y})}{N} \] For example, for $n=1, ..., H$, the value of $x$ is the same: $x_n=x_h=0$. Therefore, we can write $\sum_{n=1}^H (x_n - \bar{x})(y_n -\bar{y}) = (x_h - \bar{x}) \sum_{n=1}^H (y_n -\bar{y})$ and more generally:

\[ \hat{\sigma}_{xy} = \frac{(x_h - \bar{x})\sum_{n=1}^H (y_n - \bar{y}) + (x_b - \bar{x})\sum_{n=H+1}^B (y_n - \bar{y}) + (x_m - \bar{x})\sum_{n=H+B+1}^N (y_n - \bar{y})}{N} \]

Moreover $\frac{\sum_{n=1}^H (y_n - \bar{y})}{H}=(\bar{y}_h-\bar{y})$ and then:

\[ \hat{\sigma}_{xy} = \frac{(x_h - \bar{x})H(\bar{y}_h-\bar{y})+ (x_b - \bar{x})B(\bar{y}_b -\bar{y})+ (x_m - \bar{x})M(\bar{y}_m - \bar{y})}{N} \]

Denoting $f_h = \frac{H}{N}$, $f_b = \frac{B}{N}$ and $f_m=\frac{M}{N}$ the empirical frequencies of the three education levels in the sample:

\[ \hat{\sigma}_{xy} = f_h(x_h - \bar{x})(\bar{y}_h-\bar{y})+ f_b(x_b - \bar{x})(\bar{y}_b -\bar{y})+ (x_m - \bar{x})(\bar{y}_m - \bar{y}) \]

Finally, denoting $k=h, b, m$ the education levels:

\[ \hat{\sigma}_{xy}=\hat{\sigma}_{x\bar{y}_x}= \sum_k f_k (x_k- \bar{x})(\bar{y}_k-\bar{y}) \]

Therefore, the covariance between $x$ and $y$ is also the covariance between $x$ and the mean values of $y$ for given values of $x$. We have here: $\bar{y}_h = 8$, $\bar{y}_b = 9.5$ and $\bar{y}_m = 10.5$. The covariance is then:

\[ \hat{\sigma}_{xy} = 0.4 \times (0 - 2.4) \times (8 - 9.2) + 0.3 \times (3 - 2.4) \times (9.5 - 9.2) + 0.3 \times (5 - 2.4) \times (10.5 - 9.2) = 2.22 \]

Consider now the case where the mean wage is the same for every education level: $\bar{y}_b=\bar{y}_l=\bar{y}_m=\bar{y}=9.2$. In this case, the covariance is obviously 0 because $\bar{y}_k-\bar{y}=0\;\forall k$.

1.2 Model and data set

We’ll consider in this chapter the question of mode shares for inter-urban transportation. More precisely, considering that a trip can be made using one out of two transport modes (air and rail), how can we modelize the market shares of both modes? We’ll use in this section a popular model in transportation economics which is the price-time model. price_time contains aggregate data about rail and air transportation between Paris and 13 French towns in 1995, it is reproduced from Bonnel (2004) pp. 364-366.

price_time %>% print(n = 3)

# A tibble: 13 × 7
  town     trafic_rail trafic_air price_rail price_air time_rail
  <chr>          <dbl>      <dbl>      <dbl>     <dbl>     <dbl>
1 Bordeaux        2005       1400       48.3      82.6       242
2 Brest            471        428       51.6      98.6       308
3 Clermont         429        243       32.7      89.4       266
# ℹ 10 more rows
# ℹ 1 more variable: time_air <dbl>

For the sake of simplicity, we’ll use shorter names for the variables:

prtime <- price_time %>%
    set_names(c("town", "qr", "qa", "pr", "pa", "tr", "ta"))

Variables are prices (pr and pa) in euros, transport times (tr and ta) in minutes and thousands of trips (qf and qa) for the two modes (r for rail and a for air). We first compute the market shares of rail:

prtime <- mutate(prtime, sr = qr / (qr + qa))
prtime %>% pull(sr) %>% summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.136   0.339   0.524   0.555   0.868   0.943

Rail’s market share exhibits huge variations in the sample, ranging from 14 to 94%. For an individual, the relevant cost of a trip is the generalized cost, which is the sum of the monetary cost and the value of the travel time. Denoting $h^i$ the time value of individual $i$, in euros per hour, the generalized cost for the two modes are:

\[ \left\{ \begin{array}{rcl} c_{a} ^ i &=& p_{a} + h ^ i t_{a} / 60\\ c_{r} ^ i &=& p_{r} + h ^ i t_{r} / 60\\ \end{array} \right. \tag{1.4}\]

Plane is typically faster and more expensive than train, which means that in the time-value / generalized cost plane, the generalized cost for rail will be represented by a line with a lower intercept (the price of train is lower) and with a higher slope (transport time is higher) than the one that corresponds to air. Generalized cost for both modes and for the two towns of Bordeaux and Nice are presented in Figure 1.3.

Figure 1.3: Generalized cost for train and plane

Every individual will choose the mode with the lowest generalized cost. For example, $i$ will choose the train if $c_r^i < c_a^i$. This will depend on the individual value of time: an individual with a high value of time will choose the plane as an individual with a lower travel time will choose the train. For given values of prices and travel times, one can compute a value of travel time $h^*$ which equates the generalized costs of the two modes in Equation 1.4:

\[ h^* = 60\frac{p_a - p_r}{(t_r - t_a)} \]

For Bordeaux and Nice, these time values are respectively 26.7 and 9 euros per hour. Nice is actually very far from Paris and only people with a very low value of time would spend 7.5 hours in the train instead of taking a plane. We now compute this threshold value for every city:

prtime <- mutate(prtime,  h = (pa - pr) / ( (tr - ta) / 60) )
prtime %>% pull(h) %>% summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    15.9    20.4    59.0    67.2   314.3

There are huge variations of the threshold value of time, as it ranges from 9 (Nice) to 314 (Nantes) euros per hour. Before considering a theoretical model that links the market share of train with the threshold value of time, let’s have a first glance at Figure 1.4 of this relationship using a scatterplot.

Figure 1.4: Share of rail in function of the threshold time value

The relationship between the threshold value of time and rail’s market share seems approximately linear, except for cities where the market share of train is very high (more than 75%). For now, we’ll remove these four cities from the sample and plot on Figure 1.5 the scatterplot for this restricted sample.²

prtime <- filter(prtime, sr < 0.75)

Figure 1.5: Share of rail in function of the threshold time value on a subsample

Now, we consider the distribution of the values of time. If $h$ follows a given distribution between $a$ and $b$, train’s market share is the share of the population for which the value of time is between $a$ and $h^*$ (and plane’s market share is the share of the population for which the value of time is between $h^*$ and $b$). The simplest probability distribution is the uniform distribution, which is defined by a constant density equal to $\frac{1}{b-a}$ between $a$ and $b$. It is represented in Figure 1.6. The area of the rectangle of width $[a,b]$ and height $[0, \frac{1}{b-a}]$ is 100%, because the whole population has a time value between $a$ and $b$. This rectangle has two components:

a first rectangle of width $[a, h ^ *]$ which includes people for which time value is below $h^*$ and therefore take the train,
a second rectangle of width $[h ^*, b]$ which includes people for which time value is higher than $h^*$ and therefore take the plane.

Stated differently: $s_f = \frac{h^* - a}{b - a} = -\frac{a}{b - a} + \frac{1}{b-a} h^*$ and this model therefore predicts a linear relationship between $h^*$ and $s_f$, the intercept being $-\frac{a}{b - a}$ and the slope $\frac{1}{b - a}$. Of course, rail’s market share depends on other variables than the threshold value of time, so that the linear relationship concerns the conditional expectation of rail’s market share. With $y=s_f$ and $x=h^*$, we therefore have a linear model of the form: $\mbox{E}(y | x) = \alpha + \beta x$. Moreover, the two parameters to be estimated $\alpha$ and $\beta$ are functions of the structural parameters of the model $a$ and $b$ which are the minimum and the maximum of the value of time.

Figure 1.6: Model shares with a uniform distribution

1.3 Computation of the OLS estimator

The model we seek to estimate is: $\mbox{E}(y_n \mid x_n) = \alpha+\beta x_n$. The difference between the observed value of $y$ and its conditional expectation is called the error for observation $n$:

\[ y_n - \mbox{E}(y_n \mid x_n) = \epsilon_n \]

The linear regression model can therefore be rewritten as:

\[ y_n = \mbox{E}(y_n \mid x_n) + \epsilon_n = \alpha + \beta x_n + \epsilon_n \]

$\epsilon_n$ is the error for observation $n$ when the values of the unknown parameters $(\alpha, \beta)$ are set to their true values $(\alpha_o, \beta_o)$. For given values of $(\alpha, \beta)$, obtained using an estimation method, $\epsilon_n$ will be called the residual for observation $n$. The residual for an observation is therefore the vertical distance between the point for observation $n$ and the regression line.

We seek to draw a straight line that is as closest as possible to all the points of our sample as in Figure 1.5. In the simple linear regression model, the distance between a point and the line is defined by the vertical distance, which is the residual for this observation. For the whole sample, we need to aggregate this individual measures of distance. Summing them is not an issue, as there are positive and negative values of the residuals and the sum may be very close to zero even if the individual residuals are very high in absolute values. One solution would be to use the sum of the absolute values of the residuals,³ but with the OLS estimator, we’ll consider the sum of the squares of the residuals (also called the residual sum of squares (RSS). Taking the squares, as taking the absolute values, removes the sign of the individual residuals and it results in an estimator which has nice mathematical and statistical properties. We’ll therefore consider a function $f$ which depends on the value of the response and the covariate in the sample (two vectors $x$ and $y$ of length $N$) and on two unknown parameters $\alpha$ and $\beta$, respectively the intercept and the slope of the regression line:

\[ f(\alpha, \beta |x, y)=\sum_{n = 1} ^ N (y_n - \alpha - \beta x_n) ^ 2 \]

Note that we write $f$ as a function of the two unknown parameters conditional on the values of $x$ and $y$ for a given sample. First-order conditions for the minimization of $f$ are:

\[ \left\{ \begin{array}{rcl} \displaystyle \frac{\partial f}{\partial \alpha} &=& -2 \sum_{n = 1} ^ N \left(y_n - \alpha - \beta x_n\right) = 0 \\ \displaystyle \frac{\partial f}{\partial \beta} &=& -2\sum_{n=1}^N x_n\left(y_n-\alpha-\beta x_n\right)=0 \end{array} \right. \tag{1.5}\]

Or, dividing by $-2$:

\[ \sum_{n = 1} ^ N\left(y_n - \alpha -\beta x_n\right) = \sum_{n = 1} ^ N \epsilon_n = 0 \tag{1.6}\]

\[ \sum_{n = 1} ^ N x_n\left(y_n - \alpha - \beta x_n\right) = \sum_{n = 1} ^ N x_n \epsilon_n = 0 \tag{1.7}\]

Equation 1.6 indicates that the sum (or the mean) of the residuals in the sample is 0. Dividing this expression by $N$ also implies that, denoting $\bar{y}$ and $\bar{x}$ the sample means of the response and of the covariate:

\[ \bar{y} = \alpha + \beta \bar{x}, \tag{1.8}\]

which means that the sample mean is on the regression line. Denoting $\hat{\epsilon}_n$ the residuals of the OLS estimator, Equation 1.7 states that $\sum_n x_n \hat{\epsilon} / N=0$, i.e., that the average cross-product of the covariate and the residual is 0. But, as the sample mean of the residuals $\bar{\hat{\epsilon}}$ is 0, this expression is also the covariance between the covariate and the residuals:

\[ \hat{\sigma}_{x\hat{\epsilon}} = \frac{\sum_{n = 1} ^ N (x_n - \bar{x})(\hat{\epsilon}_n - \bar{\hat{\epsilon}})}{N} = \frac{\sum_{n = 1} ^ N x_n \hat{\epsilon}_n}{N} - \bar{x}\bar{\hat{\epsilon}}= \frac{\sum_{n = 1} ^ N x_n \hat{\epsilon}_n}{N}= 0 \]

which means that the regression line is such that there is no correlation between the covariate and the residuals in the sample. Subtracting $\bar{y} - \alpha - \beta \bar{x}$, which is 0 (see Equation 1.8) from Equation 1.7, one gets:

\[ \sum_{n = 1}^N x_n\left[\left(y_n - \bar{y}\right)-\beta \left(x_n - \bar{x}\right)\right] = 0 \]

Moreover, $\sum_{n = 1} ^ N \bar{x}\left[\left(y_n - \bar{y}\right) - \beta\left(x_n - \bar{x}\right)\right] = 0$ and so:

\[ \sum_{n=1} ^ N \left(x_n - \bar{x}\right)\left[\left(y_n - \bar{y}\right) - \beta\left(x_n - \bar{x}\right)\right] = 0 \tag{1.9}\]

Solving for $\beta$, we finally get the estimator of the slope:

\[ \hat{\beta} = \frac{\sum_{n = 1} ^ N (x_n - \bar{x})(y_n - \bar{y})}{\sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2} = \frac{S_{xy}}{S_{xx}}= \frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x ^ 2}= \hat{\rho}_{xy} \frac{\hat{\sigma}_y}{\hat{\sigma}_x} \tag{1.10}\]

Equation 1.10 gives three formulations for this estimator:

the first indicates that it is the ratio of the covariation of $x$ and $y$: $S_{xy}=\sum_{n = 1} ^ N (x_n - \bar{x})(y_n - \bar{y})$ and the variation of $x$: $S_{xx}= \sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2$,
the second is obtained by dividing both sides of the ratio by the sample size, so that the estimator is now the ratio of the sample covariance between $x$ and $y$ and the sample variance of $x$,
the third is obtained by introducing the coefficient of correlation between $x$ and $y$: $\hat{\rho}_{xy} =\frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x\hat{\sigma}_y}$, so that the estimator is also expressed as the product of the coefficient of correlation and the ratio of the standard deviations.

This last formulation is particularly intuitive: $\hat{\rho}_{xy}$ is a pure measure of the correlation between the covariate and the response. This number has no unit and lies in the $-1$/$+1$ interval, a value of $-1$ ($+1$) indicating a perfect negative (positive) correlation and the value of 0 no correlation. The ratio of the standard deviations gives the relevant unit to the slope, which is the unit of $y$ divided by the unit of $x$. With this estimator of the slope in hand, we easily get the estimator of the intercept using Equation 1.8:

\[ \hat{\alpha}=\bar{y}-\hat{\beta} \hat{x} \tag{1.11}\]

Consider now that prior to the estimation, we sustracted from the covariate and the response their sample means. We then used $\tilde{y}_n = y_n - \bar{y}$ and $\tilde{x}_n = x_n - \bar{x}$ as the response and the covariate. In this case, as the mean of these two transformed variables are zero, the intercept is 0 (from Equation 1.11) and the slope is simply $\sum \tilde{y}_n \tilde{x}_n / \sum \tilde{x}_n ^ 2$.

Once the parameters of the regression line have been computed, one can define the fitted value for observation $n$ as the value returned by the regression line for $x_n$, which is:

\[ \hat{y}_n = \hat{\alpha} + \hat{\beta} x_n \]

By definition, we have for the fitted model: $y_n = \hat{y}_n + \hat{\epsilon}_n$. Note that, for the “true” model, we have $y_n = \mbox{E}(y | x = x_n) + \epsilon_n$, so that the residuals are an estimation of the errors and the fitted values $\hat{y}_n$ are an estimation of the conditional expectations of $y$. Moreover, denoting $\bar{\hat{y}}$ the sample mean of the fitted values:

\[ \frac{\sum_n (\hat{y}_n - \bar{\hat{y}})(\hat{\epsilon}_n - \bar{\hat{\epsilon}})}{N}= \frac{\sum_n \hat{y}_n \hat{\epsilon}_n}{N} - \bar{\hat{y}}\bar{\hat{\epsilon}} = \frac{\sum_n \hat{y}_n \hat{\epsilon}_n}{N} = \frac{\sum_n (\hat{\alpha} + \hat{\beta} x_n) \hat{\epsilon}_n}{N} = 0 \]

Therefore, there is no correlation between $\hat{y}$ and $\hat{\epsilon}$, which is clear as $\hat{y}$ is a linear function of $x$ and $x$ is uncorrelated with $\hat{\epsilon}$. $(\hat{\alpha}, \hat{\beta})$ is an optimum of the objective function. To check that this optimum is a minimum, we have to compute the second derivatives. From Equation 1.5, we get: $\frac{\partial^2 f}{\partial \alpha ^ 2}= 2N$, $\frac{\partial^2 f}{\partial \beta ^ 2}= 2\sum_n x_n ^ 2$ and $\frac{\partial^2 f}{\partial \alpha \partial \beta}= 2\sum_n x_n$. We need, for a maximum, positive direct second derivatives, which is obviously the case, and also a positive determinant of the matrix of second derivatives:

\[ D = \frac{\partial^2 f}{\partial \alpha ^ 2}\frac{\partial^2 f}{\partial \beta ^ 2} - \left(\frac{\partial^2 f}{\partial \alpha \partial \beta}\right)^2 = 4 N \sum_n x_n ^ 2 - 4 \left(\sum_n x_n\right) ^ 2 = 4N^2\left(\frac{\sum_n x_n ^ 2}{N} - \bar{x}^2\right) > 0 \]

which is the case as the term in brackets is the variance of $x$ and is therefore positive.

1.4 Geometry of least squares, variance decomposition and of determination

The OLS estimator relies on variances and covariances of several observable ($y$ and $x$) and computed ($\hat{y}$, $\hat{\epsilon}$) variables. Its properties can be nicely illustrated using vector algebra, each variable being represented by a vector, and by plotting these vectors.⁴

Vectors, variance and covariance

Every variable $z$ used in a regression is a vector of $\mathcal{R}_N$, i.e., a set of $N$ real values: $z^\top = (z_1, z_2, \ldots, z_N)$. The length (or norm) of the vector is: $\|z\| = \sqrt{\sum_{n=1}^N z_n ^ 2}$. Remember that the OLS estimator can always be computed with data measured in deviations from their sample mean. Then, $\|z\| ^ 2 / N$ is the variance of the variable, or the norm of the vector is $\sqrt{N}$ times the standard deviation of the corresponding variable. The inner (or scalar) product of two vectors is denoted by $z ^ \top w = w ^ \top z = \sum_{n=1} ^ N z_n w\index[author]{Davidson}\index[author]{McKinnon}_n$ (note that the inner product is commutative). For corresponding variables expressed in deviations from their respective means it is, up to the $1/N$ factor, the covariance between the two variables. Denoting $\theta$ the angle formed by the two vectors, we also have:⁵ $z ^ \top w = \cos \theta \|z\| \|w\|$.

Consider as an example: $x = ( 4 , 3 )$, $z = ( 4.5 , 6 )$ and $w = ( -6 , 4.5 )$. The three vectors are plotted in Figure 1.7. The norm of $x$ is $\|x\|=\sqrt{ 4 ^ 2 + 3 ^ 2} = 5$. Similarly, $\|z\| = 7.5$ and $\|w\| = 7.5$. The cosinus of the angle formed by $x$ and $z$ with the horizontal axis is $\cos \theta_x = 4 / 5 = 0.8$ and $\cos \theta_z = 4.5 / 7.5 = 0.6$. The angle formed by $x$ and $z$ is therefore: $\theta = \arccos 0.6 - \arccos 0.8 = 0.284$, with $\cos 0.284 = 0.96$. We can then check that $z ^ \top x = 4 \times 4.5 + 3 \times 6 = 36$, which is equal to: $\cos \theta \|z\| \|w\| = 0.96 \times 7.5 \times 5$. As the absolute value of $\cos \theta$ is necessarily lower than or equal to 1, the inner product of two vectors is lower than the product of the norms of the two vectors, and $\cos \theta = \frac{x ^ \top z}{\|x\| \|z\|}$ is the ratio of the covariance between $x$ and $z$ and the product of their standard deviations, which is the coefficient of correlation between the two underlying variables $x$ and $z$. Consider now $z$ and $w$. Their inner product is: $z ^ \top x = 4.5 \times -6 + 6 \times 4.5 = 0$. This is because $z$ and $w$ are two orthogonal vectors, which means that the two underlying variables are uncorrelated.

Geometry of least squares

The geometry of the simple linear regression model is represented in Figure 1.8. With $N = 2$, $x$ and $y$ are two vectors in a plane. For the “true” model, the $y$ vector is the sum of two vectors: $\beta x$ (which is the conditional expectation of $y$) and $\epsilon$, which is the vector of errors. For the estimated model, $y$ is the sum of the fitted values $\hat{y} = \hat{\beta} x$ and the residuals $\hat{\epsilon}$. Using the OLS estimator, we must minimize the sum of squares of the residuals, i.e., the norm of the $\hat{\epsilon}$ vector. Obviously, this implies that $\hat{\epsilon}$ should be orthogonal to $\hat{y}$ and therefore to $x$, which implies that the residuals are uncorrelated to the fitted values (and to the covariate) in the sample. Note also that, except in the unlikely case where $\hat{\beta} = \beta$, $\|\hat{\epsilon}\| < \|\epsilon\|$ which means that the residuals have a smaller variance than the errors. Note finally that what determines $\hat{y}$ and $\hat{\epsilon}$ is not $x$ per se, but the subspace defined by it, in our case the horizontal straight line. For example, consider the regression of $y$ on $z = 0.5 x$; we would then obtain exactly the same values for $\hat{y}$ and $\hat{\epsilon}$, the only difference being that the estimator of $\beta$ would be two times larger.

Figure 1.8: Geometry of the simple regression model

Variance decomposition and the $R^2$

For one observation $n$, we have:

\[ y_n - \bar{y}=\left(y_n - \hat{y}_n\right)+\left(\hat{y}_n - \bar{y}\right) \tag{1.12}\]

The difference between $y$ for individual $n$ and the sample mean is therefore the sum of:

a residual variation: $\left(y_n - \hat{y}_n\right) = \hat{\epsilon}_n$,
an explained variation: $\hat{y}_n - \bar{y}=\hat{\beta}(x_n - \bar{x})$.

Taking the square of Equation 1.12 and summing for all $n$, we get:

\[ \begin{array}{rcl} \sum_{n = 1} ^ N(y_n - \bar{y})^2 & = & \sum_{n = 1}^ N \left(\hat{\epsilon}_n + \hat{\beta}(x_n - \bar{x})\right) ^ 2 \\ & = & \sum_{n = 1} ^ N \hat{\epsilon}_n ^ 2 + \hat{\beta} ^ 2 \sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2 \\ & + & 2 \hat{\beta}\sum_{n = 1} ^ N \hat{\epsilon}_n x_n - 2\hat{\beta}\bar{x}\sum_{n = 1} ^ N \hat{\epsilon}_n \end{array} \]

But $\sum_{n = 1} ^ N \hat{\epsilon}_n x_n = 0$ (Equation 1.7) and $\sum_{n = 1} ^ N \hat{\epsilon}_n = 0$ (Equation 1.6), so that:

\[ \sum_{n = 1} ^ N (y_n - \bar{y}) ^ 2 = \hat{\beta} ^ 2 \sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2+\sum_{n = 1} ^ N \hat{\epsilon}_n ^ 2 \tag{1.13}\]

This equation indicates that the total sum of squares (TSS) of the response equals the sum of the explained sum of squares (ESS) and of the residual sum of squares (RSS). This latter term is also called the deviance, and it is the objective function for the OLS estimator. Dividing by $N$, we also have on the left-hand size the variance of $y$ and on the right-hand side the sum of the variances of $\hat{y}$ and $\hat{\epsilon}$. This is the formula of the variance decomposition of the response. It can be easily understood using Figure 1.8. It is clear from this figure that $y = \hat{y} + \hat{\epsilon}$ and that $\hat{y}$ and $\hat{\epsilon}$ are orthogonal. Therefore, applying the Pythagorean theorem, we have $\|y\|^2 = \|\hat{y}\|^2 + \|\hat{\epsilon}\|^2$. Up to the $1/N$ factor, if the response is measured in deviation from its sample mean, we have on the left the total variance of $y$ and on the right the sum of the variances of $\hat{y}$ and $\hat{\epsilon}$. Note that it is a unique feature of the ordinary least squares estimator. Any other estimator will generally result with a vector of residuals which won’t be orthogonal to $\hat{y}$ (or to $x$) and therefore Equation 1.13 won’t apply.

The coefficient of determination, denoted by $R^2$, measures the share of the variance of the response which is explained by the model, or one minus the share of the residual variation. We then have:

\[ R^2=\frac{\hat{\beta}^2\sum_{n=1}^N(x_n-\bar{x})^2}{\sum_{n=1}^N (y_n-\bar{y})^2} = 1 - \frac{\sum_{n=1}^N\hat{\epsilon}_n^2}{\sum_{n=1}^N (y_n-\bar{y})^2} \tag{1.14}\]

using Equation 1.10, we finally get:

\[ R^2=\left[\frac{\sum_{n=1}^N(x_n-\bar{x})(y_n-\bar{y})}{\sum_{n=1}^N (x_n-\bar{x})^2}\right]^2 \frac{\sum_{n=1}^N(x_n-\bar{x})^2}{\sum_{n=1}^N (y_n-\bar{y})^2} = \left[\frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x ^ 2}\right] ^ 2 \left[\frac{\hat{\sigma}_x ^ 2}{\hat{\sigma}_y ^ 2}\right] = \frac{\hat{\sigma}_{xy} ^ 2}{\hat{\sigma}_x ^ 2\hat{\sigma}_y ^ 2} =\hat{\rho}_{xy} ^ 2 \]

$R^2$ is therefore simply the square of the coefficient of correlation between $x$ and $y$. We have seen previously that, denoting $\theta$ the angle formed by two vectors, $\cos \theta$ is the coefficient of correlation between the two underlying variables (if they are measured in deviations from their means). Therefore, in Figure 1.8, $R^2$ is represented by the square of the cosine of the angle formed by the two vectors $\hat{y}$ and $y$. As this angle tends to 0 (the two vectors point almost in the same direction), $R^2$ tends to 1. On the contrary, if this angle tends to $\pi/2$, the two vectors become almost orthogonal, and $R^2$ tends to 0.

1.5 Computation with R

The slope of the regression line can easily be computed “by hand”, using any of the formula indicated in Equation 1.10, using the dplyr::summarise function. We first compute the variations of $x$ and $y$ and the covariation of $x$ and $y$, the two standard deviations and the coefficient of correlation between $x$ and $y$.

stats <- prtime %>%
    summarise(N = nrow(prtime), xb = mean(h), yb = mean(sr),
              Sxx = sum( (h - xb) ^ 2), Syy = sum( (sr - yb) ^ 2),
              Sxy = sum( (h - xb) * (sr - yb)),
              sx = sqrt(Sxx / N), sy = sqrt(Syy / N),
              sxy = Sxy / N, rxy = sxy / (sx * sy))

We can then compute the estimator of the slope using any of the three formulas:

hbeta <- stats$Sxy / stats$Sxx
hbeta
## [1] 0.02393
stats$sxy / stats$sx ^ 2
## [1] 0.02393
stats$rxy * stats$sy / stats$sx
## [1] 0.02393

The estimation of the intercept is obtained using Equation 1.11:

halpha <- stats$yb - hbeta * stats$xb
halpha
## [1] -0.01548

Much more simply, the OLS estimator can be obtained using the lm function (for linear model). It is a very important function in R, not only because it implements efficiently the most important estimator used in econometrics, but also because R functions that implement other estimators often mimic the features of the lm function. Therefore, once one is at ease with using the lm function, using other estimating function of R will be straightforward. lm is a function that has many arguments, but the first two are fundamental and almost mandatory:⁶

formula is a symbolic description of the model to be estimated,
data is a data frame that contains the variables used in the formula.

Here, our formula writes sr ~ h, which means sr as a function of h. The data frame is prtime. The result of the lm function may be directly printed (the result is then lost), or saved in an object, which can be later printed or manipulated:

lm(sr ~ h, prtime)


Call:
lm(formula = sr ~ h, data = prtime)

Coefficients:
(Intercept)            h  
    -0.0155       0.0239

pxt <- lm(sr ~ h, prtime)

writing directly pxt is like writing print(pxt), the side effect is to print a short description of the results, namely a remind of the function call and the name and the values of the fitted coefficients. lm returns an object of class lm which is a list of 12 elements; their names can be retrieved using the names function:

names(pxt)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

An element of this list can be extracted using the $ operator. For example:

pxt$coefficients
## (Intercept)           h 
##    -0.01548     0.02393

returns the named vector of coefficients. pxt$residuals and pxt$fitted.values return two vectors of length $N$ containing the residuals and the fitted values. However, it is not advised to use the $ operator to retrieve the elements of a fitted model. Specific functions, called extractors, should be used instead. For example, to retrieve the coefficients, the residuals and the fitted values as previously, we would use:

coef(pxt)
resid(pxt)
fitted(pxt)

There are other functions that extract important information about the model, as the number of observation (nobs) and the sum of square residuals (deviance):

nobs(pxt)
## [1] 9
deviance(pxt)
## [1] 0.03606

The results of the estimation are presented in Figure 1.9; we’ve added to the scatterplot:

the sample mean, indicated by a large circle,
the regression line,
the residuals, represented by arrows: upward arrows represent positive residuals (e.g., Brest and Toulon), and downward arrows represent negative residuals (e.g., Strasbourg and Nice).

Figure 1.9: Regression line and residuals

Individual coefficients can be extracted using the [ operator. As coef returns a named vector, one can either indicate the position or the name of the coefficient to be extracted:

int <- unname(coef(pxt)[1])
slope <- unname(coef(pxt)["h"])
c(int, slope)
## [1] -0.01548  0.02393

Note that we use the unname function in order to remove the name of the extracted coefficient. Once the intercept (int) and the slope (slope) are extracted, the structural parameters can be retrieved, as the intercept is $\alpha = -\frac{a}{b-a}$ and the slope $\beta =\frac{1}{b-a}$. Therefore, $a = -\frac{\alpha}{\beta}$ and $b = \frac{1}{\beta} + a$. We finally get:

ahat = - int / slope
bhat = 1 / slope + ahat
c(ahat, bhat)
## [1]  0.647 42.431

Time values therefore lie between 0.65 and 42.43 euros per hour. The mean (and median) time value is the mean of the two extreme values, which is 21.54 euros per hour.

1.6 Data generator process and simulations

Inferential statistics rely on the notions of population and sample. The population is a large and exhaustive set of observations, and a sample is a small subset of observations drawn in this population. These notions are relevant in medical sciences and often in economic studies. For example, the first data set we used concerned smoking habits and birth weight. The population of interest is all American pregnant women in 1988, and a sample of 1388 of them was drawn from this population. The second data set concerned the relation between education and wage. The population was American labor force in 1992, and a sample of 16,481 workers was drawn from this population. On the contrary, our third data set doesn’t fit with these notions of population and sample. The sample consists of major towns in France that are connected on a regular basis by rail and air to Paris. It contains 13 observations, which are not 13 observations randomly drawn from a large set of cities, but which are more or less all the relevant cities.

Data generator process

An interesting alternative is the notion of data generator process (DGP). It describes how the data are assumed to have been generated. We assume in the linear regression model that:

$\mbox{E}(y|x = x_n) = \alpha + \beta x_n$: the expected value of $y$ for a given value of $x$ is a linear function of $x$,
$y_n = \mbox{E}(y|x = x_n) + \epsilon_n$: the observed value of $y_n$ is obtained by adding to the conditional expectation of $y$ for $x = x_n$ a random variable $\epsilon$ called the error.

From Equation 1.10, we can write the estimator of the slope as:

\[ \hat{\beta} = \frac{\sum_{n = 1} ^ N (x_{n} - \bar{x})(y_{n} - \bar{y})} {\sum_{n = 1} ^ N (x_{n} - \bar{x})^ 2} = \sum_{n = 1} ^ N \frac{(x_{n} - \bar{x})}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2} y_{n} = \sum_{n} c_{n} y_{n} \tag{1.15}\]

with $c_{n} = \frac{x_{n} - \bar{x}}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2}$. The OLS estimator is therefore a linear estimator, i.e., a linear combination of the values of $y$. The coefficients of this linear combination $c_n$ are such that $\sum_{n = 1} ^ N c_{n} = 0$ and that:

\[ \sum_{n = 1} ^ N c_{n} ^ 2 = \frac{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2} {\left(\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2\right) ^ 2} = \frac{1}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2} = \frac{1}{N\hat{\sigma}_x^2} \]

Replacing $y_{n}$ by $\alpha + \beta x_{n} + \epsilon_{n}$, we then express $\hat{\beta}$ as a function of $\epsilon_n$:

\[ \hat{\beta}=\sum_{n = 1} ^ N c_{n} (\alpha + \beta x_{n} + \epsilon_{n}) = \alpha \sum_{n = 1} ^ N c_{n}+\beta\sum_{n = 1} ^ N \frac{x_{n}(x_{n} - \bar{x})}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2}+\sum_{n = 1} ^ N c_{n} \epsilon_{n} \]

As $\sum_{n = 1} ^ N x_{n}(x_{n} - \bar{x}) = \sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2$ and $\sum_{n = 1} ^ N c_{n} = 0$, we finally get:

\[ \hat{\beta} = \beta + \sum_{n = 1} ^ N c_{n} \epsilon_{n} \tag{1.16}\]

The deviation of the estimator of the slope of the OLS regression line $\hat{\beta}$ from the true value $\beta$ is therefore a linear combination of the $N$ errors. Consider our sample used to estimate the price-time model. From a DGP perspective, this sample has been generated using the formula: $y=\alpha + \beta x + \epsilon$. Consider now that the “true” values of $\alpha$ and $\beta$ are $-0.2$ and $0.032$, we can in this case compute the vector of errors for our sample ($\epsilon = y - \alpha - \beta x$):

alpha <- - 0.2
beta <- 0.032
y <- prtime %>% pull(sr)
x <- prtime %>% pull(h)
eps = y - alpha - beta * x

We then compute the OLS estimator, and we retrieve $\hat{\epsilon}$ and $\hat{y}$.

z <- lm(y ~ x)
hat_eps <- z %>% residuals
hat_y <- z %>% fitted

The observed data set, in the DGP perspective, is represented in Figure 1.10.

The “true” model is represented by the plain line. The errors are represented by the plain arrows (positive errors for upward arrows, negative errors for downward arrows). Each value of $y_n$ is the sum of the conditional expectation of $y$ for the value of $x$: $E(y|x = x_n) = \alpha + \beta x_n$ (the value returned by the plain line for the given value of $x$) and the error $\epsilon_n$ represented by the plain arrow. For our specific sample, we have a specific vector of $\epsilon_n$, which means a specific set of points and a specific regression line, the dashed line on the figure. Each value of $y_n$ is then also the sum of the fitted value (the one returned by the regression line for $x=x_n$) and the residual, represented by a dashed arrow.

We can check that the residuals sum to 0 and are uncorrelated with the covariate and the fitted values:

sum(hat_eps)
## [1] 1.735e-18
sum(hat_eps * x)
## [1] 3.556e-16
sum(hat_eps * hat_y)
## [1] 3.903e-18

which is not the case for the errors:

sum(eps)
## [1] 0.3818
sum(eps * x)
## [1] 4.036
sum(eps * hat_y)
## [1] 0.09069

We can also check that the residuals are “smaller” than the errors by computing their standard deviations:

sd(eps)
## [1] 0.08496
sd(hat_eps)
## [1] 0.06713

Random numbers and simulations

Now consider an other fictive sample for the same values of $x$, i.e., consider that the values of the threshold values of time are the same, but that the other factors that influence rail’s shares (the $\epsilon$) are different. To generate such a fictive sample, we need to generate the $\epsilon$ vector, using a function that generates random numbers.⁷ With R, these functions have a name composed of the letter r (for random) and the abbreviated name of the statistical distribution: for example runif draws numbers in a uniform distribution (by default within a $0-1$ range) and rnorm draws numbers in a normal distribution (by default with zero expectation and unit standard deviation). These functions have a mandatory argument which is the number of draws. For example, to get five numbers drawn from a standard normal distribution:

rnorm(5)
## [1]  0.01875 -0.18425 -1.37133 -0.59917  0.29455

Using the same command once more, we get a completely different sequence:

rnorm(5)
## [1]  0.3898 -1.2081 -0.3637 -1.6267 -0.2565

As stated previously, the rnorm function doesn’t draw random numbers but computes a sequence of numbers that looks like a random sequence. Imagine that rnorm actually computes a sequence of thousands of numbers, what is obtained using rnorm(5) is 5 consecutive numbers in this sequence, for example from the 5107th to the 5111th number. The position of the first element is called the seed, and it can be set to an integer using the set.seed function. Using the same seed while starting a simulation, we would then get exactly the same pseudo-random numbers each time and therefore the same results:

set.seed(7L)
rnorm(5)
## [1]  2.2872 -1.1968 -0.6943 -0.4123 -0.9707
rnorm(5)
## [1] -0.9473  0.7481 -0.1170  0.1527  2.1900
set.seed(7L)
rnorm(5)
## [1]  2.2872 -1.1968 -0.6943 -0.4123 -0.9707
rnorm(5)
## [1] -0.9473  0.7481 -0.1170  0.1527  2.1900

The DGP is completely described by specifying the distribution of $\epsilon$. We’ll consider here a normal distribution with mean 0 and standard deviation $\sigma_\epsilon = 0.08$. A pseudo-random sample can then be constructed as follow:

N <- prtime %>% nrow
x <- prtime %>% pull(h)
alpha <- - 0.2 ; beta <- 0.032 ; seps <- 0.08
eps <- rnorm(N, sd = seps)
y <- alpha + beta * x + eps
asmpl <- tibble(y, x)

which gives the following OLS estimates:

lm(y ~ x, asmpl) %>% coef
## (Intercept)           x 
##    -0.25218     0.03841

The notion of DGP enables to perform simulations, that have two purposes:

the first (and the most important in practice) is that, in some situations, it is impossible to get analytical results for the properties of an estimator, and those properties can in this case be obtained using simulations,
the second is pedagogical, as theoretical results can be confirmed and illustrated using simulations.

A simulation consists of creating a large number of samples, computing some interesting numbers for every sample and calculating some relevant statistics for these numbers. For example, we’ll compute the slope of the OLS estimate for every sample, and we’ll calculate the mean and the standard deviation for this slope. A minimal simulation is presented in Figure 1.11, where we present four different samples obtained for four different sets of errors.

For every sample, there is a specific set of errors (arrows) and therefore specific data points and regression lines. The estimated intercepts range from $-0.212$ to $-0.015$ and the slopes from $0.024$ to $0.032$. The important point is that $\beta$ is, in practice, an unknown fixed parameter. $\hat{\beta}$ depends on the $N$ observations of the sample and therefore the $N$ values of the errors. Each sample is characterized by a specific vector of errors and therefore by a different value of the estimator.

More generally, we denote $R$ the number of replications and we construct a data frame with $R \times N$ lines. The columns are the 13 values of $x$ (fixed in this simulation), the vector $\epsilon$ drawn in a normal distribution with a standard deviation equal to 0.08 and $y = \alpha + \beta x + \epsilon$, with $\alpha = 0.2$ and $\beta = 0.032$ as previously. Finally we add a variable id that is a vector containing integers from 1 to $R$ repeated 13 times that will be used to identify the sample:

alpha <- - 0.2 ; beta <- 0.032 ; seps <- 0.08
R <- 1E03 ; N <- length(x)
datas <- tibble(id = rep(1:R, each = N),
                x = rep(x, R), 
                eps = rnorm(R * N, sd = seps),
                y = alpha + beta * x + eps)

Running regressions on every sample requires to use lm(y ~ x) not on the whole data set, but on every subset defined by a value of id. The lm function has a subset argument that must be a logical expression used to select only a subset of the data frame. For example, to get the fitted model for the first sample, the logical expression is id == 1 and one can use:

lm(y ~ x, data = datas, subset = id == 1)

The slope can also be calculated using Equation 1.15, which shows that the OLS estimator is a linear combination of the values of the response:

datas %>% filter(id == 1) %>% 
  mutate(cn = (x - mean(x)) / sum((x - mean(x)) ^ 2)) %>%
  summarise(slope = sum(cn * y)) %>% 
  pull
## [1] 0.03287

To get values of the estimator for every sample, we have to loop on this command for id from 1 to R. A good practice in R is to avoid the use of explicit loops whenever it’s possible. The group_by / summarise couple of functions from the dplyr package can be used instead.

results <- datas  %>% 
  group_by(id) %>% 
  mutate(cn = (x - mean(x)) / sum((x - mean(x)) ^ 2)) %>%
  summarise(slope = sum(cn * y), intercept = mean(y) - slope * mean(x))
results

# A tibble: 1,000 × 3
     id  slope intercept
  <int>  <dbl>     <dbl>
1     1 0.0329    -0.166
2     2 0.0309    -0.181
3     3 0.0348    -0.213
4     4 0.0267    -0.129
5     5 0.0270    -0.127
# ℹ 995 more rows

The result is now a tibble with R lines, as the computation of the OLS estimator is performed for every sample. As we now have a large number of values of $\hat{\beta}$, we can analyze its statistical properties, for example its mean and its standard deviation for the 1000 random samples:

results %>% summarise(mean = mean(slope), sd = sd(slope))

# A tibble: 1 × 2
    mean      sd
   <dbl>   <dbl>
1 0.0320 0.00437

We’ll see in Chapter 2 that the expected value of $\hat{\beta}$ is $\beta=0.032$ (we’ll say that the OLS estimator is unbiased) and that its standard deviation is $\sigma_\epsilon / \sigma_x / \sqrt{N} = 0.0041$. The mean and standard deviations for our 1000 values of $\hat{\beta}$ are estimates of these two theoretical values. We can see that the mean of $\hat{\beta}$ is actually equal to 0.032 and that the standard deviation equals 0.00437, which is close to the theoretical value.

A more general method to perform simulation is to use list-columns of data frames. To conduct such an analysis step by step, we start with a very small example, with $R = 2$ and $N = 3$.

R <- 2
N <- 3
small_data <- tibble(id = rep(1:R, each = N),
                     x = rep(x[1:N], R), 
                     eps = rnorm(R * N, sd = seps),
                     y = alpha + beta * x + eps)
small_data

# A tibble: 6 × 4
     id     x     eps     y
  <int> <dbl>   <dbl> <dbl>
1     1  26.7 -0.0624 0.593
2     1  20.5  0.0243 0.479
3     1  27.0  0.111  0.775
4     2  26.7 -0.0124 0.643
5     2  20.5 -0.0356 0.419
# ℹ 1 more row

Starting with small_data, we use tidy::nest to nest the data frame using the id variable (.by = id); the name of the results is by default data, but it can be customized using the .key argument:

datalst <- small_data %>% nest(.by = id, .key = "smpl")
datalst

# A tibble: 2 × 2
     id smpl            
  <int> <list>          
1     1 <tibble [3 × 3]>
2     2 <tibble [3 × 3]>

Each line now corresponds to a value of id and smpl is a list column that contains a tibble,

datalst %>% pull(smpl)

[[1]]
# A tibble: 3 × 3
      x     eps     y
  <dbl>   <dbl> <dbl>
1  26.7 -0.0624 0.593
2  20.5  0.0243 0.479
3  27.0  0.111  0.775

[[2]]
# A tibble: 3 × 3
      x     eps     y
  <dbl>   <dbl> <dbl>
1  26.7 -0.0124 0.643
2  20.5 -0.0356 0.419
3  27.0  0.141  0.805

We now write a function that takes as argument a data frame containing the $x$ and $y$ columns and returns a one-line tibble that contains the fitted intercept and slope:

ols <- function(s){
  x <- s$x ; y <- s$y
  tibble(slope = sum((x - mean(x)) * (y - mean(y))) /  
           sum( ( x - mean(x)) ^ 2),
         intercept = mean(y) - slope * mean(x))
}

This function can be applied for a given sample:

ols(small_data[1:N, ])

# A tibble: 1 × 2
   slope intercept
   <dbl>     <dbl>
1 0.0329    -0.198

But, as the smpl column of datalst is a list, the ols function can be applied to all the elements of the list, which can be done using the purrr::map function which takes as argument a list and a function:

datalst %>% pull(smpl) %>% map(ols)

[[1]]
# A tibble: 1 × 2
   slope intercept
   <dbl>     <dbl>
1 0.0329    -0.198

[[2]]
# A tibble: 1 × 2
   slope intercept
   <dbl>     <dbl>
1 0.0483    -0.573

and returns a list. But this can more easily be done using mutate or transmute, so that the result is a tibble:

results <- datalst %>% transmute(smpl = map(smpl, ols))
results

# A tibble: 2 × 1
  smpl            
  <list>          
1 <tibble [1 × 2]>
2 <tibble [1 × 2]>

Finally, tidyr::unnest expands the list column containing tibbles into rows and columns.

results %>% unnest(cols = smpl)

# A tibble: 2 × 2
   slope intercept
   <dbl>     <dbl>
1 0.0329    -0.198
2 0.0483    -0.573

which returns a standard tibble with one row for each draw and one column for each computed statistic. Going back to the full example with $R = 1000$, we get:

datas %>% nest(.by = id, .key = "smpl") %>% 
  transmute(smpl = map(smpl, ols)) %>% 
  unnest(cols = smpl) %>% 
  summarise(mean = mean(slope), sd = sd(slope))

# A tibble: 1 × 2
    mean      sd
   <dbl>   <dbl>
1 0.0320 0.00437

Bonnel, Patrick. 2004. Prévoir La Demande de Transports. Presses de l’école nationale des ponts et chaussées.

Davidson, Russell, and James G. MacKinnon. 2004. Econometric Theory and Methods. Oxford University Press.

Mullahy, John. 1997. “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior.” The Review of Economics and Statistics 79 (4): 586–93. http://www.jstor.org/stable/2951410.

Plug, Erik. 2004. “Estimating the Effect of Mother’s Schooling on Children’s Schooling Using a Sample of Adoptees.” American Economic Review 94 (1): 358–68. https://doi.org/10.1257/000282804322970850.

Note that we use a biased estimator of the covariance as the denominator is $N$ and not $N-1$. The difference is negligible if $N$ is large.↩︎
This is obviously a very raw method to remove the non-linearity. Real time-price models use a much relevant solution, i.e., a transformation of the two variables in order to deal with the non-linearity.↩︎
This estimator is called the least absolute deviations estimator.↩︎
For a detailed presentation of the geometry of least squares, see Davidson and MacKinnon (2004), chapter 2.↩︎
See Davidson and MacKinnon (2004), p. 48.↩︎
Actually, the second argument data is not mandatory because estimation can be performed without using a data frame.↩︎
More precisely, a function that generates a sequence of numbers that looks like random numbers.↩︎

# Simple linear regression model {#sec-simple_ols} ```{r } #| include: false #| label: setup_simple_regression source("../_commonR.R") ```     Obviously there will be rarely a deterministic relationship between a response and one or several covariates. Education has a positive causal effect on wage, but there are many other variables that affect wages: sex, ethnicity, experience, abilities, to name a few. Even if a large set of relevant covariates are observed, some are not (this is in particular the case for abilities). Therefore, we won't try to modelize the value of $y$ as a function of covariates $x$ and some unknown parameters $\gamma$: $y = f(x, \gamma)$ but, more modestly, the **conditional expectation** of $y$: \index[general]{conditional expectation|(} $$ \mbox{E}(y | x) = f(x, \gamma) $$ {#eq-condexpsimp} Back to our education / wage model, the conditional expectation is the mean value of the wage in the population for a given value of education. It is therefore a function of $x$ which takes as many values as there are distinct values of $x$. In this chapter, we discuss the simplest econometric model, which is the simple linear regression model. This is a **simple** model because there is only one covariate $x$. This is a **linear** model because the $f$ function is assumed to be linear. @eq-condexpsimp can then be rewritten as follow: $$ \mbox{E}(y | x) = \alpha + \beta x $$ {#eq-lincondexp} @sec-condexp_covariance details the notion of conditional expectation and its relation with the notion of covariance. @sec-simple_model_data presents the model and the data set that will be used throughout this chapter. @sec-comp_simple_ols is devoted to the computation of the ordinary least squares estimator. @sec-simple_geometry presents the geometry of least squares. @sec-comp_ols_simple_R explains how to compute the ordinary least squares estimator using **R**. Finally @sec-dgp_simulations presents the notion of data generator process and explains how simulations can be usefully performed. ## Conditional expectation and covariance {#sec-condexp_covariance} \index[general]{covariance|(} To estimate the two unknown parameters $\alpha$ (the **intercept**) and $\beta$ (the **slope**), we need a data set that contains different individuals for which $x$ and $y$ are observed. Typically, we'll consider a random sample, which is a small subset of the whole population consisting of a set of individuals randomly drawn from this population. ### Linear models: two examples To understand what the hypothesis of the simple linear model implies, we consider two data sets. The first one is called `birthwt`\idxdata{birthwt}{micsr} [@MULL:97]\index[author]{Mullbauer} and contains birth weights (`birthwt` in ounces) of babies and smoking habits of mothers. The `cigarettes`\idxdata{cigarettes}{micsr} variable is the number of cigarettes smoked per day during pregnancy (for about 85% of the mothers, it is equal to 0). We first compute a binary variable for smoking mothers: ```{r } #| label: smoke_computation birthwt <- birthwt %>% mutate(smoke = ifelse(cigarettes > 0, 1, 0)) ``` and consider a simple linear regression model with `birthwt` as the response and `smoke` as a covariate. As the covariate takes only two values, so does the conditional expectation. In a linear regression model, @eq-lincondexp will therefore returns two values, $\alpha$ for $x = 0$ and $\alpha + \beta$ for $x = 1$. $\alpha$ and $\alpha + \beta$ are therefore the expected birth weights for respectively non-smoking and smoking mothers. Note that, because we have as many parameters as values of the covariate, the linear hypothesis is necessarily supported. Natural estimators of the conditional expectations are the conditional sample means, i.e., the average birth weights in the sample for non-smoking and smoking women: ```{r } #| label: smoke_effect cond_means <- birthwt %>% group_by(smoke) %>% summarise(birthwt = mean(birthwt)) cond_means ``` ```{r } #| include: false #| label: smoke_effect_hiden alpha <- cond_means[1, 2, drop = TRUE] alpha_beta <- cond_means[2, 2, drop = TRUE] beta <- alpha_beta - alpha ``` Therefore, our estimation of $\alpha$ is $\hat{\alpha} = `r round(alpha, 1)`$ and the estimation of $\alpha + \beta$ is $`r round(alpha_beta, 1)`$, so that $\hat{\beta} = `r round(alpha_beta, 1)` - `r round(alpha, 1)` = `r round(beta, 1)`$ ounces, which is the estimation of birth weight's loss caused by smoking during pregnancy. This can be illustrated using a scatterplot (see @fig-birthwt)\idxdata{cigarettes}{micsr}, with `smoke` on the horizontal axis and `birthwt` on the vertical axis, all the points having only two possible abscissa values, 0 and 1. The conditional means are represented by square points, and we also drew the line that contains these two points. The intercept is therefore the y-value of the square point for $x = 0$ which is $\hat{\alpha} = `r round(alpha, 1)`$. The slope of the line is the ratio of the vertical and the horizontal distances between the two large points, which are respectively $\hat{\beta}$ and $1 - 0 = 1$ and is therefore equal to $\hat{\beta}$. ```{r} #| label: fig-birthwt #| fig-cap: "Birth weight for non-smoking and smoking mothers" #| echo: false birthwt %>% ggplot(aes(smoke, birthwt)) + geom_point(size = 0.1) + geom_abline(slope = beta, intercept = alpha) + geom_point(data = cond_means, size = 3, shape = 15) + scale_y_continuous(limits = c(50, 200)) ``` Consider now the relationship between education and wage. The `adoptees`\idxdata{adoptees}{micsr.data} data set [@PLUG:04]\index[author]{Plug} contains the number of years of education `educ` and the annual income `income` (in thousands of US$) for 16481 individuals. We restrict the sample to individuals with an education level between 12 (high school degree) and 15 years (bachelor's degree). ```{r } #| label: adoptees_filter adoptees <- adoptees %>% filter(educ >= 12, educ <= 15) ``` and we compute the mean income for the four values of education: ```{r } #| adoptees_mean_income adoptees %>% group_by(educ) %>% summarise(income = mean(income)) ``` The scatterplot is presented in @fig-adoptees: ```{r } #| label: fig-adoptees #| fig-cap: "Education and income" #| echo: false cond_means <- adoptees %>% group_by(educ) %>% summarise(income = mean(income)) adoptees %>% ggplot(aes(educ, income)) + geom_point(size = 0.1) + geom_point(data = cond_means, size = 3) + scale_y_continuous(limits = c(5E01, 1E02)) ``` This time, we have four distinct values of the covariate and we can estimate four conditional sample means, but there are only two parameters to estimate for the simple linear model. Therefore, we can't estimate directly $\alpha$ and $\beta$ using the conditional means, except in the improbable case where the four conditional means lie on a straight line, which means that, for one additional year of education, the average wage increases exactly by the same amount. We can see in @fig-adoptees\idxdata{adoptees}{micsr.data} that it is not the case: the increase of income is `r round(diff(cond_means$income[1:2]), 1)` for a 13th year of education, `r round(diff(cond_means$income[2:3]), 1)` for a 14th year and `r round(diff(cond_means$income[3:4]), 1)` for a 15th year. We therefore need a formal method of estimation which enables us to obtain values of $\hat{\alpha}$ and $\hat{\beta}$. The remaining of this chapter is devoted to the presentation of this estimator, which is called the **ordinary least squares** (**OLS**) estimator. ### Linear model, conditional expectation and covariance For now, it is very important to understand the relation between the conditional expectation and the covariance. The covariance between $x$ and $y$ is, denoting $\mu_x = \mbox{E}(x)$ and $\mu_y = \mbox{E}(y)$: $$ \sigma_{xy} = \mbox{cov}(x, y) = \mbox{E}\left[(x - \mu_x)(y - \mu_y)\right] = \mbox{E}(xy) - \mu_x\mu_y $$ It is the expectation of the product of the two variables in deviations from their respective expectations or the expectation of the product minus the product of the expectations. The expectation is for the two variables $x$ and $y$ and the law of repeated expectation states that this expectation can be written as: $$ \mbox{cov}(x, y) = \mbox{E}_x\left[\mbox{E}_y\left((y - \mu_y)(x - \mu_x)\mid x\right)\right] = \mbox{E}_x\left[(x - \mu_x)\mbox{E}_y(y - \mu_y\mid x)\right] $$ Therefore, the covariance between $x$ and $y$ is also the covariance between $x$ and the conditional expectation of $y$ (which is a function of $x$). As an example, consider the case where $y$ is the hourly wage and $x$ the education level, which takes the values 0 (highschool), 3 (bachelor) and 5 (graduate), the frequencies for the three levels of education being 0.4, 0.3 and 0.3. Let denote $h$, $b$ and $g$ these three education levels and assume that $\mbox{E}(y\mid x) = 8 + 0.5 x$, which means that one more year of education increases the expected wage by $0.5. The expected level of education is $\mu_x = 0.4 \times 0 + 0.3 \times 3 + 0.3 \times 5 = 2.4$. The expected wage for the three levels of education are $\mu_{xh} = 8$, $\mu_{xb} = 9.5$ and $\mu_{xg} = 10.5$. The expected wage is then $\mu_y = 0.4 \times 8 + 0.3 \times 9.5 + 0.3 \times 10.5 = 9.2$. We first compute the first term of the covariance between wage and education for $x = x_h = 0$: $$ \begin{array}{rcl} \mbox{E}\left[ (x - \mu_x)(y - \mu_y)\mid x = x_h\right] &=& \mbox{E}\left[ (x_h - \mu_x)(y - \mu_y)\mid x = x_h\right] \\ &=& (x_h - \mu_x) \mbox{E}\left[ (y - \mu_y)\mid x = x_h\right] \\ &=& (x_h - \mu_x)(\mbox{E}(y\mid x = x_h) - \mu_y)\\ &=& (0 - 2.4) \times (8 - 9.2) = 2.88 \end{array} $$ {#eq-cov_one_term} Similarly, we get $(3 - 2.4) \times (9.5 - 9.2) = 0.18$ for the bachelor level and $(5 - 2.4)\times (10.5 - 9.2) = 3.38$ for the master level. Finally the covariance is obtained as the weight average of these three terms: $$ \sigma_{xy} = 0.4 \times 2.88 + 0.3 \times 0.18 + 0.3 \times 3.38 = 2.22 $$ Now consider the case where the conditional expectation of the hourly wage is the same for the three education levels: $\mbox{E}(y \mid x = x_h) = \mbox{E}(y \mid x = x_b) = \mbox{E}(y \mid x = x_m)$. These three conditional expectations are therefore equal to the unconditional expectation $\mu_y$ and @eq-cov_one_term is clearly 0 and so is the covariance. Therefore, a constant conditional expectation of $y$ given $x$ implies that the covariance is 0. Note that the opposite is not true. For example, if $\mu_{yh} = 8$, $\mu_{yb} = 13.2$ and $\mu_{ym} = 6.2$, the relation between the conditional mean of the wage and the level of education is no more linear, and it is not monotonous but inversely U-shaped (the wage first increases with education and then decreases). The unconditional mean is still: $\mu_y = 0.4 \times 8 + 0.3 \times 13.2 + 0.3 \times 6.2 = 9.2$, but the covariance is then: $$ \sigma_{xy} = 0.4 \times (0-2.4) \times (8 - 9.2) + 0.3 \times (3-2.4) \times (13.2 - 9.2) + 0.3 \times (5 - 2.4) \times (6.2 - 9.2) = 0 $$ Therefore, a 0 covariance doesn't imply that the conditional mean is constant. However, in the special case where the conditional mean is a linear function of the conditional variable, the two properties of 0 covariance and constant conditional means are equivalent. Consider now the sample counterpart of this theoretical covariance. Consider a sample of 10 individuals for which the levels of education and the hourly wages are: ```{r echo = FALSE, message = FALSE} library("tibble") y <- c(8, 7, 8, 9, 9, 9.5, 10, 10, 10.5, 11) x <- c(0, rep(c(0, 3, 5), each = 3)) N <- length(y) xb <- mean(x) yb <- mean(y) ybi <- tapply(y, x, mean)[as.character(x)] Sxy <- sum(x * y) tibble(education = as.character(x), wage = as.character(y)) %>% as.matrix %>% t %>% knitr::kable() ``` The sample means of $x$ and $y$ are : $\bar{x} = `r mean(x)`$ and $\bar{y} = `r mean(y)`$. The conditional means of $y$ for a given value of $x$ in the sample are simply the means for the three subsamples defined by an education level: - $x_h=0$\ : $\bar{y}_h = \frac{8+7+8+9}{4} = `r mean(c(8, 7, 8, 9))`$, - $x_b=3$\ : $\bar{y}_b = \frac{9+9.5+10}{3} = `r mean(c(9, 9.5, 10))`$, - $x_m=5$\ : $\bar{y}_m = \frac{10, 10.5, 11}{3} = `r mean(c(10, 10.5, 11))`$, and the covariance is obtained as the mean of the products of the two variables in deviations from their mean, or equivalently by the difference between the mean of the products and the products of the means:^[Note that we use a biased estimator of the covariance as the denominator is $N$ and not $N-1$. The difference is negligible if $N$ is large.] $$ \hat{\sigma}_{xy} = \frac{\sum_{n=1}^N (x_n - \bar{x}) (y_n - \bar{y})}{N}=\frac{\sum_{n=1}^N x_n y_n}{N}-\bar{x}\bar{y} $$ The sum of the products is $\sum_n x_n y_n = 243$ and the covariance is then: $$ \hat{\sigma}_{xy}=\frac{`r Sxy`}{`r N`} - `r xb` \times `r yb` = `r Sxy/N-xb*yb` $$ We also can consider the three subsamples defined by education levels, denoting $n=1, ..., H$, $n=H+1, ..., H+B$, $n=H+B+1, ..., N$ (with $H=4$, $B=3$ and $N=10$) observations for, respectively, a high-school, a bachelor and a master level. We then have: $$ \hat{\sigma}_{xy} = \frac{\sum_{n=1}^H (x_n - \bar{x})(y_n - \bar{y}) + \sum_{n=H+1}^{H+B} (x_n - \bar{x})(y_n - \bar{y}) + \sum_{n=H+B+1}^N (x_n - \bar{x})(y_n - \bar{y})}{N} $$ For example, for $n=1, ..., H$, the value of $x$ is the same: $x_n=x_h=0$. Therefore, we can write $\sum_{n=1}^H (x_n - \bar{x})(y_n -\bar{y}) = (x_h - \bar{x}) \sum_{n=1}^H (y_n -\bar{y})$ and more generally: $$ \hat{\sigma}_{xy} = \frac{(x_h - \bar{x})\sum_{n=1}^H (y_n - \bar{y}) + (x_b - \bar{x})\sum_{n=H+1}^B (y_n - \bar{y}) + (x_m - \bar{x})\sum_{n=H+B+1}^N (y_n - \bar{y})}{N} $$ Moreover $\frac{\sum_{n=1}^H (y_n - \bar{y})}{H}=(\bar{y}_h-\bar{y})$ and then: $$ \hat{\sigma}_{xy} = \frac{(x_h - \bar{x})H(\bar{y}_h-\bar{y})+ (x_b - \bar{x})B(\bar{y}_b -\bar{y})+ (x_m - \bar{x})M(\bar{y}_m - \bar{y})}{N} $$ Denoting $f_h = \frac{H}{N}$, $f_b = \frac{B}{N}$ and $f_m=\frac{M}{N}$ the empirical frequencies of the three education levels in the sample: $$ \hat{\sigma}_{xy} = f_h(x_h - \bar{x})(\bar{y}_h-\bar{y})+ f_b(x_b - \bar{x})(\bar{y}_b -\bar{y})+ (x_m - \bar{x})(\bar{y}_m - \bar{y}) $$ Finally, denoting $k=h, b, m$ the education levels: $$ \hat{\sigma}_{xy}=\hat{\sigma}_{x\bar{y}_x}= \sum_k f_k (x_k- \bar{x})(\bar{y}_k-\bar{y}) $$ Therefore, the covariance between $x$ and $y$ is also the covariance between $x$ and the mean values of $y$ for given values of $x$. We have here: $\bar{y}_h = `r mean(y[1:4])`$, $\bar{y}_b = `r mean(y[5:7])`$ and $\bar{y}_m = `r mean(y[8:10])`$. The covariance is then: $$ \hat{\sigma}_{xy} = 0.4 \times (0 - 2.4) \times (8 - 9.2) + 0.3 \times (3 - 2.4) \times (9.5 - 9.2) + 0.3 \times (5 - 2.4) \times (10.5 - 9.2) = `r 0.4 * 2.4 * 1.2 + .3 * 0.6 * 0.3 + 0.3 * 2.6 * 1.3` $$ Consider now the case where the mean wage is the same for every education level: $\bar{y}_b=\bar{y}_l=\bar{y}_m=\bar{y}=9.2$. In this case, the covariance is obviously 0 because $\bar{y}_k-\bar{y}=0\;\forall k$. \index[general]{conditional expectation|)} \index[general]{covariance|)} ## Model and data set {#sec-simple_model_data} We'll consider in this chapter the question of mode shares for inter-urban transportation. More precisely, considering that a trip can be made using one out of two transport modes (air and rail), how can we modelize the market shares of both modes? We'll use in this section a popular model in transportation economics which is the price-time model. `price_time`\idxdata{price\_time}{micsr.data} contains aggregate data about rail and air transportation between Paris and 13 French towns in 1995, it is reproduced from @BONN:04 pp. 364-366\index[author]{Bonnel}. ```{r } #| label: price_time price_time %>% print(n = 3) ``` For the sake of simplicity, we'll use shorter names for the variables: ```{r } #| label: price_time_names prtime <- price_time %>% set_names(c("town", "qr", "qa", "pr", "pa", "tr", "ta")) ``` Variables are prices (`pr` and `pa`) in euros, transport times (`tr` and `ta`) in minutes and thousands of trips (`qf` and `qa`) for the two modes (`r` for rail and `a` for air). We first compute the market shares of rail: ```{r } #| label: shares_computation #| collapse: true prtime <- mutate(prtime, sr = qr / (qr + qa)) prtime %>% pull(sr) %>% summary ``` Rail's market share exhibits huge variations in the sample, ranging from `r round(min(prtime$sr) * 100)` to `r round(max(prtime$sr) * 100)`%. For an individual, the relevant cost of a trip is the generalized cost, which is the sum of the monetary cost and the value of the travel time. Denoting $h^i$ the time value of individual $i$, in euros per hour, the generalized cost for the two modes are: $$ \left\{ \begin{array}{rcl} c_{a} ^ i &=& p_{a} + h ^ i t_{a} / 60\\ c_{r} ^ i &=& p_{r} + h ^ i t_{r} / 60\\ \end{array} \right. $$ {#eq-generalized_costs} Plane is typically faster and more expensive than train, which means that in the time-value\ /\ generalized cost plane, the generalized cost for rail will be represented by a line with a lower intercept (the price of train is lower) and with a higher slope (transport time is higher) than the one that corresponds to air. Generalized cost for both modes and for the two towns of Bordeaux and Nice are presented in @fig-costmode. ```{r} #| label: fig-costmode #| fig-cap: "Generalized cost for train and plane" #| echo: false z <- prtime %>% filter(town %in% c("Bordeaux", "Nice")) %>% mutate(h = (pa - pr) / ( (tr - ta) / 60)) v <- tibble(h = c(0, 50), Bordeaux_Plane = z[1, "pa", drop = TRUE] + z[1, "ta", drop = TRUE] * h / 60, Bordeaux_Train = z[1, "pr", drop = TRUE] + z[1, "tr", drop = TRUE] * h / 60, Nice_Plane = z[2, "pa", drop = TRUE] + z[2, "ta", drop = TRUE] * h / 60, Nice_Train = z[2, "pr", drop = TRUE] + z[2, "tr", drop = TRUE] * h / 60 ) v %>% pivot_longer(-1, names_sep = "_", names_to = c("town", "mode")) %>% ggplot(aes(h, value)) + geom_line(aes(linetype = mode, linewidth = town)) + coord_cartesian(xlim = c(0, 40), ylim = c(50, 200)) + labs(x = "Value of time", y = "Generalized cost", town = NULL) + scale_linewidth_discrete(range = c(1, 2)) ``` Every individual will choose the mode with the lowest generalized cost. For example, $i$ will choose the train if $c_r^i < c_a^i$. This will depend on the individual value of time: an individual with a high value of time will choose the plane as an individual with a lower travel time will choose the train. For given values of prices and travel times, one can compute a value of travel time $h^*$ which equates the generalized costs of the two modes in @eq-generalized_costs: $$ h^* = 60\frac{p_a - p_r}{(t_r - t_a)} $$ For Bordeaux and Nice, these time values are respectively `r round(z$h[1], 1)` and `r round(z$h[2], 1)` euros per hour. Nice is actually very far from Paris and only people with a very low value of time would spend 7.5 hours in the train instead of taking a plane. We now compute this threshold value for every city: ```{r } #| label: threshold_value_time #| collapse: true prtime <- mutate(prtime, h = (pa - pr) / ( (tr - ta) / 60) ) prtime %>% pull(h) %>% summary ``` There are huge variations of the threshold value of time, as it ranges from `r round(pull(top_n(prtime, -1, h), h))` (`r pull(top_n(prtime, -1, h), town)`) to `r round(pull(top_n(prtime, 1, h), h))` (`r pull(top_n(prtime, 1, h), town)`) euros per hour. Before considering a theoretical model that links the market share of train with the threshold value of time, let's have a first glance at @fig-hsfsmpl of this relationship using a scatterplot. ```{r} #| label: fig-hsfsmpl #| fig-cap: "Share of rail in function of the threshold time value" #| echo: false prtime %>% ggplot(aes(h, sr)) + geom_point() + ggrepel::geom_label_repel(aes(label = town)) ``` The relationship between the threshold value of time and rail's market share seems approximately linear, except for cities where the market share of train is very high (more than 75%). For now, we'll remove these four cities from the sample and plot on @fig-hsfsubsmpl the scatterplot for this restricted sample.^[This is obviously a very raw method to remove the non-linearity. Real time-price models use a much relevant solution, i.e., a transformation of the two variables in order to deal with the non-linearity.] ```{r } #| label: subset_price_time prtime <- filter(prtime, sr < 0.75) ``` ```{r} #| label: fig-hsfsubsmpl #| fig-cap: "Share of rail in function of the threshold time value on a subsample" #| echo: false prtime %>% ggplot(aes(h, sr)) + geom_point() + ggrepel::geom_label_repel(aes(label = town)) ``` Now, we consider the distribution of the values of time. If $h$ follows a given distribution between $a$ and $b$, train's market share is the share of the population for which the value of time is between $a$ and $h^*$ (and plane's market share is the share of the population for which the value of time is between $h^*$ and $b$). The simplest probability distribution is the uniform distribution, which is defined by a constant density equal to $\frac{1}{b-a}$ between $a$ and $b$. It is represented in @fig-unifdist.\index[general]{uniform distribution} The area of the rectangle of width $[a,b]$ and height $[0, \frac{1}{b-a}]$ is 100%, because the whole population has a time value between $a$ and $b$. This rectangle has two components: - a first rectangle of width $[a, h ^ *]$ which includes people for which time value is below $h^*$ and therefore take the train, - a second rectangle of width $[h ^*, b]$ which includes people for which time value is higher than $h^*$ and therefore take the plane. Stated differently: $s_f = \frac{h^* - a}{b - a} = -\frac{a}{b - a} + \frac{1}{b-a} h^*$ and this model therefore predicts a linear relationship between $h^*$ and $s_f$, the intercept being $-\frac{a}{b - a}$ and the slope $\frac{1}{b - a}$. Of course, rail's market share depends on other variables than the threshold value of time, so that the linear relationship concerns the conditional expectation of rail's market share. With $y=s_f$ and $x=h^*$, we therefore have a linear model of the form: $\mbox{E}(y | x) = \alpha + \beta x$. Moreover, the two parameters to be estimated $\alpha$ and $\beta$ are functions of the structural parameters of the model $a$ and $b$ which are the minimum and the maximum of the value of time. ```{r} #| label: fig-unifdist #| fig-cap: "Model shares with a uniform distribution" #| echo: false knitr::include_graphics("./tikz/fig/price_time.png", auto_pdf = TRUE) ``` \newpage ## Computation of the OLS estimator {#sec-comp_simple_ols} The model we seek to estimate is: $\mbox{E}(y_n \mid x_n) = \alpha+\beta x_n$. The difference between the observed value of $y$ and its conditional expectation is called the **error** for observation $n$:\index[general]{errors} $$ y_n - \mbox{E}(y_n \mid x_n) = \epsilon_n $$ The linear regression model can therefore be rewritten as: $$ y_n = \mbox{E}(y_n \mid x_n) + \epsilon_n = \alpha + \beta x_n + \epsilon_n $$ $\epsilon_n$ is the error for observation $n$ when the values of the unknown parameters $(\alpha, \beta)$ are set to their true values $(\alpha_o, \beta_o)$. For given values of $(\alpha, \beta)$, obtained using an estimation method, $\epsilon_n$ will be called the **residual** for observation $n$. The residual for an observation is therefore the vertical distance between the point for observation $n$ and the regression line. \index[general]{residuals} We seek to draw a straight line that is as closest as possible to all the points of our sample as in @fig-hsfsubsmpl. In the simple linear regression model, the distance between a point and the line is defined by the vertical distance, which is the residual for this observation. For the whole sample, we need to aggregate this individual measures of distance. Summing them is not an issue, as there are positive and negative values of the residuals and the sum may be very close to zero even if the individual residuals are very high in absolute values. One solution would be to use the sum of the absolute values of the residuals,^[This estimator is called the least absolute deviations estimator.] but with the OLS estimator, we'll consider the sum of the squares of the residuals (also called the residual sum of squares (**RSS**). Taking the squares, as taking the absolute values, removes the sign of the individual residuals and it results in an estimator which has nice mathematical and statistical properties. We'll therefore consider a function $f$ which depends on the value of the response and the covariate in the sample (two vectors $x$ and $y$ of length $N$) and on two unknown parameters $\alpha$ and $\beta$, respectively the intercept and the slope of the regression line: $$ f(\alpha, \beta |x, y)=\sum_{n = 1} ^ N (y_n - \alpha - \beta x_n) ^ 2 $$ Note that we write $f$ as a function of the two unknown parameters conditional on the values of $x$ and $y$ for a given sample. First-order conditions for the minimization of $f$ are: $$ \left\{ \begin{array}{rcl} \displaystyle \frac{\partial f}{\partial \alpha} &=& -2 \sum_{n = 1} ^ N \left(y_n - \alpha - \beta x_n\right) = 0 \\ \displaystyle \frac{\partial f}{\partial \beta} &=& -2\sum_{n=1}^N x_n\left(y_n-\alpha-\beta x_n\right)=0 \end{array} \right. $$ {#eq-gradientols} Or, dividing by $-2$: $$ \sum_{n = 1} ^ N\left(y_n - \alpha -\beta x_n\right) = \sum_{n = 1} ^ N \epsilon_n = 0 $${#eq-cpoalpha} $$ \sum_{n = 1} ^ N x_n\left(y_n - \alpha - \beta x_n\right) = \sum_{n = 1} ^ N x_n \epsilon_n = 0 $$ {#eq-cpobeta} @eq-cpoalpha indicates that the sum (or the mean) of the residuals in the sample is 0. Dividing this expression by $N$ also implies that, denoting $\bar{y}$ and $\bar{x}$ the sample means of the response and of the covariate: $$ \bar{y} = \alpha + \beta \bar{x}, $$ {#eq-cpoalpha2} which means that the sample mean is on the regression line. Denoting $\hat{\epsilon}_n$ the residuals of the OLS estimator, @eq-cpobeta states that $\sum_n x_n \hat{\epsilon} / N=0$, i.e., that the average cross-product of the covariate and the residual is 0. But, as the sample mean of the residuals $\bar{\hat{\epsilon}}$ is 0, this expression is also the covariance between the covariate and the residuals: $$ \hat{\sigma}_{x\hat{\epsilon}} = \frac{\sum_{n = 1} ^ N (x_n - \bar{x})(\hat{\epsilon}_n - \bar{\hat{\epsilon}})}{N} = \frac{\sum_{n = 1} ^ N x_n \hat{\epsilon}_n}{N} - \bar{x}\bar{\hat{\epsilon}}= \frac{\sum_{n = 1} ^ N x_n \hat{\epsilon}_n}{N}= 0 $$ which means that the regression line is such that there is no correlation between the covariate and the residuals in the sample. Subtracting $\bar{y} - \alpha - \beta \bar{x}$, which is 0 (see @eq-cpoalpha2) from @eq-cpobeta, one gets: $$ \sum_{n = 1}^N x_n\left[\left(y_n - \bar{y}\right)-\beta \left(x_n - \bar{x}\right)\right] = 0 $$ Moreover, $\sum_{n = 1} ^ N \bar{x}\left[\left(y_n - \bar{y}\right) - \beta\left(x_n - \bar{x}\right)\right] = 0$ and so: $$ \sum_{n=1} ^ N \left(x_n - \bar{x}\right)\left[\left(y_n - \bar{y}\right) - \beta\left(x_n - \bar{x}\right)\right] = 0 $$ {#eq-normal_equation} Solving for $\beta$, we finally get the estimator of the slope: $$ \hat{\beta} = \frac{\sum_{n = 1} ^ N (x_n - \bar{x})(y_n - \bar{y})}{\sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2} = \frac{S_{xy}}{S_{xx}}= \frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x ^ 2}= \hat{\rho}_{xy} \frac{\hat{\sigma}_y}{\hat{\sigma}_x} $$ {#eq-slrbeta} @eq-slrbeta gives three formulations for this estimator: - the first indicates that it is the ratio of the covariation of $x$ and $y$: $S_{xy}=\sum_{n = 1} ^ N (x_n - \bar{x})(y_n - \bar{y})$ and the variation of $x$: $S_{xx}= \sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2$, - the second is obtained by dividing both sides of the ratio by the sample size, so that the estimator is now the ratio of the sample covariance between $x$ and $y$ and the sample variance of $x$, - the third is obtained by introducing the coefficient of correlation between $x$ and $y$: $\hat{\rho}_{xy} =\frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x\hat{\sigma}_y}$, so that the estimator is also expressed as the product of the coefficient of correlation and the ratio of the standard deviations. This last formulation is particularly intuitive: $\hat{\rho}_{xy}$ is a pure measure of the correlation between the covariate and the response. This number has no unit and lies in the $-1$/$+1$ interval, a value of $-1$ ($+1$) indicating a perfect negative (positive) correlation and the value of 0 no correlation. The ratio of the standard deviations gives the relevant unit to the slope, which is the unit of $y$ divided by the unit of $x$. With this estimator of the slope in hand, we easily get the estimator of the intercept using @eq-cpoalpha2: $$ \hat{\alpha}=\bar{y}-\hat{\beta} \hat{x} $$ {#eq-slralpha} Consider now that prior to the estimation, we sustracted from the covariate and the response their sample means. We then used $\tilde{y}_n = y_n - \bar{y}$ and $\tilde{x}_n = x_n - \bar{x}$ as the response and the covariate. In this case, as the mean of these two transformed variables are zero, the intercept is 0 (from @eq-slralpha) and the slope is simply $\sum \tilde{y}_n \tilde{x}_n / \sum \tilde{x}_n ^ 2$. Once the parameters of the regression line have been computed, one can define the **fitted value** for observation $n$ as the value returned by the regression line for $x_n$, which is: $$ \hat{y}_n = \hat{\alpha} + \hat{\beta} x_n $$ By definition, we have for the fitted model: $y_n = \hat{y}_n + \hat{\epsilon}_n$. Note that, for the "true" model, we have $y_n = \mbox{E}(y | x = x_n) + \epsilon_n$, so that the residuals are an estimation of the errors and the fitted values $\hat{y}_n$ are an estimation of the conditional expectations of $y$. Moreover, denoting $\bar{\hat{y}}$ the sample mean of the fitted values: $$ \frac{\sum_n (\hat{y}_n - \bar{\hat{y}})(\hat{\epsilon}_n - \bar{\hat{\epsilon}})}{N}= \frac{\sum_n \hat{y}_n \hat{\epsilon}_n}{N} - \bar{\hat{y}}\bar{\hat{\epsilon}} = \frac{\sum_n \hat{y}_n \hat{\epsilon}_n}{N} = \frac{\sum_n (\hat{\alpha} + \hat{\beta} x_n) \hat{\epsilon}_n}{N} = 0 $$ Therefore, there is no correlation between $\hat{y}$ and $\hat{\epsilon}$, which is clear as $\hat{y}$ is a linear function of $x$ and $x$ is uncorrelated with $\hat{\epsilon}$. $(\hat{\alpha}, \hat{\beta})$ is an optimum of the objective function. To check that this optimum is a minimum, we have to compute the second derivatives. From @eq-gradientols, we get: $\frac{\partial^2 f}{\partial \alpha ^ 2}= 2N$, $\frac{\partial^2 f}{\partial \beta ^ 2}= 2\sum_n x_n ^ 2$ and $\frac{\partial^2 f}{\partial \alpha \partial \beta}= 2\sum_n x_n$. We need, for a maximum, positive direct second derivatives, which is obviously the case, and also a positive determinant of the matrix of second derivatives: $$ D = \frac{\partial^2 f}{\partial \alpha ^ 2}\frac{\partial^2 f}{\partial \beta ^ 2} - \left(\frac{\partial^2 f}{\partial \alpha \partial \beta}\right)^2 = 4 N \sum_n x_n ^ 2 - 4 \left(\sum_n x_n\right) ^ 2 = 4N^2\left(\frac{\sum_n x_n ^ 2}{N} - \bar{x}^2\right) > 0 $$ which is the case as the term in brackets is the variance of $x$ and is therefore positive. ## Geometry of least squares, variance decomposition and \mbox{coefficient} of determination {#sec-simple_geometry} \index[general]{geometry of least squares!simple linear regression model|(} The OLS estimator relies on variances and covariances of several observable ($y$ and $x$) and computed ($\hat{y}$, $\hat{\epsilon}$) variables. Its properties can be nicely illustrated using vector algebra, each variable being represented by a vector, and by plotting these vectors.^[For a detailed presentation of the geometry of least squares, see @DAVI:MACK:04, chapter 2.]\index[author]{Davidson}\index[author]{McKinnon} ### Vectors, variance and covariance Every variable $z$ used in a regression is a vector of $\mathcal{R}_N$, i.e., a set of $N$ real values: $z^\top = (z_1, z_2, \ldots, z_N)$. The length (or norm) of the vector is: $\|z\| = \sqrt{\sum_{n=1}^N z_n ^ 2}$. Remember that the OLS estimator can always be computed with data measured in deviations from their sample mean. Then, $\|z\| ^ 2 / N$ is the variance of the variable, or the norm of the vector is $\sqrt{N}$ times the standard deviation of the corresponding variable. The inner (or scalar) product of two vectors is denoted by $z ^ \top w = w ^ \top z = \sum_{n=1} ^ N z_n w\index[author]{Davidson}\index[author]{McKinnon}_n$ (note that the inner product is commutative). For corresponding variables expressed in deviations from their respective means it is, up to the $1/N$ factor, the covariance between the two variables. Denoting $\theta$ the angle formed by the two vectors, we also have:^[See @DAVI:MACK:04, p. 48.] $z ^ \top w = \cos \theta \|z\| \|w\|$. ```{r } #| label: vectors_functions #| echo: false x <- c(4, 3) z <- c(4.5, 6) w <- c(-6, 4.5) norm_z <- sqrt(sum(z ^ 2)) norm_x <- sqrt(sum(x ^ 2)) norm_w <- sqrt(sum(w ^ 2)) theta_x <- acos(x[1] / norm_x) theta_z <- acos(z[1] / norm_z) theta <- theta_z - theta_x dg <- function(x) x / pi * 180 prtv <- function(x) paste("(", x[1], ",", x[2], ")") nrm <- function(x) sqrt(sum(x ^ 2)) ``` Consider as an example: $x = `r prtv(x)`$, $z = `r prtv(z)`$ and $w = `r prtv(w)`$. The three vectors are plotted in @fig-mafig. The norm of $x$ is $\|x\|=\sqrt{ `r x[1]` ^ 2 + `r x[2]` ^ 2} = `r nrm(x)`$. Similarly, $\|z\| = `r nrm(z)`$ and $\|w\| = `r nrm(w)`$. The cosinus of the angle formed by $x$ and $z$ with the horizontal axis is $\cos \theta_x = `r x[1]` / `r nrm(x)` = `r round(x[1] / nrm(x), 3)`$ and $\cos \theta_z = `r z[1]` / `r nrm(z)` = `r round(z[1] / nrm(z), 3)`$. The angle formed by $x$ and $z$ is therefore: $\theta = \arccos `r z[1] / nrm(z)` - \arccos `r x[1] / nrm(x)` = `r round(theta_z - theta_x, 3)`$, with $\cos `r round(theta, 3)` = `r cos(theta)`$. We can then check that $z ^ \top x = `r x[1]` \times `r z[1]` + `r x[2]` \times `r z[2]` = `r sum(x * z)`$, which is equal to: $\cos \theta \|z\| \|w\| = `r cos(theta)` \times `r nrm(z)` \times `r nrm(x)`$. As the absolute value of $\cos \theta$ is necessarily lower than or equal to 1, the inner product of two vectors is lower than the product of the norms of the two vectors, and $\cos \theta = \frac{x ^ \top z}{\|x\| \|z\|}$ is the ratio of the covariance between $x$ and $z$ and the product of their standard deviations, which is the coefficient of correlation between the two underlying variables $x$ and $z$. Consider now $z$ and $w$. Their inner product is: $z ^ \top x = `r z[1]` \times `r w[1]` + `r z[2]` \times `r w[2]` = `r sum(z * w)`$. This is because $z$ and $w$ are two orthogonal vectors, which means that the two underlying variables are uncorrelated. ```{r } #| label: fig-mafig #| fig-cap: "Vector algebra" #| echo: false knitr::include_graphics("./tikz/fig/vectors2D.png", auto_pdf = TRUE) ``` ### Geometry of least squares {#sec-geometry_ols} The geometry of the simple linear regression model is represented in @fig-smplregmodel. With $N = 2$, $x$ and $y$ are two vectors in a plane. For the "true" model, the $y$ vector is the sum of two vectors: $\beta x$ (which is the conditional expectation of $y$) and $\epsilon$, which is the vector of errors. For the estimated model, $y$ is the sum of the fitted values $\hat{y} = \hat{\beta} x$ and the residuals $\hat{\epsilon}$. Using the OLS estimator, we must minimize the sum of squares of the residuals, i.e., the norm of the $\hat{\epsilon}$ vector. Obviously, this implies that $\hat{\epsilon}$ should be orthogonal to $\hat{y}$ and therefore to $x$, which implies that the residuals are uncorrelated to the fitted values (and to the covariate) in the sample. Note also that, except in the unlikely case where $\hat{\beta} = \beta$, $\|\hat{\epsilon}\| < \|\epsilon\|$ which means that the residuals have a smaller variance than the errors. Note finally that what determines $\hat{y}$ and $\hat{\epsilon}$ is not $x$ per se, but the subspace defined by it, in our case the horizontal straight line. For example, consider the regression of $y$ on $z = 0.5 x$; we would then obtain exactly the same values for $\hat{y}$ and $\hat{\epsilon}$, the only difference being that the estimator of $\beta$ would be two times larger. ```{r } #| label: fig-smplregmodel #| fig-cap: "Geometry of the simple regression model" #| echo: false knitr::include_graphics("./tikz/fig/OLS2D.png", auto_pdf = TRUE) ``` \index[general]{geometry of least squares!simple linear regression model|)} ### Variance decomposition and the $R^2$ {#sec-vardecomp_R2} \index[general]{coefficient of determination|(} For one observation $n$, we have: $$ y_n - \bar{y}=\left(y_n - \hat{y}_n\right)+\left(\hat{y}_n - \bar{y}\right) $$ {#eq-varobs} The difference between $y$ for individual $n$ and the sample mean is therefore the sum of: - a residual variation: $\left(y_n - \hat{y}_n\right) = \hat{\epsilon}_n$, - an explained variation: $\hat{y}_n - \bar{y}=\hat{\beta}(x_n - \bar{x})$. Taking the square of @eq-varobs and summing for all $n$, we get: $$ \begin{array}{rcl} \sum_{n = 1} ^ N(y_n - \bar{y})^2 & = & \sum_{n = 1}^ N \left(\hat{\epsilon}_n + \hat{\beta}(x_n - \bar{x})\right) ^ 2 \\ & = & \sum_{n = 1} ^ N \hat{\epsilon}_n ^ 2 + \hat{\beta} ^ 2 \sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2 \\ & + & 2 \hat{\beta}\sum_{n = 1} ^ N \hat{\epsilon}_n x_n - 2\hat{\beta}\bar{x}\sum_{n = 1} ^ N \hat{\epsilon}_n \end{array} $$ But $\sum_{n = 1} ^ N \hat{\epsilon}_n x_n = 0$ (@eq-cpobeta) and $\sum_{n = 1} ^ N \hat{\epsilon}_n = 0$ (@eq-cpoalpha), so that: $$ \sum_{n = 1} ^ N (y_n - \bar{y}) ^ 2 = \hat{\beta} ^ 2 \sum_{n = 1} ^ N (x_n - \bar{x}) ^ 2+\sum_{n = 1} ^ N \hat{\epsilon}_n ^ 2 $$ {#eq-variance_decomposition} This equation indicates that the total sum of squares (TSS) of the response equals the sum of the explained sum of squares (ESS) and of the residual sum of squares (RSS). This latter term is also called the **deviance**, and it is the objective function for the OLS estimator. Dividing by $N$, we also have on the left-hand size the variance of $y$ and on the right-hand side the sum of the variances of $\hat{y}$ and $\hat{\epsilon}$. This is the formula of the **variance decomposition** of the response. It can be easily understood using @fig-smplregmodel. It is clear from this figure that $y = \hat{y} + \hat{\epsilon}$ and that $\hat{y}$ and $\hat{\epsilon}$ are orthogonal. Therefore, applying the Pythagorean theorem, we have $\|y\|^2 = \|\hat{y}\|^2 + \|\hat{\epsilon}\|^2$. Up to the $1/N$ factor, if the response is measured in deviation from its sample mean, we have on the left the total variance of $y$ and on the right the sum of the variances of $\hat{y}$ and $\hat{\epsilon}$. Note that it is a unique feature of the ordinary least squares estimator. Any other estimator will generally result with a vector of residuals which won't be orthogonal to $\hat{y}$ (or to $x$) and therefore @eq-variance_decomposition won't apply. The coefficient of determination, denoted by $R^2$, measures the share of the variance of the response which is explained by the model, or one minus the share of the residual variation. We then have: $$ R^2=\frac{\hat{\beta}^2\sum_{n=1}^N(x_n-\bar{x})^2}{\sum_{n=1}^N (y_n-\bar{y})^2} = 1 - \frac{\sum_{n=1}^N\hat{\epsilon}_n^2}{\sum_{n=1}^N (y_n-\bar{y})^2} $$ {#eq-r2_slrm} using @eq-slrbeta, we finally get: $$ R^2=\left[\frac{\sum_{n=1}^N(x_n-\bar{x})(y_n-\bar{y})}{\sum_{n=1}^N (x_n-\bar{x})^2}\right]^2 \frac{\sum_{n=1}^N(x_n-\bar{x})^2}{\sum_{n=1}^N (y_n-\bar{y})^2} = \left[\frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x ^ 2}\right] ^ 2 \left[\frac{\hat{\sigma}_x ^ 2}{\hat{\sigma}_y ^ 2}\right] = \frac{\hat{\sigma}_{xy} ^ 2}{\hat{\sigma}_x ^ 2\hat{\sigma}_y ^ 2} =\hat{\rho}_{xy} ^ 2 $$ $R^2$ is therefore simply the square of the coefficient of correlation between $x$ and $y$. We have seen previously that, denoting $\theta$ the angle formed by two vectors, $\cos \theta$ is the coefficient of correlation between the two underlying variables (if they are measured in deviations from their means). Therefore, in @fig-smplregmodel, $R^2$ is represented by the square of the cosine of the angle formed by the two vectors $\hat{y}$ and $y$. As this angle tends to 0 (the two vectors point almost in the same direction), $R^2$ tends to 1. On the contrary, if this angle tends to $\pi/2$, the two vectors become almost orthogonal, and $R^2$ tends to 0. \index[general]{coefficient of determination|)} \newpage ## Computation with R {#sec-comp_ols_simple_R} The slope of the regression line can easily be computed "by hand", using any of the formula indicated in @eq-slrbeta, using the `dplyr::summarise` function. We first compute the variations of $x$ and $y$ and the covariation of $x$ and $y$, the two standard deviations and the coefficient of correlation between $x$ and $y$. ```{r } #| label: moments_computation stats <- prtime %>% summarise(N = nrow(prtime), xb = mean(h), yb = mean(sr), Sxx = sum( (h - xb) ^ 2), Syy = sum( (sr - yb) ^ 2), Sxy = sum( (h - xb) * (sr - yb)), sx = sqrt(Sxx / N), sy = sqrt(Syy / N), sxy = Sxy / N, rxy = sxy / (sx * sy)) ``` We can then compute the estimator of the slope using any of the three formulas: ```{r } #| label: slope_computation #| collapse: true hbeta <- stats$Sxy / stats$Sxx hbeta stats$sxy / stats$sx ^ 2 stats$rxy * stats$sy / stats$sx ``` The estimation of the intercept is obtained using @eq-slralpha: ```{r } #| label: intercept_computation #| collapse: true halpha <- stats$yb - hbeta * stats$xb halpha ``` Much more simply, the OLS estimator can be obtained using the `lm` function (for linear model). It is a very important function in **R**, not only because it implements efficiently the most important estimator used in econometrics, but also because **R** functions that implement other estimators often mimic the features of the `lm`\idxfun{lm}{stats} function. Therefore, once one is at ease with using the `lm` function, using other estimating function of **R** will be straightforward. `lm` is a function that has many arguments, but the first two are fundamental and almost mandatory:^[Actually, the second argument `data` is not mandatory because estimation can be performed without using a data frame.] - `formula` is a symbolic description of the model to be estimated, - `data` is a data frame that contains the variables used in the formula. Here, our formula writes `sr ~ h`, which means `sr` as a function of `h`. The data frame is `prtime`. The result of the `lm` function may be directly printed (the result is then lost), or saved in an object, which can be later printed or manipulated: ```{r } #| label: lm_function lm(sr ~ h, prtime) pxt <- lm(sr ~ h, prtime) ``` writing directly `pxt` is like writing `print(pxt)`, the side effect is to print a short description of the results, namely a remind of the function call and the name and the values of the fitted coefficients. `lm` returns an object of class `lm` which is a list of 12 elements; their names can be retrieved using the `names` \idxfun{names}{base}function: ```{r} #| label: names_lm names(pxt) ``` An element of this list can be extracted using the `$` operator. For example: ```{r} #| label: lm_coefficients #| collapse: true pxt$coefficients ``` returns the named vector of coefficients. `pxt$residuals` and `pxt$fitted.values` return two vectors of length $N$ containing the residuals and the fitted values. However, it is not advised to use the `$` operator to retrieve the elements of a fitted model. Specific functions, called extractors, should be used instead. For example, to retrieve the coefficients, the residuals and the fitted values as previously, we would use: ```{r } #| label: using_methods #| results: false coef(pxt) resid(pxt) fitted(pxt) ``` \idxfun{coef}{stats}\idxfun{resid}{stats}\idxfun{fitted}{stats} There are other functions that extract important information about the model, as the number of observation (`nobs`)\idxfun{nobs}{stats} and the sum of square residuals (`deviance`)\idxfun{deviance}{stats}: ```{r} #| label: other_lm_methods #| collapse: true nobs(pxt) deviance(pxt) ``` The results of the estimation are presented in @fig-reserror; we've added to the scatterplot: - the sample mean, indicated by a large circle, - the regression line, - the residuals, represented by arrows: upward arrows represent positive residuals (e.g., Brest and Toulon), and downward arrows represent negative residuals (e.g., Strasbourg and Nice). ```{r } #| label: fig-reserror #| fig-cap: "Regression line and residuals" #| echo: false Mpxtps <- prtime %>% summarise(h = mean(h), sr = mean(sr)) dx <- 0 #myarrow <- arrow(length = unit(0.025, "npc"), angle = 15, type = "closed") myarrow <- arrow(length = unit(0.03, "npc"), angle = 10, type = "open") prtime %>% mutate(hy = fitted(pxt)) %>% ggplot(aes(h, sr)) + geom_point() + geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, col = "black") + ggrepel::geom_label_repel(aes(label = town)) + geom_segment(aes(x = h - dx, xend = h - dx, yend = sr, y = hy), arrow = myarrow) + geom_point(data = Mpxtps, size = 5) ``` Individual coefficients can be extracted using the `[` operator. As `coef` returns a named vector, one can either indicate the position or the name of the coefficient to be extracted: ```{r } #| label: coef_extraction #| collapse: true int <- unname(coef(pxt)[1]) slope <- unname(coef(pxt)["h"]) c(int, slope) ``` Note that we use the `unname`\idxfun{unname}{base} function in order to remove the name of the extracted coefficient. Once the intercept (`int`) and the slope (`slope`) are extracted, the structural parameters can be retrieved, as the intercept is $\alpha = -\frac{a}{b-a}$ and the slope $\beta =\frac{1}{b-a}$. Therefore, $a = -\frac{\alpha}{\beta}$ and $b = \frac{1}{\beta} + a$. We finally get: ```{r } #| label: structural_coefficients #| collapse: true ahat = - int / slope bhat = 1 / slope + ahat c(ahat, bhat) ``` Time values therefore lie between `r round(ahat, 2)` and `r round(bhat, 2)` euros per hour. The mean (and median) time value is the mean of the two extreme values, which is `r round( (ahat + bhat) / 2, 2)` euros per hour. ## Data generator process and simulations {#sec-dgp_simulations} Inferential statistics rely on the notions of population and sample. The population is a large and exhaustive set of observations, and a sample is a small subset of observations drawn in this population. These notions are relevant in medical sciences and often in economic studies. For example, the first data set we used concerned smoking habits and birth weight. The population of interest is all American pregnant women in 1988, and a sample of 1388 of them was drawn from this population. The second data set concerned the relation between education and wage. The population was American labor force in 1992, and a sample of 16,481 workers was drawn from this population. On the contrary, our third data set doesn't fit with these notions of population and sample. The sample consists of major towns in France that are connected on a regular basis by rail and air to Paris. It contains 13 observations, which are not 13 observations randomly drawn from a large set of cities, but which are more or less all the relevant cities. ### Data generator process \index[general]{data generator process|(} An interesting alternative is the notion of **data generator process** (**DGP**). It describes how the data are assumed to have been generated. We assume in the linear regression model that: - $\mbox{E}(y|x = x_n) = \alpha + \beta x_n$: the expected value of $y$ for a given value of $x$ is a linear function of $x$, - $y_n = \mbox{E}(y|x = x_n) + \epsilon_n$: the observed value of $y_n$ is obtained by adding to the conditional expectation of $y$ for $x = x_n$ a random variable $\epsilon$ called the error. From @eq-slrbeta, we can write the estimator of the slope as: $$ \hat{\beta} = \frac{\sum_{n = 1} ^ N (x_{n} - \bar{x})(y_{n} - \bar{y})} {\sum_{n = 1} ^ N (x_{n} - \bar{x})^ 2} = \sum_{n = 1} ^ N \frac{(x_{n} - \bar{x})}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2} y_{n} = \sum_{n} c_{n} y_{n} $$ {#eq-linear_estimator} with $c_{n} = \frac{x_{n} - \bar{x}}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2}$. The OLS estimator is therefore a linear estimator, i.e., a linear combination of the values of $y$. The coefficients of this linear combination $c_n$ are such that $\sum_{n = 1} ^ N c_{n} = 0$ and that: $$ \sum_{n = 1} ^ N c_{n} ^ 2 = \frac{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2} {\left(\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2\right) ^ 2} = \frac{1}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2} = \frac{1}{N\hat{\sigma}_x^2} $$ Replacing $y_{n}$ by $\alpha + \beta x_{n} + \epsilon_{n}$, we then express $\hat{\beta}$ as a function of $\epsilon_n$: $$ \hat{\beta}=\sum_{n = 1} ^ N c_{n} (\alpha + \beta x_{n} + \epsilon_{n}) = \alpha \sum_{n = 1} ^ N c_{n}+\beta\sum_{n = 1} ^ N \frac{x_{n}(x_{n} - \bar{x})}{\sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2}+\sum_{n = 1} ^ N c_{n} \epsilon_{n} $$ As $\sum_{n = 1} ^ N x_{n}(x_{n} - \bar{x}) = \sum_{n = 1} ^ N (x_{n} - \bar{x}) ^ 2$ and $\sum_{n = 1} ^ N c_{n} = 0$, we finally get: $$ \hat{\beta} = \beta + \sum_{n = 1} ^ N c_{n} \epsilon_{n} $$ {#eq-ols_lin_comb_errors} \newpage  The deviation of the estimator of the slope of the OLS regression line $\hat{\beta}$ from the true value $\beta$ is therefore a linear combination of the $N$ errors. Consider our sample used to estimate the price-time model. From a DGP perspective, this sample has been generated using the formula: $y=\alpha + \beta x + \epsilon$. Consider now that the "true" values of $\alpha$ and $\beta$ are $-0.2$ and $0.032$, we can in this case compute the vector of errors for our sample ($\epsilon = y - \alpha - \beta x$):\idxfun{pull}{dplyr} ```{r} #| label: error_computation alpha <- - 0.2 beta <- 0.032 y <- prtime %>% pull(sr) x <- prtime %>% pull(h) eps = y - alpha - beta * x ``` We then compute the OLS estimator, and we retrieve $\hat{\epsilon}$ and $\hat{y}$.\idxfun{lm}{stats} ```{r} #| label: resid_fitted_extraction z <- lm(y ~ x) hat_eps <- z %>% residuals hat_y <- z %>% fitted ``` The observed data set, in the DGP perspective, is represented in @fig-dgp. ```{r } #| label: fig-dgp #| fig-cap: "Data generating process" #| echo: false alpha <- - 0.2 beta <- 0.032 prtime <- prtime %>% mutate(Eyx = alpha + beta * h, hy = fitted(pxt)) Mpxtps <- prtime %>% summarise(h = mean(h), sr = mean(sr)) dx <- 0.1 myarrow <- arrow(length = unit(0.025, "npc"), angle = 15, type = "closed") myarrow <- arrow(length = unit(0.05, "npc"), angle = 10, type = "closed") myarrow2 <- arrow(length = unit(0.05, "npc"), angle = 10, type = "open") prtime %>% ggplot(aes(h, sr)) + geom_point(size = 0.5) + geom_abline(slope = 0.032, intercept = - 0.2) + geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, linetype = "dashed", color = "black") + geom_segment(aes(x = h - dx, xend = h - dx, yend = sr, y = hy), arrow = myarrow2, linetype = "dashed") + geom_segment(aes(x = h + dx, xend = h + dx, yend = sr, y = Eyx), arrow = myarrow) + geom_point(data = Mpxtps, size = 5) + ggrepel::geom_label_repel(aes(label = town)) ``` The "true" model is represented by the plain line. The errors are represented by the plain arrows (positive errors for upward arrows, negative errors for downward arrows). Each value of $y_n$ is the sum of the conditional expectation of $y$ for the value of $x$: $E(y|x = x_n) = \alpha + \beta x_n$ (the value returned by the plain line for the given value of $x$) and the error $\epsilon_n$ represented by the plain arrow. For our specific sample, we have a specific vector of $\epsilon_n$, which means a specific set of points and a specific regression line, the dashed line on the figure. Each value of $y_n$ is then also the sum of the fitted value (the one returned by the regression line for $x=x_n$) and the residual, represented by a dashed arrow. We can check that the residuals sum to 0 and are uncorrelated with the covariate and the fitted values: ```{r} #| collapse: true #| label: zerosum_uncor_resid sum(hat_eps) sum(hat_eps * x) sum(hat_eps * hat_y) ``` which is not the case for the errors: ```{r} #| collapse: true #| label: empirical_moments_errors sum(eps) sum(eps * x) sum(eps * hat_y) ``` We can also check that the residuals are "smaller" than the errors by computing their standard deviations:\idxfun{sd}{stats} ```{r} #| label: smaller_residuals #| collapse: true sd(eps) sd(hat_eps) ``` \index[general]{data generator process|(} ### Random numbers and simulations {#sec-random_numbers_simulations} \index[general]{simulations!simple linear regression model|(} Now consider an other fictive sample for the same values of $x$, i.e., consider that the values of the threshold values of time are the same, but that the other factors that influence rail's shares (the $\epsilon$) are different. To generate such a fictive sample, we need to generate the $\epsilon$ vector, using a function that generates random numbers.^[More precisely, a function that generates a sequence of numbers that looks like random numbers.] With **R**, these functions have a name composed of the letter `r` (for random) and the abbreviated name of the statistical distribution: for example `runif`\idxfun{runif}{stats} draws numbers in a uniform distribution (by default within a $0-1$ range) and `rnorm`\idxfun{rnorm}{stats} draws numbers in a normal distribution (by default with zero expectation and unit standard deviation). These functions have a mandatory argument which is the number of draws. For example, to get five numbers drawn from a standard normal distribution: \index[general]{normal distribution}\index[general]{uniform distribution} ```{r} #| include: false #| label: set_the_seed set.seed(10L) ``` ```{r} #| label: first_draw_normal #| collapse: true rnorm(5) ``` Using the same command once more, we get a completely different sequence: ```{r} #| label: second_draw_norm #| collapse: true rnorm(5) ``` As stated previously, the `rnorm`\idxfun{rnorm}{stats} function doesn't draw random numbers but computes a sequence of numbers that looks like a random sequence. Imagine that `rnorm` actually computes a sequence of thousands of numbers, what is obtained using `rnorm(5)` is 5 consecutive numbers in this sequence, for example from the 5107th to the 5111th number. The position of the first element is called the **seed**, and it can be set to an integer using the `set.seed`\idxfun{set.seed}{base} function. Using the same seed while starting a simulation, we would then get exactly the same pseudo-random numbers each time and therefore the same results: ```{r} #| label: setting_seed #| collapse: true set.seed(7L) rnorm(5) rnorm(5) set.seed(7L) rnorm(5) rnorm(5) ``` \index[general]{data generator process|(} The DGP is completely described by specifying the distribution of $\epsilon$. We'll consider here a normal distribution with mean 0 and standard deviation $\sigma_\epsilon = 0.08$. A pseudo-random sample can then be constructed as follow:\idxfun{pull}{dplyr}\idxfun{rnorm}{stats}\idxfun{tibble}{tibble} ```{r} #| label: a_small_sample N <- prtime %>% nrow x <- prtime %>% pull(h) alpha <- - 0.2 ; beta <- 0.032 ; seps <- 0.08 eps <- rnorm(N, sd = seps) y <- alpha + beta * x + eps asmpl <- tibble(y, x) ``` which gives the following OLS estimates:\idxfun{lm}{stats}\idxfun{coef}{stats} ```{r} #| label: lm_on_small_random_sample #| collapse: true lm(y ~ x, asmpl) %>% coef ``` The notion of DGP enables to perform simulations, that have two purposes: - the first (and the most important in practice) is that, in some situations, it is impossible to get analytical results for the properties of an estimator, and those properties can in this case be obtained using simulations, - the second is pedagogical, as theoretical results can be confirmed and illustrated using simulations. A simulation consists of creating a large number of samples, computing some interesting numbers for every sample and calculating some relevant statistics for these numbers. For example, we'll compute the slope of the OLS estimate for every sample, and we'll calculate the mean and the standard deviation for this slope. A minimal simulation is presented in @fig-foursmpls, where we present four different samples obtained for four different sets of errors. ```{r } #| label: fig-foursmpls #| fig-cap: "4 different samples" #| echo: false set.seed <- 1 dx <- 0 smpls <- prtime %>% select(h, sr) %>% mutate(Ey = alpha + beta * h, e_0 = sr - Ey, e_1 = rnorm(9, sd = 0.08), e_2 = rnorm(9, sd = 0.08), e_3 = rnorm(9, sd = 0.08)) %>% select(- sr) %>% pivot_longer(-(1:2), names_to = "sample", values_to = "error") %>% mutate(sr = Ey + error) Msmpls <- smpls %>% group_by(sample) %>% summarise(alpha = coef(lm(formula = sr ~ h))[1], beta = coef(lm(formula = sr ~ h))[2], h = mean(h), sr = mean(sr), coefs = paste("(", round(alpha, 3), ") - (", round(beta, 3), ")", sep = "")) smpls <- smpls %>% left_join(select(Msmpls, sample, coefs)) smpls %>% ggplot(aes(h, sr)) + geom_point() + geom_abline(intercept = alpha, slope = beta) + geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") + geom_point(data = Msmpls, size = 4) + geom_segment(aes(x = h + dx, xend = h + dx, yend = sr, y = Ey), arrow = myarrow) + facet_wrap(~ coefs) ``` For every sample, there is a specific set of errors (arrows) and therefore specific data points and regression lines. The estimated intercepts range from $`r round(Msmpls %>% top_n(-1, alpha) %>% pull(alpha), 3)`$ to $`r round(Msmpls %>% top_n(1, alpha) %>% pull(alpha), 3)`$ and the slopes from $`r round(Msmpls %>% top_n(-1, beta) %>% pull(beta), 3)`$ to $`r round(Msmpls %>% top_n(1, beta) %>% pull(beta), 3)`$. The important point is that $\beta$ is, in practice, an unknown fixed parameter. $\hat{\beta}$ depends on the $N$ observations of the sample and therefore the $N$ values of the errors. Each sample is characterized by a specific vector of errors and therefore by a different value of the estimator. \index[general]{data generator process|(} More generally, we denote $R$ the number of replications and we construct a data frame with $R \times N$ lines. The columns are the 13 values of $x$ (fixed in this simulation), the vector $\epsilon$ drawn in a normal distribution with a standard deviation equal to 0.08 and $y = \alpha + \beta x + \epsilon$, with $\alpha = 0.2$ and $\beta = 0.032$ as previously. Finally we add a variable `id` that is a vector containing integers from 1 to $R$ repeated 13 times that will be used to identify the sample:\idxfun{tibble}{tibble} ```{r } #| label: experience_setting alpha <- - 0.2 ; beta <- 0.032 ; seps <- 0.08 R <- 1E03 ; N <- length(x) datas <- tibble(id = rep(1:R, each = N), x = rep(x, R), eps = rnorm(R * N, sd = seps), y = alpha + beta * x + eps) ``` Running regressions on every sample requires to use `lm(y ~ x)` not on the whole data set, but on every subset defined by a value of `id`. The `lm` function\idxfun{lm}{stats} has a `subset` argument that must be a logical expression used to select only a subset of the data frame. For example, to get the fitted model for the first sample, the logical expression is `id == 1` and one can use: ```{r } #| eval: false #| label: simulation_with_lm lm(y ~ x, data = datas, subset = id == 1) ``` The slope can also be calculated using @eq-linear_estimator, which shows that the OLS estimator is a linear combination of the values of the response:\idxfun{filter}{dplyr}\idxfun{mutate}{dplyr}\idxfun{summarise}{dplyr} ```{r} #| label: simulation_using_manual_computation #| collapse: true datas %>% filter(id == 1) %>% mutate(cn = (x - mean(x)) / sum((x - mean(x)) ^ 2)) %>% summarise(slope = sum(cn * y)) %>% pull ``` To get values of the estimator for every sample, we have to loop on this command for `id` from 1 to `R`. A good practice in **R** is to avoid the use of explicit loops whenever it's possible. The `group_by` / `summarise` couple of functions from the `dplyr` package can be used instead. \idxfun{group\_by}{dplyr}\idxfun{summarise}{dplyr} ```{r} #| label: simulations_group_by_summarise results <- datas %>% group_by(id) %>% mutate(cn = (x - mean(x)) / sum((x - mean(x)) ^ 2)) %>% summarise(slope = sum(cn * y), intercept = mean(y) - slope * mean(x)) results ``` The result is now a tibble with `R` lines, as the computation of the OLS estimator is performed for every sample. As we now have a large number of values of $\hat{\beta}$, we can analyze its statistical properties, for example its mean and its standard deviation for the 1000 random samples: \idxfun{summarise}{dplyr}\idxfun{sd}{stats} ```{r} #| label: simulations_results #| collapse: false results %>% summarise(mean = mean(slope), sd = sd(slope)) ``` We'll see in @sec-stat_prop_slm that the expected value of $\hat{\beta}$ is $\beta=0.032$ (we'll say that the OLS estimator is unbiased) and that its standard deviation is $\sigma_\epsilon / \sigma_x / \sqrt{N} = 0.0041$. The mean and standard deviations for our 1000 values of $\hat{\beta}$ are estimates of these two theoretical values. We can see that the mean of $\hat{\beta}$ is actually equal to 0.032 and that the standard deviation equals 0.00437, which is close to the theoretical value. A more general method to perform simulation is to use list-columns of data frames. To conduct such an analysis step by step, we start with a very small example, with $R = 2$ and $N = 3$.\idxfun{tibble}{tibble}\idxfun{rnorm}{stats}\idxfun{rep}{base} ```{r} #| label: creating_small_data R <- 2 N <- 3 small_data <- tibble(id = rep(1:R, each = N), x = rep(x[1:N], R), eps = rnorm(R * N, sd = seps), y = alpha + beta * x + eps) small_data ``` Starting with `small_data`, we use `tidy::nest`\idxfun{nest}{tidyr} to nest the data frame using the `id` variable (`.by = id`); the name of the results is by default `data`, but it can be customized using the `.key` argument: ```{r} #| label: nesting_small_data datalst <- small_data %>% nest(.by = id, .key = "smpl") datalst ``` Each line now corresponds to a value of `id` and `smpl` is a list column that contains a tibble, \idxfun{pull}{dplyr} ```{r} #| label: extracting_one_tibble datalst %>% pull(smpl) ``` We now write a function that takes as argument a data frame containing the $x$ and $y$ columns and returns a one-line tibble that contains the fitted intercept and slope: \idxfun{tibble}{tibble}\idxfun{function}{base} ```{r} #| label: ols_function ols <- function(s){ x <- s$x ; y <- s$y tibble(slope = sum((x - mean(x)) * (y - mean(y))) / sum( ( x - mean(x)) ^ 2), intercept = mean(y) - slope * mean(x)) } ``` This function can be applied for a given sample: ```{r} #| label: ols_on_one_tibble ols(small_data[1:N, ]) ``` But, as the `smpl` column of `datalst` is a list, the `ols` function can be applied to all the elements of the list, which can be done using the `purrr::map` function which takes as argument a list and a function: \idxfun{pull}{dplyr}\idxfun{map}{purrr} ```{r} #| label: ols_on_all_tibbles datalst %>% pull(smpl) %>% map(ols) ``` and returns a list. But this can more easily be done using `mutate` or `transmute`, so that the result is a tibble: \idxfun{transmute}{dplyr}\idxfun{map}{purrr} ```{r} #| label: ols_on_all_tibbles_with_mutate results <- datalst %>% transmute(smpl = map(smpl, ols)) results ``` Finally, `tidyr::unnest` expands the list column containing tibbles into rows and columns. \idxfun{unnest}{tidyr} ```{r} #| label: unnesting_the_result results %>% unnest(cols = smpl) ``` which returns a standard tibble with one row for each draw and one column for each computed statistic. Going back to the full example with $R = 1000$, we get: \idxfun{nest}{tidyr}\idxfun{transmute}{dplyr}\idxfun{map}{purrr}\idxfun{summarise}{dplyr}\idxfun{sd}{stats} ```{r} #| label: nest_full_example datas %>% nest(.by = id, .key = "smpl") %>% transmute(smpl = map(smpl, ols)) %>% unnest(cols = smpl) %>% summarise(mean = mean(slope), sd = sd(slope)) ``` \index[general]{simulations!simple linear regression model|)}