Microeconometrics with R

Author

Yves Croissant

Published

June 4, 2024

Preface

This book is about doing microeconometrics with R. Microeconometrics is defined by Cameron and Trivedi (2005) as “the analysis of individual-level data on the economic behavior of individuals or ﬁrms using regression methods applied to cross-section and panel data”. We’ll use in this book a broader definition of microeconometrics, as some of the empirical examples use countries or regions as a unit of observation. But, nevertheless, the underlying model will have microeconomic foundations, as for example the Solow growth model that is treated at length in Chapter 3 and Chapter 9 and the Kuzets curve presented in Chapter 4. A negative delimitation of what is in this book is that it doesn’t contain any analyses of time series.

I used throughout the book some manuals. My main reference was Cameron and Trivedi (2005), but I also used Greene (2018) (especially the online mathematical and statistical appendices that I found particularly useful) and Wooldridge (2010). At an introductory level, I also used Stock and Watson (2015) and Maddala and Lahiri (2009). For some points (geometry of least squares, asymptotic theory, data generating process, computational considerations), I was also inspired by Davidson and MacKinnon (1993, 2004). On more specific subjects, I used also some specialized textbook and surveys: Maddala (1983), Amemiya (1981, 1984) for Chapter 10 and Chapter 11 (binomial and censored models), Cameron and Trivedi (2013) for Chapter 12 (count data), Lancaster (1990) and Kiefer (1988) for Chapter 13 (duration models) and Train (2009) for Chapter 14 (discrete choice models).

R and RStudio

R appeared in the late nineties as a clone of S which rapidly became a GNU project. It became increasingly popular among statisticians, especially in fields where S+ was widely used. Its large adoption by econometricians is more recent, especially because of the use by econometricians of other softwares like SAS, GAUSS and especially nowadays Stata, and the lack of some popular estimators and tests for econometrics in R. Since then, numerous excellent packages have since been developed that make the adoption of R as the primary software for an econometrician a relevant choice. Moreover, R has some characteristics that make R code concise and clear, compared to other softwares.

RStudio (now Posit), founded in 2009 by Joseph J. Allaire, developed an integrated development environment called RStudio and a set of R packages that changed deeply the way a lot of R users (including myself) use R. Among them, quarto was used to build this book and the set of packages called the tidyverse is used extensively for all the basic tasks of an applied statistician which are, citing Wickham, Cetinkaya-Rundel, and Grolemund (2023), importing, tidying, transforming and visualizing data sets. Packages of the tidyverse, respectively readr, tidyr, dplyr and ggplot2 enable to perform these tasks using a set of efficient, intuitive and consistent functions.

This book requires basic knowledge of R and the tidyverse, at the level of the first part of Wickham, Cetinkaya-Rundel, and Grolemund (2023), which is titled “the whole game” and is an excellent introduction for readers of this book that are not R users. Moreover, there is a free online version of this book that can be found at https://r4ds.hadley.nz/.

R packages

R has a modular structure. A particular analysis with R requires the use of functions that are part of different packages. To use these functions, the user can either “attach” the package (using the library function) or prefix the name of the function with the name of the package. For example, to use the waldtest function of the lmtest package, one can either use:

library(lmtest)
waldtest(---)

or:

lmtest::waldtest(---)

A relevant strategy is to attach only the packages that are often used in the analysis.

Different kinds of packages should be considered:

the base packages contain the core of R. They are included in any distribution of R and they are automatically attached, which means that the user can have access directly to the functions they contain. The two most important are base and stats, which contain respectively the basic general and statistical functions of R.
the R distribution also contains 15 recommended packages, which are not attached. Among them, we’ll use in this book survival and MASS.
contributed R packages. They are developed by R users and are hosted by the Comprehensive R Archive Network (CRAN). They can be very easily installed within the RStudio IDE (using the Packages tab on the right bottom panel of RStudio). Nowadays, there are more than 20,000 of them.

Twenty years ago, using R for econometrics analysis required a lot of programming because a lot of core methods of econometrics were not available neither in the core of R, nor in contributed packages. Nowadays, most of the basic methods are available in contributed packages, and the book uses extensively these packages. Some missing methods are implemented in the companion package of the book called micsr.

Data sets

This book uses a lot of data sets to illustrate the use of the techniques described in the book. A particular care was given to use original and interesting data sets. Our main source to get data was the Journal of Applied Econometrics data archive maintained by James G. MacKinnon since 1994 and available at http://qed.econ.queensu.ca/jae/¹ and the website of the American Economic Association https://www.aeaweb.org which provides data sets and programs used in articles published in AEA’s journal, in particular the American Economic Review and the four American Economic Journals (we particularly used Applied Economics). We also wrote to a lot of authors to ask them whether the data used in their publications was still available and if they could share it with us. Many thanks for those who provided data sets that are now included in the micsr and the micsr.data packages: John Mullahy (birthwt, cigmales), Mark Ottoni Wilhelm (charitable), Lawrence Katz (recall), Joseph Terza (trips), Franck Koppelman (toronto_montreal), Christopher Skeels and David Drukker (moffitt), Winfried Pohlmeier (doctor_ger) and Rainer Winkelman (jobmob).²

The micsr package

The book comes with a companion package called micsr. It is available on CRAN and is therefore very easy to install. As indicated previously, the book uses extensively the tidyverse package and it is therefore recommended to install tidyverse before trying to replicate the R code contained in the book. micsr contains a small subset of the data sets included in the book and some specific methods of estimations and testing procedures for microeconometrics. Data sets used in this book not included in the micsr package are available in the micsr.data package which is available from github. To install it, you should first install the pak package and then simply write in the console:

pak::pkg_install("ycroissant/micsr.data")

Throughout the book, we assume that the reader that is willing to replicate the examples had previously attached the tidyverse, micsr and micsr.data packages, using:

library(tidyverse)
library(micsr)
library(micsr.data)

micsr provides specific estimation and testing methods that will be presented in the core of the book. It also provides some general purpose functions. Among them, the gaze function will be extensively used throughout the book.

The output of R calls is often quite long and takes two much place in a book. For this reason, micsr contains a generic function called gaze with methods for a lot of R objects. Consider for example a t-test:

t.test(extra ~ 1, data = sleep)


    One Sample t-test

data:  extra
t = 3.413, df = 19, p-value = 0.002918
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.5955845 2.4844155
sample estimates:
mean of x 
     1.54

The output is 12 lines long, including three blank lines. gaze writes the main part of the results in just one line:

t.test(extra ~ 1, data = sleep) %>% gaze
## t = 3.413, df: 19, pval = 0.003

Amemiya, Takeshi. 1981. “Qualitative Response Models: A Survey.” Journal of Economic Literature 19 (4): 1483–1536. https://ideas.repec.org/a/aea/jeclit/v19y1981i4p1483-1536.html.

———. 1984. “Tobit Models: A Survey.” Journal of Econometrics 24 (1): 3–61. https://doi.org/10.1016/0304-4076(84)90074-5.

Cameron, A. Colin, and Pravin K. Trivedi. 2005. Microeconometrics. Cambridge University Press. https://EconPapers.repec.org/RePEc:cup:cbooks:9780521848053.

———. 2013. Regression Analysis of Count Data. 2nd ed. 53. Cambridge: Cambridge University Press.

Davidson, Russell, and James G. MacKinnon. 1993. Estimation and Inference in Econometrics. New-York: Oxford University Press.

———. 2004. Econometric Theory and Methods. Oxford University Press.

Greene, William H. 2018. Econometrics Analysis. 8th ed. Pearson.

Kiefer, Nicholas M. 1988. “Economic Duration Data and Hazard Functions.” Journal of Economic Literature 26 (2): 646–79. http://www.jstor.org/stable/2726365.

Lancaster, Tony. 1990. In The Econometric Analysis of Transition Data. Econometric Society Monographs. Cambridge University Press.

Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in Econometrics. Econometric Society Monographs. Cambridge University Press. https://doi.org/10.1017/CBO9780511810176.

Maddala, G. S., and Kajal Lahiri. 2009. Introduction to Econometrics. 4th ed. Wiley.

Stock, James H., and Mark W. Watson. 2015. Introduction to Econometrics. Pearson.

Train, Kenneth. 2009. Discrete Choice Methods with Simulation. Cambridge University Press. https://EconPapers.repec.org/RePEc:cup:cbooks:9780521766555.

Wickham, Hadley, Mine Cetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly.

Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. Vol. 1. MIT Press Books 0262232588. The MIT Press. https://ideas.repec.org/b/mtp/titles/0262232588.html.

This website is still available but is superseded since 2022 by a new website at https://journaldata.zbw.eu/journals/jae.↩︎
Franck Vella, Adrian Pagan, Jerry Hausman, Claudia Goldin, Wynand van de Ven, Andrew Jones, Richard J. Smith, Pramila Krishnan, Orley C. Ashenfelter, John Ham, James Adams, John McDonald, Laurence Kotlikoff, Mark Rosenzweig, James Dunvely and Kathryn Graddy kindly answered me that they were unable to share the data I required (most of the time because it was no longer available to them).↩︎

# Preface {.unnumbered} This book is about doing microeconometrics with R. Microeconometrics is defined by @CAME:TRIV:05\index[author]{Cameron}\index[author]{Trivedi} as "the analysis of individual-level data on the economic behavior of individuals or ﬁrms using regression methods applied to cross-section and panel data". We'll use in this book a broader definition of microeconometrics, as some of the empirical examples use countries or regions as a unit of observation. But, nevertheless, the underlying model will have microeconomic foundations, as for example the Solow growth model that is treated at length in @sec-mult_reg_chapter and @sec-spatial_econometrics and the Kuzets curve presented in @sec-interpretation_chapter. A negative delimitation of what is in this book is that it doesn't contain any analyses of time series. I used throughout the book some manuals. My main reference was @CAME:TRIV:05, but I also used @GREE:18 (especially the online mathematical and statistical appendices that I found particularly useful) and @WOOL:10. At an introductory level, I also used @STOC:WATS:15 and @MADD:LAHI:09. For some points (geometry of least squares, asymptotic theory, data generating process, computational considerations), I was also inspired by Davidson and MacKinnon [-@DAVI:MACK:93; -@DAVI:MACK:04]\index[author]{Davidson}\index[author]{MacKinnon}. On more specific subjects, I used also some specialized textbook and surveys: @MADD:83\index[author]{Maddala}, Amemiya [-@AMEM:81; -@AMEM:84]\index[author]{Amemiya} for @sec-binomial and @sec-tobit (binomial and censored models), @CAME:TRIV:13\index[author]{Cameron}\index[author]{Trivedi} for @sec-count (count data), @LANC:90\index[author]{Lancaster} and @KIEF:88\index[author]{Kiefer} for @sec-duration (duration models) and @TRAI:09\index[author]{Train} for @sec-rum (discrete choice models). ## R and RStudio {-} **R** appeared in the late nineties as a clone of **S** which rapidly became a **GNU** project. It became increasingly popular among statisticians, especially in fields where **S+** was widely used. Its large adoption by econometricians is more recent, especially because of the use by econometricians of other softwares like **SAS**, **GAUSS** and especially nowadays **Stata**, and the lack of some popular estimators and tests for econometrics in **R**. Since then, numerous excellent packages have since been developed that make the adoption of **R** as the primary software for an econometrician a relevant choice. Moreover, **R** has some characteristics that make **R** code concise and clear, compared to other softwares. **RStudio** (now **Posit**), founded in 2009 by Joseph J. Allaire\index[author]{Allaire}, developed an integrated development environment called **RStudio** and a set of **R** packages that changed deeply the way a lot of **R** users (including myself) use **R**. Among them, **quarto** was used to build this book and the set of packages called the **tidyverse** is used extensively for all the basic tasks of an applied statistician which are, citing @WICK:CETI:GROL:23\index[author]{Wickham}\index[author]{Cetinkaya-Rundel}\index[author]{Grolemund}, importing, tidying, transforming and visualizing data sets. Packages of the **tidyverse**, respectively **readr**, **tidyr**, **dplyr** and **ggplot2** enable to perform these tasks using a set of efficient, intuitive and consistent functions. This book requires basic knowledge of **R** and the **tidyverse**, at the level of the first part of @WICK:CETI:GROL:23, which is titled "the whole game" and is an excellent introduction for readers of this book that are not **R** users. Moreover, there is a free online version of this book that can be found at <https://r4ds.hadley.nz/>. ## R packages **R** has a modular structure. A particular analysis with **R** requires the use of functions that are part of different **packages**. To use these functions, the user can either "attach" the package (using the `library` function) or prefix the name of the function with the name of the package. For example, to use the `waldtest` function of the **lmtest** package, one can either use: ```{r} #| eval: false library(lmtest) waldtest(---) ``` or: ```{r} #| eval: false lmtest::waldtest(---) ``` A relevant strategy is to attach only the packages that are often used in the analysis. Different kinds of packages should be considered: - the **base** packages contain the core of **R**. They are included in any distribution of **R** and they are automatically attached, which means that the user can have access directly to the functions they contain. The two most important are **base** and **stats**, which contain respectively the basic general and statistical functions of **R**. - the **R** distribution also contains 15 **recommended** packages, which are not attached. Among them, we'll use in this book **survival** and **MASS**. - **contributed** R packages. They are developed by **R** users and are hosted by the Comprehensive R Archive Network (**CRAN**). They can be very easily installed within the RStudio IDE (using the `Packages` tab on the right bottom panel of RStudio). Nowadays, there are more than 20,000 of them. Twenty years ago, using **R** for econometrics analysis required a lot of programming because a lot of core methods of econometrics were not available neither in the core of **R**, nor in contributed packages. Nowadays, most of the basic methods are available in contributed packages, and the book uses extensively these packages. Some missing methods are implemented in the companion package of the book called **micsr**. ## Data sets {-} This book uses a lot of data sets to illustrate the use of the techniques described in the book. A particular care was given to use original and interesting data sets. Our main source to get data was the Journal of Applied Econometrics data archive maintained by James G. MacKinnon\index[author]{MacKinnon} since 1994 and available at <http://qed.econ.queensu.ca/jae/>^[This website is still available but is superseded since 2022 by a new website at <https://journaldata.zbw.eu/journals/jae>.] and the website of the American Economic Association <https://www.aeaweb.org> which provides data sets and programs used in articles published in AEA's journal, in particular the American Economic Review and the four American Economic Journals (we particularly used Applied Economics). We also wrote to a lot of authors to ask them whether the data used in their publications was still available and if they could share it with us. Many thanks for those who provided data sets that are now included in the **micsr** and the **micsr.data** packages: John Mullahy\index[author]{Mullahy} (`birthwt`, `cigmales`), Mark Ottoni Wilhelm\index[author]{Wilhelm} (`charitable`), Lawrence Katz\index[author]{Katz} (`recall`), Joseph Terza\index[author]{Terza} (`trips`), Franck Koppelman\index[author]{Koppelman} (`toronto_montreal`), Christopher Skeels\index[author]{Skeels} and David Drukker\index[author]{Moffitt} (`moffitt`), Winfried Pohlmeier\index[author]{Pohlmeier} (`doctor_ger`) and Rainer Winkelman\index[author]{Winkelman} (`jobmob`).^[Franck Vella\index[author]{Vella}, Adrian Pagan\index[author]{Pagan}, Jerry Hausman\index[author]{Hausman}, Claudia Goldin\index[author]{Goldin}, Wynand van de Ven\index[author]{van de Ven}, Andrew Jones\index[author]{Jones}, Richard J. Smith\index[author]{Smith}, Pramila Krishnan\index[author]{Krishnan}, Orley C. Ashenfelter\index[author]{Ashenfelter}, John Ham\index[author]{Ham}, James Adams\index[author]{Adams}, John McDonald\index[author]{McDonald}, Laurence Kotlikoff\index[author]{Kotlikoff}, Mark Rosenzweig\index[author]{Rosenzweig}, James Dunvely\index[author]{Dunvely} and Kathryn Graddy\index[author]{Graddy} kindly answered me that they were unable to share the data I required (most of the time because it was no longer available to them).] ## The micsr package {-} The book comes with a companion package called **micsr**. It is available on **CRAN** and is therefore very easy to install. As indicated previously, the book uses extensively the **tidyverse** package and it is therefore recommended to install **tidyverse** before trying to replicate the **R** code contained in the book. **micsr** contains a small subset of the data sets included in the book and some specific methods of estimations and testing procedures for microeconometrics. Data sets used in this book not included in the **micsr** package are available in the **micsr.data** package which is available from **github**. To install it, you should first install the **pak** package and then simply write in the console: ```{r} #| eval: false pak::pkg_install("ycroissant/micsr.data") ``` Throughout the book, we assume that the reader that is willing to replicate the examples had previously attached the **tidyverse**, **micsr** and **micsr.data** packages, using: ```{r} library(tidyverse) library(micsr) library(micsr.data) ``` \newpage **micsr** provides specific estimation and testing methods that will be presented in the core of the book. It also provides some general purpose functions. Among them, the `gaze` function will be extensively used throughout the book. The output of **R** calls is often quite long and takes two much place in a book. For this reason, **micsr** contains a generic function called `gaze` with methods for a lot of **R** objects. Consider for example a t-test: ```{r} t.test(extra ~ 1, data = sleep) ``` The output is 12 lines long, including three blank lines. `gaze` writes the main part of the results in just one line: ```{r} #| collapse: true t.test(extra ~ 1, data = sleep) %>% gaze ```                                    \mainmatter