9 Spatial econometrics

Spatial econometrics is a very dynamic field in modern econometrics. Geolocated data are now frequently available, even when the data doesn’t concern geographic entities like countries, regions or towns, but households or firms. From their geographic coordinates, one is able to define for every observation a set of neighbors, and the interactions between neighbors can then be taken into account. There are two very different types of geographical data: vectors and rasters. A vector is a point or a set of points that define a line. A raster is a grid that contains the value of one variable (for example, a numeric indicating the elevation or a factor indicating land use). Relatively recently, two R packages have emerged that provide plenty of function that enables to deal easily with vectors and rasters. These are respectively sf (simple feature) and terra.¹ In this chapter, we’ll consider only vectors.

The first two sections are devoted to simple features; Section 9.1 presents the structure of simple features objects and Section 9.1.2, using the example of a spatial RD design, illustrates how to deal with simple features in a statistical analysis. Section 9.3 deals with the detection and the measurement of spatial correlation. Finally, Section 9.4 presents some popular spatial models, namely the spatial error and the spatial autoregressive models.

9.1 Simple features

In a spatial statistical analysis, the first task is to get geographical information about the observations. This is usually done by importing external files in R, for example shapefiles. This task is performed by the sf library and the result is an sf object.

Structure of a simple feature

To understand what a simple feature is, it is best to construct a small one “by hand”. The geographical representation of an observation is an sfg for simple feature geometry. It can be a point, a set of points that form a line or a polygon if the line ends where it started. We first load the sf package:

library(sf)

Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

The message indicates that sf uses three important external libraries:

GEOS to compute topological operations,
GDAL to read external files with a large variety of formats,
PROJ to transform the coordinates in a specific CRS (coordinate reference system).

We consider in this example the three main cities in France. For Paris, the latitude² is $48.87$ and the longitude³ is $2.33$. Most of sf’s functions start with st_. The st_point function can be used to construct a sfg (simple feature geometry) for Paris, the argument being a numeric vector containing the longitude and the latitude.

paris <- st_point(c(2.33, 48.87))

We then perform the same operation for Lyon and Marseilles:

lyon <- st_point(c(4.85, 45.76))
marseilles <- st_point(c(5.37, 43.30))

We then construct a sfc (simple feature column) which is a set of sfgs, using the st_sfc function. The different elements are entered as unnamed arguments of the function and a crs argument can be used to specify the CRS. Three formats exist to describe the CRS: proj4string (a character string), EPSG (an integer), and WKT, for well known text (a list). We can use here either "+proj=longlat +datum=WGS84 +no_defs" or 4326. The datum is the representation of the earth from which the latitudes and the longitudes are computed.

cities_coords <- st_sfc(paris, lyon, marseilles, crs = 4326)
cities_coords

Geometry set for 3 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.33 ymin: 43.3 xmax: 5.37 ymax: 48.87
Geodetic CRS:  WGS 84

Printing the sfc gives interesting information, as the bounding box (the coordinates of the rectangle that contains the data) and the CRS. Finally, we can construct a sf object using st_sf by binding a data frame containing information about the cities and the sfc.

cities_data <- tibble(pop = c(2.161, 0.513, 0.861),
                      area = c(105.4, 47.87, 240.6))
cities <- st_sf(cities_data, cities_coords)
cities

Simple feature collection with 3 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.33 ymin: 43.3 xmax: 5.37 ymax: 48.87
Geodetic CRS:  WGS 84
    pop   area      cities_coords
1 2.161 105.40 POINT (2.33 48.87)
2 0.513  47.87 POINT (4.85 45.76)
3 0.861 240.60  POINT (5.37 43.3)

A sf is just a data frame with a specific column that contains the coordinates of the observations. This column (called the geometry column) can be extracted using the st_geometry function:

cities %>% st_geometry

and is “sticky”, which means that if some columns are selected, the geometry column is always returned along with the selected columns, i.e.:

cities %>% select(pop)

returns pop and cities_coords, even if the latter colon was not selected. We can then plot our three cities, along with a map of France, which is obtained using the geodata::gadm function.⁴

france <- geodata::gadm("FR", 1, ".")

france is not an object of class sf,⁵ so we first coerce it to a sf using the st_as_sf function, and then we extract the geometry and the series called NAME_1 which contains the name of the regions:

france <- france %>% st_as_sf %>%
    select(region = NAME_1)

Imported vector data are often large:

france %>% object.size %>% format("MB")
## [1] "3.6 Mb"

and they can be simplified using the rmapshaper::ms_simplify function, with a keep argument which indicates the degree of simplification ($0.01$ means that we seek to obtain a sf 100 times lighter than the initial one).

france <- france %>%
    rmapshaper::ms_simplify(keep = 0.01)
france %>% object.size %>% format("MB", digits = 2)
## [1] "0.07 Mb"

sf provides a plot method to get quickly a map of the data. Note that a thematic map is plotted for all the series of the sf, and it is therefore recommended to select first a unique series. If one is only interested in the vectors, they can be extracted before plotting using st_geometry:

france %>% st_geometry %>% plot

For more advanced maps, several specialized packages are available, in particular tmap (Tennekes 2018) and mapsf (Giraud 2023). In this chapter, we’ll only use ggplot, which provides a geom_sf function. This geom is very special compared to other geoms, as the kind of geom that is plotted depends on the geometry column of the sf: if cities is provided as the data argument, points will be plotted; but if france is provided, polygons will be drawn. In the following code, we start with france as the data argument of ggplot and then the call to geom_sf results in the drawing of the administrative borders of French regions. Then we use a second time geom_sf with this time cities as data argument, so that points are drawn for the three cities. We use here aes(size = pop) so that the size of the points is related to the population of the cities. The result is presented in Figure 9.1.

france %>% ggplot() + geom_sf() +
    geom_sf(data = cities, aes(size = pop))

Figure 9.1: Map of France with its three main cities

sf provides several functions to compute values of interest. For example, to get the distance from Paris to Lyon and Marseilles, we would use:

st_distance(cities[1, ], cities[2:3, ])
## Units: [m]
##        [,1]   [,2]
## [1,] 394508 662105

st_distance always returns a matrix, the number of elements of the first (second) argument being the number of rows (columns) of the matrix. If only one argument is supplied, the result is a square matrix with 0 on the diagonal. Note that the numbers have a unit of measurement, which is here meters. To define a unit and to convert from one unit to another, the units package provides the set_units function. Consider the following example: we first provide a numeric ($42.195$), we define its unit as kilometers, and then we convert it to miles:

library(units)
d <- 42.195
dkm <- d %>% set_units(km)
dkm
## 42.2 [km]
dkm %>% set_units(miles)
## 26.22 [miles]

Even when the conversion is simple, it is advisable to use set_units instead of applying the conversion “by hand”:

dm <- dkm %>% set_units(m)
dm
## 42195 [m]
dkm * 1000
## 42195 [km]

The numerical values are the same, but in the second case the unit hasn’t changed and is still kilometers. st_area computes the area of a polygon:

st_area(france) %>% units::set_units(km ^ 2)
## Units: [km^2]
##  [1] 71023 47924 27507 39465  8705 57530 31917 12030 30044 84593
## [11] 73135 32267 31837

Computation on sf objects

The regression discontinuity framework can be adapted to consider geographic discontinuities. Some entities are considered on both sides of a border and the forcing variable is then the distance to the border.⁶ The us_counties data set contains the borders of US counties:

us_counties %>% print(n = 3)

Simple feature collection with 3141 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -19940000 ymin: 2147000 xmax: 20010000 ymax: 11520000
Projected CRS: WGS 84 / Pseudo-Mercator
# A tibble: 3,141 × 5
  fips  gid       state   county                            geometry
* <chr> <chr>     <chr>   <chr>                   <MULTIPOLYGON [m]>
1 01001 USA.1.1_1 Alabama Autauga (((-9675520 3850831, -9652842 385…
2 01003 USA.1.2_1 Alabama Baldwin (((-9798375 3599675, -9797718 360…
3 01005 USA.1.3_1 Alabama Barbour (((-9508473 3713483, -9545463 371…
# ℹ 3,138 more rows

Kumar (2018) investigates the effect of a legal restriction on home equity extraction which is specific to Texas on mortgage defaults. Kumar measures mortgage default rates in Texas and in the bordering states and compares mortgage default rates for counties which are closed to the Texas border (on both sides). Let’s first plot the map of the counties, using ggplot and geom_sf (see Figure 9.2):

us_counties %>%
    filter(! state %in% c("Alaska", "Hawaii")) %>%
    ggplot() + geom_sf()

The polygons can be merged using the group_by / summarise functions of dplyr. If the sf is grouped by states, the statistics computed by the summarise function is performed at the state level and the geometry now contains the coordinates of the states. With a void call to summarise, we get only this new geometry, and we can use it to plot a map of states (see Figure 9.3):

states <- us_counties %>%
    filter(! state %in% c("Alaska", "Hawaii")) %>%
    group_by(state) %>%
    summarise()
states %>% ggplot() + geom_sf()

To get the border states of Texas, we use spatial indexation, i.e., we use the [ operator. The first argument is used to select rows; it is normally a vector of integers (the positions of the lines to extract), a vector of characters (the names of the lines to extract) or a logical vector. Here we can index an sf by another sf, and the result is (by default) a new sf containing the elements of the first one which has common points with the second one:

border_states <- states[filter(states, state == "Texas"), ]
border_states

Simple feature collection with 5 features and 1 field
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: -12140000 ymin: 2979000 xmax: -9918000 ymax: 4439000
Projected CRS: WGS 84 / Pseudo-Mercator
# A tibble: 5 × 2
  state                                                     geometry
  <chr>                                               <GEOMETRY [m]>
1 Arkansas   POLYGON ((-10407328 3897869, -10410758 3897882, -10443…
2 Louisiana  MULTIPOLYGON (((-10253638 3470493, -10257225 3475979, …
3 New Mexico POLYGON ((-11868806 3763464, -11868752 3752373, -11870…
4 Oklahoma   POLYGON ((-10942378 4046699, -10950200 4049735, -10956…
5 Texas      MULTIPOLYGON (((-11360369 3475920, -11367496 3476824, …

We then compute the Texas border. It is defined by the common points of the borders of Texas and its border states. This is obtained using the st_intersection function which returns four lines (one for each border state), and the border is then obtained by merging these four lines using the st_union function. The result is presented in Figure 9.4.

texas <- filter(border_states, state == "Texas") %>% st_geometry
border <- filter(border_states, state != "Texas") %>% st_geometry
border <- st_intersection(texas, border) %>%  st_union
border_states %>% ggplot() +
    geom_sf() +
    geom_sf(data = border, linewidth = 1) +
    geom_sf_label(aes(label = state))

Figure 9.4: Border states and Texas borders

Note the use of geom_sf_label function which, is a specialized version of geom_label that is suitable for writing labels on a map. We then select the counties that belong to any of these states and we compute the distance to the Texas border. It is defined as the distance between the centroid of the county and the closest point of the border. The centroids are obtained using the st_centroid function. We’ve already used st_distance to comp ute the distance between two points. It can also be used to compute the (shortest) distance between a point and a line:

border_counties <- us_counties %>%
    filter(state %in% pull(border_states, state))
centroids <- border_counties %>% st_geometry %>% st_centroid
dists <- st_distance(centroids, border)[, 1] %>% set_units(miles)
border_counties <- border_counties %>% add_column(dists)
head(dists)
## Units: [miles]
## [1] 194.7 157.4 259.2 212.0 241.6 130.3

st_distance returns a matrix of distance, each column corresponding to a line of the second argument. As here there is only one line, we convert this matrix to a vector by taking its first column. The unit of the returned values is meters; we convert it to miles, like in the original article and we add this column to the sf. As in the original article, we fill with different colors in Figure 9.5 counties that are less than 25, 50, 75 and 100 miles from the border:

border_counties %>%
    mutate(dist_class = cut(dists, c(0, 25, 50, 75, 100))) %>% 
  ggplot() + 
  geom_sf(aes(fill = dist_class)) +
  scale_fill_grey(na.translate = FALSE) + 
  geom_sf(data = border)

Figure 9.5: Counties close to the Texan border

Finally, we select the relevant series from the data set of the paper, called mortgage_default, and we merge it with border_counties:

mortdef_sf <- border_counties %>% 
  right_join(mortgage_defaults, by = "fips") %>% 
  mutate(dists = ifelse(state == "Texas", dists, - dists))

Note the use of the quite unusual dplyr::right_join function. By using right_join with border_counties as the first argument and mortgage_defaults as the second argument, we get an object of the class of the first argument (therefore a sf) and we get all the lines of the mortgage_defaults tibble and only those of us_counties that match. We use the convention that the distance is positive for Texan counties and negative for neighboring states. Finally, we plot the discontinuity in Figure 9.6.

mortdef_sf %>%
    filter(abs(dists) < 50) %>% 
    mutate(state = ifelse(state == "Texas", "Texas", "other")) %>% 
    ggplot(aes(dists, default)) +
    micsr::geom_binmeans(aes(size = after_stat(n)), shape = 21) +
    geom_smooth(aes(linetype = state, weight = loans), 
                method = "lm", se = FALSE, color = "black")

Figure 9.6: Discontinuity of the mortgage defaults data set

The intercept is a bit more than 7% outside the border and about 4.5% inside the border, so that the effect of the Texas specific regulation seems significant (a reduction of about 2.5% of the mortgage default rate).

9.2 Two examples

To illustrate the techniques presented in the subsequent sections, we’ll use two data sets. The first one, from Wheeler (2003), is called agglo_growth and deals with the topic of economies and diseconomies of agglomeration. The second one, called sp_growth is from Ertur and Koch (2007) and is used to fit an extension of Solow’s growth model.

Agglomeration economies and diseconomies

The agglo_growth contains data about US counties in 1990, identified by their fips code.

agglo_growth

# A tibble: 3,106 × 14
  fips  emp_gr pop_gr    emp    pop college manuf  unemp income
  <chr>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl>  <dbl>  <dbl>
1 01001  0.203 0.0591 12591. 32259.  0.121  0.248 0.0900  5774.
2 01003  0.367 0.224  29807. 78556.  0.121  0.206 0.0686  5960.
3 01005  0.177 0.0264  8642. 24756.  0.0923 0.344 0.0874  4544.
4 01007  0.224 0.0528  5377. 15723.  0.0490 0.424 0.158   4859.
5 01009  0.248 0.0737 13713. 36459.  0.0529 0.287 0.0892  5213.
# ℹ 3,101 more rows
# ℹ 5 more variables: educ_sh <dbl>, hw_sh <dbl>, pol_sh <dbl>,
#   notwhite <dbl>, type <fct>

emp_gr and pop_gr are the employment and the population growth in each county between 1980 and 1990 and emp and pop the level of employment and of the population in 1980. Wheeler (2003) investigates the existence of:

agglomeration economies, which implies that the growth of a given territory will be positively correlated with its size,
agglomeration diseconomies for large cities that experience congestion, crime, pollution, etc.

The hypothesis is therefore that the relation between size and growth should be inverted U shaped, which means that, for small territories, the agglomeration economies effect is dominant; as for large territories, the agglomeration diseconomies becomes dominant. We’ll reproduce some of the results using the counties of Louisiana. We first join the tibble to the sf called us_counties which contains the geometries of the counties that we’ve already used in the previous section:

louisiana <- us_counties %>%
    right_join(agglo_growth, by = "fips") %>%
    filter(state == "Louisiana")

We first plot a thematic map (Figure 9.7), with colors for counties related to the growth of the population (pop_gr). It is recommended to create first a categorical variable using base::cut, because it is easier to visualize a discrete palette of colors than a continuous one. We use scale_fill_brewer to use one of the palettes provided by the RColorBrewer package, which provides sequential, diverging and qualitative palettes. The first two are suitable to represent the values of numerical variables. Sequential palettes use a color, light for low values of the variable and dark for high values. Divergent palettes use two colors, with light colors for middle range values and dark colors for low and high values. All the available palettes can be displayed using RColorBrewer::display.brewer.all(). We use here the Oranges sequential palette.

louisiana %>%
    mutate(pop_gr = cut(pop_gr, (-3:3) / 10)) %>% 
    ggplot() + geom_sf(aes(fill = pop_gr)) +
    scale_fill_brewer(palette = "Oranges")

Figure 9.7: Population growth and initial population in Louisiana

The inverted U shaped hypothesis between the log of the initial population and the growth rate can be tested by regressing the growth rate on the log population and its square. The later coefficient should then be negative.

mod1 <- lm(pop_gr ~ poly(log(pop), 2, raw = TRUE), louisiana)
mod1 %>% gaze

                               Estimate Std. Error t value Pr(>|t|)
poly(log(pop), 2, raw = TRUE)1  0.68298    0.19333    3.53  0.00079
poly(log(pop), 2, raw = TRUE)2 -0.02974    0.00883   -3.37  0.00132

The coefficient of the square term is negative and significant and we get a maximum growth for the fitted model for a value of log(pop) equal to $-0.683 / (-0.0297 \times 2) = 11.4819$, i.e., a population of about 97 thousand inhabitants. This result is illustrated in Figure 9.8, a scatterplot representing the relationship between the logarithm of the initial population and its growth rate and a fitting line with a second degree polynomial:

louisiana %>% ggplot(aes(pop, pop_gr)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE), 
                se = FALSE, color = "black") + 
  scale_x_continuous(trans = "log10")

Figure 9.8: Relation between initial population and population growth in Louisiana

Solow model

The second example is from Ertur and Koch (2007) who estimated a growth model for the countries of the world taking spatial correlation into account. The sp_solow data set is provided by the necountries package:

data("sp_solow", package = "necountries")
sp_solow

# A tibble: 91 × 6
  name      code   gdp60  gdp95 saving labgwth
  <chr>     <chr>  <dbl>  <dbl>  <dbl>   <dbl>
1 Angola    AGO    5136.  2629. 0.0736 0.0233 
2 Argentina ARG   18733. 24738. 0.178  0.0165 
3 Australia AUS   26480. 45331. 0.247  0.0210 
4 Austria   AUT   15283. 45023. 0.261  0.00291
5 Burundi   BDI     889.  1339. 0.0521 0.0168 
# ℹ 86 more rows

There are 91 countries in the sp_solow data set, the series being the gdp in 1960 and 1995, the saving rate (saving) and the growth of the labor force (labgwth). As in Section 3.1.2, we denote i as the saving rate and v as the sum of the growth rate of the labor force and 0.05.⁷ We also compute the annual growth rate (growth) as the difference of the logs of the GDP for 1995 and 1960 divided by 35.

sp_solow <- sp_solow %>%
  rename(i = saving) %>% 
  mutate(v = labgwth + 0.05,
         growth = (log(gdp95) - log(gdp60)) / 35)

We then need to join this data frame to a sf containing the administrative boundaries of the countries and the coordinate of their capital. Several packages enable to get an sf of the world. For example, the spData package has a world sf which is obtained from Natural Earth,⁸ geodata has a world function that enables to download from the gadm website⁹ an object of class SpatVector (than can be easily converted to a sf, as shown in Section 9.1.1). We use here the convenient necountries package that also uses Natural Earth and provides a countries function with different arguments to select a subset of countries of the world. By default, countries returns 199 lines, the 193 United Nations’ recognized countries, the two observer countries (Palestine and Vatican) and four not or not fully recognized countries (Kosovo, Somaliland, Northern Cyprus and Taiwan). Each line includes the geometry of the “main” part of the countries. Some countries have “parts” or “dependencies” that can be included using part = TRUE and dependency = TRUE. A “part” is an area which has the same political status as the rest of the country, but is far from the main part of the country. Examples of parts include Alaska and Hawaii for the United States, Martinique and Guadeloupe for France and Canaries for Spain. A “dependency” is an area with a specific political status. Examples of dependencies are Greenland for Denmark, New Caledonia for France and Gibraltar for the United Kingdom.

sp_solow has columns that contain the names and the iso-codes of the countries (respectively name and code). Any of them can be used to join sp_solow with the countries’ object, but it is much safer to use code, as it avoids the problem of small differences in countries’ names. A lot of countries of the world are not present in sp_solow (especially most of the communist countries). We check whether all the countries of sp_solow are present in the countries object, with the check_join function; the by argument indicates the series in sp_solow that identifies the country:

library(necountries)
countries() %>% check_join(sp_solow, by = "code")
## 
## Countries in the external tibble not in the countries' sf:
##  HKG, ZAR

The two problems are that the D.R. of Congo (iso3 code COD) used to be called Zaire (iso3 code ZAR) and that Hong Kong, which is a part of China for the necountries package, was considered as a sovereign country in Ertur and Koch (2007) study. We then correct the code for the D.R. of Congo, we add Hong Kong using the include argument and we remove Antarctica with the exclude argument:

sp_solow <- sp_solow %>% 
  mutate(code = ifelse(code == "ZAR", "COD", code))
sps <- countries(include = "Hong Kong", exclude = "Antarctica") %>%
    select(iso3, country, point) %>% 
    left_join(sp_solow, by = "code") %>% select(- name)
sps

Simple feature collection with 200 features and 8 fields
Active geometry column: polygon
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: -180 ymin: -55.67 xmax: 180 ymax: 83.12
Geodetic CRS:  WGS 84
# A tibble: 200 × 10
  iso3  country           point                              polygon
  <chr> <chr>       <POINT [°]>                       <GEOMETRY [°]>
1 AFG   Afghani…  (69.18 34.52) POLYGON ((74.54 37.02, 74.37 37.15,…
2 ALB   Albania   (19.82 41.33) POLYGON ((20.57 41.87, 20.6 41.95, …
3 DZA   Algeria   (3.049 36.77) POLYGON ((-4.822 25, -4.166 24.57, …
4 AND   Andorra   (1.527 42.51) POLYGON ((1.707 42.5, 1.722 42.61, …
5 AGO   Angola   (13.23 -8.836) MULTIPOLYGON (((13.07 -4.635, 12.9 …
# ℹ 195 more rows
# ℹ 6 more variables: gdp60 <dbl>, gdp95 <dbl>, i <dbl>,
#   labgwth <dbl>, v <dbl>, growth <dbl>

Note that the resulting sf has two geometry columns, polygons for the borders of the countries (polygon) and points for the capitals (point). Note also that the active geometry column is polygon. We then draw in Figure 9.9 a world map with the color of the countries related to the annual growth during the 1960-95 period and a point with a size related to the initial (1960) GDP. necountries provides a plot method which draws a basic map using ggplot2. A basic thematic map can be drawn using the fill argument to fill the areas of the countries and the centroid or the capital arguments to draw a point at the position of the centroid or the capital of each country. Any ggplot2 functions can be used to customize the graphic:

sps %>% plot(fill = "growth", centroid = "gdp60", palette = "Blues") +
  scale_size_continuous(range = c(0.5, 3)) + 
  theme(legend.position = "bottom") + 
  guides(size = guide_legend(nrow = 3, reverse = TRUE),
         fill = guide_legend(nrow = 3, byrow = TRUE)) + 
  labs(fill = "Growth rate (1960-95)", size = "GDP in 1960")

Figure 9.9: Growth and initial GDP, 1960-1995

The basic Solow model, which was given in Equation 3.3, is then estimated:

lm(log(gdp95) ~ log(i) + log(v), sp_solow) %>% gaze

       Estimate Std. Error t value Pr(>|t|)
log(i)    1.276      0.125   10.19   <2e-16
log(v)   -2.709      0.642   -4.22    6e-05

Remember that the structural model implies that $\beta_s = - \beta_v = \kappa / (1 - \kappa)$. This hypothesis can be tested using a reparametrization of the model:

lm(log(gdp95) ~ log(i / v) + log(v), sp_solow) %>% gaze
##          Estimate Std. Error t value Pr(>|t|)
## log(i/v)    1.276      0.125    10.2   <2e-16
## log(v)     -1.433      0.681    -2.1    0.038

for which the hypothesis is that the coefficient of log(v) equals 0. This hypothesis is rejected at the 5% level (but not at the 1% level). Imposing this hypothesis, we get:

lm(log(gdp95) ~ log(i / v), sp_solow) %>% gaze
##          Estimate Std. Error t value Pr(>|t|)
## log(i/v)    1.379      0.117    11.8   <2e-16

which implies a value of $\kappa$ (the share of the capital in the GDP) equal to $\hat{\beta}_i / (1 + \hat{\beta}_i) = 0.58$ which is implausibly high.

9.3 Spatial correlation

Spatial correlation occurs when one can define a distance relationship between observations. The notion of distance is broad, but we’ll consider in this section only geographical distance. For each observation, one can in this case define a set of neighbors. If the geographical representation of an observation is a polygon (which is the relevant choice for countries or regions), two observations for example can be said to be neighbors if they have a common border. If the geographical representation of an observation is a point (for example for a city), two observations are neighbors if the distance between them is less than, say, 100 kilometers. Once the set of neighbors have been defined for every observations, weights can be computed. The weights can be equal or may depend on the distance between the observation and its neighbors. These two operations of defining the set of neighbors and their weights are performed by the spdep package (Pebesma and Bivand 2023).

Contiguity and weights

To get the matrix of contiguity for the counties of Louisiana, we use the spdep::poly2nb function:

library(spdep)
nb_louis <- poly2nb(louisiana)
nb_louis

Neighbour list object:
Number of regions: 64 
Number of nonzero links: 322 
Percentage nonzero weights: 7.861 
Average number of links: 5.031

The print method indicates the number of contiguity links (322). Instead of storing these links in a full matrix (with 1 for contiguous counties and 0 otherwise) which would have $64^2 = 4096$ cells with only 322 of them with a value of 1 (about 7.9%), an nb object is returned. It is a list of 64 vectors which contains, for each observation, the positions of its neighbors. For example:

nb_louis %>% head(3)

[[1]]
[1] 16 23 40 48 56

[[2]]
[1]  6 16 33 41 48

[[3]]
[1]  4 20 26 55 62 64

The first two counties have five neighbors and the third one has six. A summary method provides more details. poly2nb has a queen argument which is TRUE by default. Queen contiguity means that two polygons which have only one common point are neighbors. Setting queen to FALSE implies that rook continuity is used. In this case, only counties that have a common border are neighbors.

nb_louis_rook <- poly2nb(louisiana, queen = FALSE)

Printing nb_nc_rook, one can check that the number of contiguity links (318) is slightly lower than previously (322). nb objects can’t be plotted as is with ggplot2. We provide a convenient st_nb function which performs this task, as it returns a sf object (see Figure 9.10):

louisiana %>%
    ggplot() +
    geom_sf(fill = NA) +
    geom_sf(data = st_nb(louisiana), linetype = "dotted") +
    geom_sf(data = st_nb(louisiana, queen = FALSE))

Figure 9.10: Queen and rook links for the counties of Louisiana

We first plot queen contiguity links with dotted lines and then rook contiguity with plain lines, so that the specific queen contiguity links appear as dotted segments. A matrix of weights is obtained by using the nb2listw function:

W_louis <- nb_louis %>% nb2listw

A listw object is returned, which contains two lists: the first (neighbours) is the same as the one of the nb object. The second contains weights:

W_louis$weights %>% head(3)

[[1]]
[1] 0.2 0.2 0.2 0.2 0.2

[[2]]
[1] 0.2 0.2 0.2 0.2 0.2

[[3]]
[1] 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667

We can see that weights of neighbors for a given observation are all equal (1/5 for the first two observations which have five neighbors and $1/6$ for the third one which has six neighbors) and sum to one. Therefore, premultiplying a vector $(y - \bar{y})$ by $W$ results in a vector with typical value $\sum_m w_{nm} y_m - \bar{y} = \tilde{y}_n - \bar{y}$. nb and a listw objects can be coerced to matrices using the nb2mat and listw2mat functions.

W_louis %>% listw2mat

Instead of defining the neighborhood as common borders, one can consider distance between points, for example the capitals or the centroids of the countries using the sp_solow data set. Remember that sps contains two sf, polygon and point and that the active geometry is polygon. We first specify point as being the active geometry, using the st_set_geometry function, and we remove all countries for which the data are not available:

sp_solow2 <- sps %>% na.omit %>% st_set_geometry("point")

Then, we use the dnearneigh function to compute the set of neighbors for each country: the first argument is the sf and the next two arguments, called d1 and d2 are mandatory and should be set to the minimum and the maximum distances that should be used to define the neighbors. Note that this distance should be indicated in kilometers if the CRS is geographical, which is the case here.

d <- dnearneigh(sp_solow2, 0, 1000)
d

Neighbour list object:
Number of regions: 91 
Number of nonzero links: 238 
Percentage nonzero weights: 2.874 
Average number of links: 2.615 
18 regions with no links:
3 8 10 17 27 41 44 49 50 55 62 63 64 65 72 74 79 91
29 disjoint connected subgraphs

With a distance of 1000 km, there are 238 links and an average of 2.62 neighbors per country. Note that 18 countries have no neighbors. We then compute the weights, using nb2listwdist. The first two argument are a nb and a sf objects. The type argument indicates how the weights should be computed. Denoting $d_{nm}$ the distance between the two unit $n$ and $m$, with type = "idw", we get $w_{nm} = d_{nm} ^ {- \alpha}$. With type = "exp", the weights are $w_{nm} = \mbox{exp}(- \alpha d_{nm})$. The alpha argument controls the value of $\alpha$ which is 1 by default. For example, if type = "idw" and alpha = 2, the weights are the inverse of the square of the distance. Once the weights have been computed, they can be normalized in different ways, using the style argument. If type = "raw" (the default), no normalization is performed. A usual choice is "W" where the weights of a unit are normalized so that they sum to 1:

w <- nb2listwdist(d, sp_solow2, type = "idw", alpha = 1, 
                  style = "W", zero.policy = TRUE)

Note the use of zero.policy: if TRUE, the matrix of weights is computed even if, as it is the case here, some countries have no neighbors.

Tests for spatial correlation

The spatial correlation hypothesis can be tested either for a specific variable or using the residuals of a linear regression. Two main tests have been proposed. The first is Moran’s $I$ test: denoting $W$ as the weights matrix where the weights sum to one for a given line, the statistic is defined as:

\[ I = \frac{(y - \bar{y})^\top W (y - \bar{y})}{(y - \bar{y})^\top (y - \bar{y})} = \frac{\sum_n (y_n - \bar{y})(\tilde{y}_n - \bar{y})}{\sum_n (y_n - \bar{y})^2} \]

which is simply the ratio of the covariance between $y$ and $\tilde{y}$ and the variance of $y$. It therefore can be obtained as the coefficient of $y$ in a linear regression of $\tilde{y}$ on $y$. If the variance of $\tilde{y}$ and $y$ were equal, it would also be the coefficient of correlation between the value of $y$ for a unit and its neighbors. The Moran test is computed using the spdep::moran.test function which takes as argument a series and a listw object:

moran.test(louisiana$pop_gr, W_louis) %>% gaze
## Moran I = 0.306 (-0.016, 0.078), z = 4.147, pval = 0.000

The value of the statistic is $0.306$. Under the hypothesis of no correlation, the expected value of this statistic is $- 1 / (N-1) = -0.016$ and the standard deviation is $0.078$.¹⁰ The standardized statistic ($4.147$) is asymptotically normal and the hypothesis of no spatial correlation is rejected. An alternative to Moran’s $I$ test is Geary’s $C$ test, which is defined as:

\[ C = \frac{N - 1}{2N}\frac{\sum_n \sum_m w_{nm} (y_n - y_m) ^ 2}{\sum_n (y_n - \bar{y})^ 2} \]

Introducing deviations from the mean in the previous expression and developing, we get:

\[ C = \frac{1}{2}\frac{N - 1}{N} \left(1 - 2 I + \frac{\sum_n (y_m- \bar{y}) ^ 2 \sum_m w_{mn}}{\sum_n (y_n - \bar{y}) ^ 2}\right) \]

If the weight matrix were symmetric, as $\sum_n w_{nm} = 1$, we also would have $\sum_m w_{mn} = 1$, so that the last term is one and $C = \frac{N - 1}{N}(1 - I)$. Therefore, we can expect $C$ to be close to $(1 - I)$. Geary’s test is implemented in the spdep::geary.test function

geary.test(louisiana$pop_gr, W_louis) %>% gaze
## Geary C = 0.681 (1.000, 0.083), z = 3.824, pval = 0.000

These tests are unconditional tests of spatial correlation. The same tests can be performed on the residuals of linear models.

model_louisiana <- lm(pop_gr~poly(log(pop),2,raw =TRUE), louisiana)
lm.morantest(model_louisiana, W_louis) %>% gaze
## Moran I = 0.136 (-0.020, 0.077), z = 2.026, pval = 0.021

The hypothesis of no spatial correlation is still rejected at the 5% level, but the p-value is much higher than the one associated with the unconditional test.

Ertur and Koch (2007) computed the Moran test for the residuals of the OLS estimation of the standard Solow’s growth model. The set of neighbors is obtained with setting an infinite maximum distance, so that all countries are neighbors to each other. Two matrices of weights are computed: one with $w_{nm} = d_{nm}^{-2}$ (W1) and the other with $w_{nm} = \mbox{exp}(-2 d_{nm})$ (W2):

lm_solow <- lm(log(gdp95) ~ log(i) + log(v), sp_solow2)
d <- dnearneigh(sp_solow2, 0, Inf)
W1 <- nb2listwdist(d, sp_solow2, type = "idw", alpha = 2,
                   style = "W", zero.policy = TRUE)
W2 <- nb2listwdist(d, sp_solow2, type = "exp", alpha = 2,
                   style = "W", zero.policy = TRUE)
lm.morantest(lm_solow, W1) %>% gaze
## Moran I = 0.431 (-0.020, 0.065), z = 6.927, pval = 0.000
lm.morantest(lm_solow, W2) %>% gaze
## Moran I = 0.560 (-0.022, 0.125), z = 4.675, pval = 0.000

Whatever the weighting matrix, the hypothesis of no spatial correlation is rejected.

Local spatial correlation

A first glance of local spatial correlation can be obtained using Moran’s plot (see Figure 9.11), implemented in the spdep::moran.plot function, where $y_n - \bar{y}$ is on the $x$ axis and $\tilde{y}_n - \bar{y}$ on the $y$ axis. Therefore, both variables have zero mean and the intercept of the fitting line is therefore 0, the slope being Moran’s $I$ statistic:

moran.plot(louisiana$pop_gr, W_louis)

Each observation is situated in one of the four quarters of the plane. Observations in the upper-right quarter are “high-high” observations, which means that the value of $y$ for observation $n$ and its neighbors are higher than the sample mean. Similarly, the lower-left quarter contains “low-low” observations, i.e., observations for which own values of $y$ and values of its neighbors are lower than the sample mean. The upper-left and the lower-right quarters contain respectively the “low-high” and the “high-low” observations. In case of no spatial correlation, the points should be randomly disposed around the origin and the slope of the regression line should be 0. On the contrary, in case of positive spatial correlation, a majority of points should be of the “low-low” or “high-high” category and the regression line should have a positive slope.

Anselin (1995) proposed local versions of Moran’s $I$ and Geary’s $C$ statistics. We then have for each observation $I_n$ and $G_n$ so that $\sum_n I_n$ and $\sum_n G_n$ are respectively proportional to the global Moran and Geary statistics. Local Moran’s statistics are defined as:

\[ I_n = (y_n - \bar{y}) \sum_m w_{nm} (y_m - \bar{y}) = (y_n - \bar{y})(\tilde{y}_n - \bar{y}) \]

and their sum is $(y_n - \bar{y})(\tilde{y}_n - \bar{y})$, which is the numerator of Moran’s $I$ statistic. Therefore, we have:

\[ I = \frac{\sum_n I_n}{\sum_n (y_n - \bar{y}) ^ 2} \]

Local Moran’s statistics are obtained using the spdep::localmoran function:

locmor <- localmoran(louisiana$pop_gr, W_louis)

which returns a matrix, with a line for every observation and column containing the local Moran values, their expectation, variance, the standardized statistic and the p-value. It’s easier to coerce this matrix to a tibble in order to extract extreme values of the statistic:¹¹

locmor %>% as_tibble %>% rename(pval = 5) %>% filter(pval < 0.01)

# A tibble: 11 × 5
  Ii         E.Ii       Var.Ii     Z.Ii       pval      
  <localmrn> <localmrn> <localmrn> <localmrn> <localmrn>
1  1.17852   -1.603e-02 0.154640    3.038     0.002384  
2  0.93029   -7.606e-03 0.090380    3.120     0.001810  
3 -0.03674   -2.119e-05 0.000175   -2.776     0.005510  
4  2.31977   -6.666e-02 0.744962    2.765     0.005694  
5  3.00564   -8.712e-02 1.210852    2.811     0.004945  
# ℹ 6 more rows

There are therefore 11 out of 64 (17%) observations for which the p-value is lower than 1%, which confirms the presence of spatial correlation. The local Moran statistic can also be represented in a map (Figure 9.12), in order to identify the “hot spots”:¹²

z_locmor <- locmor %>% as_tibble %>% pull(Ii) %>% as.numeric
z_locmor <- z_locmor %>% 
  cut(c(-1, - 0.5, 0, 0.5, 1, 1.5, 2, 2.5, Inf))
louisiana %>% add_column(z = z_locmor) %>% 
  mutate(z = fct_rev(z)) %>% 
  ggplot + geom_sf(aes(fill = z)) + 
  scale_fill_grey()

Figure 9.12: Map of local Moran statistic

Two hot spots are then identified, in the north-west and the south-west of the state.

9.4 Spatial econometrics

Spatial models and tests

We consider here linear gaussian models that are extended in order to integrate the spatial features of the sample, using the weighting matrix described in the previous section. Two main models can be considered. The first one is the spatial error model (SEM) which can be written in matrix form as:

\[ y = \alpha \iota + X \beta + \epsilon \; \mbox{with} \; \epsilon = \rho_\epsilon W \epsilon + \nu \tag{9.1}\]

Therefore, the error for observation $n$ is linearly related to the errors of its neighbors $\tilde{\epsilon}_n$. OLS estimation gives unbiased and consistent estimators but, as always with non-spherical errors, it is inefficient and the estimated covariance matrix of the coefficients based on the simple formula ($\sigma ^ 2 (\tilde{X}^\top \tilde{X}) ^ {-1}$) is biased.

The second model is called the spatial autoregressive model (SAR).¹³. It extends the basic gaussian linear model by adding as a regressor the mean value of the response for the neighbor units. For one observation, the model writes $y_n =\alpha + \beta ^ \top x_n + \rho_y \tilde{y}_n + \epsilon_n$ or, in matrix form:

\[ y = \alpha j + X \beta + \rho_y W y + \epsilon = Z \gamma + \rho_y W y + \epsilon \] The reduced form of the model is:

\[ y = (I - \rho_y W) ^ {-1} Z \gamma + (I - \rho_y W) ^ {-1}\epsilon \tag{9.2}\]

Therefore, the values of $y$ depend on all the values of $\epsilon$, and $\tilde{y}_{n}$ is therefore correlated with $\epsilon_n$. The OLS estimator is therefore biased and inconsistent. Moreover, in Equation 9.2, the spatial dependence in the parameter $\rho_y$ feeds back (R. Bivand, Millo, and Piras 2021, 6), which is not the case for Equation 9.1. This means that for the SEM model, the marginal effect of a covariate is the corresponding coefficient, but this is not the case for the SAR model. Moreover, the matrix $(I - \rho_y W) ^ {-1}$ can be written as an infinite series:

\[ (I - \rho_y W) ^ {-1} = I + \rho_y W + \rho_y^2 W^2 + \rho_y^3 W^3 + \ldots \] Consider the case of three observations, France (1), Italy (2) and Spain (3). The matrix of contiguity is:

\[ W = \left(\begin{array}{rcl} 0 & 0.5 & 0.5 \\ 1 & 0 & 0 \\ 1 & 0 & 0\end{array}\right) \] France has two neighbors (Italy and Spain) and Italy and Spain only one (France). The weights are such that they sum to 1 for each line. Consider a variation of the unique covariate in Spain ($\Delta x_3$) and denote $\beta$ the corresponding coefficient. The direct effect on the response for the three countries is obviously $\Delta y_0 ^ \top = (0, 0, \beta \Delta x_3)$. This increase of $y$ in Spain implies an increase of $y$ in the neighboring country, France. Therefore, $\Delta y_1 ^ \top = (0.5 \rho_y \beta \Delta x_3, 0, 0)$. This increase of $y$ in France implies an increase of $y$ in the neighboring countries, Italy and Spain: $\Delta y_2 ^ \top = (0, 0.5 \rho_y ^ 2 \beta \Delta x_3, 0.5 \rho_y ^ 2 \beta \Delta x_3)$. This increase of $y$ in Italy and Spain implies an increase of $y$ in the neighboring country of Italy and Spain, which is France: $\Delta y_3 ^ \top = (0.5\rho_y ^ 3 \beta \Delta x_3, 0, 0)$, etc. To take a numerical example, consider $\beta \Delta x_3 = 1$ and $\rho_y = 0.3$. Then, $\Delta y_0 ^ \top = (0, 0, 1)$, $\Delta y_1 ^ \top = (0.15, 0, 0)$, $\Delta y_2 ^ \top = (0, 0.045, 0.045)$ and $\Delta y_3 ^ \top = (0.0135, 0, 0)$. In total, we have $\sum_{i = 0} ^ 3 \Delta y_i ^ \top = (0.177, 0.045, 1.045)$. These four effects and their sum is computed below:

W <- matrix(c(0, 1, 1, 0.5, 0, 0, 0.5, 0, 0), nrow = 3)
bdx <- c(0, 0, 1)
rho <- 0.3
dy0 <- diag(3) %*% bdx
dy1 <- rho * W %*% bdx
dy2 <- rho ^ 2 * W %*% W %*% bdx
dy3 <- rho ^ 3 * W %*% W %*% W %*% bdx
dy_03 <- cbind(dy0, dy1, dy2, dy3)
dy_03

     [,1] [,2]  [,3]   [,4]
[1,]    0 0.15 0.000 0.0135
[2,]    0 0.00 0.045 0.0000
[3,]    1 0.00 0.045 0.0000

apply(dy_03, 1, sum)

[1] 0.1635 0.0450 1.0450

The total effect of $\Delta x_3$ is obtained using the formula: $\Delta y = (I - \rho_y W) ^ {-1} \Delta x$:

solve(diag(3) - rho * W) %*% bdx

        [,1]
[1,] 0.16484
[2,] 0.04945
[3,] 1.04945

which is very close to the sum of the direct effect and the first three indirect effects computed previously.

Spatial models are usually estimated by maximum likelihood, by assuming a multivariate normal distribution for the iid errors. For the SEM model, the idiosyncratic errors can be written as a function of the response:

\[ \nu = (I - \rho_\epsilon W) y - (I - \rho_\epsilon W) Z\gamma \tag{9.3}\]

so that the Jacobian of the transformation is

\[\left| \frac{\partial \nu}{\partial y} \right| = \left| I - \rho_\epsilon W \right|\].

the likelihood is similar to the one of the linear gaussian model except that an extra term, which is the log of the Jacobian, should be added:

\[ -N/2 (\ln 2\pi + \ln \sigma ^ 2) + \ln \left| I - \rho_\epsilon W \right| - \frac{1}{\sigma ^ 2} \nu^\top \nu \] where $\nu$ is given by Equation 9.3. For the SAR model, Equation 9.2 indicates that the Jacobian is the same, so that the log-likelihood is:

\[ -N/2 (\ln 2\pi + \ln \sigma ^ 2) + \ln \left| I - \rho_y W \right| - \frac{1}{\sigma ^ 2} \epsilon^\top \epsilon ^ 2 \]

with $\epsilon = y - Z \gamma - \rho_y Wy$.

These two models can be augmented by spatial lags of the covariates, they are in this case called Durbin’s models. There is an inserting relationship between Durbin’s SAR model and the SEM model. The former can be written:

\[ y = \rho_y W y + Z\gamma + W Z\theta + \epsilon \]

with $\epsilon$ iid errors. The common factor hypothesis is that $\theta = - \rho_y \gamma$. In this case, we have:

\[ (I - \rho_y W) y = (I - \rho_y W) Z\gamma + \epsilon \] Or finally:

\[ y = Z\gamma + (I - \rho_y W) ^{-1} \epsilon \] which is the SEM model, as denoting $\eta = (I -\rho_y W) ^ {-1} \epsilon$ the errors of this model, we have $y = Z\gamma + \eta$, with $\eta = \rho_y W \eta + \epsilon$.

Once it has been established that there is some spatial dependence, one has to choose the “right” specification. Several tests have been proposed, based on the three test principles. Lagrange multiplier tests proposed by Anselin et al. (1996) are popular, as they only require the OLS estimation. Different flavors of tests have been proposed, testing $\rho_y = 0$, $\rho_\epsilon = 0$ or $\rho_y = \rho_\epsilon = 0$. The basic version of, for example the first test ($\mbox{H}_{0}: \rho_y = 0$) has, as a maintained hypothesis that $\rho_\epsilon = 0$. A robust version of the test has been proposed which doesn’t impose this maintained hypothesis. The common factor hypothesis can easily be tested using a likelihood ratio test. The Wald statistic is less easy to compute, as a set of non-linear hypotheses $\theta = - \rho_y \beta$ should be tested.

Application to the growth model

To overcome the irrelevant empirical results of the standard Solow model, Ertur and Koch (2007) consider an endogenous growth model with spillovers. The production function is as usual a Cobb-Douglas:

\[ Y_n(t) = A_n(t) K_n ^ \kappa(t) L_n ^ {1 -\kappa}(t) \] $\kappa$ being the elasticity of the production with the capital and also the share of the profits in the national product. The level of technology is given by:

\[ A_n(t) = \Omega(0) e ^ {\mu t} k_n ^ \phi(t) \prod_{m \neq n} ^ N A_n^{\gamma w_{nm}}(t) \]

where $k_n(t) = K_n(t) / L_n(t)$ is the physical capital per worker, $\Omega(0)$ is the initial level of the technology and $\mu$ is a constant rate of technological growth, as in the Solow model. The next term takes into account the spillover effect of domestic investment, following Romer (1986), and the strength of this spillover effect is measured by $\phi$. The last term is specific to the model of Ertur and Koch (2007), for which the spillovers are not restricted to domestic investment, but concerns also the technology of neighboring countries. The effect of the technology of country $m$ on the technology of country $n$ is the product of a constant parameter $\gamma$ and a specific weight, $w_{nm}$, which is a decreasing function of the distance between both countries. Ertur and Koch (2007) showed (equation 10, page 1038) that, at the steady state, the output per capita is:

\[ \begin{array}{rcl} \ln y_n ^ * &=& \frac{1}{1 - \kappa - \phi} \ln \Omega + \frac{\kappa + \phi}{1 - \kappa - \phi} \ln i_n - \frac{\kappa + \phi}{1 - \kappa - \phi} \ln v_n \\ &-& \frac{\gamma\kappa}{1 - \kappa - \phi} \sum_{m\neq n}^ N w_{nm} \ln i_m + \frac{\gamma\kappa}{1 - \kappa - \phi} \sum_{m\neq n}^ N w_{nm} \ln v_n \\ &+& \frac{\gamma(1 - \kappa)}{1 - \kappa - \phi} \sum_{m\neq n}^ N w_{nm} \ln y_m ^ * \end{array} \]

In matrix form, denoting $y = \ln y^*$, $X = (\ln i \ln v)$, $\beta = \left(\begin{array}{c} (\kappa + \phi) / (1 - \kappa - \phi)\\ - (\kappa + \phi) / (1 - \kappa - \phi)\end{array}\right)$, $\theta = \left(\begin{array}{c} \gamma \kappa / (1 - \kappa - \phi)\\ - \gamma \kappa / (1 - \kappa - \phi)\end{array}\right)$ and $\rho_y = \gamma (1 - \kappa) / (1 - \kappa - \phi)$, the model can be written as:

\[ y = \alpha i + X \beta + WX \theta + \rho_y Wy + \epsilon \] if $\gamma = 0$, the model reduces to the model of Romer (1986) and the last two terms disappear. Note that in this case $\kappa$ and $\phi$ are not identified, but only their sum. Moreover, if $\phi$ is also 0, we get the Solow model. Whether $\phi$ equals zero or not, the resulting model is a standard linear model already estimated in Section 9.2.2:

ols_solow <- lm(log(gdp95) ~ log(i) + log(v), sp_solow2)

Moreover, imposing the theoretical constraint of the growth model, we must have the sum of the two coefficients of log(i) and log(v) equal to 0. Therefore, the constrained model can be estimated using the difference of the logarithms of the two covariates as the unique regressor. For convenience, we call the ratio of the two covariates i and we store it in a new data frame called sp_solow3. The constrained OLS model can then be estimated:

sp_solow3 <- sp_solow2 %>% mutate(i = i / v)
ols_solowc <- lm(log(gdp95) ~ log(i), sp_solow3)

The hypothesis of $\gamma = 0$ can be tested using Lagrange multipliers tests, testing either the alternative that there is a spatial lag or correlated errors. spdep::lm.LMtests provides different flavors of these tests. Its two main arguments are a lm model and a matrix of spatial weights. The test argument can be either "LMerr" and “LMlag” for the standard tests of uncorrelated errors and absence of spatial lag, "RLMerr" and "RLMlag" for the robust versions of these two tests and "SARMA", suitable when the alternative hypothesis is the presence of both correlation of the errors and a spatial lag. test = "all" enables to perform the five tests:

lm.LMtests(ols_solow, W1, test = "all") %>% gaze
## RSerr   : 39.874, df = 1, p-value =  0.000
## RSlag   : 51.467, df = 1, p-value =  0.000
## adjRSerr:  3.004, df = 1, p-value =  0.083
## adjRSlag: 14.596, df = 1, p-value =  0.000
## SARMA   : 54.471, df = 2, p-value =  0.000

All the tests reject the hypothesis of uncorrelated errors or/and spatial lag, except the robust test of uncorrelated errors. Therefore, these tests seem to indicate that the correct model is a spatial lag model. We then estimate the three spatial models using the spatialreg package (Pebesma and Bivand 2023) which provides three functions: lagsarlm for SAR models, sacsarlm for SEM model and errorsarlm for SAR-SEM models. The three first arguments are formula, data and listw, which should be a listw object containing the weights. The Durbin argument enables to estimate Durbin’s versions of the model; this can be done either by setting it to TRUE (a spatial lag is then added for all the covariates) or a formula indicating the subset of covariates for which spatial lags have to be added. We estimate also, as for the OLS estimator, the variant of the model that imposes the theoretical restrictions on the coefficients. The results are presented in Table 9.1.

library(spatialreg)
m1 <- lagsarlm(log(gdp95) ~ log(i) + log(v), 
               sp_solow2, W1, Durbin = TRUE)
m2 <- errorsarlm(log(gdp95) ~ log(i) + log(v), 
                 sp_solow2, W1, Durbin = FALSE)
m3 <- sacsarlm(log(gdp95) ~ log(i) + log(v), 
               sp_solow2, W1, Durbin = TRUE)
m1c <- update(m1, . ~ . - log(v), data = sp_solow3)
m2c <- update(m2, . ~ . - log(v), data = sp_solow3)
m3c <- update(m3, . ~ . - log(v), data = sp_solow3)

We can first test the theoretical restrictions using either a likelihood ratio or a Wald test:

library(lmtest)
library(car)
linearHypothesis(ols_solow, "log(i) + log(v) = 0", 
                 test = "Chisq") %>% gaze
## Chisq = 4.427, df: 1, pval = 0.035
lrtest(ols_solow, ols_solowc) %>% gaze
## Chisq = 4.467, df: 1, pval = 0.035
linearHypothesis(m1, c("log(i) + log(v) = 0", 
                       "lag.log(i) + lag.log(v) = 0")) %>% gaze
## Chisq = 2.382, df: 2, pval = 0.304
lrtest(m1, m1c) %>% gaze
## Chisq = 2.377, df: 2, pval = 0.305

As seen previously, the theoretical restrictions are rejected at the 5% level using a Wald test for the Solow model. This is not the case for the spatial lag model. Note also that the Wald and the likelihood ratio tests give very similar results. An interesting feature of the model of Ertur and Koch (2007) is that their specification enables the identification of all the structural parameters of the model, $\kappa$, $\phi$ and $\gamma$. If $\phi = 0$, $\theta = \gamma \beta$ and $\rho_y = \gamma$, so that: $y = \alpha j+ (I - \gamma W)X\beta + \gamma y + \epsilon$, the reduced form of the model is, in matrix form:

\[ y = \alpha j + X \beta + (I - \gamma W) ^ {-1} \epsilon \] which is a SEM model without Durbin terms. Therefore, the hypothesis that $\phi = 0$ is in the context of this model the hypothesis of common factors and can easily be tested using a likelihood ratio test:

lrtest(m1c, m2c) %>% gaze
## Chisq = 7.033, df: 1, pval = 0.008

The hypothesis of common factor is rejected at the 1% level. Our preferred specification is therefore a spatial lag model imposing the theoretical restrictions. The structural parameters of the model ($s$) can be computed using the fitted reduced form parameters ($r$). The relation between $r$ and $s$ is:

Table 9.1: Results of the growth models
	OLS	SAR	SEM	SARMA	OLSc	SARc	SEMc
$\ln i$	1.276	0.837	0.843	0.876	1.379	0.863	0.863
	(0.125)	(0.099)	(0.099)	(0.102)	(0.117)	(0.099)	(0.099)
$\ln v$	−2.709	−1.636	−1.848	−1.737
	(0.642)	(0.555)	(0.550)	(0.562)
$W \ln i$		−0.338		−0.325		−0.278
		(0.182)		(0.392)		(0.179)
$W\ln v$		0.566		0.076
		(0.851)		(1.398)
$\rho_y$		0.746		0.633		0.740
		(0.070)		(0.338)		(0.070)
$\rho_\epsilon$			0.823	0.317			0.835
			(0.053)	(0.509)			(0.050)
Num.Obs.	91	91	91	91	91	91	91
Log.Lik.	−102.177	−73.451	−76.569	−72.912	−104.410	−74.640	−78.156
F	74.025				138.297

\[ r ^ \top = (\beta, \theta, \rho) = \left(\frac{\kappa + \phi}{1 - \kappa - \phi}, - \gamma \frac{\kappa}{1 - \kappa - \phi}, \gamma \frac{1 - \kappa}{1 - \kappa - \phi}\right) \] Inversing this relation, we get the structural parameters $s$ as a function of the reduced form parameters:

\[ s ^ \top = (\kappa, \phi, \gamma) = \left(\frac{\theta}{\theta - \rho}, \frac{\beta}{1 + \beta} - \frac{\theta}{\theta - \rho}, \frac{\rho - \theta}{1 + \beta}\right) \]

For our model, we get:

beta <- unname(coef(m1c)["log(i)"])
theta <- unname(coef(m1c)["lag.log(i)"])
rho <- unname(coef(m1c)["rho"])
kappa <- theta / (theta - rho)
phi <- beta / (1 + beta) - theta / (theta - rho)
gamma <- (rho - theta) / (1 + beta)
s <- c(kappa = kappa, phi = phi, gamma = gamma)
s
##  kappa    phi  gamma 
## 0.2729 0.1902 0.5463

To apply the delta method, we compute the matrix of derivatives of $s$ with respect to $r$:

\[ \Gamma(r) = \frac{\partial s}{\partial r ^ \top}(r)= \left( \begin{array}{rcl} 0 & - \frac{\rho}{(\theta - \rho) ^ 2} & \frac{\theta}{(\theta - \rho) ^ 2} \\ \frac{1}{(1 + \beta) ^ 2} & \frac{\rho}{(\theta - \rho) ^ 2} & - \frac{\theta}{(\theta - \rho) ^ 2} \\ - \frac{\theta - \rho}{(1 + \beta) ^ 2} & - \frac{1}{1 + \beta} & \frac{1}{1 + \beta} \end{array} \right) \] and the asymptotic variance of $s$ is then:

\[ \hat{V}(\hat{r}) = \Gamma(\hat{r}) \hat{V}(\hat{s})\Gamma(\hat{r}) ^ \top \]

Vr <- vcov(m1c)[c("log(i)", "lag.log(i)", "rho"), 
                c("log(i)", "lag.log(i)", "rho")]
G <- c(0, 
       - rho / (theta - rho) ^ 2, theta / (theta - rho) ^ 2,
       1 / (1 + beta) ^ 2, rho / (theta - rho) ^ 2, 
       - theta / (theta - rho) ^ 2,
       (theta - rho) / (1 + beta) ^ 2, 
       - 1 / (1 + beta), 1 / (1 + beta))
G <- matrix(G, nrow = 3, byrow = TRUE)
V <- G %*% Vr %*% t(G)

We finally compute the table of the structural parameters:

stdev <- V %>% diag %>% sqrt
z <- s / stdev
p <- pnorm(abs(z), lower.tail = FALSE) * 2
struct_par <- cbind(Estimate = s, "Std. Error" = stdev, 
                    "z value" = z, "Pr(>|z|" = p)
rownames(struct_par) <- names(s)
struct_par

      Estimate Std. Error z value   Pr(>|z|
kappa   0.2729     0.1182   2.309 2.095e-02
phi     0.1902     0.1071   1.775 7.587e-02
gamma   0.5463     0.1159   4.712 2.453e-06

The value of $\kappa$ is compatible with the observable share of the profits in the national product. $\phi$ is significant only at the 10% level. Finally, $\gamma$ is highly significant, which is a confirmation of the relevance of the model of Ertur and Koch (2007).

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x.

Anselin, Luc, Anil K. Bera, Raymond Florax, and Mann J. Yoon. 1996. “Simple Diagnostic Tests for Spatial Dependence.” Regional Science and Urban Economics 26 (1): 77–104. https://doi.org/10.1016/0166-0462(95)02111-6.

Bivand, Roger S., and David W. S. Wong. 2018. “Comparing Implementations of Global and Local Indicators of Spatial Association.” Test 27: 716–48.

Bivand, Roger, Giovanni Millo, and Gianfranco Piras. 2021. “A Review of Software for Spatial Econometrics in r.” Mathematics 9 (11). https://doi.org/10.3390/math9111276.

Ertur, Cem, and Wilfried Koch. 2007. “Growth, Technological Interdependence and Spatial Externalities: Theory and Evidence.” Journal of Applied Econometrics 22 (6): 1033–62. http://www.jstor.org/stable/25146563.

Giraud, Timothée. 2023. Mapsf: Thematic Cartography. https://CRAN.R-project.org/package=mapsf.

Hijmans, Robert J. 2023. Terra: Spatial Data Analysis. https://CRAN.R-project.org/package=terra.

Kumar, Anil. 2018. “Do Restrictions on Home Equity Extraction Contribute to Lower Mortgage Defaults? Evidence from a Policy Discontinuity at the Texas Border.” American Economic Journal: Economic Policy 10 (1): 268–97. https://www.jstor.org/stable/26529015.

Pebesma, Edzer, and Roger S. Bivand. 2023. Spatial Data Science with Applications in R. Chapman & Hall. https://r-spatial.org/book/.

Romer, Paul M. 1986. “Increasing Returns and Long-Run Growth.” Journal of Political Economy 94 (5): 1002–37. http://www.jstor.org/stable/1833190.

Tennekes, Martijn. 2018. “tmap: Thematic Maps in R.” Journal of Statistical Software 84 (6): 1–39. https://doi.org/10.18637/jss.v084.i06.

Wheeler, Christopher H. 2003. “Evidence on Agglomeration Economies, Diseconomies, and Growth.” Journal of Applied Econometrics 18 (1): 79–104. http://www.jstor.org/stable/30035189.

sf (Pebesma and Bivand 2023) and terra (Hijmans 2023) respectively supersede the sp and the raster packages.↩︎
The latitude is a coordinate that specifies the north/south position of a point on earth. It is an angular value equal to 0 at the equator and to -/+ 90 at the pole. The value is positive in the northern hemisphere and negative in the southern hemisphere.↩︎
The longitude is a coordinate that specifies the east/west position of a point on earth. Meridians are lines that connect the two poles, and the longitude is the angular value of the position of a point to the reference (Greenwich) meridian.↩︎
This function is an interface to the http://www.gadm.org site which provides the boundaries of all countries in the world with different administrative level. The second argument set to 1 means that we want the boundaries of French regions, and the third argument set to "." means that the file is stored in the current directory.↩︎
It is actually an object of class SpatVector, used by the terra package.↩︎
With the convention that the sign of the distance is different on both sides of the border.↩︎
0.05 being the sum of the rate of depreciation and the rate of technical progress, assumed to be the same for all countries.↩︎
<https://www.naturalearthdata.com/>↩︎
<https://gadm.org/>↩︎
The expression of the variance of the Moran statistic can be found in R. S. Bivand and Wong (2018).↩︎
Note that we also rename for convenience the fifth column which has a non-standard name (Pr(z != E(Ii))) to pval.↩︎
Note that the orders of the levels are reversed using forcats::fct_rev, so that the high values are represented in dark gray.↩︎
This model is also known as the spatial lag model.↩︎

# Spatial econometrics {#sec-spatial_econometrics} ```{r} #| echo: false source("../_commonR.R") st_nb <- function(x, queen = TRUE){ .crs <- st_crs(x) .nb <- poly2nb(x, queen = queen) as(nb2lines(.nb, coords = st_centroid(st_geometry(x))), "sf") } ``` Spatial econometrics is a very dynamic field in modern econometrics. Geolocated data are now frequently available, even when the data doesn't concern geographic entities like countries, regions or towns, but households or firms. From their geographic coordinates, one is able to define for every observation a set of neighbors, and the interactions between neighbors can then be taken into account. There are two very different types of geographical data: vectors and rasters. A vector is a point or a set of points that define a line. A raster is a grid that contains the value of one variable (for example, a numeric indicating the elevation or a factor indicating land use). Relatively recently, two **R** packages have emerged that provide plenty of function that enables to deal easily with vectors and rasters. These are respectively **sf** (**simple feature**) and **terra**.^[**sf** [@PEBE:BIVA:23]\index[author]{Pebesma}\index[author]{Bivand} and **terra** [@HIJM:23] respectively supersede the **sp** and the **raster** packages.] In this chapter, we'll consider only vectors. The first two sections are devoted to simple features; @sec-simple_features presents the structure of simple features objects and @sec-sf_computation, using the example of a spatial RD design, illustrates how to deal with simple features in a statistical analysis. @sec-spatial_correlation deals with the detection and the measurement of spatial correlation. Finally, @sec-spatial_models presents some popular spatial models, namely the spatial error and the spatial autoregressive models. ## Simple features {#sec-simple_features} In a spatial statistical analysis, the first task is to get geographical information about the observations. This is usually done by importing external files in **R**, for example shapefiles. This task is performed by the **sf** library and the result is an `sf` object. ### Structure of a simple feature {#sec-struct_sf} To understand what a simple feature is, it is best to construct a small one "by hand". The geographical representation of an observation is an `sfg` for simple feature geometry. It can be a point, a set of points that form a line or a polygon if the line ends where it started. We first load the **sf** package: ```{r } #| message: true library(sf) ``` The message indicates that **sf** uses three important external libraries: - **GEOS** to compute topological operations, - **GDAL** to read external files with a large variety of formats, - **PROJ** to transform the coordinates in a specific **CRS** (coordinate reference system). We consider in this example the three main cities in France. For Paris, the latitude^[The latitude is a coordinate that specifies the north/south position of a point on earth. It is an angular value equal to 0 at the equator and to -/+ 90 at the pole. The value is positive in the northern hemisphere and negative in the southern hemisphere.] is $48.87$ and the longitude^[The longitude is a coordinate that specifies the east/west position of a point on earth. Meridians are lines that connect the two poles, and the longitude is the angular value of the position of a point to the reference (Greenwich) meridian.] is $2.33$. Most of **sf**'s functions start with `st_`. The `st_point` function can be used to construct a `sfg` (simple feature geometry) for Paris, the argument being a numeric vector containing the longitude and the latitude. \idxfun{st\_point}{sf} ```{r } #| results: true paris <- st_point(c(2.33, 48.87)) ``` We then perform the same operation for Lyon and Marseilles: \idxfun{st\_point}{sf} ```{r } lyon <- st_point(c(4.85, 45.76)) marseilles <- st_point(c(5.37, 43.30)) ``` We then construct a `sfc` (simple feature column) which is a set of `sfg`s, using the `st_sfc` function. The different elements are entered as unnamed arguments of the function and a `crs` argument can be used to specify the **CRS**. Three formats exist to describe the **CRS**: **proj4string** (a character string), **EPSG** (an integer), and **WKT**, for well known text (a list). We can use here either `"+proj=longlat +datum=WGS84 +no_defs"` or `4326`. The datum is the representation of the earth from which the latitudes and the longitudes are computed. \idxfun{st\_sfc}{sf} ```{r } cities_coords <- st_sfc(paris, lyon, marseilles, crs = 4326) cities_coords ``` Printing the `sfc` gives interesting information, as the bounding box (the coordinates of the rectangle that contains the data) and the **CRS**. Finally, we can construct a `sf` object using `st_sf` by binding a data frame containing information about the cities and the `sfc`. \idxfun{tibble}{tibble}\idxfun{st\_sf}{sf} ```{r } cities_data <- tibble(pop = c(2.161, 0.513, 0.861), area = c(105.4, 47.87, 240.6)) cities <- st_sf(cities_data, cities_coords) cities ``` A `sf` is just a data frame with a specific column that contains the coordinates of the observations. This column (called the geometry column) can be extracted using the `st_geometry` function: \idxfun{st\_geometry}{sf} ```{r } #| results: "hide" #| echo: true cities %>% st_geometry ``` and is "sticky", which means that if some columns are selected, the geometry column is always returned along with the selected columns, i.e.: ```{r } #| results: false cities %>% select(pop) ``` returns `pop` **and** `cities_coords`, even if the latter colon was not selected. We can then plot our three cities, along with a map of France, which is obtained using the `geodata::gadm` function.^[This function is an interface to the <http://www.gadm.org> site which provides the boundaries of all countries in the world with different administrative level. The second argument set to 1 means that we want the boundaries of French regions, and the third argument set to `"."` means that the file is stored in the current directory.] \idxfun{gadm}{geodata} ```{r } france <- geodata::gadm("FR", 1, ".") ``` `france` is not an object of class `sf`,^[It is actually an object of class `SpatVector`, used by the **terra** package.] so we first coerce it to a `sf` using the `st_as_sf` function, and then we extract the geometry and the series called `NAME_1` which contains the name of the regions: \idxfun{st\_as\_sf}{sf}\idxfun{select}{dplyr} ```{r } france <- france %>% st_as_sf %>% select(region = NAME_1) ``` Imported vector data are often large: \idxfun{object.size}{utils}\idxfun{format}{base} ```{r } #| collapse: true france %>% object.size %>% format("MB") ``` and they can be simplified using the `rmapshaper::ms_simplify` function, with a `keep` argument which indicates the degree of simplification ($0.01$ means that we seek to obtain a `sf` 100 times lighter than the initial one). \idxfun{ms\_simplify}{rmapshaper}\idxfun{object.size}{utils}\idxfun{format}{base} ```{r } #| collapse: true france <- france %>% rmapshaper::ms_simplify(keep = 0.01) france %>% object.size %>% format("MB", digits = 2) ``` **sf** provides a `plot` method to get quickly a map of the data. Note that a thematic map is plotted for all the series of the `sf`, and it is therefore recommended to select first a unique series. If one is only interested in the vectors, they can be extracted before plotting using `st_geometry`: \idxfun{st\_geometry}{sf} ```{r} #| results: hide #| eval: false france %>% st_geometry %>% plot ``` For more advanced maps, several specialized packages are available, in particular **tmap** [@TENN:18]\index[author]{Tennekes} and **mapsf** [@GIRA:23]\index[author]{Giraud}. In this chapter, we'll only use **ggplot**, which provides a `geom_sf` function. This geom is very special compared to other geoms, as the kind of geom that is plotted depends on the geometry column of the `sf`: if `cities` is provided as the data argument, points will be plotted; but if `france` is provided, polygons will be drawn. In the following code, we start with `france` as the data argument of `ggplot` and then the call to `geom_sf` results in the drawing of the administrative borders of French regions. Then we use a second time `geom_sf` with this time `cities` as data argument, so that points are drawn for the three cities. We use here `aes(size = pop)` so that the size of the points is related to the population of the cities. The result is presented in @fig-map_france. \idxfun{ggplot}{ggplot2}\idxfun{aes}{ggplot2}\idxfun{geom\_sf}{ggplot2} ```{r } #| fig-cap: "Map of France with its three main cities" #| label: fig-map_france france %>% ggplot() + geom_sf() + geom_sf(data = cities, aes(size = pop)) ``` **sf** provides several functions to compute values of interest. For example, to get the distance from Paris to Lyon and Marseilles, we would use: \idxfun{st\_distance}{sf} ```{r } #| collapse: true st_distance(cities[1, ], cities[2:3, ]) ``` `st_distance` always returns a matrix, the number of elements of the first (second) argument being the number of rows (columns) of the matrix. If only one argument is supplied, the result is a square matrix with 0 on the diagonal. Note that the numbers have a unit of measurement, which is here meters. To define a unit and to convert from one unit to another, the **units** package provides the `set_units` function. Consider the following example: we first provide a numeric ($42.195$), we define its unit as kilometers, and then we convert it to miles: \idxfun{set\_units}{units} ```{r} #| collapse: true library(units) d <- 42.195 dkm <- d %>% set_units(km) dkm dkm %>% set_units(miles) ``` Even when the conversion is simple, it is advisable to use `set_units` instead of applying the conversion "by hand": \idxfun{set\_units}{units} ```{r} #| collapse: true dm <- dkm %>% set_units(m) dm dkm * 1000 ``` The numerical values are the same, but in the second case the unit hasn't changed and is still kilometers. `st_area` computes the area of a polygon: \idxfun{set\_units}{units}\idxfun{st\_area}{sf} ```{r } #| collapse: true st_area(france) %>% units::set_units(km ^ 2) ``` ### Computation on sf objects {#sec-sf_computation} \index[general]{regression discontinuity!spatial|(} The regression discontinuity framework can be adapted to consider geographic discontinuities. Some entities are considered on both sides of a border and the forcing variable is then the distance to the border.^[With the convention that the sign of the distance is different on both sides of the border.] The `us_counties` data set contains the borders of US counties: ```{r } #| message: false us_counties %>% print(n = 3) ``` @KUMA:18\index[author]{Kumar} investigates the effect of a legal restriction on home equity extraction which is specific to Texas on mortgage defaults. Kumar measures mortgage default rates in Texas and in the bordering states and compares mortgage default rates for counties which are closed to the Texas border (on both sides). Let's first plot the map of the counties, using `ggplot` and `geom_sf` (see @fig-counties): \idxfun{filter}{dplyr}\idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2} ```{r } #| label: fig-counties #| fig-cap: Map of American counties #| warning: false us_counties %>% filter(! state %in% c("Alaska", "Hawaii")) %>% ggplot() + geom_sf() ``` The polygons can be merged using the `group_by` / `summarise` functions of **dplyr**. If the `sf` is grouped by states, the statistics computed by the `summarise` function is performed at the state level and the `geometry` now contains the coordinates of the states. With a void call to `summarise`, we get only this new `geometry`, and we can use it to plot a map of states (see @fig-states): \idxfun{filter}{dplyr}\idxfun{group\_by}{dplyr}\idxfun{summarise}{dplyr}\idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2} ```{r } #| label: fig-states #| fig-cap: Map of American states #| warning: false states <- us_counties %>% filter(! state %in% c("Alaska", "Hawaii")) %>% group_by(state) %>% summarise() states %>% ggplot() + geom_sf() ``` To get the border states of Texas, we use spatial indexation, i.e., we use the `[` operator. The first argument is used to select rows; it is normally a vector of integers (the positions of the lines to extract), a vector of characters (the names of the lines to extract) or a logical vector. Here we can index an `sf` by another `sf`, and the result is (by default) a new `sf` containing the elements of the first one which has common points with the second one: \idxfun{filter}{dplyr} ```{r } border_states <- states[filter(states, state == "Texas"), ] border_states ``` We then compute the Texas border. It is defined by the common points of the borders of Texas and its border states. This is obtained using the `st_intersection` function which returns four lines (one for each border state), and the border is then obtained by merging these four lines using the `st_union` function. The result is presented in @fig-border. \idxfun{filter}{dplyr}\idxfun{st\_geometry}{sf}\idxfun{st\_intersection}{sf}\idxfun{st\_union}{sf} \idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2}\idxfun{geom\_sf\_label}{ggplot2}\idxfun{aes}{ggplot2} ```{r } #| fig-cap: Border states and Texas borders #| label: fig-border #| warning: false texas <- filter(border_states, state == "Texas") %>% st_geometry border <- filter(border_states, state != "Texas") %>% st_geometry border <- st_intersection(texas, border) %>% st_union border_states %>% ggplot() + geom_sf() + geom_sf(data = border, linewidth = 1) + geom_sf_label(aes(label = state)) ``` Note the use of `geom_sf_label` function which, is a specialized version of `geom_label` that is suitable for writing labels on a map. We then select the counties that belong to any of these states and we compute the distance to the Texas border. It is defined as the distance between the centroid of the county and the closest point of the border. The centroids are obtained using the `st_centroid` function. We've already used `st_distance` to comp ute the distance between two points. It can also be used to compute the (shortest) distance between a point and a line: \idxfun{filter}{dplyr}\idxfun{pull}{dplyr}\idxfun{st\_geometry}{sf}\idxfun{st\_centroid}{sf}\idxfun{st\_distance}{sf}\idxfun{set\_units}{units}\idxfun{add\_column}{tibble}\idxfun{head}{utils} ```{r } #| collapse: true border_counties <- us_counties %>% filter(state %in% pull(border_states, state)) centroids <- border_counties %>% st_geometry %>% st_centroid dists <- st_distance(centroids, border)[, 1] %>% set_units(miles) border_counties <- border_counties %>% add_column(dists) head(dists) ``` `st_distance` returns a matrix of distance, each column corresponding to a line of the second argument. As here there is only one line, we convert this matrix to a vector by taking its first column. The unit of the returned values is meters; we convert it to miles, like in the original article and we add this column to the sf. As in the original article, we fill with different colors in @fig-close_counties_dist counties that are less than 25, 50, 75 and 100 miles from the border: \idxfun{mutate}{dplyr}\idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2}\idxfun{scale\_fill\_grey}{ggplot2}\idxfun{geom\_sf}{ggplot2}\idxfun{aes}{ggplot2} ```{r } #| label: fig-close_counties_dist #| fig-cap: Counties close to the Texan border #| warning: false border_counties %>% mutate(dist_class = cut(dists, c(0, 25, 50, 75, 100))) %>% ggplot() + geom_sf(aes(fill = dist_class)) + scale_fill_grey(na.translate = FALSE) + geom_sf(data = border) ``` Finally, we select the relevant series from the data set of the paper, called `mortgage_default`, and we merge it with `border_counties`: \idxfun{full\_join}{dplyr}\idxfun{mutate}{dplyr}\idxfun{st\_sf}{sf} ```{r} #| include: false #| eval: false mortdef_sf <- mortgage_defaults %>% full_join(border_counties, by = c("fips")) %>% st_sf(agr = "geometry") %>% mutate(dists = ifelse(state == "Texas", dists, - dists)) ``` \idxfun{right\_join}{dplyr}\idxfun{mutate}{dplyr}\idxfun{ifelse}{base} ```{r } mortdef_sf <- border_counties %>% right_join(mortgage_defaults, by = "fips") %>% mutate(dists = ifelse(state == "Texas", dists, - dists)) ``` Note the use of the quite unusual `dplyr::right_join` function. By using `right_join` with `border_counties` as the first argument and `mortgage_defaults` as the second argument, we get an object of the class of the first argument (therefore a `sf`) and we get all the lines of the `mortgage_defaults` tibble and only those of `us_counties` that match. We use the convention that the distance is positive for Texan counties and negative for neighboring states. Finally, we plot the discontinuity in @fig-disc_texas. \idxfun{filter}{dplyr}\idxfun{mutate}{dplyr}\idxfun{ggplot}{ggplot2}\idxfun{geom\_binmeans}{micsr} \idxfun{aes}{ggplot2}\idxfun{geom\_smooth}{ggplot2} ```{r } #| warning: false #| fig-cap: "Discontinuity of the mortgage defaults data set" #| label: fig-disc_texas mortdef_sf %>% filter(abs(dists) < 50) %>% mutate(state = ifelse(state == "Texas", "Texas", "other")) %>% ggplot(aes(dists, default)) + micsr::geom_binmeans(aes(size = after_stat(n)), shape = 21) + geom_smooth(aes(linetype = state, weight = loans), method = "lm", se = FALSE, color = "black") ``` The intercept is a bit more than 7% outside the border and about 4.5% inside the border, so that the effect of the Texas specific regulation seems significant (a reduction of about 2.5% of the mortgage default rate). \index[general]{regression discontinuity!spatial|)} ## Two examples To illustrate the techniques presented in the subsequent sections, we'll use two data sets. The first one, from @WHEE:03\index[author]{Wheeler}, is called `agglo_growth` and deals with the topic of economies and diseconomies of agglomeration. The second one, called `sp_growth` is from @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} and is used to fit an extension of Solow's growth model. ### Agglomeration economies and diseconomies The `agglo_growth` contains data about US counties in 1990, identified by their fips code. \idxdata[(]{agglo\_growth}{micsr.data} ```{r } agglo_growth ``` `emp_gr` and `pop_gr` are the employment and the population growth in each county between 1980 and 1990 and `emp` and `pop` the level of employment and of the population in 1980. @WHEE:03\index[author]{Wheeler} investigates the existence of: - agglomeration economies, which implies that the growth of a given territory will be positively correlated with its size, - agglomeration diseconomies for large cities that experience congestion, crime, pollution, etc. The hypothesis is therefore that the relation between size and growth should be inverted U shaped, which means that, for small territories, the agglomeration economies effect is dominant; as for large territories, the agglomeration diseconomies becomes dominant. We'll reproduce some of the results using the counties of Louisiana. We first join the tibble to the `sf` called `us_counties` which contains the geometries of the counties that we've already used in the previous section: \idxfun{right\_join}{dplyr}\idxfun{filter}{dplyr} ```{r } louisiana <- us_counties %>% right_join(agglo_growth, by = "fips") %>% filter(state == "Louisiana") ``` We first plot a thematic map (@fig-map_louisiana), with colors for counties related to the growth of the population (`pop_gr`). It is recommended to create first a categorical variable using `base::cut`, because it is easier to visualize a discrete palette of colors than a continuous one. We use `scale_fill_brewer` to use one of the palettes provided by the **RColorBrewer** package, which provides sequential, diverging and qualitative palettes. The first two are suitable to represent the values of numerical variables. Sequential palettes use a color, light for low values of the variable and dark for high values. Divergent palettes use two colors, with light colors for middle range values and dark colors for low and high values. All the available palettes can be displayed using `RColorBrewer::display.brewer.all()`. We use here the `Oranges` sequential palette. \idxfun{mutate}{dplyr}\idxfun{cut}{base}\idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2}\idxfun{scale\_fill\_brewer}{ggplot2} ```{r} #| eval: false louisiana %>% mutate(pop_gr = cut(pop_gr, (-3:3) / 10)) %>% ggplot() + geom_sf(aes(fill = pop_gr)) + scale_fill_brewer(palette = "Oranges") ``` ```{r } #| label: fig-map_louisiana #| fig-cap: "Population growth and initial population in Louisiana" #| echo: false louisiana %>% mutate(pop_gr = cut(pop_gr, (-3:3) / 10)) %>% ggplot() + geom_sf(aes(fill = pop_gr)) + scale_fill_grey() ``` The inverted U shaped hypothesis between the log of the initial population and the growth rate can be tested by regressing the growth rate on the log population and its square. The later coefficient should then be negative. \idxfun{lm}{stats}\idxfun{poly}{stats}\idxfun{gaze}{micsr} ```{r } mod1 <- lm(pop_gr ~ poly(log(pop), 2, raw = TRUE), louisiana) mod1 %>% gaze ``` \idxfun{coef}{stats} ```{r} #| echo: false b1 <- coef(mod1)[2] b2 <- coef(mod1)[3] lpops <- - b1 / 2 / b2 pops <- exp(lpops) ``` The coefficient of the square term is negative and significant and we get a maximum growth for the fitted model for a value of `log(pop)` equal to $-`r round(b1, 4)` / (`r round(b2, 4)` \times 2) = `r round(lpops, 4)`$, i.e., a population of about `r round(pops / 1E03)` thousand inhabitants. This result is illustrated in @fig-iushaped_louisiana, a scatterplot representing the relationship between the logarithm of the initial population and its growth rate and a fitting line with a second degree polynomial: \idxfun{ggplot}{ggplot2}\idxfun{aes}{ggplot2}\idxfun{geom\_point}{ggplot2}\idxfun{geom\_smooth}{ggplot2}\idxfun{poly}{stats}\idxfun{scale\_x\_continuous}{ggplot2} ```{r } #| label: fig-iushaped_louisiana #| fig-cap: "Relation between initial population and population growth in Louisiana" louisiana %>% ggplot(aes(pop, pop_gr)) + geom_point() + geom_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE), se = FALSE, color = "black") + scale_x_continuous(trans = "log10") ``` \idxdata[)]{agglo\_growth}{micsr.data} ### Solow model {#sec-solow_model} The second example is from @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} who estimated a growth model for the countries of the world taking spatial correlation into account. The `sp_solow` data set is provided by the **necountries** package: \idxdata[(]{sp\_solow}{necountries} \idxfun{data}{utils} ```{r} data("sp_solow", package = "necountries") sp_solow ``` There are 91 countries in the `sp_solow` data set, the series being the gdp in 1960 and 1995, the saving rate (`saving`) and the growth of the labor force (`labgwth`). As in @sec-data_solow_mrw, we denote `i` as the saving rate and `v` as the sum of the growth rate of the labor force and 0.05.^[0.05 being the sum of the rate of depreciation and the rate of technical progress, assumed to be the same for all countries.] We also compute the annual growth rate (`growth`) as the difference of the logs of the GDP for 1995 and 1960 divided by 35. \idxfun{rename}{dplyr}\idxfun{mutate}{dplyr} ```{r } sp_solow <- sp_solow %>% rename(i = saving) %>% mutate(v = labgwth + 0.05, growth = (log(gdp95) - log(gdp60)) / 35) ``` We then need to join this data frame to a sf containing the administrative boundaries of the countries and the coordinate of their capital. Several packages enable to get an sf of the world. For example, the **spData** package has a `world` sf which is obtained from Natural Earth,^[<<https://www.naturalearthdata.com/>>] **geodata** has a `world` function that enables to download from the gadm website^[<<https://gadm.org/>>] an object of class `SpatVector` (than can be easily converted to a `sf`, as shown in @sec-struct_sf). We use here the convenient **necountries** package that also uses Natural Earth and provides a `countries` function with different arguments to select a subset of countries of the world. By default, `countries` returns 199 lines, the 193 United Nations' recognized countries, the two observer countries (Palestine and Vatican) and four not or not fully recognized countries (Kosovo, Somaliland, Northern Cyprus and Taiwan). Each line includes the geometry of the "main" part of the countries. Some countries have "parts" or "dependencies" that can be included using `part = TRUE` and `dependency = TRUE`. A "part" is an area which has the same political status as the rest of the country, but is far from the main part of the country. Examples of parts include Alaska and Hawaii for the United States, Martinique and Guadeloupe for France and Canaries for Spain. A "dependency" is an area with a specific political status. Examples of dependencies are Greenland for Denmark, New Caledonia for France and Gibraltar for the United Kingdom. `sp_solow` has columns that contain the names and the iso-codes of the countries (respectively `name` and `code`). Any of them can be used to join `sp_solow` with the `countries`' object, but it is much safer to use `code`, as it avoids the problem of small differences in countries' names. A lot of countries of the world are not present in `sp_solow` (especially most of the communist countries). We check whether all the countries of `sp_solow` are present in the `countries` object, with the `check_join` function; the `by` argument indicates the series in `sp_solow` that identifies the country: \idxfun{check\_join}{necountries}\idxfun{countries}{necountries} ```{r } #| collapse: true library(necountries) countries() %>% check_join(sp_solow, by = "code") ``` The two problems are that the D.R. of Congo (`iso3` code `COD`) used to be called Zaire (`iso3` code `ZAR`) and that Hong Kong, which is a part of China for the **necountries** package, was considered as a sovereign country in @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} study. We then correct the code for the D.R. of Congo, we add Hong Kong using the `include` argument and we remove `Antarctica` with the exclude argument: \idxfun{mutate}{dplyr}\idxfun{ifelse}{base}\idxfun{countries}{necountries}\idxfun{select}{dplyr}\idxfun{left\_join}{dplyr} ```{r eval = TRUE} sp_solow <- sp_solow %>% mutate(code = ifelse(code == "ZAR", "COD", code)) sps <- countries(include = "Hong Kong", exclude = "Antarctica") %>% select(iso3, country, point) %>% left_join(sp_solow, by = "code") %>% select(- name) sps ``` Note that the resulting `sf` has two geometry columns, polygons for the borders of the countries (`polygon`) and points for the capitals (`point`). Note also that the active geometry column is `polygon`. We then draw in @fig-growth_gdp a world map with the color of the countries related to the annual growth during the 1960-95 period and a point with a size related to the initial (1960) GDP. **necountries** provides a `plot` method which draws a basic map using **ggplot2**. A basic thematic map can be drawn using the `fill` argument to fill the areas of the countries and the `centroid` or the `capital` arguments to draw a point at the position of the centroid or the capital of each country. Any **ggplot2** functions can be used to customize the graphic: \idxfun{scale\_size\_continuous}{ggplot2}\idxfun{theme}{ggplot2}\idxfun{guides}{ggplot2}\idxfun{labs}{ggplot2}\idxfun{guide\_legend}{ggplot2} ```{r} #| eval: false sps %>% plot(fill = "growth", centroid = "gdp60", palette = "Blues") + scale_size_continuous(range = c(0.5, 3)) + theme(legend.position = "bottom") + guides(size = guide_legend(nrow = 3, reverse = TRUE), fill = guide_legend(nrow = 3, byrow = TRUE)) + labs(fill = "Growth rate (1960-95)", size = "GDP in 1960") ``` ```{r eval = TRUE} #| label: fig-growth_gdp #| fig-cap: "Growth and initial GDP, 1960-1995" #| out-width: "100%" #| fig-width: 7 #| echo: false sps %>% plot(fill = "growth", centroid = "gdp60", bw = TRUE) + scale_size_continuous(range = c(0.5, 3)) + theme(legend.position = "bottom") + guides(size = guide_legend(nrow = 3, reverse = TRUE), fill = guide_legend(nrow = 3, byrow = TRUE)) + labs(fill = "Growth rate (1960-95)", size = "GDP in 1960") ``` The basic Solow model, which was given in @eq-solow_equation, is then estimated: \idxfun{lm}{stats}\idxfun{gaze}{micsr} ```{r } lm(log(gdp95) ~ log(i) + log(v), sp_solow) %>% gaze ``` Remember that the structural model implies that $\beta_s = - \beta_v = \kappa / (1 - \kappa)$. This hypothesis can be tested using a reparametrization of the model:\index[general]{reparametrization!Solow model} \idxfun{lm}{stats}\idxfun{gaze}{micsr} ```{r} #| collapse: true lm(log(gdp95) ~ log(i / v) + log(v), sp_solow) %>% gaze ``` for which the hypothesis is that the coefficient of `log(v)` equals 0. This hypothesis is rejected at the 5% level (but not at the 1% level). Imposing this hypothesis, we get: \idxfun{lm}{stats}\idxfun{coef}{stats}\idxfun{unname}{base} ```{r} #| echo: false z <- lm(log(gdp95) ~ log(i / v), sp_solow) b <- unname(coef(z)[2]) kappa <- b / (1 + b) ``` \idxfun{lm}{stats}\idxfun{gaze}{micsr} ```{r} #| collapse: true lm(log(gdp95) ~ log(i / v), sp_solow) %>% gaze ``` which implies a value of $\kappa$ (the share of the capital in the GDP) equal to $\hat{\beta}_i / (1 + \hat{\beta}_i) = `r round(kappa, 2)`$ which is implausibly high. \idxdata[]{sp\_solow}{necountries} ## Spatial correlation {#sec-spatial_correlation} Spatial correlation occurs when one can define a distance relationship between observations. The notion of distance is broad, but we'll consider in this section only geographical distance. For each observation, one can in this case define a set of neighbors. If the geographical representation of an observation is a polygon (which is the relevant choice for countries or regions), two observations for example can be said to be neighbors if they have a common border. If the geographical representation of an observation is a point (for example for a city), two observations are neighbors if the distance between them is less than, say, 100 kilometers. Once the set of neighbors have been defined for every observations, weights can be computed. The weights can be equal or may depend on the distance between the observation and its neighbors. These two operations of defining the set of neighbors and their weights are performed by the **spdep** package [@PEBE:BIVA:23]\index[author]{Pebesma}\index[author]{Bivand}. ### Contiguity and weights \index[general]{contiguity matrix|(} To get the matrix of contiguity for the counties of Louisiana, we use the `spdep::poly2nb` function: \idxdata[(]{agglo\_growth}{micsr.data} \idxfun{poly2nb}{spdep} ```{r } #| message: false library(spdep) nb_louis <- poly2nb(louisiana) nb_louis ``` \idxfun{sapply}{base}\idxfun{length}{base} ```{r} #| echo: false nlinks <- sapply(nb_louis, length) %>% sum ncount <- length(nb_louis) ``` The print method indicates the number of contiguity links (`r nlinks`). Instead of storing these links in a full matrix (with 1 for contiguous counties and 0 otherwise) which would have $`r ncount`^2 = `r ncount^2`$ cells with only `r nlinks` of them with a value of 1 (about `r round(nlinks / ncount ^ 2 * 100, 1)`%), an `nb` object is returned. It is a list of `r ncount` vectors which contains, for each observation, the positions of its neighbors. For example: \idxfun{head}{utils} ```{r } nb_louis %>% head(3) ``` The first two counties have five neighbors and the third one has six. A `summary` method provides more details. `poly2nb` has a `queen` argument which is `TRUE` by default. Queen contiguity means that two polygons which have only one common point are neighbors. Setting `queen` to `FALSE` implies that rook continuity is used. In this case, only counties that have a common border are neighbors. \idxfun{poly2nb}{spdep} ```{r } nb_louis_rook <- poly2nb(louisiana, queen = FALSE) ``` \idxfun{sapply}{base} ```{r} #| echo: false nlinks_rook <- sapply(nb_louis_rook, length) %>% sum ``` Printing `nb_nc_rook`, one can check that the number of contiguity links (`r nlinks_rook`) is slightly lower than previously (`r nlinks`). `nb` objects can't be plotted as is with **ggplot2**. We provide a convenient `st_nb` function which performs this task, as it returns a `sf` object (see @fig-links): \idxfun{ggplot}{ggplot2}\idxfun{st\_nb}{micsr}\idxfun{geom\_sf}{ggplot2} ```{r } #| label: fig-links #| fig-cap: "Queen and rook links for the counties of Louisiana" louisiana %>% ggplot() + geom_sf(fill = NA) + geom_sf(data = st_nb(louisiana), linetype = "dotted") + geom_sf(data = st_nb(louisiana, queen = FALSE)) ``` We first plot queen contiguity links with dotted lines and then rook contiguity with plain lines, so that the specific queen contiguity links appear as dotted segments. A matrix of weights is obtained by using the `nb2listw` function: ```{r } W_louis <- nb_louis %>% nb2listw ``` A `listw` object is returned, which contains two lists: the first (`neighbours`) is the same as the one of the `nb` object. The second contains weights: \idxfun{head}{utils} ```{r } W_louis$weights %>% head(3) ``` We can see that weights of neighbors for a given observation are all equal (1/5 for the first two observations which have five neighbors and $1/6$ for the third one which has six neighbors) and sum to one. Therefore, premultiplying a vector $(y - \bar{y})$ by $W$ results in a vector with typical value $\sum_m w_{nm} y_m - \bar{y} = \tilde{y}_n - \bar{y}$. `nb` and a `listw` objects can be coerced to matrices using the `nb2mat` and `listw2mat` functions. ```{r} #| results: hide W_louis %>% listw2mat ``` \idxdata[)]{agglo\_growth}{micsr.data} Instead of defining the neighborhood as common borders, one can consider distance between points, for example the capitals or the centroids of the countries using the `sp_solow` data set. Remember that `sps` contains two `sf`, `polygon` and `point` and that the active geometry is `polygon`. We first specify `point` as being the active geometry, using the `st_set_geometry` function, and we remove all countries for which the data are not available: \idxdata[(]{sp\_solow}{necountries} \idxfun{na.omit}{stats}\idxfun{st\_set\_geometry}{sf} ```{r } sp_solow2 <- sps %>% na.omit %>% st_set_geometry("point") ``` Then, we use the `dnearneigh` function to compute the set of neighbors for each country: the first argument is the `sf` and the next two arguments, called `d1` and `d2` are mandatory and should be set to the minimum and the maximum distances that should be used to define the neighbors. Note that this distance should be indicated in kilometers if the CRS is geographical, which is the case here. \idxfun{spdep}{dnearneigh} ```{r} d <- dnearneigh(sp_solow2, 0, 1000) d ``` \idxfun{sapply}{base}\idxfun{ifelse}{base}\idxfun{length}{base} ```{r} #| echo: false nlinks <- sum(sapply(d, function(x) ifelse(length(x) == 1 && x == 0, 0, length(x)))) ncount <- length(d) ``` With a distance of 1000 km, there are `r nlinks` links and an average of `r round(nlinks / ncount, 2)` neighbors per country. Note that 18 countries have no neighbors. We then compute the weights, using `nb2listwdist`. The first two argument are a `nb` and a `sf` objects. The `type` argument indicates how the weights should be computed. Denoting $d_{nm}$ the distance between the two unit $n$ and $m$, with `type = "idw"`, we get $w_{nm} = d_{nm} ^ {- \alpha}$. With `type = "exp"`, the weights are $w_{nm} = \mbox{exp}(- \alpha d_{nm})$. The `alpha` argument controls the value of $\alpha$ which is 1 by default. For example, if `type = "idw"` and `alpha = 2`, the weights are the inverse of the square of the distance. Once the weights have been computed, they can be normalized in different ways, using the `style` argument. If `type = "raw"` (the default), no normalization is performed. A usual choice is `"W"` where the weights of a unit are normalized so that they sum to 1: \idxfun{nb2listwdist}{spdep} ```{r} w <- nb2listwdist(d, sp_solow2, type = "idw", alpha = 1, style = "W", zero.policy = TRUE) ``` \idxdata[)]{sp\_solow}{necountries} Note the use of `zero.policy`: if `TRUE`, the matrix of weights is computed even if, as it is the case here, some countries have no neighbors. \index[general]{contiguity matrix|)} ### Tests for spatial correlation \index[general]{spatial correlation!tests|(} The spatial correlation hypothesis can be tested either for a specific variable or using the residuals of a linear regression. Two main tests have been proposed. The first is Moran's $I$ test: denoting $W$ as the weights matrix where the weights sum to one for a given line, the statistic is defined as: $$ I = \frac{(y - \bar{y})^\top W (y - \bar{y})}{(y - \bar{y})^\top (y - \bar{y})} = \frac{\sum_n (y_n - \bar{y})(\tilde{y}_n - \bar{y})}{\sum_n (y_n - \bar{y})^2} $$ which is simply the ratio of the covariance between $y$ and $\tilde{y}$ and the variance of $y$. It therefore can be obtained as the coefficient of $y$ in a linear regression of $\tilde{y}$ on $y$. If the variance of $\tilde{y}$ and $y$ were equal, it would also be the coefficient of correlation between the value of $y$ for a unit and its neighbors. The Moran test is computed using the `spdep::moran.test` function which takes as argument a series and a `listw` object: \idxdata[(]{agglo\_growth}{micsr.data} \idxfun{moran.test}{spdep}\idxfun{gaze}{micsr} \index[general]{Moran test|(} ```{r } #| collapse: true moran.test(louisiana$pop_gr, W_louis) %>% gaze ``` \idxfun{moran.test}{spdep}\idxfun{unname}{base} ```{r } #| include: false mor <- moran.test(louisiana$pop_gr, W_louis) stat_mor <- mor$estimate[1] %>% unname mu_mor <- mor$estimate[2] %>% unname sd_mor <- mor$estimate[3] %>% unname %>% sqrt ``` ```{r include = FALSE} Y <- louisiana$pop_gr N <- length(Y) Wmat <- W_louis %>% listw2mat DX <- outer(Y, Y, "-") ^ 2 (sum(Wmat * DX) / N) / (sum((Y - mean(Y)) ^ 2) / (N - 1)) / 2 geary.test(louisiana$pop_gr, W_louis) %>% .[["estimate"]] %>% unname %>% .[1] mrn <- moran.test(louisiana$pop_gr, W_louis) %>% .[["estimate"]] %>% unname %>% .[1] (N - 1) / N / 2 * (1 - 2 * mrn + sum( (Y - mean(Y)) ^ 2 * apply(Wmat, 2, sum)) / sum( (Y - mean(Y)) ^ 2)) ``` The value of the statistic is $`r round(stat_mor, 3)`$. Under the hypothesis of no correlation, the expected value of this statistic is $- 1 / (N-1) = `r round(mu_mor, 3)`$ and the standard deviation is $`r round(sd_mor, 3)`$.^[The expression of the variance of the Moran statistic can be found in @BIVA:WONG:18\index[author]{Bivand}\index[author]{Wong}.] The standardized statistic ($`r round((stat_mor - mu_mor) / sd_mor, 3)`$) is asymptotically normal and the hypothesis of no spatial correlation is rejected. \index[general]{Moran test|)}\index[general]{Geary test|(} An alternative to Moran's $I$ test is Geary's $C$ test, which is defined as: $$ C = \frac{N - 1}{2N}\frac{\sum_n \sum_m w_{nm} (y_n - y_m) ^ 2}{\sum_n (y_n - \bar{y})^ 2} $$ Introducing deviations from the mean in the previous expression and developing, we get: $$ C = \frac{1}{2}\frac{N - 1}{N} \left(1 - 2 I + \frac{\sum_n (y_m- \bar{y}) ^ 2 \sum_m w_{mn}}{\sum_n (y_n - \bar{y}) ^ 2}\right) $$ If the weight matrix were symmetric, as $\sum_n w_{nm} = 1$, we also would have $\sum_m w_{mn} = 1$, so that the last term is one and $C = \frac{N - 1}{N}(1 - I)$. Therefore, we can expect $C$ to be close to $(1 - I)$. Geary's test is implemented in the `spdep::geary.test` function \idxfun{geary.test}{spdep}\idxfun{gaze}{micsr} ```{r } #| collapse: true geary.test(louisiana$pop_gr, W_louis) %>% gaze ``` \index[general]{Geary test|)}\index[general]{Moran test|(} These tests are unconditional tests of spatial correlation. The same tests can be performed on the residuals of linear models. \idxfun{lm.morantest}{spdep}\idxfun{gaze}{micsr}\idxfun{lm}{stats}\idxfun{poly}{stats} ```{r } #| collapse: true model_louisiana <- lm(pop_gr~poly(log(pop),2,raw =TRUE), louisiana) lm.morantest(model_louisiana, W_louis) %>% gaze ``` \idxdata[)]{agglo\_growth}{micsr.data} The hypothesis of no spatial correlation is still rejected at the 5% level, but the p-value is much higher than the one associated with the unconditional test. @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} computed the Moran test for the residuals of the OLS estimation of the standard Solow's growth model. The set of neighbors is obtained with setting an infinite maximum distance, so that all countries are neighbors to each other. Two matrices of weights are computed: one with $w_{nm} = d_{nm}^{-2}$ (`W1`) and the other with $w_{nm} = \mbox{exp}(-2 d_{nm})$ (`W2`): \idxdata[(]{sp\_solow}{necountries} \idxfun{lm}{stats}\idxfun{dnearneigh}{spdep}\idxfun{nb2listwdist}{spdep}\idxfun{lm.morantest}{spdep}\idxfun{gaze}{micsr} ```{r} #| collapse: true lm_solow <- lm(log(gdp95) ~ log(i) + log(v), sp_solow2) d <- dnearneigh(sp_solow2, 0, Inf) W1 <- nb2listwdist(d, sp_solow2, type = "idw", alpha = 2, style = "W", zero.policy = TRUE) W2 <- nb2listwdist(d, sp_solow2, type = "exp", alpha = 2, style = "W", zero.policy = TRUE) lm.morantest(lm_solow, W1) %>% gaze lm.morantest(lm_solow, W2) %>% gaze ``` Whatever the weighting matrix, the hypothesis of no spatial correlation is rejected. \idxfun{localmoran}{spdep}\idxfun{resid}{stats}\idxfun{as\_tibble}{tibble}\idxfun{pull}{dplyr}\idxfun{as.numeric}{base}\idxfun{cut}{base}\idxfun{st\_set\_geometry}{sf}\idxfun{add\_column}{tibble}\idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2}\idxfun{aes}{ggplot2}\idxfun{scale\_fill\_brewer}{ggplot2} ```{r} #| include: false #| eval: false locmor <- localmoran(resid(lm_solow), W1) z_locmor <- locmor %>% as_tibble %>% pull(Ii) %>% as.numeric z_locmor <- z_locmor %>% cut(c(-1, - 0.5, 0, 0.5, 1, 1.5, 2, 2.5, Inf)) sp_solow2 %>% st_set_geometry("polygon") %>% add_column(z = z_locmor) %>% ggplot + geom_sf(aes(fill = z)) + scale_fill_brewer(palette = "Blues") ``` \idxdata[)]{sp\_solow}{necountries} \index[general]{Moran test|)} ### Local spatial correlation A first glance of local spatial correlation can be obtained using Moran's plot (see @fig-moran_plot), implemented in the `spdep::moran.plot` function, where $y_n - \bar{y}$ is on the $x$ axis and $\tilde{y}_n - \bar{y}$ on the $y$ axis. Therefore, both variables have zero mean and the intercept of the fitting line is therefore 0, the slope being Moran's $I$ statistic: \idxfun{moran.plot}{spdep} ```{r } #| fig-cap: "Moran plot" #| label: fig-moran_plot moran.plot(louisiana$pop_gr, W_louis) ``` Each observation is situated in one of the four quarters of the plane. Observations in the upper-right quarter are *"high-high"* observations, which means that the value of $y$ for observation $n$ and its neighbors are higher than the sample mean. Similarly, the lower-left quarter contains *"low-low"* observations, i.e., observations for which own values of $y$ and values of its neighbors are lower than the sample mean. The upper-left and the lower-right quarters contain respectively the *"low-high"* and the *"high-low"* observations. In case of no spatial correlation, the points should be randomly disposed around the origin and the slope of the regression line should be 0. On the contrary, in case of positive spatial correlation, a majority of points should be of the *"low-low"* or *"high-high"* category and the regression line should have a positive slope. \index[general]{Moran test!local|(} \index[general]{Geary test!local|(} @ANSE:95\index[author]{Anselin} proposed local versions of Moran's $I$ and Geary's $C$ statistics. We then have for each observation $I_n$ and $G_n$ so that $\sum_n I_n$ and $\sum_n G_n$ are respectively proportional to the global Moran and Geary statistics. Local Moran's statistics are defined as: $$ I_n = (y_n - \bar{y}) \sum_m w_{nm} (y_m - \bar{y}) = (y_n - \bar{y})(\tilde{y}_n - \bar{y}) $$ and their sum is $(y_n - \bar{y})(\tilde{y}_n - \bar{y})$, which is the numerator of Moran's $I$ statistic. Therefore, we have: $$ I = \frac{\sum_n I_n}{\sum_n (y_n - \bar{y}) ^ 2} $$ Local Moran's statistics are obtained using the `spdep::localmoran` function: \idxdata[(]{agglo\_growth}{micsr.data} \idxfun{localmoran}{spdep} ```{r } locmor <- localmoran(louisiana$pop_gr, W_louis) ``` which returns a matrix, with a line for every observation and column containing the local Moran values, their expectation, variance, the standardized statistic and the p-value. It's easier to coerce this matrix to a tibble in order to extract extreme values of the statistic:^[Note that we also rename for convenience the fifth column which has a non-standard name (`Pr(z != E(Ii))`) to `pval`.] \idxfun{as\_tibble}{tibble}\idxfun{rename}{dplyr}\idxfun{filter}{dplyr} ```{r } locmor %>% as_tibble %>% rename(pval = 5) %>% filter(pval < 0.01) ``` There are therefore 11 out of 64 (17%) observations for which the p-value is lower than 1%, which confirms the presence of spatial correlation. The local Moran statistic can also be represented in a map (@fig-map_local_moran), in order to identify the "hot spots":^[Note that the orders of the levels are reversed using `forcats::fct_rev`, so that the high values are represented in dark gray.] \idxfun{as\_tibble}{tibble}\idxfun{pull}{dplyr}\idxfun{as.numeric}{base}\idxfun{cut}{base}\idxfun{add\_column}{tibble}\idxfun{mutate}{dplyr}\idxfun{fct\_rev}{forcats}\idxfun{ggplot}{ggplot2}\idxfun{geom\_sf}{ggplot2}\idxfun{scale\_fill\_grey}{ggplot2} ```{r } #| fig-cap: "Map of local Moran statistic" #| label: fig-map_local_moran z_locmor <- locmor %>% as_tibble %>% pull(Ii) %>% as.numeric z_locmor <- z_locmor %>% cut(c(-1, - 0.5, 0, 0.5, 1, 1.5, 2, 2.5, Inf)) louisiana %>% add_column(z = z_locmor) %>% mutate(z = fct_rev(z)) %>% ggplot + geom_sf(aes(fill = z)) + scale_fill_grey() ``` Two hot spots are then identified, in the north-west and the south-west of the state. ```{r include = FALSE, eval = FALSE} crossproduct <- function(x, y){ J <- length(x$weights) sapply(1:J, function(j) sum(y[x$neighbours[[j]]] * x$weights[[j]])) y_yb <- crime_nc$crime_rate - mean(crime_nc$crime_rate) as.numeric(crossprod(crossproduct(W, y_yb), y_yb)) / sum(y_yb ^ 2) N <- length(y_yb) S0 <- N Wm <- W %>% listw2mat S1 <- sum((Wm + (t(Wm))) ^ 2) / 2 S2 <- sum( (apply(Wm, 1, sum) + apply(Wm, 2, sum)) ^ 2) S3 <- (sum(y_yb ^ 4) / N) / (sum(y_yb ^ 2) / N) ^ 2 S4 <- (N ^ 2 - 3 * N + 3) * S1 - N * S2 + 3 * S0 ^ 2 S5 <- (N ^ 2 - N) * S1 - 2 * N * S2 + 6 * S0 ^ 2 V <- (N * S4 - S3 * S5) / ( (N - 1) * (N - 2) * (N - 3) * S0 ^ 2) - 1 / (N - 1) ^ 2 ``` \index[general]{Moran test!local|)} \index[general]{Geary test!local|)} \idxdata[)]{agglo\_growth}{micsr.data} \index[general]{spatial correlation!tests|)} ## Spatial econometrics {#sec-spatial_models} ### Spatial models and tests We consider here linear gaussian models that are extended in order to integrate the spatial features of the sample, using the weighting matrix described in the previous section. Two main models can be considered. The first one is the **spatial error model** (**SEM**) which can be written in matrix form as: \index[general]{spatial error model|(} $$ y = \alpha \iota + X \beta + \epsilon \; \mbox{with} \; \epsilon = \rho_\epsilon W \epsilon + \nu $$ {#eq-sem} Therefore, the error for observation $n$ is linearly related to the errors of its neighbors $\tilde{\epsilon}_n$. OLS estimation gives unbiased and consistent estimators but, as always with non-spherical errors, it is inefficient and the estimated covariance matrix of the coefficients based on the simple formula ($\sigma ^ 2 (\tilde{X}^\top \tilde{X}) ^ {-1}$) is biased. \index[general]{spatial error model|)} \index[general]{spatial autoregressive model|(} The second model is called the **spatial autoregressive model** (**SAR**).^[This model is also known as the spatial lag model.]. It extends the basic gaussian linear model by adding as a regressor the mean value of the response for the neighbor units. For one observation, the model writes $y_n =\alpha + \beta ^ \top x_n + \rho_y \tilde{y}_n + \epsilon_n$ or, in matrix form: $$ y = \alpha j + X \beta + \rho_y W y + \epsilon = Z \gamma + \rho_y W y + \epsilon $$ The reduced form of the model is: $$ y = (I - \rho_y W) ^ {-1} Z \gamma + (I - \rho_y W) ^ {-1}\epsilon $$ {#eq-sar} Therefore, the values of $y$ depend on all the values of $\epsilon$, and $\tilde{y}_{n}$ is therefore correlated with $\epsilon_n$. The OLS estimator is therefore biased and inconsistent. Moreover, in @eq-sar, the spatial dependence in the parameter $\rho_y$ feeds back [@BIVA:MILL:PIRA:21, page 6]\index[author]{Bivand}\index[author]{Millo}\index[author]{Piras}, which is not the case for @eq-sem. This means that for the SEM model, the marginal effect of a covariate is the corresponding coefficient, but this is not the case for the SAR model. Moreover, the matrix $(I - \rho_y W) ^ {-1}$ can be written as an infinite series: $$ (I - \rho_y W) ^ {-1} = I + \rho_y W + \rho_y^2 W^2 + \rho_y^3 W^3 + \ldots $$ Consider the case of three observations, France (1), Italy (2) and Spain (3). The matrix of contiguity is: $$ W = \left(\begin{array}{rcl} 0 & 0.5 & 0.5 \\ 1 & 0 & 0 \\ 1 & 0 & 0\end{array}\right) $$ France has two neighbors (Italy and Spain) and Italy and Spain only one (France). The weights are such that they sum to 1 for each line. Consider a variation of the unique covariate in Spain ($\Delta x_3$) and denote $\beta$ the corresponding coefficient. The direct effect on the response for the three countries is obviously $\Delta y_0 ^ \top = (0, 0, \beta \Delta x_3)$. This increase of $y$ in Spain implies an increase of $y$ in the neighboring country, France. Therefore, $\Delta y_1 ^ \top = (0.5 \rho_y \beta \Delta x_3, 0, 0)$. This increase of $y$ in France implies an increase of $y$ in the neighboring countries, Italy and Spain: $\Delta y_2 ^ \top = (0, 0.5 \rho_y ^ 2 \beta \Delta x_3, 0.5 \rho_y ^ 2 \beta \Delta x_3)$. This increase of $y$ in Italy and Spain implies an increase of $y$ in the neighboring country of Italy and Spain, which is France: $\Delta y_3 ^ \top = (0.5\rho_y ^ 3 \beta \Delta x_3, 0, 0)$, etc. To take a numerical example, consider $\beta \Delta x_3 = 1$ and $\rho_y = 0.3$. Then, $\Delta y_0 ^ \top = (0, 0, 1)$, $\Delta y_1 ^ \top = (0.15, 0, 0)$, $\Delta y_2 ^ \top = (0, 0.045, 0.045)$ and $\Delta y_3 ^ \top = (0.0135, 0, 0)$. In total, we have $\sum_{i = 0} ^ 3 \Delta y_i ^ \top = (0.177, 0.045, 1.045)$. These four effects and their sum is computed below: \idxfun{matrix}{base}\idxfun{diag}{base}\idxfun{cbind}{base}\idxfun{apply}{base} ```{r} W <- matrix(c(0, 1, 1, 0.5, 0, 0, 0.5, 0, 0), nrow = 3) bdx <- c(0, 0, 1) rho <- 0.3 dy0 <- diag(3) %*% bdx dy1 <- rho * W %*% bdx dy2 <- rho ^ 2 * W %*% W %*% bdx dy3 <- rho ^ 3 * W %*% W %*% W %*% bdx dy_03 <- cbind(dy0, dy1, dy2, dy3) dy_03 apply(dy_03, 1, sum) ``` The total effect of $\Delta x_3$ is obtained using the formula: $\Delta y = (I - \rho_y W) ^ {-1} \Delta x$: \idxfun{solve}{base}\idxfun{diag}{base} ```{r} solve(diag(3) - rho * W) %*% bdx ``` which is very close to the sum of the direct effect and the first three indirect effects computed previously. \index[general]{spatial autoregressive model|)} \index[general]{spatial error model|(} Spatial models are usually estimated by maximum likelihood, by assuming a multivariate normal distribution for the iid errors. For the SEM model, the idiosyncratic errors can be written as a function of the response: $$ \nu = (I - \rho_\epsilon W) y - (I - \rho_\epsilon W) Z\gamma $$ {#eq-etasem} so that the Jacobian of the transformation is $$\left| \frac{\partial \nu}{\partial y} \right| = \left| I - \rho_\epsilon W \right|$$. the likelihood is similar to the one of the linear gaussian model except that an extra term, which is the log of the Jacobian, should be added: $$ -N/2 (\ln 2\pi + \ln \sigma ^ 2) + \ln \left| I - \rho_\epsilon W \right| - \frac{1}{\sigma ^ 2} \nu^\top \nu $$ where $\nu$ is given by @eq-etasem. \index[general]{spatial error model|)} \index[general]{spatial autoregressive model|(} For the **SAR** model, @eq-sar indicates that the Jacobian is the same, so that the log-likelihood is: $$ -N/2 (\ln 2\pi + \ln \sigma ^ 2) + \ln \left| I - \rho_y W \right| - \frac{1}{\sigma ^ 2} \epsilon^\top \epsilon ^ 2 $$ with $\epsilon = y - Z \gamma - \rho_y Wy$.    \index[general]{spatial autoregressive model|)} These two models can be augmented by spatial lags of the covariates, they are in this case called Durbin's models. There is an inserting relationship between Durbin's SAR model and the SEM model. The former can be written: \index[general]{Durbin's model|(} \index[general]{spatial autoregressive model!Durbin|(} \index[general]{spatial error model!Durbin|(} $$ y = \rho_y W y + Z\gamma + W Z\theta + \epsilon $$ with $\epsilon$ iid errors. The **common factor** hypothesis is that $\theta = - \rho_y \gamma$. In this case, we have: \index[general]{common factor hypothesis} $$ (I - \rho_y W) y = (I - \rho_y W) Z\gamma + \epsilon $$ Or finally: $$ y = Z\gamma + (I - \rho_y W) ^{-1} \epsilon $$ which is the SEM model, as denoting $\eta = (I -\rho_y W) ^ {-1} \epsilon$ the errors of this model, we have $y = Z\gamma + \eta$, with $\eta = \rho_y W \eta + \epsilon$. Once it has been established that there is some spatial dependence, one has to choose the "right" specification. Several tests have been proposed, based on the three test principles. Lagrange multiplier tests proposed by @ANSE:BERA:FLOR:YOON:96\index[author]{Anselin}\index[author]{Bera}\index[author]{Florax}\index[author]{Yoon} are popular, as they only require the OLS estimation. Different flavors of tests have been proposed, testing $\rho_y = 0$, $\rho_\epsilon = 0$ or $\rho_y = \rho_\epsilon = 0$. The basic version of, for example the first test ($\mbox{H}_{0}: \rho_y = 0$) has, as a maintained hypothesis that $\rho_\epsilon = 0$. A robust version of the test has been proposed which doesn't impose this maintained hypothesis. The common factor hypothesis can easily be tested using a likelihood ratio test. The Wald statistic is less easy to compute, as a set of non-linear hypotheses $\theta = - \rho_y \beta$ should be tested. \index[general]{Durbin's model|)} \index[general]{spatial autoregressive model!Durbin|)} \index[general]{spatial error model!Durbin|)} ### Application to the growth model To overcome the irrelevant empirical results of the standard Solow model, @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} consider an endogenous growth model with spillovers. The production function is as usual a Cobb-Douglas: $$ Y_n(t) = A_n(t) K_n ^ \kappa(t) L_n ^ {1 -\kappa}(t) $$ $\kappa$ being the elasticity of the production with the capital and also the share of the profits in the national product. The level of technology is given by: $$ A_n(t) = \Omega(0) e ^ {\mu t} k_n ^ \phi(t) \prod_{m \neq n} ^ N A_n^{\gamma w_{nm}}(t) $$ where $k_n(t) = K_n(t) / L_n(t)$ is the physical capital per worker, $\Omega(0)$ is the initial level of the technology and $\mu$ is a constant rate of technological growth, as in the Solow model. The next term takes into account the spillover effect of domestic investment, following @ROME:86\index[author]{Romer}, and the strength of this spillover effect is measured by $\phi$. The last term is specific to the model of @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch}, for which the spillovers are not restricted to domestic investment, but concerns also the technology of neighboring countries. The effect of the technology of country $m$ on the technology of country $n$ is the product of a constant parameter $\gamma$ and a specific weight, $w_{nm}$, which is a decreasing function of the distance between both countries. @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} showed (equation 10, page 1038) that, at the steady state, the output per capita is: $$ \begin{array}{rcl} \ln y_n ^ * &=& \frac{1}{1 - \kappa - \phi} \ln \Omega + \frac{\kappa + \phi}{1 - \kappa - \phi} \ln i_n - \frac{\kappa + \phi}{1 - \kappa - \phi} \ln v_n \\ &-& \frac{\gamma\kappa}{1 - \kappa - \phi} \sum_{m\neq n}^ N w_{nm} \ln i_m + \frac{\gamma\kappa}{1 - \kappa - \phi} \sum_{m\neq n}^ N w_{nm} \ln v_n \\ &+& \frac{\gamma(1 - \kappa)}{1 - \kappa - \phi} \sum_{m\neq n}^ N w_{nm} \ln y_m ^ * \end{array} $$ In matrix form, denoting $y = \ln y^*$, $X = (\ln i \ln v)$, $\beta = \left(\begin{array}{c} (\kappa + \phi) / (1 - \kappa - \phi)\\ - (\kappa + \phi) / (1 - \kappa - \phi)\end{array}\right)$, $\theta = \left(\begin{array}{c} \gamma \kappa / (1 - \kappa - \phi)\\ - \gamma \kappa / (1 - \kappa - \phi)\end{array}\right)$ and $\rho_y = \gamma (1 - \kappa) / (1 - \kappa - \phi)$, the model can be written as: $$ y = \alpha i + X \beta + WX \theta + \rho_y Wy + \epsilon $$ \idxdata[(]{sp\_solow}{necountries} if $\gamma = 0$, the model reduces to the model of @ROME:86\index[author]{Romer} and the last two terms disappear. Note that in this case $\kappa$ and $\phi$ are not identified, but only their sum. Moreover, if $\phi$ is also 0, we get the Solow model. Whether $\phi$ equals zero or not, the resulting model is a standard linear model already estimated in @sec-solow_model: \idxfun{lm}{stats} ```{r} ols_solow <- lm(log(gdp95) ~ log(i) + log(v), sp_solow2) ``` Moreover, imposing the theoretical constraint of the growth model, we must have the sum of the two coefficients of `log(i)` and `log(v)` equal to 0. Therefore, the constrained model can be estimated using the difference of the logarithms of the two covariates as the unique regressor. For convenience, we call the ratio of the two covariates `i` and we store it in a new data frame called `sp_solow3`. The constrained OLS model can then be estimated: \idxfun{mutate}{dplyr}\idxfun{lm}{stats} ```{r} sp_solow3 <- sp_solow2 %>% mutate(i = i / v) ols_solowc <- lm(log(gdp95) ~ log(i), sp_solow3) ``` The hypothesis of $\gamma = 0$ can be tested using Lagrange multipliers tests, testing either the alternative that there is a spatial lag or correlated errors. `spdep::lm.LMtests` provides different flavors of these tests. Its two main arguments are a `lm` model and a matrix of spatial weights. The `test` argument can be either `"LMerr"` and "`LMlag`" for the standard tests of uncorrelated errors and absence of spatial lag, `"RLMerr"` and `"RLMlag"` for the robust versions of these two tests and `"SARMA"`, suitable when the alternative hypothesis is the presence of both correlation of the errors and a spatial lag. `test = "all"` enables to perform the five tests: \idxfun{lm.LMtests}{spdep}\idxfun{gaze}{micsr} ```{r} #| collapse: true lm.LMtests(ols_solow, W1, test = "all") %>% gaze ``` All the tests reject the hypothesis of uncorrelated errors or/and spatial lag, except the robust test of uncorrelated errors. Therefore, these tests seem to indicate that the correct model is a spatial lag model. We then estimate the three spatial models using the **spatialreg** package [@PEBE:BIVA:23]\index[author]{Pebesma}\index[author]{Bivand} which provides three functions: `lagsarlm` for SAR models, `sacsarlm` for SEM model and `errorsarlm` for SAR-SEM models. The three first arguments are `formula`, `data` and `listw`, which should be a `listw` object containing the weights. The `Durbin` argument enables to estimate Durbin's versions of the model; this can be done either by setting it to `TRUE` (a spatial lag is then added for all the covariates) or a formula indicating the subset of covariates for which spatial lags have to be added. We estimate also, as for the OLS estimator, the variant of the model that imposes the theoretical restrictions on the coefficients. The results are presented in @tbl-results_ertu_koch. \idxfun{lagsarlm}{spatialreg}\idxfun{errorsarlm}{spatialreg}\idxfun{sacsarlm}{spatialreg}\idxfun{update}{stats} ```{r} library(spatialreg) m1 <- lagsarlm(log(gdp95) ~ log(i) + log(v), sp_solow2, W1, Durbin = TRUE) m2 <- errorsarlm(log(gdp95) ~ log(i) + log(v), sp_solow2, W1, Durbin = FALSE) m3 <- sacsarlm(log(gdp95) ~ log(i) + log(v), sp_solow2, W1, Durbin = TRUE) m1c <- update(m1, . ~ . - log(v), data = sp_solow3) m2c <- update(m2, . ~ . - log(v), data = sp_solow3) m3c <- update(m3, . ~ . - log(v), data = sp_solow3) ``` We can first test the theoretical restrictions using either a likelihood ratio or a Wald test: \idxfun{linearHypothesis}{car}\idxfun{lrtest}{lmtest}\idxfun{gaze}{micsr} ```{r} #| collapse: true library(lmtest) library(car) linearHypothesis(ols_solow, "log(i) + log(v) = 0", test = "Chisq") %>% gaze lrtest(ols_solow, ols_solowc) %>% gaze linearHypothesis(m1, c("log(i) + log(v) = 0", "lag.log(i) + lag.log(v) = 0")) %>% gaze lrtest(m1, m1c) %>% gaze ``` As seen previously, the theoretical restrictions are rejected at the 5% level using a Wald test for the Solow model. This is not the case for the spatial lag model. Note also that the Wald and the likelihood ratio tests give very similar results. An interesting feature of the model of @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch} is that their specification enables the identification of all the structural parameters of the model, $\kappa$, $\phi$ and $\gamma$. If $\phi = 0$, $\theta = \gamma \beta$ and $\rho_y = \gamma$, so that: $y = \alpha j+ (I - \gamma W)X\beta + \gamma y + \epsilon$, the reduced form of the model is, in matrix form: $$ y = \alpha j + X \beta + (I - \gamma W) ^ {-1} \epsilon $$ which is a SEM model without Durbin terms. Therefore, the hypothesis that $\phi = 0$ is in the context of this model the hypothesis of common factors and can easily be tested using a likelihood ratio test: \idxfun{lrtest}{lmtest}\idxfun{gaze}{micsr} ```{r} #| collapse: true lrtest(m1c, m2c) %>% gaze ``` The hypothesis of common factor is rejected at the 1% level. Our preferred specification is therefore a spatial lag model imposing the theoretical restrictions. The structural parameters of the model ($s$) can be computed using the fitted reduced form parameters ($r$). The relation between $r$ and $s$ is: ```{r} #| echo: false #| label: tbl-results_ertu_koch #| tbl-cap: "Results of the growth models" library(modelsummary) options(modelsummary_get = "broom") modelsummary(list(OLS = ols_solow, SAR = m1, SEM = m2, SARMA = m3, OLSc = ols_solowc, SARc = m1c, SEMc = m2c), coef_omit = "Intercept", coef_map = c("log(i)" = "$\\ln i$", "log(v)" = "$\\ln v$", "lag.log(i)" = "$W \\ln i$", "lag.log(v)" = "$W\\ln v$", rho = "$\\rho_y$", lambda = "$\\rho_\\epsilon$"), gof_omit = "R2|AIC|BIC", escape = FALSE, output = "kableExtra") ``` $$ r ^ \top = (\beta, \theta, \rho) = \left(\frac{\kappa + \phi}{1 - \kappa - \phi}, - \gamma \frac{\kappa}{1 - \kappa - \phi}, \gamma \frac{1 - \kappa}{1 - \kappa - \phi}\right) $$ Inversing this relation, we get the structural parameters $s$ as a function of the reduced form parameters: $$ s ^ \top = (\kappa, \phi, \gamma) = \left(\frac{\theta}{\theta - \rho}, \frac{\beta}{1 + \beta} - \frac{\theta}{\theta - \rho}, \frac{\rho - \theta}{1 + \beta}\right) $$ For our model, we get: \idxfun{unname}{base}\idxfun{coef}{stats} ```{r} #| collapse: true beta <- unname(coef(m1c)["log(i)"]) theta <- unname(coef(m1c)["lag.log(i)"]) rho <- unname(coef(m1c)["rho"]) kappa <- theta / (theta - rho) phi <- beta / (1 + beta) - theta / (theta - rho) gamma <- (rho - theta) / (1 + beta) s <- c(kappa = kappa, phi = phi, gamma = gamma) s ``` \index[general]{delta method|(} To apply the delta method, we compute the matrix of derivatives of $s$ with respect to $r$: $$ \Gamma(r) = \frac{\partial s}{\partial r ^ \top}(r)= \left( \begin{array}{rcl} 0 & - \frac{\rho}{(\theta - \rho) ^ 2} & \frac{\theta}{(\theta - \rho) ^ 2} \\ \frac{1}{(1 + \beta) ^ 2} & \frac{\rho}{(\theta - \rho) ^ 2} & - \frac{\theta}{(\theta - \rho) ^ 2} \\ - \frac{\theta - \rho}{(1 + \beta) ^ 2} & - \frac{1}{1 + \beta} & \frac{1}{1 + \beta} \end{array} \right) $$ and the asymptotic variance of $s$ is then: $$ \hat{V}(\hat{r}) = \Gamma(\hat{r}) \hat{V}(\hat{s})\Gamma(\hat{r}) ^ \top $$ \idxfun{vcov}{stats}\idxfun{matrix}{base} ```{r} Vr <- vcov(m1c)[c("log(i)", "lag.log(i)", "rho"), c("log(i)", "lag.log(i)", "rho")] G <- c(0, - rho / (theta - rho) ^ 2, theta / (theta - rho) ^ 2, 1 / (1 + beta) ^ 2, rho / (theta - rho) ^ 2, - theta / (theta - rho) ^ 2, (theta - rho) / (1 + beta) ^ 2, - 1 / (1 + beta), 1 / (1 + beta)) G <- matrix(G, nrow = 3, byrow = TRUE) V <- G %*% Vr %*% t(G) ``` We finally compute the table of the structural parameters: \idxfun{diag}{base}\idxfun{pnorm}{stats}\idxfun{cbind}{base} ```{r} stdev <- V %>% diag %>% sqrt z <- s / stdev p <- pnorm(abs(z), lower.tail = FALSE) * 2 struct_par <- cbind(Estimate = s, "Std. Error" = stdev, "z value" = z, "Pr(>|z|" = p) rownames(struct_par) <- names(s) struct_par ``` \idxdata[)]{sp\_solow}{necountries} \index[general]{delta method|)} The value of $\kappa$ is compatible with the observable share of the profits in the national product. $\phi$ is significant only at the 10% level. Finally, $\gamma$ is highly significant, which is a confirmation of the relevance of the model of @ERTU:KOCH:07\index[author]{Ertur}\index[author]{Koch}.