14 Discrete choice models

Consider the case where the response is the choice of an alternative among a set of mutually exclusive alternatives. This choice can be modeled in a utility maximization framework, which means that we hypothesize that the individual chooses the alternative which corresponds to the maximum level of utility.¹ Namely, denoting $j = 1 \ldots J$ the set of alternatives, we’ll denote $U_{nj}$ the level of utility of individual $n$ if he chooses alternative $j$ and the $j$^thalternative will be chosen if $U_{nj} > U_{nk}\; \forall k \neq j$. The level of utility will depends on some observable covariates and on some other unobservables whose effect will be summarized in a variable $\epsilon_{nj}$ that will be considered, from the researcher’s point of view, as the realization of a random variable. Two key questions for these models are how the covariates enter the utility function and what is the assumed distribution of the error. Section 14.1 will deal with the first point, namely how to deal with covariates in discrete choice model. Section 14.2 will present the landmark model in this field, the multinomial logit model. Section 14.3 will go beyond the hypothesis of iid errors to present two extensions of the basic model, the nested and the heteroscedastic logit models. Section 14.4 will introduce the rich field of random parameters logit models. Finally, Section 14.5 will be devoted to the multinomial probit model.²

Throughout this chapter, we’ll use the mlogit package which is devoted to the estimation of discrete choice models:

library(mlogit)

14.1 Data management and model description

Data management

Data sets used for discrete choice models estimation concern some individuals, who make one or a sequential choice of one alternative among a set of mutually exclusive alternatives. The determinants of these choices are covariates that can depend on the alternative and the choice situation, only on the alternative or only on the choice situation. Data sets can have two different shapes: a wide shape (one row for each choice situation) or a long shape (one row for each alternative and, therefore, as many rows as there are alternatives for each choice situation). mlogit deals with both formats. It depends on the dfidx package which takes as first argument a data.frame and returns a dfidx object, which is a data.frame in “long” format with a special data frame column which contains the indexes. The second argument is called idx. In its simple use, it should be a list (or a vector) of two characters containing the choice situation and the alternative indexes.

Wide format

dutch_railways³ is an example of a wide data set:

dutch_railways %>% print(n = 4)

# A tibble: 2,929 × 11
     id choiceid choice price_A price_B time_A time_B change_A
  <int>    <int> <fct>    <dbl>   <dbl>  <dbl>  <dbl>    <int>
1     1        1 A         10.9    18.2   2.5    2.5         0
2     1        2 A         10.9    14.5   2.5    2.17        0
3     1        3 A         10.9    18.2   1.92   1.92        0
4     1        4 B         18.2    14.5   2.17   2.5         0
# ℹ 2,925 more rows
# ℹ 3 more variables: change_B <int>, comfort_A <int>,
#   comfort_B <int>

This data set contains data about a stated preference survey in the Netherlands in 1987. Each individual has responded to several (up to 16) scenarios. For every scenario, two train trips are proposed to the user, with different combinations of four attributes: price (the price in euros), time (travel time in minutes), change (the number of changes) and comfort (the class of comfort, 0, 1 or 2, 0 being the most comfortable class). This “wide” format is suitable to store choice situation (or individual specific) variables because, in this case, they are stored only once in the data. It is cumbersome for alternative-specific variables because there are as many columns for such variables as there are alternatives.

For such a wide data set, the shape argument of dfidx is mandatory, as its default value is "long". The alternative-specific variables are indicated with the varying argument which is a numeric vector that indicates their position in the data frame. This argument is then passed to stats::reshape that coerced the original data.frame in “long” format. Further arguments may be passed to reshape. For example, as the names of the variables are of the form price_A, one must add sep = "_" (the default value being "."). The choice argument is also mandatory because the response has to be transformed in a logical value in the long format. In “wide” format, there is no alternative index. The choice situation index is not mandatory, as there is one line for each choice situation. In this data set, there is a choice situation index called id, and it is nested in the individual index called choiceid. To take the panel dimension into account, idx is a list of length 1 (the choice situation) containing a vector of length 2 with choiceid and id. The idnames is used to give a relevant name for the second index, the NA in first position indicating that the name of the first index is unchanged.

Tr <- dfidx(dutch_railways, shape = "wide", varying = 4:11, sep = "_",
            idx = list(c("choiceid", "id")), idnames = c(NA, "alt"),
            opposite = c("price", "time", "change", "comfort"))

Note the use of the opposite argument for the four covariates: we expect negative coefficients for all of them, taking the opposite of the covariates will lead to expected positive coefficients.

Tr %>% print(n = 4)

# A tibble: 5,858 × 6
# Index:    2929 (choiceid) x 2 (alt)
# Balanced: yes
# Nesting:  choiceid (id)
  idx   choice price  time change comfort
  <idx> <fct>  <dbl> <dbl>  <int>   <int>
1 1:A   A      -10.9 -2.5       0      -1
2 1:B   A      -18.2 -2.5       0      -1
3 2:A   A      -10.9 -2.5       0      -1
4 2:B   A      -14.5 -2.17      0      -1
5 3:A   A      -10.9 -1.92      0      -1
# ℹ 5,853 more rows

An idx column is added to the data, which contains the three relevant indexes: choiceid is the choice situation index, alt the alternative index and id the individual index. This column can be extracted using the idx function:

idx(Tr)

Long format

toronto_montreal,⁴ is an example of a data set in long format. It presents the choice of individuals for a transport mode for the Toronto-Montreal corridor in 1989:

toronto_montreal %>% print(n = 5)

# A tibble: 15,520 × 11
   case alt   choice  dist  cost   ivt   ovt  freq income urban
  <int> <fct>  <int> <int> <dbl> <int> <int> <int>  <int> <int>
1     1 train      0    83  28.2    50    66     4     45     0
2     1 car        1    83  15.8    61     0     0     45     0
3     2 train      0    83  28.2    50    66     4     25     0
4     2 car        1    83  15.8    61     0     0     25     0
5     3 train      0    83  28.2    50    66     4     70     0
# ℹ 15,515 more rows
# ℹ 1 more variable: noalt <int>

There are four transport modes (air, train, bus and car) and most of the variables are alternative-specific (cost for monetary cost, ivt for in-vehicle time, ovt for out-vehicle time, freq for frequency). The only choice situation-specific variables are dist (distance of the trip), income (household income), urban (a dummy for trips which have a large city at the origin or the destination) and noalt (the number of available alternatives). The advantage of this shape is that there are much fewer columns than in the wide format, the caveat being that values of dist, income and urban are repeated up to four times. For data in “long” format, the shape and the choice arguments are no longer mandatory. To replicate published results later in the text, we’ll use only a subset of the choice situations, namely those for which the four alternatives are available. This can be done using the subset function with the subset argument set to noalt == 4 while estimating the model. This can also be done within dfidx, using the subset argument.

The information about the structure of the data can be explicitly indicated using choice situations and alternative indexes (respectively case and alt in this data set) or, in part, guessed by the dfidx function. Here, after subsetting, we have 2779 choice situations with 4 alternatives, and the rows are ordered first by choice situation and then by alternative (train, air, bus, and car in this order). The first way to read correctly this data frame is to ignore completely the two index variables. In this case, the only supplementary argument to provide is the alt.levels argument, which is a character vector that contains the name of the alternatives in their order of appearance:

MC <- dfidx(toronto_montreal, subset = noalt == 4,
            alt.levels = c("train", "air", "bus", "car"))

Note that this can only be used if the data set is “balanced”, which means that the same set of alternatives is available for all choice situations. It is also possible to provide the name of the variable that contains the alternatives through the argument idx:

MC <- dfidx(toronto_montreal, subset = noalt == 4, idx = list(NA, "alt"))

The name of the variable that contains the information about the choice situations can also be indicated through the argument idx:

MC <- dfidx(toronto_montreal, subset = noalt == 4, idx = "case",
            alt.levels = c("train", "air", "bus", "car"))

Both alternative and choice situation variables can also be provided:

MC <- dfidx(toronto_montreal, subset = noalt == 4, 
            idx = c("case", "alt"))

More simply, as the two indexes are stored in the first two columns of the original data frame, the idx argument can be unset:

MC <- dfidx(toronto_montreal, subset = noalt == 4)

and the indexes can be kept as standalone series if the drop.index argument is set to FALSE:

MC <- dfidx(toronto_montreal, subset = noalt == 4, 
            idx = c("case", "alt"), drop.index = FALSE)
MC %>% print(n = 5)

# A tibble: 11,116 × 12
# Index:    2779 (case) x 4 (alt)
# Balanced: yes
  idx        case alt   choice  dist  cost   ivt   ovt  freq income
  <idx>     <int> <fct>  <int> <int> <dbl> <int> <int> <int>  <int>
1 109:train   109 train      0   377  58.2   215    74     4     45
2 109:air     109 air        1   377 143.     56    85     9     45
3 109:bus     109 bus        0   377  27.5   301    63     8     45
4 109:car     109 car        0   377  71.6   262     0     0     45
5 110:train   110 train      0   377  58.2   215    74     4     70
# ℹ 11,111 more rows
# ℹ 2 more variables: urban <int>, noalt <int>

Model description

Standard formulas are not very practical to describe random utility models, as these models may use different sets of covariates. Actually, working with random utility models, one has to consider at most three sets of covariates:

alternative- and choice situation-specific covariates $x_{nj}$ with generic coefficients $\beta$ and alternative-specific covariates $t_j$ with a generic coefficient $\nu$,
choice situation-specific covariates $z_n$ with alternative-specific coefficients $\gamma_j$,
alternative- and choice situation-specific covariates $w_{nj}$ with alternative-specific coefficients $\delta_j$.

The covariates enter the observable part of the utility which can be written, for alternative $j$:

\[ V_{nj}=\alpha_j + \beta x_{nj} + \nu t_j + \gamma_j z_n + \delta_j w_{nj} \]

As the absolute value of utility is irrelevant, only utility differences are useful to modelize the choice for one alternative. For two alternatives $j$ and $l$, we obtain:

\[ V_{nj}-V_{nl}=(\alpha_j-\alpha_l) + \beta (x_{nj}-x_{nl}) + \nu(t_j - t_l) + (\gamma_j-\gamma_l) z_n + (\delta_j w_{nj} - \delta_k w_{nl}) \]

It is clear from the previous expression that coefficients of choice situation-specific variables (the intercept being one of those) should be alternative-specific; otherwise they would disappear in the differentiation. Moreover, only differences of these coefficients are relevant and can be identified. For example, with three alternatives 1, 2 and 3, the three coefficients $\gamma_1, \gamma_2, \gamma_3$ associated with a choice situation-specific variable cannot be identified, but only two linear combinations. Therefore, one has to make a choice of normalization, and the simplest one is just to set $\gamma_1 = 0$.

Coefficients for alternative and choice situation-specific variables may (or may not) be alternative-specific. For example, transport time is alternative-specific, but 10 mn in public transport may not have the same impact on utility than 10 mn in a car. In this case, alternative-specific coefficients are relevant. Monetary cost is also alternative-specific, but in this case, one can consider than $1 is $1 however it is spent for the use of a car or in public transports. In this case, a generic coefficient is relevant. The treatment of alternative-specific variables doesn’t differ much from the alternative and choice situation-specific variables with a generic coefficient. However, if some of these variables are introduced, the $\nu$ parameter can only be estimated in a model without intercepts to avoid perfect multicolinearity.

A logit model with only choice situation-specific variables is sometimes called a multinomial logit model, one with only alternative-specific variables, a conditional logit model, and one with both kinds of variables, a mixed logit model. This is seriously misleading: conditional logit model is also a logit model for longitudinal data in the statistical literature, and mixed logit is one of the names of a logit model with random parameters. Therefore, in what follows, we’ll use the name multinomial logit model for the model we’ve just described whatever the nature of the explanatory variables used.

The mlogit package provides objects of class mFormula which are built upon Formula objects provided by the Formula package. To illustrate the use of mFormula objects, we use again the toronto_montreal data set and consider three sets of covariates that will be indicated in a three-part formula, which refers to the three items at the beginning of this section.

cost (monetary cost) is an alternative-specific covariate with a generic coefficient (part 1),
income and urban are choice situation-specific covariates (part 2),
ivt (in-vehicle travel time) is alternative-specific and alternative-specific coefficients are expected (part 3).

library(Formula)
f <- Formula(choice ~ cost | income + urban | ivt)

Some parts of the formula may be omitted when there is no ambiguity. For example, the following sets of formulas are identical:

f2 <- Formula(choice ~ cost + ivt | income + urban)
f2 <- Formula(choice ~ cost + ivt | income + urban | 0)

f3 <- Formula(choice ~ 0 | income | 0)
f3 <- Formula(choice ~ 0 | income)

f4 <- Formula(choice ~ cost + ivt)
f4 <- Formula(choice ~ cost + ivt | 1)
f4 <- Formula(choice ~ cost + ivt | 1 | 0)

By default, an intercept is added to the model; it can be removed by using + 0 or - 1 in the second part.

f5 <- Formula(choice ~ cost | income + 0 | ivt)
f5 <- Formula(choice ~ cost | income - 1 | ivt)

A model.frame method is provided for dfidx objects. It differs from the formula method by the fact that the returned object is an object of class dfidx and not an ordinary data frame, which means that the information about the structure of the data is not lost. Defining a specific model.frame method for dfidx objects implies that the first argument of the function should be a dfidx object, which results in an unusual order of the arguments in the function (the data first, and then the formula). Moreover, as the model matrix for random utility models has specific features, we add a supplementary argument called pkg to the dfidx function so that the returned object has a specific class (and inherits the dfidx class):

MC <- dfidx(toronto_montreal, subset = noalt == 4, pkg = "mlogit")
class(MC)
## [1] "dfidx_mlogit" "dfidx"        "tbl_df"       "tbl"         
## [5] "data.frame"
f <- Formula(choice ~ cost | income  | ivt)
mf <- model.frame(MC, f)
class(mf)
## [1] "dfidx_mlogit" "dfidx"        "tbl_df"       "tbl"         
## [5] "data.frame"

Using mf as the argument of model.matrix enables the construction of the relevant model matrix for random utility model, as a specific model.matrix method for dfidx_mlogit objects is provided.

head(model.matrix(mf), 4)

  (Intercept):air (Intercept):bus (Intercept):car   cost income:air
1               0               0               0  58.25          0
2               1               0               0 142.80         45
3               0               1               0  27.52          0
4               0               0               1  71.63          0
  income:bus income:car ivt:train ivt:air ivt:bus ivt:car
1          0          0       215       0       0       0
2          0          0         0      56       0       0
3         45          0         0       0     301       0
4          0         45         0       0       0     262

The model matrix contains $J-1$ columns for every choice situation-specific variables (income and the intercept), which means that the coefficient associated with the first alternative (train) is set to 0. It contains only one column for cost because we want a generic coefficient for this variable. It contains $J$ columns for ivt, because it is an alternative specific variable for which we want alternative specific coefficients.

14.2 Random utility model and multinomial logit model

Random utility model

The utility for alternative $l$ is written as: $U_l=V_l+\epsilon_l$ where $V_l$ is a function of some observable covariates and unknown parameters to be estimated, and $\epsilon_l$ is a random deviate which contains all the unobserved determinants of the utility. Alternative $l$ is therefore chosen if $\epsilon_j < (V_l-V_j)+\epsilon_l \;\forall\;j\neq l$ and the probability of choosing this alternative is then:

\[ \mbox{P}(\epsilon_1 < V_l-V_1+\epsilon_l, \epsilon_2 < V_l-V_2+\epsilon_l, ..., \epsilon_J < V_l-V_J+\epsilon_l). \]

Denoting $F_{-l}$ as the cumulative density function of all the $\epsilon$s except $\epsilon_l$, this probability is:

\[ (\mbox{P}_l \mid \epsilon_l)= F_{-l}(V_l-V_1+\epsilon_l, ..., V_l-V_J+\epsilon_l). \]

Note that this probability is conditional on the value of $\epsilon_l$. The unconditional probability (which depends only on $\beta$ and on the value of the observed covariates) is obtained by integrating out the conditional probability using the marginal density of $\epsilon_l$, denoted by $f_l$:

\[ \mbox{P}_l=\int F_{-l}(V_l-V_1+\epsilon_l, ...,V_l-V_J)+\epsilon_l)f_l(\epsilon_l) d\epsilon_l. \]

The conditional probability is an integral of dimension $J-1$, and the computation of the unconditional probability adds one more dimension of integration.

Distribution of the error terms

The multinomial logit model (McFadden 1974) is a special case of the model developed in the previous section. It is based on three hypotheses. The first hypothesis is the independence of the errors. In this case, the univariate distribution of the errors can be used, which leads to the following conditional and unconditional probabilities:

\[ (\mbox{P}_l \mid \epsilon_l)=\prod_{j\neq l}F_j(V_l-V_j+\epsilon_l) \mbox{ and } \mbox{P}_l =\int \prod_{j\neq l}F_j(V_l-V_j+\epsilon_l) \; f_l(\epsilon_l) \;d\epsilon_l, \]

which means that the conditional probability is the product of $J-1$ univariate cumulative density functions, and the evaluation of only a one-dimensional integral is required to compute the unconditional probability. The second hypothesis is that each $\epsilon$ follows a Gumbel (maximum) distribution, whose density and probability functions are respectively:

\[ f(z)=\frac{1}{\theta}e^{-\frac{z-\mu}{\theta}} e^{-e^{-\frac{z-\mu}{\theta}}} \mbox{ and } F(z)=\int_{-\infty}^{z} f(t) dt=e^{-e^{-\frac{z-\mu}{\theta}}}, \]

where $\mu$ is the location parameter and $\theta$ the scale parameter. The first two moments of the Gumbel distribution are $\mbox{E}(z)=\mu+\theta \gamma$, where $\gamma$ is the Euler-Mascheroni constant ($\approx 0.57721$) and $\mbox{V}(z)=\frac{\pi^2}{6}\theta^2$. The mean of $\epsilon_j$ is not identified if $V_j$ contains an intercept. We can then, without loss of generality suppose that $\mu_j=0, \; \forall j$. Moreover, the overall scale of utility is not identified. Therefore, only $J-1$ scale parameters may be identified, and a natural choice of normalization is to impose that one of the $\theta_j$ is equal to 1. The last hypothesis is that the errors are identically distributed. As the location parameter is not identified for any error term, this hypothesis is essentially a homoskedasticity hypothesis, which means that the scale parameter of the Gumbel distribution is the same for all the alternatives. As one of them has been previously set to 1, we can therefore suppose that, without loss of generality, $\theta_j = 1, \;\forall j \in 1... J$. The conditional and unconditional probabilities then further simplify to:

\[ (\mbox{P}_l \mid \epsilon_l)%=\prod_{j\neq l}F(V_l-V_j+\epsilon_l) =\prod_{j\neq l}e^{-e^{-(V_l-Vj+\epsilon_l)}} \mbox{ and } \mbox{P}_l =\int_{-\infty}^{+\infty}\prod_{j\neq l}e^{-e^{-(V_l-Vj+t)}}e^{-t}e^{-e^{-t}}dt. \]

The probabilities have then very simple, closed forms, which correspond to the logit transformation of the deterministic part of the utility.⁵

\[ P_l=\frac{e^{V_l}}{\sum_{j=1}^J e^{V_j}}. \]

IIA property

If we consider the probabilities of choice for two alternatives $l$ and $m$, we have $P_l=e^{V_l}/\sum_j e^{V_j}$ and $P_m=e^{V_m}/\sum_j e^{V_j}$. The ratio of these two probabilities is:

\[ \frac{P_l}{P_m}=\frac{e^{V_l}}{e^{V_m}}=e^{V_l-V_m}. \]

This probability ratio for the two alternatives depends only on the characteristics of these two alternatives and not on those of other alternatives. This is called the independence of irrelevant alternatives (IIA) property. IIA relies on the hypothesis that the errors are identical and independent. It is not a problem in itself and may even be considered as a useful feature for a well-specified model. However, this hypothesis may be in practice violated, especially if some important variables are omitted.

Interpretation

Marginal effects

The marginal effects are the derivatives of the probabilities with respect to the covariates, which can be choice situation-specific ($z_n$) or alternative-specific ($x_{nj}$):

\[ \begin{array}{rcl} \displaystyle \frac{\partial P_{nl}}{\partial z_{n}}&=&P_{nl}\left(\beta_l-\sum_j P_{nj}\beta_j\right) \\ \displaystyle \frac{\partial P_{nl}}{\partial x_{nl}}&=&\gamma P_{nl}(1-P_{nl})\\ \displaystyle \frac{\partial P_{nl}}{\partial x_{nk}}&=&-\gamma P_{nl}P_{nk}. \end{array} \]

For a choice situation-specific variable, the sign of the marginal effect is not necessarily the sign of the coefficient. Actually, the sign of the marginal effect is given by $\left(\beta_l-\sum_j P_{nj}\beta_j\right)$, which is positive if the coefficient for alternative $l$ is greater than a weighted average of the coefficients for all the alternatives, the weights being the probabilities of choosing the alternatives. In this case, the sign of the marginal effect can be established with no ambiguity only for the alternatives with the lowest and the greatest coefficients.
For an alternative-specific variable, the sign of the coefficient can be directly interpreted. The marginal effect is obtained by multiplying the coefficient by the product of two probabilities which is at most 0.25. The rule of thumb is therefore to divide the coefficient by 4 in order to have an upper bound of the marginal effect.

Note that the last equation can be rewritten: $\frac{\mbox{d} P_{nl} / P_{nl}}{\mbox{d}x_{nk}} = -\gamma P_{nk}$. Therefore, when a characteristic of alternative $k$ changes, the relative changes of the probabilities for every alternative except $k$ are the same, which is a consequence of the IIA property.

Marginal rates of substitution

Coefficients are marginal utilities, which cannot be interpreted. However, ratios of coefficients are marginal rates of substitution. For example, if the observable part of utility is: $V=\beta_o +\beta_1 x_1 +\beta x_2 + \beta x_3$, joint variations of $x_1$ and $x_2$ which ensure the same level of utility are such that: $dV=\beta_1 dx_1+\beta_2 dx_2=0$ so that:

\[ - \frac{dx_2}{dx_1}\mid_{dV = 0} = \frac{\beta_1}{\beta_2}. \]

For example, if $x_2$ is transport cost (in dollars), $x_1$ transport time (in hours), $\beta_1 = 1.5$ and $\beta_2=0.05$, $\frac{\beta_1}{\beta_2}=30$ is the marginal rate of substitution of time in terms of dollars and the value of 30 means that, to reduce the travel time of 1 hour, the individual is willing to pay at most $30 more. Stated more simply, time value is $30 per hour.

Consumer surplus

Consumer’s surplus has a very simple expression for multinomial logit models, which was first derived by Small and Rosen (1981). The level of utility attained by an individual is $U_j=V_j+\epsilon_j$, $j$ being the chosen alternative. The expected utility, from the searcher’s point of view is then: $\mbox{E}(\max_j U_j)$, where the expectation is taken over the values of all the error terms. Its expression is simply, up to an additive unknown constant, the log of the denominator of the logit probabilities, often called the “log-sum”:

\[ \mbox{E}(U)=\ln \sum_{j=1}^Je^{V_j}+C. \]

If the marginal utility of income ($\alpha$) is known and constant, the expected surplus is simply $\frac{\mbox{E}(U)}{\alpha}$.

Application

Random utility models are fitted using the mlogit function. Basically, only two arguments are mandatory, formula and data, if an dfidx object (and not an ordinary data.frame) is provided. We use the toronto_montreal data set, which was already coerced to a dfidx object (called MC) in the previous section. The same model can then be estimated using as data argument this dfidx object:

MC <- dfidx(toronto_montreal, subset = noalt == 4)
ml.MC1 <- mlogit(choice ~ cost + freq + ovt | income | ivt, MC)

or a data.frame. In this latter case, further arguments that will be passed to dfidx should be indicated:

ml.MC1b <- mlogit(choice ~ cost + freq + ovt | income | ivt, 
                  toronto_montreal, subset = noalt == 4, 
                  idx = c("case", "alt"))

mlogit provides two further useful arguments:

reflevel indicates which alternative is the “reference” alternative, i.e., the one for which the coefficients of choice situation-specific covariates are set to 0,
alt.subset indicates a subset of alternatives on which the estimation has to be performed; in this case, only the lines that correspond to the selected alternatives are used, and all the choice situations where unselected alternatives have been chosen are removed.

We estimate the model on the subset of three alternatives (we exclude bus whose market share is negligible in our sample) and we set car as the reference alternative. Moreover, we use a total transport time variable computed as the sum of the in-vehicle and out-vehicle time variables.

MC <- MC %>% mutate(time = ivt + ovt)
ml.MC1 <- mlogit(choice ~ cost + freq | income | time, MC, 
                 alt.subset = c("car", "train", "air"), reflevel = "car")

The main results of the model are computed and displayed using the summary method:

summary(ml.MC1)


Call:
mlogit(formula = choice ~ cost + freq | income | time, data = MC, 
    alt.subset = c("car", "train", "air"), reflevel = "car", 
    method = "nr")

Frequencies of alternatives:choice
  car train   air 
0.458 0.167 0.375 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 6.94E-06 
successive function values within tolerance limits 

Coefficients :
                   Estimate Std. Error z-value Pr(>|z|)    
(Intercept):train -0.970344   0.265131   -3.66  0.00025 ***
(Intercept):air   -1.898566   0.684143   -2.78  0.00552 ** 
cost              -0.028497   0.006559   -4.34  1.4e-05 ***
freq               0.074029   0.004733   15.64  < 2e-16 ***
income:train      -0.006469   0.003104   -2.08  0.03713 *  
income:air         0.028246   0.003654    7.73  1.1e-14 ***
time:car          -0.014024   0.001380  -10.16  < 2e-16 ***
time:train        -0.010969   0.000818  -13.40  < 2e-16 ***
time:air          -0.017551   0.003992   -4.40  1.1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -1950
McFadden R^2:  0.312 
Likelihood ratio test : chisq = 1770 (p.value = <2e-16)

The frequencies of the different alternatives in the sample are first indicated. Next, some information about the optimization is displayed: the Newton-Raphson method (with analytical gradient and hessian) is used, as it is the most efficient method for this simple model for which the log-likelihood function is globally concave. Note that very few iterations and computing times are required to estimate this model. Then the usual table of coefficients is displayed, followed by some goodness-of-fit measures: the value of the log-likelihood function, which is compared to the value when only intercepts are introduced, which leads to the computation of the McFadden $R^2$ and to the likelihood ratio test. The fitted method can be used either to obtain the probability of actual choices (type = "outcome") or the probabilities for all the alternatives (type = "probabilities").

head(fitted(ml.MC1, type = "outcome"))

   109    110    111    112    113    114 
0.1909 0.3400 0.1471 0.3400 0.3400 0.2440

head(fitted(ml.MC1, type = "probabilities"), 4)

       car  train    air
109 0.4206 0.3884 0.1909
110 0.3696 0.2904 0.3400
111 0.4297 0.4233 0.1471
112 0.3696 0.2904 0.3400

Note that the log-likelihood is the sum of the log of the fitted outcome probabilities and that, as the model contains intercepts, the average fitted probabilities for every alternative equals the market shares of the alternatives in the sample.

sum(log(fitted(ml.MC1, type = "outcome")))
## [1] -1951
logLik(ml.MC1)
## 'log Lik.' -1951 (df=9)
apply(fitted(ml.MC1, type = "probabilities"), 2, mean)
##    car  train    air 
## 0.4576 0.1672 0.3752

Predictions can be made using the predict method. If no data is provided, predictions are made for the sample mean values of the covariates.

predict(ml.MC1)
##    car  train    air 
## 0.5066 0.2117 0.2817

Assume, for example, that we wish to predict the effect of a reduction of train transport time of 20%. We first create a new data.frame simply by multiplying train transport time by 0.8 and then using the predict method with this new data.frame.

NMC <- MC
NMC <- NMC %>% mutate(time = ifelse(idx$alt == "train", 0.8 * time, time))
Oprob <- fitted(ml.MC1, type = "probabilities")
Nprob <- predict(ml.MC1, newdata = NMC)
rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean))

       car  train    air
old 0.4576 0.1672 0.3752
new 0.4045 0.2636 0.3319

If, for the first individuals in the sample, we compute the ratio of the probabilities of the air and the car mode, we obtain:

head(Nprob[, "air"] / Nprob[, "car"])
##    109    110    111    112    113    114 
## 0.4539 0.9198 0.3422 0.9198 0.9198 0.6021
head(Oprob[, "air"] / Oprob[, "car"])
##    109    110    111    112    113    114 
## 0.4539 0.9198 0.3422 0.9198 0.9198 0.6021

which is an illustration of the IIA property. If train time changes, it changes the probabilities of choosing air and car, but not their ratio. We next compute the surplus for individuals of the sample induced by train time reduction. This requires the computation of the log-sum term (also called inclusive value or inclusive utility) for every choice situation, which is:

\[ \mbox{iv}_n = \ln \sum_{j = 1} ^ J e^{\beta^\top x_{nj}}. \]

For this purpose, we use the logsum function, which works on a vector of coefficients and a model matrix. The basic use of logsum consists of providing as unique argument (called coef) a mlogit object. In this case, the model.matrix and the coef are extracted from the same model:

ivbefore <- logsum(ml.MC1)

To compute the log-sum after train time reduction, we must provide a model matrix which is not the one corresponding to the fitted model. This can be done using the X argument which is a matrix or an object from which a model matrix can be extracted. This can also be done by filling the data argument (a data frame or an object from which a data frame can be extracted using model.frame), and eventually the formula argument (a formula or an object for which the formula method can be applied). If no formula is provided, but if data is a dfidx object, the formula is extracted from it.

ivafter <- logsum(ml.MC1, data = NMC)

Surplus variation is then computed as the difference of the log-sums divided by the opposite of the cost coefficient which can be interpreted as the marginal utility of income:

surplus <- - (ivafter - ivbefore) / coef(ml.MC1)["cost"]
summary(surplus)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.585   2.844   3.900   4.697   5.844  31.391

Consumer surplus variations range from 0.6 to 31 Canadian dollars, with a median value of about $4. Marginal effects are computed using the effects method. By default, they are computed at the sample mean, but a data argument can be provided. The variation of the probability and the covariate can be either absolute or relative. This is indicated with the type argument which is a combination of two a (as absolute) and r (as relative) characters. For example, type = "ar" means that what is measured is an absolute variation of the probability for a relative variation of the covariate.

effects(ml.MC1, covariate = "income", type = "ar")
##     car   train     air 
## -0.1822 -0.1509  0.3331

The results indicate that, for a 100% increase of income, the probability of choosing air increases by 33 percentage points, as the probabilities of choosing car and train decrease by 18 and 15 percentage points.

For an alternative specific covariate, a matrix of marginal effects is displayed.

effects(ml.MC1, covariate = "cost", type = "rr")

          car   train     air
car   -0.9131  0.9377  0.9377
train  0.3358 -1.2505  0.3358
air    1.2317  1.2317 -3.1410

The cell in the $l^{\mbox{th}}$ row and the $c^{\mbox{th}}$ column indicates the change of the probability of choosing alternative $c$ when the cost of alternative $l$ changes. As type = "rr", elasticities are computed. For example, a 10% change of train cost increases the probabilities of choosing car and air by 3.36%. Note that the relative changes of the probabilities of choosing one of these two modes are equal, which is a consequence of the IIA property. Finally, in order to compute travel time valuation, we divide the coefficients of travel times (in minutes) by the coefficient of monetary cost (in dollars).

coef(ml.MC1)[grep("time", names(coef(ml.MC1)))] /
    coef(ml.MC1)["cost"] * 60
##   time:car time:train   time:air 
##      29.53      23.09      36.95

The value of travel time ranges from 23 for a train to 37 Canadian dollars per hour for a plane.

14.3 Logit models relaxing the iid hypothesis

In the previous section, we assumed that the error terms were iid (identically and independently distributed), i.e., uncorrelated and homoskedastic. Extensions of the basic multinomial logit model have been proposed by relaxing one of these two hypotheses while maintaining the hypothesis of a Gumbel distribution.

Heteroskedastic logit model

The heteroskedastic logit model was proposed by Bhat (1995). The probability that $U_l>U_j$ is:

\[ P(\epsilon_j<V_l-V_j+\epsilon_l)=e^{-e^{-\frac{(V_l-V_j+\epsilon_l)}{\theta_j}}}, \]

which implies the following conditional and unconditional probabilities

\[ (P_l \mid \epsilon_l) =\prod_{j\neq l}e^{-e^{-\frac{(V_l-V_j+\epsilon_l)}{\theta_j}}}, \]

\[ \begin{array}{rcl} P_l&=&\displaystyle\int_{-\infty}^{+\infty} \prod_{j\neq l} \left(e^{-e^{-\frac{(V_l-V_j+t)}{\theta_j}}}\right)\frac{1}{\theta_l}e^{-\frac{t}{\theta_l}}e^{-e^{-\frac{t}{\theta_l}}} dt\\ &=& \displaystyle \int_{0}^{+\infty}\left(e^{-\sum_{j \neq l}e^{-\frac{V_l-V_j-\theta_l \ln t}{\theta_j}}}\right)e^{-t}dt. \end{array} \tag{14.1}\]

There is no closed form for this integral, but it can be efficiently computed using a Gauss quadrature method, and more precisely the Gauss-Laguerre quadrature method. Bhat (1995) estimated the heteroskedastic logit model on the toronto_montreal data set. Using mlogit, the heteroskedastic logit model is obtained by setting the heterosc argument to TRUE:

ml.MC <- mlogit(choice ~ freq + cost + ivt + ovt | 
                  urban + income, MC, reflevel = 'car', 
                alt.subset = c("car", "train", "air"))
hl.MC <- mlogit(choice ~ freq + cost + ivt + ovt | 
                  urban + income, MC, reflevel = 'car', 
                alt.subset = c("car", "train", "air"), 
                heterosc = TRUE)
hl.MC %>% gaze(coef = 11:12)

         Estimate Std. Error z-value Pr(>|z|)
sp.train    1.237      0.110   11.20  < 2e-16
sp.air      0.540      0.112    4.83  1.4e-06

Two supplementary coefficients (sp.train and sp.air) are estimated ($\theta_j$ in Equation 14.1), the third for the reference modality being set to 1. The variance of the error terms of train and air are respectively higher and lower than the variance of the error term of car (set to 1). Note that the z-values and p-values of the output are not particularly meaningful, as the hypothesis that the coefficient is zero (and not 1) is tested. The homoskedasticity hypothesis can be tested using any of the three tests. For the likelihood ratio and the Wald test, one can use only the fitted heteroskedastic model as argument. In this case, it is guessed that the hypothesis that the user wants to test is the homoskedasticity hypothesis.

lrtest(hl.MC, ml.MC) %>% gaze
## Chisq = 6.888, df: 2, pval = 0.032
waldtest(hl.MC, heterosc = FALSE) %>% gaze
## chisq = 25.196, df: 2, pval = 0.000

or, more simply:

lrtest(hl.MC)
waldtest(hl.MC)

The Wald test can also be computed using the linearHypothesis function from the car package:

car::linearHypothesis(hl.MC, c('sp.air = 1', 'sp.train = 1')) %>% gaze
## Chisq = 25.196, df: 2, pval = 0.000

For the score test, we provide the constrained model as argument, which is the standard multinomial logit model and the supplementary argument which defines the unconstrained model, which is in this case heterosc = TRUE.

scoretest(ml.MC, heterosc = TRUE) %>% gaze
## chisq = 9.488, df: 2, pval = 0.009

The homoskedasticity hypothesis is therefore strongly rejected using the Wald test, but only at the 1 and 5% level for, respectively, the score and the likelihood ratio tests.

Nested logit model

The nested logit model was first proposed by McFadden (1978). It is a generalization of the multinomial logit model that is based on the idea that some alternatives may be joined in several groups (called nests). The error terms may then present some correlation in the same nest, whereas error terms of different nests are still uncorrelated. Denoting $m=1... M$ the nests and $B_m$ the set of alternatives belonging to nest $m$, the cumulative distribution of the errors is:

\[ \mbox{exp}\left(-\sum_{m=1}^M \left( \sum_{j \in B_m} e^{-\epsilon_j/\lambda_m}\right)^{\lambda_m}\right). \]

The marginal distributions of the $\epsilon$s are still univariate extreme values, but there is now some correlation within nests. $1-\lambda_m$ is a measure of the correlation, i.e., $\lambda_m = 1$ implies no correlation. In the special case where $\lambda_m=1\; \forall m$, the errors are iid Gumbel errors and the nested logit model reduce to the multinomial logit model. It can then be shown that the probability of choosing alternative $j$ that belongs to nest $l$ is:

\[ P_j = \frac{e^{V_j/\lambda_l}\left(\sum_{k \in B_l} e^{V_k/\lambda_l}\right)^{\lambda_l-1}} {\sum_{m=1}^M\left(\sum_{k \in B_m} e^{V_k/\lambda_m}\right)^{\lambda_m}}, \]

and that this model is a random utility model if all the $\lambda$ parameters are in the $0-1$ interval.⁶ Let us now write the deterministic part of the utility of alternative $j$ as the sum of two terms: the first ($Z_j$) being specific to the alternative and the second ($W_l$) to the nest it belongs to:

\[V_j=Z_j+W_l.\]

We can then rewrite the probabilities as follow:

\[ \begin{array}{rcl} P_j&=&\frac{e^{(Z_j+W_l)/\lambda_l}}{\sum_{k \in B_l} e^{(Z_k+W_l)/\lambda_l}}\times \frac{\left(\sum_{k \in B_l} e^{(Z_k+W_l)/\lambda_l}\right)^{\lambda_l}} {\sum_{m=1}^M\left(\sum_{k \in B_m} e^{(Z_k+W_m)/\lambda_m}\right)^{\lambda_m}}\\ &=&\frac{e^{Z_j/\lambda_l}}{\sum_{k \in B_l} e^{Z_k/\lambda_l}}\times \frac{\left(e^{W_l/\lambda_l}\sum_{k \in B_l} e^{Z_k/\lambda_l}\right)^{\lambda_l}} {\sum_{m=1}^M\left(e^{W_m/\lambda_m}\sum_{k \in B_m} e^{Z_k/\lambda_m}\right)^{\lambda_m}}. \end{array} \]

Then denote $I_l=\ln \sum_{k \in B_l} e^{Z_k/\lambda_l}$ which is often called the log-sum, the inclusive value or the inclusive utility.⁷ We then can write the probability of choosing alternative $j$ as:

\[ P_j=\frac{e^{Z_j/\lambda_l}}{\sum_{k \in B_l} e^{Z_k/\lambda_l}}\times \frac{e^{W_l+\lambda_l I_l}}{\sum_{m=1}^Me^{W_m+\lambda_m I_m}}. \]

The first term $\mbox{P}_{j\mid l}$ is the conditional probability of choosing alternative $j$ if nest $l$ is chosen. It is often referred to as the lower model. The second term $\mbox{P}_l$ is the marginal probability of choosing nest $l$ and is referred to as the upper model. $W_l+\lambda_l I_l$ can be interpreted as the expected utility of choosing the best alternative in $l$, $W_l$ being the expected utility of choosing an alternative in this nest (whatever this alternative is) and $\lambda_l I_l$ being the expected extra utility gained by being able to choose the best alternative in the nest. The inclusive values link the two models. It is then straightforward to show that IIA applies within nests, but not for two alternatives in different nests.

A consistent but inefficient way of estimating the nested logit model is to estimate separately its two components. The coefficients of the lower model are first estimated, which enables the computation of the inclusive values $I_l$. The coefficients of the upper model are then estimated, using $I_l$ as covariates. Maximizing directly the likelihood function of the nested model leads to a more efficient estimator.

To illustrate the estimation of the nested logit model, we use the telephone data set, used by Train, McFadden, and Ben-Akiva (1987) and Walker, Ben-Akiva, and Bolduc (2007).

telephone %>% print(n = 5)

# A tibble: 2,170 × 4
  choice service  household  cost
  <lgl>  <fct>        <int> <dbl>
1 FALSE  budget           1  1.76
2 FALSE  extended         1 13.8 
3 FALSE  local            1  2.55
4 FALSE  metro            1  3.15
5 TRUE   standard         1  1.75
# ℹ 2,165 more rows

A total of 428 households were surveyed in 1984 about their choice of a local telephone service, which typically involves the choice between a flat service (a fixed monthly charge for an unlimited calls within a specified geographical area) and a measured (a reduced fixed monthly charge for a limited number of calls plus usage charges for additional calls) service. Households had the choice between five services:

budget measured (budget): no fixed monthly charge; usage charges apply to each call made,
standard measured (standard): a fixed monthly charge covers up to a specified dollar amount (greater than the fixed charge) of local calling, after which usage charges apply to each call made,
local flat (local): a greater monthly charge that may depend upon residential location; unlimited free calling within a local calling area; usage charges apply to calls made outside local calling area,
extended area flat (extended): a further increase in the fixed monthly charge to permit unlimited free calling within an extended area,
metro area flat (metro): the greatest fixed monthly charge that permits unlimited free calling within the entire metropolitan area.

The first two services are measured, and the last three are flat services. There is therefore an obvious nesting structure for this example. We first estimate the multinomial logit model, with the log of cost as the unique covariate:

ml_tel <- mlogit(choice ~ log(cost), telephone, 
                 idx = c("household", "service"))

We then update this model in order to introduce nests, using the nests argument. It is a list of characters that contains the alternatives for the different nests. It is advisable to use a named list (we use here "measured" and "flat" as names of the nests):

nl_tel <- mlogit(choice ~ cost, telephone, 
                 idx = c("household", "service"), 
                 nests = list(measured = c("budget", "standard"), 
                              flat = c("local", "metro", "extended")))
coef(nl_tel)
## (Intercept):extended    (Intercept):local    (Intercept):metro 
##               1.2255               1.2716               1.7837 
## (Intercept):standard                 cost          iv:measured 
##               0.3782              -1.4900               0.4848 
##              iv:flat 
##               0.4362

Two supplementary coefficients are estimated, iv:measured and iv:flat. The two values are in the 0-1 interval and close to each other. The un.nest.el argument enables to estimate a unique supplementary coefficient for the two nests:

nl_tel_u <- update(nl_tel, un.nest.el = TRUE)

We then have three nested models and the hypothesis of a unique parameter can be tested using any of the three tests:

scoretest(nl_tel_u, un.nest.el = FALSE) %>% gaze
## chisq = 0.109, df: 1, pval = 0.741
lrtest(nl_tel, nl_tel_u) %>% gaze
## Chisq = 0.137, df: 1, pval = 0.712
waldtest(nl_tel, un.nest.el = TRUE) %>% gaze
## chisq = 0.160, df: 1, pval = 0.689

The three tests conclude that a unique parameter can be estimated. Then, we can test whether this parameter is 1, in which case the nested logit model reduces to the multinomial logit model:

scoretest(ml_tel, 
          nests = list(measured = c("budget", "standard"), 
                       flat = c("local", "metro", "extended"))) %>% gaze
## chisq = 5.685, df: 2, pval = 0.058
lrtest(nl_tel_u, ml_tel) %>% gaze
## Chisq = 17.177, df: 1, pval = 0.000
waldtest(nl_tel_u, nests = NULL) %>% gaze
## chisq = 25.541, df: 1, pval = 0.000
car::linearHypothesis(nl_tel_u, "iv = 1") %>% gaze
## Chisq = 25.541, df: 1, pval = 0.000

Based on the Wald and the likelihood ratio, the preferred specification is the nested logit model (but note that the p-value for the score test is slightly higher than 5%).

14.4 Random parameters (or mixed) logit model

Derivation of the model

A mixed logit model (or random parameters logit model) is a logit model whose parameters are assumed to vary from one individual to another. It is therefore a model that takes the heterogeneity of the population into account. For the standard logit model, the probability that individual $n$ chooses alternative $j$ is:

\[ P_{nl}=\frac{e^{\beta'x_{nl}}}{\sum_j e^{\beta'x_{nj}}}. \]

Suppose now that the coefficients are individual-specific. The probabilities are then:

\[ P_{nl}=\frac{e^{\beta_n'x_{nl}}}{\sum_j e^{\beta_n'x_{nj}}}. \]

A first approach consists of estimating the parameters for every individual. However, these parameters are identified and can be consistently estimated only if a large number of choice situations per individual is available, which is scarcely the case. A more appealing approach consists of considering $\beta_n$ as random draws on a distribution whose parameters are estimated, which leads to the mixed logit model. The probability that individual $n$ will choose alternative $l$, for a given value of $\beta_n$ is:

\[ P_{nl} \mid \beta_n =\frac{e^{\beta_n'x_{nl}}}{\sum_j e^{\beta_n'x_{nj}}}. \]

To get the unconditional probability, we have to integrate out this conditional probability, using the density function of $\beta$. Suppose that $V_{nl}=\beta_n x_{nl}$, i.e., that there is only one individual-specific coefficient and that the density of $\beta_n$ is $f(\beta,\theta)$, $\theta$ being the vector of the parameters of the distribution of $\beta$. The unconditional probability is then:

\[ P_{nl}= \mbox{E}(P_{nl} \mid \beta_n) = \int_{\beta}(P_{nl} \mid \beta)f(\beta,\theta)d\beta = \int_{\beta}\frac{e^{\beta^\top x_{nl}}}{\sum_j e^{\beta^\top x_{nj}}}f(\beta,\theta)d\beta, \]

which is a one-dimensional integral that can be efficiently estimated by quadrature methods. If $V_{nl}=\beta_n^{\top} x_{nl}$ where $\beta_n$ is a vector of length $K$ and $f(\beta,\theta)$ is the joint density of the $K$ individual-specific coefficients, the unconditional probability is:

\[ P_{nl}= \mbox{E}(P_{nl} \mid \beta_n) = \int_{\beta_1}\int_{\beta_2}...\int_{\beta_K}(P_{nl} \mid \beta)f(\beta,\theta)d\beta_1d\beta_2... d\beta_K. \]

This is a $K$-dimensional integral which cannot easily be estimated by quadrature methods. The only practical method is then to use simulations. More precisely, $R$ draws of the parameters are taken from the distribution of $\beta$, the probability is computed for every draw and the unconditional probability, which is the expected value of the conditional probabilities is estimated by the average of the $R$ probabilities.

The expected value of a random coefficient ($\mbox{E}(\beta)$) is simply estimated by the mean of the $R$ draws on its distribution: $\bar{\beta}=\sum_{r=1}^R \beta_r$. Individual parameters are obtained by first computing the probabilities of the observed choice of $n$ for every value of $\beta_r$:

\[ P_{nr}=\frac{\sum_j y_{nj} e^{\beta_r^{'}x_{nj}}}{\sum_j e^{\beta_r^{'}x_{nj}}}, \]

where $y_{nj}$ is a dummy equal to 1 if $n$ has chosen alternative $j$. The expected value of the parameter for an individual is then estimated by using these probabilities to weight the $R$ $\beta$ values:

\[ \hat{\beta}_n = \frac{\sum_r P_{nr} \beta_r}{\sum_r P_{nr}}. \]

If there are repeated observations for the same individuals, the longitudinal dimension of the data can be taken into account in the mixed logit model, assuming that the random parameters of individual $n$ are the same for all their choice situations. Denoting $y_{ntl}$ a dummy equal to 1 if $n$ chooses alternative $l$ for the $t$^th choice situation, the probability of the observed choice is:

\[ P_{nt}\mid \beta_n=\prod_j \frac{\sum_j y_{ntj}e^{\beta_n ^\top x_{ntj}}}{\sum_j e^{\beta_n ^\top x_{ntj}}}. \]

The joint probability for the $T$ observations of individual $n$ is then:

\[ P_{n}\mid \beta_n=\prod_t \prod_j \frac{\sum_jy_{ntj}e^{\beta_n ^\top x_{ntj}}}{\sum_j e^{\beta_n ^\top x_{ntj}}} \]

Application

The random parameter logit model is estimated by providing a rpar argument to mlogit. This argument is a named vector, the names being the random coefficients and the acronyms for the law of distribution. Currently, the normal ("n"), log-normal ("ln"), zero-censored normal ("cn"), uniform ("u") and triangular ("t") distributions are available. For these distributions, two parameters are estimated which are, for normal related distributions, the mean and the standard-deviation of the underlying normal distribution and for the uniform and triangular distribution, the mean and the half-range of the distribution. For these last two distributions, zero-bounded variants are also provided ("zbt" and "zbu"). These two distributions are defined by only one parameter (the mean) and their definition domain varies from 0 to twice the mean.

Several considerations may lead to the choice of a specific distribution:

if correlated coefficients are required, the natural choice is a (transformed-) normal distribution, "n", "ln", "tn" and "cn",
it’s often the case that one wants to impose that the distribution of a random parameter takes only positive or negative values. For example, the price coefficient should be negative for every individual. In this case, "zbt" and "zbu" can be used. The use of "ln" and "cn" can also be relevant but, in this case, if only negative values are expected, one should consider the distribution of the opposite of the random price coefficient. This can easily be done using the opposite argument of dfidx,⁸
the use of unbounded distributions often leads to implausible values of some statistics of the random parameters, especially the mean. This is particularly the case of the log-normal distribution, which has an heavy right tail. In this case, the use of bounded distribution like the uniform and the triangular distributions can be used.

R is the number of draws, halton indicates whether Halton draws (see Train 2009, , chapter 9) should be used (NA and NULL indicate respectively that default Halton draws are used and that pseudo-random numbers are used), panel is a boolean which indicates if the panel data version of the log-likelihood should be used.

Correlations between random parameters can be introduced only for normal-related distributed random parameters, using the correlation argument. If TRUE, all the normal-related random parameters are correlated. The correlation argument can also be a character vector indicating a subset of the random parameters that are assumed to be correlated.

We use the dutch_railways data set, previously coerced to a dfidx object called Tr. We first estimate the multinomial model: both alternatives being virtual train trips, it is relevant to use only generic coefficients and to remove the intercept:

Tr <- dfidx(dutch_railways, choice = "choice", varying = 4:11, sep = "_", 
            opposite = c("price", "comfort", "time", "change"),
            idx = list(c("choiceid", "id")), idnames = c("chid", "alt"))
Train.ml <- mlogit(choice ~ price + time + change + comfort | - 1, Tr)
Train.ml %>% gaze

        Estimate Std. Error z-value Pr(>|z|)
price     0.3271     0.0165   19.85  < 2e-16
time      1.7206     0.1604   10.73  < 2e-16
change    0.3263     0.0595    5.49  4.1e-08
comfort   0.9457     0.0649   14.56  < 2e-16

All the coefficients are highly significant and have the predicted positive sign (remember than an increase in the variable comfort implies using a less comfortable class). The coefficients can’t be directly interpreted, but dividing them by the price coefficient, we get monetary values:

coef(Train.ml)[- 1] / coef(Train.ml)[1]
##    time  change comfort 
##  5.2598  0.9976  2.8911

mv <- coef(Train.ml)[- 1] / coef(Train.ml)[1]

We obtain the value of 5.3 euros for an hour of traveling, 1 euro for a change and 2.9 euros to travel in a more comfortable class.⁹ We then estimate a model with three random parameters, time, change and comfort. We first estimate the uncorrelated mixed logit model:

Train.mxlu <- mlogit(choice ~ price + time + change + comfort | - 1, Tr, 
                     rpar = c(time = "n", change = "n", comfort = "n"), 
                     R = 100, panel = TRUE, correlation = FALSE, 
                     halton = NA, method = "bhhh")
names(coef(Train.mxlu))
## [1] "price"      "time"       "change"     "comfort"    "sd.time"   
## [6] "sd.change"  "sd.comfort"

Compared to the multinomial logit model, there are now three more coefficients which are the standard deviations of the distribution of the three random parameters. The correlated model is obtained by setting the correlation argument to TRUE.

Train.mxlc <- update(Train.mxlu, correlation = TRUE)
names(coef(Train.mxlc))

 [1] "price"                "time"                
 [3] "change"               "comfort"             
 [5] "chol.time:time"       "chol.time:change"    
 [7] "chol.change:change"   "chol.time:comfort"   
 [9] "chol.change:comfort"  "chol.comfort:comfort"

There are now six parameters which are the elements of the Choleski decomposition of the covariance matrix of the three random parameters. These six parameters are therefore the elements of the following matrix:

\[ C= \left( \begin{array}{ccc} c_{11} & c_{12} & c_{13} \\ 0 & c_{22} & c_{23} \\ 0 & 0 & c_{33} \end{array} \right) \]

such that:

\[ C^{\top}C= \left( \begin{array}{ccc} c_{11}^2 & c_{11} c_{12} & c_{11}c_{13} \\ c_{11}c_{12} & c_{12}^2 + c_{22}^2 & c_{12}c_{23}+c_{22}c_{23} \\ c_{11}c_{13} & c_{12}c_{3} + c_{22}c_{23} & c_{13}^2 + c_{23}^2 c_{33}^2 \end{array} \right) = \left( \begin{array}{ccc} \sigma_{1}^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{12} & \sigma_{2}^2 & \sigma_{23} \\ \sigma_{13} & \sigma_{23} & \sigma_{3}^2 \end{array} \right) \]

where $\sigma_k^2$ and $\sigma_{kl}$ are respectively the variance of the random parameter $k$ and the covariance between two random parameters $k$ and $l$. Therefore, the first estimated parameter can be simply interpreted as the standard deviation of the first random parameter, but the five other can’t be interpreted easily. Random parameters may be extracted using the function rpar which takes as first argument a mlogit object and as second argument par the parameter(s) to be extracted. This function returns a rpar object and a summary method is provided to describe it:

marg.ut.time <- rpar(Train.mxlc, "time")
summary(marg.ut.time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf   1.284   4.894   4.894   8.504     Inf

The estimated random parameter is in the “preference space”, which means that it is the marginal utility of time. Parameters in the “willingness to pay” (WTP) space are more easy to interpret. They can be estimated directly (a feature not supported by mlogit) or can be obtained from the marginal utility by dividing by the coefficient of a covariate expressed in monetary value (a price for example), taken as a non-random parameter. The ratio can then be interpreted as a monetary value (or willingness to pay). To obtain the distribution of the random parameters in the WTP space, one can use the norm argument of rpar:

wtp.time <- rpar(Train.mxlc, "time", norm = "price")
summary(wtp.time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf   1.802   6.871   6.871  11.939     Inf

The median value (and the mean value as the distribution is symmetric) of transport time is about 33 euros. Several methods/functions are provided to extract the individual statistics (mean, med and stdev respectively for the mean, median and standard deviation):

mean(rpar(Train.mxlc, "time", norm = "price"))
## [1] 6.871
med(rpar(Train.mxlc, "time", norm = "price"))
## [1] 6.871
stdev(rpar(Train.mxlc, "time", norm = "price"))
## [1] 7.515

In case of correlated random parameters, as the estimated parameters can’t be directly interpreted, a vcov method for mlogit objects is provided. It has a what argument whose default value is coefficient. In this case the usual covariance matrix of the coefficients is return. If what = "rpar", the covariance matrix of the correlated random parameters is returned if type = "cov" (the default) and the correlation matrix (with standard deviations on the diagonal) is returned if type = "cor". The object is of class vcov.mlogit and a summary method for this object is provided which computes, using the delta method, the standard errors of the parameters of the covariance or the correlation matrix.

vcov(Train.mxlc, what = "rpar")

           time  change comfort
time    28.6460 -0.2788   5.558
change  -0.2788  3.1047   1.232
comfort  5.5579  1.2325   7.896

vcov(Train.mxlc, what = "rpar", type = "cor")

            time   change comfort
time     5.35220 -0.02956  0.3696
change  -0.02956  1.76203  0.2489
comfort  0.36956  0.24893  2.8099

summary(vcov(Train.mxlc, what = "rpar", type = "cor"))

                   Estimate Std. Error z-value Pr(>|z|)    
sd.time              5.3522     0.3811   14.04   <2e-16 ***
sd.change            1.7620     0.1446   12.19   <2e-16 ***
sd.comfort           2.8099     0.1783   15.76   <2e-16 ***
cor.time:change     -0.0296     0.2324   -0.13   0.8988    
cor.time:comfort     0.3696     0.1141    3.24   0.0012 ** 
cor.change:comfort   0.2489     0.1103    2.26   0.0240 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The correlation can be restricted to a subset of random parameters by filling the correlation argument with a character vector indicating the corresponding covariates:

Train.mxlc2 <- update(Train.mxlc, correlation = c("time", "comfort"))
vcov(Train.mxlc2, what = "rpar", type = "cor")

          time comfort
time    5.5726  0.3909
comfort 0.3909  3.0631

The presence of random coefficients and their correlation can be investigated using any of the three tests. Actually, three nested models can be considered, a model with no random effects, a model with random but uncorrelated effects and a model with random and correlated effects. We first present the three tests of no correlated random effects:

lrtest(Train.mxlc, Train.ml) %>% gaze
## Chisq = 388.057, df: 6, pval = 0.000
waldtest(Train.mxlc) %>% gaze
## chisq = 288.287, df: 6, pval = 0.000
car::linearHypothesis(Train.mxlc, 
                      c("chol.time:time = 0", "chol.time:change =  0", 
                        "chol.time:comfort = 0", "chol.change:change = 0", 
                        "chol.change:comfort = 0", 
                        "chol.comfort:comfort = 0")) %>% gaze
## Chisq = 288.287, df: 6, pval = 0.000
scoretest(Train.ml, rpar = c(time = "n", change = "n", comfort = "n"), 
          R = 100, correlation = TRUE, halton = NA, 
          panel = TRUE) %>% gaze
## chisq = 208.765, df: 6, pval = 0.000

The hypothesis of no correlated random parameters is strongly rejected. We then present the three tests of no correlation, the existence of random parameters being maintained.

lrtest(Train.mxlc, Train.mxlu) %>% gaze
## Chisq = 42.621, df: 3, pval = 0.000
car::linearHypothesis(Train.mxlc, 
                      c("chol.time:change = 0","chol.time:comfort = 0", 
                        "chol.change:comfort = 0")) %>% gaze
## Chisq = 103.195, df: 3, pval = 0.000
waldtest(Train.mxlc, correlation = FALSE) %>% gaze
## chisq = 103.195, df: 3, pval = 0.000
scoretest(Train.mxlu, correlation = TRUE) %>% gaze
## chisq = 10.483, df: 3, pval = 0.015

The hypothesis of no correlation is strongly rejected with the Wald and the likelihood ratio test, only at the 5% level for the score test.

14.5 Multinomial probit

The model

The multinomial probit is obtained with the same modeling that we used while presenting the random utility model. The utility of an alternative is still the sum of two components : $U_j = V_j + \epsilon_j$ but the joint distribution of the error terms is now a multivariate normal with mean 0 and with a matrix of covariance denoted by $\Omega$.¹⁰ Alternative $l$ is chosen if: \[ \left\{ \begin{array}{rcl} U_1-U_l&=&(V_1-V_l)+(\epsilon_1-\epsilon_l)<0\\ U_2-U_l&=&(V_2-V_l)+(\epsilon_2-\epsilon_l)<0\\ & \vdots & \\ U_J-U_l&=&(V_J-V_l)+(\epsilon_J-\epsilon_l)<0\\ \end{array} \right. \]

wich implies, denoting $V^l_j=V_j-V_l$ :

\[ \left\{ \begin{array}{rclrcl} \epsilon^l_1 &=& (\epsilon_1-\epsilon_l) &<& - V^l_1\\ \epsilon^l_2 &=& (\epsilon_2-\epsilon_l) &<& - V^l_2\\ &\vdots & & \vdots & \\ \epsilon^l_J &=& (\epsilon_J-\epsilon_l) &<& - V^l_J\\ \end{array} \right. \]

The initial vector of errors $\epsilon$ are transformed using the following transformation: $\epsilon^l = M^l \epsilon$, where the transformation matrix $M^l$ is a $(J-1) \times J$ matrix obtained by inserting in an identity matrix a $l^{\mbox{th}}$ column of $-1$. For example, if $J = 4$ and $l = 3$:

\[ M^3 = \left( \begin{array}{cccc} 1 & 0 & -1 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & -1 & 1 \\ \end{array} \right) \]

The covariance matrix of the error differences is:

\[ \mbox{V}\left(\epsilon^l\right)=\mbox{V}\left(M^l\epsilon\right) = M^l\mbox{V}\left(\epsilon\right){M^l}^{\top} = M^l\Omega{M^l}^{\top} \tag{14.2}\]

The probability of choosing $l$ is then:

\[ P_l =\mbox{P}(\epsilon^l_1<-V_1^l \;\&\; \epsilon^l_2<-V_2^l \;\&\; ... \; \epsilon^l_J<-V_J^l) \tag{14.3}\]

with the hypothesis of normal distribution, this writes:

\[ P_l = \int_{-\infty}^{-V_1^l}\int_{-\infty}^{-V_2^l}...\int_{-\infty}^{-V_J^l}\phi(\epsilon^l) d\epsilon^l_1 d\epsilon^l_2... d^l_J \]

with :

\[ \phi\left(\epsilon^l\right)=\frac{1}{(2\pi)^{(J-1)/2}\mid\Omega^l\mid^{1/2}} e^{-\frac{1}{2}\epsilon^l{\Omega^l}^{-1}\epsilon^l} \]

Two problems arise with this model:

the identified parameters are the elements of $\Omega^l$ and not of $\Omega$. We must then carefully investigate the meanings of these elements,
the probability is a $J-1$ integral, which should be numerically computed. The relevant strategy in this context is to use simulations.

Identification

The meaningful parameters are those of the covariance matrix of the error $\Omega$. For example, with $J = 3$:

\[ \Omega = \left( \begin{array}{ccc} \sigma_1 ^ 2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_3^3 \\ \end{array} \right) \] Computing Equation 14.2 for $l=1$, we get:

\[ \Omega^1 = M^1 \Omega {M^1}^{\top}= \left( \begin{array}{cc} \sigma_1^2+\sigma_2^2-2\sigma_{12} & \sigma_1^2 + \sigma_{23} - \sigma_{12} -\sigma_{13} \\ \sigma_1^2+\sigma_{23}- \sigma_{12} - \sigma_{13} & \sigma_1^2 + \sigma_3^2 - 2 \sigma_{13} \\ \end{array} \right) \]

The overall scale of utility being unidentified, one has to impose the value of one of the variance, for example the first one is set to 1. We then have :

\[ \Omega^1 = \left( \begin{array}{cc} 1 & \frac{\sigma_1^2+ \sigma_{23} - \sigma_{12} -\sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} \\ \frac{\sigma_1^2+\sigma_{23}- \sigma_{12} - \sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} & \frac{\sigma_1^2 + \sigma_3^2 - 2 \sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} \\ \end{array} \right) \]

Therefore, out of the six structural parameters of the covariance matrix, only three can be identified. Moreover, it’s almost impossible to interpret these parameters. More generally, with $J$ alternatives, the number of the parameters of the covariance matrix is $(J+1)\times J/2$ and the number of identified parameters is $J\times(J-1)/2-1$.

Simulations

Let $C^l$ be the Choleski decomposition of the covariance matrix of the error differences:

\[ \Omega^C= {C^l}^{\top}C^l \]

This matrix is an upper triangular matrix of dimension $(J-1)$ :

\[ C^l= \left( \begin{array}{ccccc} l_{11} & l_{12} & l_{13} &... & l_{1(J-1)} \\ 0 & l_{22} & l_{23} & ... & l_{2(J-1)} \\ 0 & 0 & l_{33} & ... & l_{3(J-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & ... & l_{(J-1)(J-1)} \\ \end{array} \right) \]

Let $\nu$ be a vector of standard normal deviates: $\nu \sim \mathcal{N}(0, I)$

Therefore, we have :

\[ \mbox{V}\left({C^l}^\top\nu\right)={C^l}^\top V(\nu){C^l}={C^l}^{\top}IC^l=\Omega^l \]

Therefore, if we draw a vector of standard normal deviates $\nu$ and apply to it this transformation, we get a realization of $\epsilon^l$. This probability of choosing $l$ given by Equation 14.3 can be written as a product of conditional and marginal probabilities:

\[ \begin{array}{rcl} P_l &=& \mbox{P}(\epsilon^l_1<- V_1^l \;\&\; \epsilon^l_2<-V_2^l \;\&\; ... \;\&\; \epsilon^l_J<-V_J^l))\\ &=& \mbox{P}(\epsilon^l_1<- V_1^l))\\ &\times&\mbox{P}(\epsilon^l_2<-V_2^l \mid \epsilon^l_1<-V_1^l) \\ &\times&\mbox{P}(\epsilon^l_3<-V_3^l \mid \epsilon^l_1<-V_1^l \;\&\; \epsilon^l_2<-V_2^l) \\ & \vdots & \\ &\times&\mbox{P}(\epsilon^l_J<-V_J^l \mid \epsilon^l_1<-V_1^l \;\&\; ... \;\&\; \epsilon^l_{J-1}<-V_{J-1}^l)) \\ \end{array} \]

The vector of error difference deviates is:

\[ \left( \begin{array}{c} \epsilon^l_1 \\ \epsilon^l_2 \\ \epsilon^l_3 \\ \vdots \\ \epsilon^l_J \end{array} \right) = {C^l} ^ \top \nu = \left( \begin{array}{ccccc} l_{11} & 0 & 0 &... & 0 \\ l_{12} & l_{22} & 0 & ... & 0 \\ l_{13} & l_{23} & l_{33} & ... & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ l_{1(J-1)} & l_{2(J-1)} & l_{3(J-1)} & ... & l_{(J-1)(J-1)} \\ \end{array} \right) \times \left( \begin{array}{c} \nu_1 \\ \nu_2 \\ \nu_3 \\ \vdots \\ \nu_J \end{array} \right) \]

\[ \left( \begin{array}{c} \epsilon^l_1 \\ \epsilon^l_2 \\ \epsilon^l_3 \\ \vdots \\ \epsilon^l_J \end{array} \right) = \left( \begin{array}{l} l_{11}\nu_1 \\ l_{12}\nu_1+l_{22}\nu_2 \\ l_{13}\nu_1+l_{23}\nu_2 + l_{33}\nu_3\\ \vdots \\ l_{1(J-1)}\nu_1+l_{2(J-1)}\nu_2+...+l_{(J-1)(J-1)}\nu_{J-1} \end{array} \right) \]

Let’s now investigate the marginal and conditional probabilities:

the first is simply the marginal probability for a standard normal deviate, therefore we have: $\mbox{P}(\epsilon^l_1<-V_1^l) = \Phi\left(-\frac{V_1^l}{l_{11}}\right)$
the second is, for a given value of $\nu_1$ equal to $\Phi\left(-\frac{V^l_2+l_{21}\nu_1}{l_{22}}\right)$. We then have to compute the mean of this expression for any value of $\nu_1$ lower than $-\frac{V^l_1}{l_{11}}$. We then have, denoting $\bar{\phi}_1$ the truncated normal density: \[\mbox{P}(\epsilon^l_2<-V_2^l)=\int_{-\infty}^{-\frac{V^l_1}{l_{11}}}\Phi\left(-\frac{V^l_2+l_{21}\nu_1}{l_{22}}\right) \bar{\phi}_1(\nu_1)d\nu_1\]
the third is, for given values of $\nu_1$ and $\nu_2$ equal to: $\Phi\left(-\frac{V^l_3+l_{31}\nu_1+l_{32}\nu_2}{l_{33}}\right)$. We then have: \[\mbox{P}(\epsilon^l_3<-V_3^l)=\int_{-\infty}^{-\frac{V^l_1}{l_{11}}}\int_{-\infty}^{-\frac{V^l_2+l_{21}\nu_1}{l_{22}}} \Phi\left(-\frac{V^l_3+l_{31}\nu_1+l_{32}\nu_2}{l_{33}}\right)\bar{\phi}_1(\nu_1)\bar{\phi}_2(\nu_2)d\nu_1d\nu_2\]
and so on.

These probabilities can easily be simulated by drawing numbers from a truncated normal distribution. This so called GHK (for Geweke, Hajivassiliou and Keane) algorithm (see for example Geweke, Keane, and Runkle 1994) can be described as follow:

compute $\Phi\left(-\frac{V_1^l}{l_{11}}\right)$,
draw a number called $\nu_1^r$ from a standard normal distribution upper-truncated at $-\frac{V_1^l}{l_{11}}$ and compute $\Phi\left(-\frac{V^l_2+l_{12}\nu_1^r}{l_{22}}\right)$,
draw a number called $\nu_2^r$ from a standard normal distribution upper-truncated at $-\frac{V^l_2+l_{12}\nu_1^r}{l_{22}}$ and compute $\Phi\left(-\frac{V^l_3+l_{13}\nu_1^r+l_{23}\nu_2^r}{l_{33}}\right)$,
$...$ draw a number called $\nu_{J-1}^r$ from a standard normal distribution upper-truncated at $-\frac{V^l_{J-1}+l_{1(J-1)}\nu_1^r+... l_{(J-2)(J-1)}\nu_{J-2}^r}{l_{(J-1)(J-1)}}$,
multiply all these probabilities and get a realization of the probability of choosing $l$ called $P^r_l$,
repeat all these steps many times and average all these probabilities; this average is an estimation of the probability: $\bar{P}_l = \sum_{r=1}^R P^r_l/R$.

Several points should be noted concerning this algorithm:

the utility differences should be computed respective to the chosen alternative for each individual,
the Choleski decomposition used should rely on the same covariance matrix of the errors. One method to attained this goal is to start from a given difference, e.g., the difference respective with the first alternative. The vector of error difference is then $\epsilon^1$ and its covariance matrix is $\Omega^1={L^1}^{\top}L^1$. To apply a difference with another alternative $l$, we construct a matrix called $S^l$ which is obtained by using a $J-2$ identity matrix, adding a first row of 0 and inserting a column of $-1$ at the $l-1$th position. For example, with four alternatives and $l=3$, we have:

\[S^3= \left( \begin{array}{ccc} 0 & -1 & 0 \\ 1 & -1 & 0 \\ 0 & -1 & 1 \\ \end{array} \right) \] The elements of the Choleski decomposition of the covariance matrix is then obtained as follow: \[ \Omega^l = S^l \Omega^1 {S^l}^{\top}=L^l {L^l}^{\top} \]
to compute draws from a normal distribution truncated at $a$, the following trick is used : take a draw $\mu$ from a uniform distribution (between 0 and 1); then $\nu = \Phi^{-1}\left(\mu \Phi(a)\right)$ is a draw from a normal distribution truncated at $a$.

Application

In Section 14.2.5, we estimated a multinomial logit for mode choice in the Toronto-Montreal corridor, using cost and freq as alternative specific covariates with a generic coefficient, income as a choice situation specific covariate and time as an alternative specific covariate with a specific coefficient. We fit the same model using this time a probit. This is simply done by setting the probit argument to TRUE.

pbt <- mlogit(formula = choice ~ cost + freq | income | time, data = MC, 
              probit = TRUE, alt.subset = c("car", "train", "air"), 
              reflevel = "car")

As previously, we want to predict the effect of a reduction of 20% of a train’s fare:

Oprob <- fitted(pbt, type = "probabilities")
Nprob <- predict(pbt, newdata = NMC)
rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean))

       car  train    air
old 0.4606 0.1676 0.3721
new 0.4194 0.2457 0.3353

With this model, the IIA property is not operative, as the ratio of probabilities of choosing air and car is no longer the same before and after the change of a train’s fare.

head(Nprob[, "air"] / Nprob[, "car"])
## [1] 0.4549 0.8923 0.3402 0.8972 0.9090 0.6034
head(Oprob[, "air"] / Oprob[, "car"])
## [1] 0.4737 0.8897 0.3585 0.9016 0.9288 0.6299

Ben-Akiva, M., D. Bolduc, and M. Bradley. 1993. “Estimation of Travel Choice Models with Randomly Distributed Values of Time.” Transportation Research Record 1413: 88–97.

Bhat, C. R. 1995. “A Heterocedastic Extreme Value Model of Intercity Travel Mode Choice.” Transportation Research B 29 (6): 471–83.

Croissant, Yves. 2020. “Estimation of Random Utility Models in r: The Mlogit Package".” Journal of Statistical Software 95 (11): 1–41. https://doi.org/10.18637/jss.v095.i11.

Daganzo, C. 1979. Multinomial Probit: The Theory and Its Application to Demand Forecasting. Academic Press, New York.

Daly, A. 1987. “Estimating ‘Tree’ Logit Models.” Transportation Research B, 251–67.

Forinash, C. V., and F. S. Koppelman. 1993. “Application and Interpretation of Nested Logit Models and Intercity Mode Choice.” Transportation Record 1413: 98–106.

Geweke, J., M. Keane, and D. Runkle. 1994. “Alternative Computational Approaches to Inference in the Multinomial Probit Model.” Review of Economics and Statistics 76: 609–32.

Hausman, Jerry, and D. Wise. 1978. “A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdemendence and Heterogeneous Preferences.” Econometrica 48: 403–29.

Heiss, Florian. 2002. “Structural Choice Analysis with Nested Logit Models.” The Stata Journal 2 (3): 227–52.

Hensher, David A., and William H. Greene. 2002. “Specification and Estimation of the Nested Logit Model: Alternative Normalisations.” Transportation Research Part B 36: 1–17.

Koppelman, Franck S., and Chieh-Hua Wen. 1998. “Alternative Nested Logit Models: Structure, Properties and Estimation.” Transportation Research B 32 (5): 289–98.

———. 2000. “The Paired Combinatorial Logit Model: Properties, Estimation and Application.” Transportation Research B 34: 75–89.

McFadden, Daniel. 1974. “The Measurement of Urban Travel Demand.” Journal of Public Economics 3: 303–28.

———. 1978. “Spatial Interaction Theory and Planning Models.” In Modeling the Choice of Residential Location, edited by A. Karlqvist, 75–96. North-Holland, Amsterdam.

Meijer, Erik, and Jan Rouwendal. 2006. “Measuring Welfare Effects in Models with Random Coefficients.” Journal of Applied Econometrics 21 (2): 227–44. https://doi.org/10.1002/jae.841.

Small, K. A., and H. S. Rosen. 1981. “Applied Welfare Economics with Discrete Choice Models.” Econometrica 49: 105–30.

Train, Kenneth. 2009. Discrete Choice Methods with Simulation. Cambridge University Press. https://EconPapers.repec.org/RePEc:cup:cbooks:9780521766555.

Train, Kenneth, Daniel McFadden, and Moshe Ben-Akiva. 1987. “The Demand for Local Telephone Service: A Fully Discrete Model of Residential Calling Patterns and Service Choices.” The RAND Journal of Economics 18 (1): 109–23. http://www.jstor.org/stable/2555538.

Walker, Joan L., Moshe Ben-Akiva, and Denis Bolduc. 2007. “Identification of Parameters in Normal Error Component Logit-Mixture (NECLM) Models.” Journal of Applied Econometrics 22 (6): 1095–125. http://www.jstor.org/stable/25146565.

We’ve already encountered this model in the special case where the response is binomial in Section 10.2.2.↩︎
This chapter is largely based on Croissant (2020).↩︎
Used by Ben-Akiva, Bolduc, and Bradley (1993) and Meijer and Rouwendal (2006).↩︎
Used in particular by Forinash and Koppelman (1993), Bhat (1995), Koppelman and Wen (1998) and Koppelman and Wen (2000).↩︎
See Train (2009), pp. 74-75.↩︎
A slightly different version of the nested logit model (Daly 1987) is often used, but is not compatible with the random utility maximization hypothesis. Its difference from the previous expression is that the deterministic parts of the utility for each alternative are not divided by the nest elasticity. The differences between the two versions have been discussed in Koppelman and Wen (1998), Heiss (2002) and Hensher and Greene (2002).↩︎
We’ve already encountered this expression in Section 14.2.4.3.↩︎
See Section 14.1.1.1.↩︎
Remember that the survey took place in 1987. The values should be multiplied by 1.73 to get the value of 1 euro in 2024.↩︎
See Hausman and Wise (1978) and Daganzo (1979).↩︎

# Discrete choice models {#sec-rum} ```{r } #| include: false source("../_commonR.R") ``` Consider the case where the response is the choice of an alternative among a set of mutually exclusive alternatives. This choice can be modeled in a utility maximization framework, which means that we hypothesize that the individual chooses the alternative which corresponds to the maximum level of utility.^[We've already encountered this model in the special case where the response is binomial in @sec-rum_binomial.] Namely, denoting $j = 1 \ldots J$ the set of alternatives, we'll denote $U_{nj}$ the level of utility of individual $n$ if he chooses alternative $j$ and the $j$^th^alternative will be chosen if $U_{nj} > U_{nk}\; \forall k \neq j$. The level of utility will depends on some observable covariates and on some other unobservables whose effect will be summarized in a variable $\epsilon_{nj}$ that will be considered, from the researcher's point of view, as the realization of a random variable. Two key questions for these models are how the covariates enter the utility function and what is the assumed distribution of the error. @sec-data_formula_rum will deal with the first point, namely how to deal with covariates in discrete choice model. @sec-multinom_logit will present the landmark model in this field, the multinomial logit model. @sec-relaxing_iid will go beyond the hypothesis of iid errors to present two extensions of the basic model, the nested and the heteroscedastic logit models. @sec-mixed_logit will introduce the rich field of random parameters logit models. Finally, @sec-multinom_probit will be devoted to the multinomial probit model.^[This chapter is largely based on @CROI:20\index[author]{Croissant}.] Throughout this chapter, we'll use the **mlogit** package which is devoted to the estimation of discrete choice models: ```{r } #| label: 'loading mlogit' library(mlogit) ``` ## Data management and model description {#sec-data_formula_rum}              ### Data management Data sets used for discrete choice models estimation concern some individuals, who make one or a sequential choice of one alternative among a set of mutually exclusive alternatives. The determinants of these choices are covariates that can depend on the alternative and the choice situation, only on the alternative or only on the choice situation. Data sets can have two different shapes: a *wide* shape (one row for each choice situation) or a *long* shape (one row for each alternative and, therefore, as many rows as there are alternatives for each choice situation). **mlogit** deals with both formats. It depends on the **dfidx** package which takes as first argument a `data.frame` and returns a `dfidx` object, which is a `data.frame` in "long" format with a special data frame column which contains the indexes. The second argument is called `idx`. In its simple use, it should be a list (or a vector) of two characters containing the choice situation and the alternative indexes. #### Wide format {#sec-wide_format} `dutch_railways`^[Used by @BENA:BOLD:BRAD:93\index[author]{Ben-Akiva}\index[author]{Bolduc}\index[author]{Bradley} and @MEIJ:ROUW:06\index[author]{Meijer}\index[author]{Rouwendal}.] is an example of a *wide* data set:\idxdata[(]{dutch\_railways}{micsr.data} ```{r } #| label: 'Train data' dutch_railways %>% print(n = 4) ``` \index[general]{stated preference survey} This data set contains data about a stated preference survey in the Netherlands in 1987. Each individual has responded to several (up to 16) scenarios. For every scenario, two train trips are proposed to the user, with different combinations of four attributes: `price` (the price in euros), `time` (travel time in minutes), `change` (the number of changes) and `comfort` (the class of comfort, 0, 1 or 2, 0 being the most comfortable class). This "wide" format is suitable to store choice situation (or individual specific) variables because, in this case, they are stored only once in the data. It is cumbersome for alternative-specific variables because there are as many columns for such variables as there are alternatives. For such a wide data set, the `shape` argument of `dfidx` is mandatory, as its default value is `"long"`. The alternative-specific variables are indicated with the `varying` argument which is a numeric vector that indicates their position in the data frame. This argument is then passed to `stats::reshape` that coerced the original `data.frame` in "long" format. Further arguments may be passed to `reshape`. For example, as the names of the variables are of the form `price_A`, one must add `sep = "_"` (the default value being `"."`). The `choice` argument is also mandatory because the response has to be transformed in a logical value in the long format. In "wide" format, there is no alternative index. The choice situation index is not mandatory, as there is one line for each choice situation. In this data set, there is a choice situation index called `id`, and it is nested in the individual index called `choiceid`. To take the panel dimension into account, `idx` is a list of length 1 (the choice situation) containing a vector of length 2 with `choiceid` and `id`. The `idnames` is used to give a relevant name for the second index, the `NA` in first position indicating that the name of the first index is unchanged. \idxfun{dfidx}{dfidx} ```{r } #| label: 'dfidx for Train' Tr <- dfidx(dutch_railways, shape = "wide", varying = 4:11, sep = "_", idx = list(c("choiceid", "id")), idnames = c(NA, "alt"), opposite = c("price", "time", "change", "comfort")) ``` Note the use of the `opposite` argument for the four covariates: we expect negative coefficients for all of them, taking the opposite of the covariates will lead to expected positive coefficients. ```{r } #| label: 'head of the transformed Train data set' Tr %>% print(n = 4) ``` An `idx` column is added to the data, which contains the three relevant indexes: `choiceid` is the choice situation index, `alt` the alternative index and `id` the individual index. This column can be extracted using the `idx` function: \idxfun{idx}{dfidx} ```{r } #| label: 'index of the transformed Train data set' #| results: false idx(Tr) ``` \idxdata[)]{dutch\_railways}{micsr.data} #### Long format `toronto_montreal`,^[Used in particular by @FORI:KOPP:93\index[author]{Forinash}\index[author]{Koppelman}, @BHAT:95\index[author]{Bhat}, @KOPP:WEN:98\index[author]{Koppelman}\index[author]{Wen} and @KOPP:WEN:00.] is an example of a data set in long format. It presents the choice of individuals for a transport mode for the Toronto-Montreal corridor in 1989: \idxdata[(]{toronto\_montreal}{micsr.data} ```{r } #| label: 'loading ModeCanada' toronto_montreal %>% print(n = 5) ``` There are four transport modes (`air`, `train`, `bus` and `car`) and most of the variables are alternative-specific (`cost` for monetary cost, `ivt` for in-vehicle time, `ovt` for out-vehicle time, `freq` for frequency). The only choice situation-specific variables are `dist` (distance of the trip), `income` (household income), `urban` (a dummy for trips which have a large city at the origin or the destination) and `noalt` (the number of available alternatives). The advantage of this shape is that there are much fewer columns than in the wide format, the caveat being that values of `dist`, `income` and `urban` are repeated up to four times. For data in "long" format, the `shape` and the `choice` arguments are no longer mandatory. To replicate published results later in the text, we'll use only a subset of the choice situations, namely those for which the four alternatives are available. This can be done using the `subset` function with the `subset` argument set to `noalt == 4` while estimating the model. This can also be done within `dfidx`, using the `subset` argument. The information about the structure of the data can be explicitly indicated using choice situations and alternative indexes (respectively `case` and `alt` in this data set) or, in part, guessed by the `dfidx` function. Here, after subsetting, we have 2779 choice situations with 4 alternatives, and the rows are ordered first by choice situation and then by alternative (`train`, `air`, `bus`, and `car` in this order). The first way to read correctly this data frame is to ignore completely the two index variables. In this case, the only supplementary argument to provide is the `alt.levels` argument, which is a character vector that contains the name of the alternatives in their order of appearance: \idxfun{dfidx}{dfidx} ```{r } #| label: 'applying dfidx to Modecanada (1)' MC <- dfidx(toronto_montreal, subset = noalt == 4, alt.levels = c("train", "air", "bus", "car")) ``` Note that this can only be used if the data set is "balanced", which means that the same set of alternatives is available for all choice situations. It is also possible to provide the name of the variable that contains the alternatives through the argument `idx`: \idxfun{dfidx}{dfidx} ```{r } #| label: 'applying dfidx to Modecanada (2)' MC <- dfidx(toronto_montreal, subset = noalt == 4, idx = list(NA, "alt")) ``` The name of the variable that contains the information about the choice situations can also be indicated through the argument `idx`: \idxfun{dfidx}{dfidx} ```{r } #| label: 'applying dfidx to Modecanada (3)' MC <- dfidx(toronto_montreal, subset = noalt == 4, idx = "case", alt.levels = c("train", "air", "bus", "car")) ``` Both alternative and choice situation variables can also be provided: \idxfun{dfidx}{dfidx} ```{r } #| label: 'applying dfidx to Modecanada (4)' MC <- dfidx(toronto_montreal, subset = noalt == 4, idx = c("case", "alt")) ``` More simply, as the two indexes are stored in the first two columns of the original data frame, the `idx` argument can be unset: \idxfun{dfidx}{dfidx} ```{r } #| label: 'ModeCanada without idx' MC <- dfidx(toronto_montreal, subset = noalt == 4) ``` and the indexes can be kept as standalone series if the `drop.index` argument is set to `FALSE`: \idxfun{dfidx}{dfidx} ```{r } #| label: 'applying dfidx to Modecanada (5)' MC <- dfidx(toronto_montreal, subset = noalt == 4, idx = c("case", "alt"), drop.index = FALSE) MC %>% print(n = 5) ``` \idxdata[)]{toronto\_montreal}{micsr.data} ### Model description Standard `formula`s are not very practical to describe random utility models, as these models may use different sets of covariates. Actually, working with random utility models, one has to consider at most three sets of covariates: 1. alternative- and choice situation-specific covariates $x_{nj}$ with generic coefficients $\beta$ and alternative-specific covariates $t_j$ with a generic coefficient $\nu$, 1. choice situation-specific covariates $z_n$ with alternative-specific coefficients $\gamma_j$, 1. alternative- and choice situation-specific covariates $w_{nj}$ with alternative-specific coefficients $\delta_j$. The covariates enter the observable part of the utility which can be written, for alternative $j$: $$ V_{nj}=\alpha_j + \beta x_{nj} + \nu t_j + \gamma_j z_n + \delta_j w_{nj} $$ As the absolute value of utility is irrelevant, only utility differences are useful to modelize the choice for one alternative. For two alternatives $j$ and $l$, we obtain: $$ V_{nj}-V_{nl}=(\alpha_j-\alpha_l) + \beta (x_{nj}-x_{nl}) + \nu(t_j - t_l) + (\gamma_j-\gamma_l) z_n + (\delta_j w_{nj} - \delta_k w_{nl}) $$ It is clear from the previous expression that coefficients of choice situation-specific variables (the intercept being one of those) should be alternative-specific; otherwise they would disappear in the differentiation. Moreover, only differences of these coefficients are relevant and can be identified. For example, with three alternatives 1, 2 and 3, the three coefficients $\gamma_1, \gamma_2, \gamma_3$ associated with a choice situation-specific variable cannot be identified, but only two linear combinations. Therefore, one has to make a choice of normalization, and the simplest one is just to set $\gamma_1 = 0$. Coefficients for alternative and choice situation-specific variables may (or may not) be alternative-specific. For example, transport time is alternative-specific, but 10 mn in public transport may not have the same impact on utility than 10 mn in a car. In this case, alternative-specific coefficients are relevant. Monetary cost is also alternative-specific, but in this case, one can consider than $1 is $1 however it is spent for the use of a car or in public transports. In this case, a generic coefficient is relevant. The treatment of alternative-specific variables doesn't differ much from the alternative and choice situation-specific variables with a generic coefficient. However, if some of these variables are introduced, the $\nu$ parameter can only be estimated in a model without intercepts to avoid perfect multicolinearity. \index[general]{multinomial logit model} \index[general]{conditional logit model} \index[general]{mixed logit model} A logit model with only choice situation-specific variables is sometimes called a *multinomial logit model*, one with only alternative-specific variables, a *conditional logit model*, and one with both kinds of variables, a *mixed logit model*. This is seriously misleading: *conditional logit model* is also a logit model for longitudinal data in the statistical literature, and *mixed logit* is one of the names of a logit model with random parameters. Therefore, in what follows, we'll use the name *multinomial logit model* for the model we've just described whatever the nature of the explanatory variables used. The **mlogit** package provides objects of class `mFormula` which are built upon `Formula` objects provided by the **Formula** package. To illustrate the use of `mFormula` objects, we use again the `toronto_montreal` data set\idxdata[(]{toronto\_montreal}{micsr.data} and consider three sets of covariates that will be indicated in a three-part formula, which refers to the three items at the beginning of this section. - `cost` (monetary cost) is an alternative-specific covariate with a generic coefficient (part 1), - `income` and `urban` are choice situation-specific covariates (part 2), - `ivt` (in-vehicle travel time) is alternative-specific and alternative-specific coefficients are expected (part 3). \idxfun{Formula}{Formula} ```{r } #| label: 'a three parts formula' library(Formula) f <- Formula(choice ~ cost | income + urban | ivt) ``` Some parts of the formula may be omitted when there is no ambiguity. For example, the following sets of `formula`s are identical: \idxfun{Formula}{Formula} ```{r } #| label: 'ommission of some parts (1)' f2 <- Formula(choice ~ cost + ivt | income + urban) f2 <- Formula(choice ~ cost + ivt | income + urban | 0) ``` \idxfun{Formula}{Formula} ```{r } #| label: 'ommission of some parts (2)' f3 <- Formula(choice ~ 0 | income | 0) f3 <- Formula(choice ~ 0 | income) ``` \idxfun{Formula}{Formula} ```{r } #| label: 'ommission of some parts (3)' f4 <- Formula(choice ~ cost + ivt) f4 <- Formula(choice ~ cost + ivt | 1) f4 <- Formula(choice ~ cost + ivt | 1 | 0) ``` By default, an intercept is added to the model; it can be removed by using `+ 0` or `- 1` in the second part. \idxfun{Formula}{Formula} ```{r } #| label: 'removing the intercept' f5 <- Formula(choice ~ cost | income + 0 | ivt) f5 <- Formula(choice ~ cost | income - 1 | ivt) ``` A `model.frame` method is provided for `dfidx` objects. It differs from the `formula` method by the fact that the returned object is an object of class `dfidx` and not an ordinary data frame, which means that the information about the structure of the data is not lost. Defining a specific `model.frame` method for `dfidx` objects implies that the first argument of the function should be a `dfidx` object, which results in an unusual order of the arguments in the function (the data first, and then the formula). Moreover, as the model matrix for random utility models has specific features, we add a supplementary argument called `pkg` to the `dfidx` function so that the returned object has a specific class (and inherits the `dfidx` class): \idxfun{dfidx}{dfidx}\idxfun{class}{base}\idxfun{Formula}{Formula}\idxfun{model.frame}{stats} ```{r } #| label: 'model.matrix method for Formula objects' #| collapse: true MC <- dfidx(toronto_montreal, subset = noalt == 4, pkg = "mlogit") class(MC) f <- Formula(choice ~ cost | income | ivt) mf <- model.frame(MC, f) class(mf) ``` Using `mf` as the argument of `model.matrix` enables the construction of the relevant model matrix for random utility model, as a specific `model.matrix` method for `dfidx_mlogit` objects is provided. \idxfun{head}{utils}\idxfun{model.matrix}{stats} ```{r} head(model.matrix(mf), 4) ``` The model matrix contains $J-1$ columns for every choice situation-specific variables (`income` and the intercept), which means that the coefficient associated with the first alternative (`train`) is set to 0. It contains only one column for `cost` because we want a generic coefficient for this variable. It contains $J$ columns for `ivt`, because it is an alternative specific variable for which we want alternative specific coefficients. \idxdata[(]{toronto\_montreal}{micsr.data} ## Random utility model and multinomial logit model {#sec-multinom_logit} ### Random utility model \index[general]{random utility model|(} The utility for alternative $l$ is written as: $U_l=V_l+\epsilon_l$ where $V_l$ is a function of some observable covariates and unknown parameters to be estimated, and $\epsilon_l$ is a random deviate which contains all the unobserved determinants of the utility. Alternative $l$ is therefore chosen if $\epsilon_j < (V_l-V_j)+\epsilon_l \;\forall\;j\neq l$ and the probability of choosing this alternative is then: $$ \mbox{P}(\epsilon_1 < V_l-V_1+\epsilon_l, \epsilon_2 < V_l-V_2+\epsilon_l, ..., \epsilon_J < V_l-V_J+\epsilon_l). $$ Denoting $F_{-l}$ as the cumulative density function of all the $\epsilon$s except $\epsilon_l$, this probability is: $$ (\mbox{P}_l \mid \epsilon_l)= F_{-l}(V_l-V_1+\epsilon_l, ..., V_l-V_J+\epsilon_l). $$ Note that this probability is conditional on the value of $\epsilon_l$. The unconditional probability (which depends only on $\beta$ and on the value of the observed covariates) is obtained by integrating out the conditional probability using the marginal density of $\epsilon_l$, denoted by $f_l$: $$ \mbox{P}_l=\int F_{-l}(V_l-V_1+\epsilon_l, ...,V_l-V_J)+\epsilon_l)f_l(\epsilon_l) d\epsilon_l. $$ The conditional probability is an integral of dimension $J-1$, and the computation of the unconditional probability adds one more dimension of integration. \index[general]{random utility model|)} ### Distribution of the error terms \index[general]{uncorrelated errors!random utility model} The multinomial logit model [@MCFAD:74]\index[author]{McFadden} is a special case of the model developed in the previous section. It is based on three hypotheses. The first hypothesis is the independence of the errors. In this case, the univariate distribution of the errors can be used, which leads to the following conditional and unconditional probabilities: $$ (\mbox{P}_l \mid \epsilon_l)=\prod_{j\neq l}F_j(V_l-V_j+\epsilon_l) \mbox{ and } \mbox{P}_l =\int \prod_{j\neq l}F_j(V_l-V_j+\epsilon_l) \; f_l(\epsilon_l) \;d\epsilon_l, $$ which means that the conditional probability is the product of $J-1$ univariate cumulative density functions, and the evaluation of only a one-dimensional integral is required to compute the unconditional probability. \index[general]{Gumbel distribution} The second hypothesis is that each $\epsilon$ follows a Gumbel (maximum) distribution, whose density and probability functions are respectively: $$ f(z)=\frac{1}{\theta}e^{-\frac{z-\mu}{\theta}} e^{-e^{-\frac{z-\mu}{\theta}}} \mbox{ and } F(z)=\int_{-\infty}^{z} f(t) dt=e^{-e^{-\frac{z-\mu}{\theta}}}, $$ where $\mu$ is the location parameter and $\theta$ the scale parameter. The first two moments of the Gumbel distribution are $\mbox{E}(z)=\mu+\theta \gamma$, where $\gamma$ is the Euler-Mascheroni constant ($\approx 0.57721$) and $\mbox{V}(z)=\frac{\pi^2}{6}\theta^2$. The mean of $\epsilon_j$ is not identified if $V_j$ contains an intercept. We can then, without loss of generality suppose that $\mu_j=0, \; \forall j$. Moreover, the overall scale of utility is not identified. Therefore, only $J-1$ scale parameters may be identified, and a natural choice of normalization is to impose that one of the $\theta_j$ is equal to 1. \index[general]{homoskedasticity!random utility model} The last hypothesis is that the errors are identically distributed. As the location parameter is not identified for any error term, this hypothesis is essentially a homoskedasticity hypothesis, which means that the scale parameter of the Gumbel distribution is the same for all the alternatives. As one of them has been previously set to 1, we can therefore suppose that, without loss of generality, $\theta_j = 1, \;\forall j \in 1... J$. The conditional and unconditional probabilities then further simplify to: $$ (\mbox{P}_l \mid \epsilon_l)%=\prod_{j\neq l}F(V_l-V_j+\epsilon_l) =\prod_{j\neq l}e^{-e^{-(V_l-Vj+\epsilon_l)}} \mbox{ and } \mbox{P}_l =\int_{-\infty}^{+\infty}\prod_{j\neq l}e^{-e^{-(V_l-Vj+t)}}e^{-t}e^{-e^{-t}}dt. $$ The probabilities have then very simple, closed forms, which correspond to the logit transformation of the deterministic part of the utility.^[See @TRAI:09, pp. 74-75.] $$ P_l=\frac{e^{V_l}}{\sum_{j=1}^J e^{V_j}}. $$ ### IIA property \index[general]{independence of irrelevant alternatives hypothesis|(} If we consider the probabilities of choice for two alternatives $l$ and $m$, we have $P_l=e^{V_l}/\sum_j e^{V_j}$ and $P_m=e^{V_m}/\sum_j e^{V_j}$. The ratio of these two probabilities is: $$ \frac{P_l}{P_m}=\frac{e^{V_l}}{e^{V_m}}=e^{V_l-V_m}. $$ This probability ratio for the two alternatives depends only on the characteristics of these two alternatives and not on those of other alternatives. This is called the independence of irrelevant alternatives (**IIA**) property. IIA relies on the hypothesis that the errors are identical and independent. It is not a problem in itself and may even be considered as a useful feature for a well-specified model. However, this hypothesis may be in practice violated, especially if some important variables are omitted. \index[general]{independence of irrelevant alternatives hypothesis|)} ### Interpretation #### Marginal effects The marginal effects are the derivatives of the probabilities with respect to the covariates, which can be choice situation-specific ($z_n$) or alternative-specific ($x_{nj}$): $$ \begin{array}{rcl} \displaystyle \frac{\partial P_{nl}}{\partial z_{n}}&=&P_{nl}\left(\beta_l-\sum_j P_{nj}\beta_j\right) \\ \displaystyle \frac{\partial P_{nl}}{\partial x_{nl}}&=&\gamma P_{nl}(1-P_{nl})\\ \displaystyle \frac{\partial P_{nl}}{\partial x_{nk}}&=&-\gamma P_{nl}P_{nk}. \end{array} $$ - For a choice situation-specific variable, the sign of the marginal effect is not necessarily the sign of the coefficient. Actually, the sign of the marginal effect is given by $\left(\beta_l-\sum_j P_{nj}\beta_j\right)$, which is positive if the coefficient for alternative $l$ is greater than a weighted average of the coefficients for all the alternatives, the weights being the probabilities of choosing the alternatives. In this case, the sign of the marginal effect can be established with no ambiguity only for the alternatives with the lowest and the greatest coefficients. - For an alternative-specific variable, the sign of the coefficient can be directly interpreted. The marginal effect is obtained by multiplying the coefficient by the product of two probabilities which is at most 0.25. The rule of thumb is therefore to divide the coefficient by 4 in order to have an upper bound of the marginal effect. Note that the last equation can be rewritten: $\frac{\mbox{d} P_{nl} / P_{nl}}{\mbox{d}x_{nk}} = -\gamma P_{nk}$. Therefore, when a characteristic of alternative $k$ changes, the relative changes of the probabilities for every alternative except $k$ are the same, which is a consequence of the IIA property. #### Marginal rates of substitution Coefficients are marginal utilities, which cannot be interpreted. However, ratios of coefficients are marginal rates of substitution. For example, if the observable part of utility is: $V=\beta_o +\beta_1 x_1 +\beta x_2 + \beta x_3$, joint variations of $x_1$ and $x_2$ which ensure the same level of utility are such that: $dV=\beta_1 dx_1+\beta_2 dx_2=0$ so that: $$ - \frac{dx_2}{dx_1}\mid_{dV = 0} = \frac{\beta_1}{\beta_2}. $$ For example, if $x_2$ is transport cost (in dollars), $x_1$ transport time (in hours), $\beta_1 = 1.5$ and $\beta_2=0.05$, $\frac{\beta_1}{\beta_2}=30$ is the marginal rate of substitution of time in terms of dollars and the value of 30 means that, to reduce the travel time of 1 hour, the individual is willing to pay at most $30 more. Stated more simply, time value is $30 per hour. #### Consumer surplus {#sec-consumer_surplus} Consumer's surplus has a very simple expression for multinomial logit models, which was first derived by @SMAL:ROSE:81\index[author]{Small}\index[author]{Rosen}. The level of utility attained by an individual is $U_j=V_j+\epsilon_j$, $j$ being the chosen alternative. The expected utility, from the searcher's point of view is then: $\mbox{E}(\max_j U_j)$, where the expectation is taken over the values of all the error terms. Its expression is simply, up to an additive unknown constant, the log of the denominator of the logit probabilities, often called the "log-sum":\index[general]{log-sum} $$ \mbox{E}(U)=\ln \sum_{j=1}^Je^{V_j}+C. $$ If the marginal utility of income ($\alpha$) is known and constant, the expected surplus is simply $\frac{\mbox{E}(U)}{\alpha}$. ### Application {#sec-appl_mult_logit} \idxdata[(]{toronto\_montreal}{micsr.data} Random utility models are fitted using the `mlogit` function. Basically, only two arguments are mandatory, `formula` and `data`, if an `dfidx` object (and not an ordinary `data.frame`) is provided. We use the `toronto_montreal` data set, which was already coerced to a `dfidx` object (called `MC`) in the previous section. The same model can then be estimated using as `data` argument this `dfidx` object: \idxfun{mlogit}{mlogit} ```{r } #| label: 'multinomial logit with a dfidx' MC <- dfidx(toronto_montreal, subset = noalt == 4) ml.MC1 <- mlogit(choice ~ cost + freq + ovt | income | ivt, MC) ``` or a `data.frame`. In this latter case, further arguments that will be passed to `dfidx` should be indicated: \idxfun{mlogit}{mlogit} ```{r } #| label: 'multinomial logit with an ordinary data.frame' ml.MC1b <- mlogit(choice ~ cost + freq + ovt | income | ivt, toronto_montreal, subset = noalt == 4, idx = c("case", "alt")) ``` `mlogit` provides two further useful arguments: - `reflevel` indicates which alternative is the "reference" alternative, i.e., the one for which the coefficients of choice situation-specific covariates are set to 0, - `alt.subset` indicates a subset of alternatives on which the estimation has to be performed; in this case, only the lines that correspond to the selected alternatives are used, and all the choice situations where unselected alternatives have been chosen are removed. We estimate the model on the subset of three alternatives (we exclude `bus` whose market share is negligible in our sample) and we set `car` as the reference alternative. Moreover, we use a total transport time variable computed as the sum of the in-vehicle and out-vehicle time variables. \idxfun{mutate}{dplyr}\idxfun{mlogit}{mlogit} ```{r } #| label: 'estimation on a subset of alternatives' MC <- MC %>% mutate(time = ivt + ovt) ml.MC1 <- mlogit(choice ~ cost + freq | income | time, MC, alt.subset = c("car", "train", "air"), reflevel = "car") ``` The main results of the model are computed and displayed using the `summary` method: ```{r } #| label: 'summary method for mlogit' summary(ml.MC1) ``` The frequencies of the different alternatives in the sample are first indicated. Next, some information about the optimization is displayed: the Newton-Raphson method (with analytical gradient and hessian) is used, as it is the most efficient method for this simple model for which the log-likelihood function is globally concave. Note that very few iterations and computing times are required to estimate this model. Then the usual table of coefficients is displayed, followed by some goodness-of-fit measures: the value of the log-likelihood function, which is compared to the value when only intercepts are introduced, which leads to the computation of the McFadden $R^2$ and to the likelihood ratio test. The `fitted` method can be used either to obtain the probability of actual choices (`type = "outcome"`) or the probabilities for all the alternatives (`type = "probabilities"`). \idxfun{fitted}{stats}\idxfun{head}{utils} ```{r } #| label: 'fitted method for mlogit' head(fitted(ml.MC1, type = "outcome")) head(fitted(ml.MC1, type = "probabilities"), 4) ``` Note that the log-likelihood is the sum of the log of the fitted outcome probabilities and that, as the model contains intercepts, the average fitted probabilities for every alternative equals the market shares of the alternatives in the sample. \idxfun{fitted}{stats}\idxfun{logLik}{stats}\idxfun{apply}{base} ```{r } #| label: 'computation of the log likelihood and the market shares' #| collapse: true sum(log(fitted(ml.MC1, type = "outcome"))) logLik(ml.MC1) apply(fitted(ml.MC1, type = "probabilities"), 2, mean) ``` Predictions can be made using the `predict` method. If no data is provided, predictions are made for the sample mean values of the covariates. \idxfun{predict}{stats} ```{r } #| label: 'default behaviour of the predict method' #| collapse: true predict(ml.MC1) ``` Assume, for example, that we wish to predict the effect of a reduction of train transport time of 20\%. We first create a new `data.frame` simply by multiplying train transport time by 0.8 and then using the `predict` method with this new `data.frame`. \idxfun{mutate}{dplyr}\idxfun{ifelse}{base}\idxfun{fitted}{stats}\idxfun{predict}{stats} \idxfun{rbind}{base}\idxfun{apply}{base} ```{r } #| label: 'predicting with different data' NMC <- MC NMC <- NMC %>% mutate(time = ifelse(idx$alt == "train", 0.8 * time, time)) Oprob <- fitted(ml.MC1, type = "probabilities") Nprob <- predict(ml.MC1, newdata = NMC) rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean)) ``` If, for the first individuals in the sample, we compute the ratio of the probabilities of the air and the car mode, we obtain: \idxfun{head}{utils} ```{r } #| label: 'illustration of the IIA property' #| collapse: true head(Nprob[, "air"] / Nprob[, "car"]) head(Oprob[, "air"] / Oprob[, "car"]) ``` which is an illustration of the IIA property. \index[general]{independence of irrelevant alternatives hypothesis} If train time changes, it changes the probabilities of choosing air and car, but not their ratio. We next compute the surplus for individuals of the sample induced by train time reduction. This requires the computation of the log-sum term (also called inclusive value or inclusive utility) for every choice situation, which is: $$ \mbox{iv}_n = \ln \sum_{j = 1} ^ J e^{\beta^\top x_{nj}}. $$ For this purpose, we use the `logsum` function, which works on a vector of coefficients and a model matrix. The basic use of `logsum` consists of providing as unique argument (called `coef`) a `mlogit` object. In this case, the `model.matrix` and the `coef` are extracted from the same model: \idxfun{logsum}{mlogit} ```{r } #| label: 'computation of the initital logsum' ivbefore <- logsum(ml.MC1) ``` To compute the log-sum after train time reduction, we must provide a model matrix which is not the one corresponding to the fitted model. This can be done using the `X` argument which is a matrix or an object from which a model matrix can be extracted. This can also be done by filling the `data` argument (a data frame or an object from which a data frame can be extracted using `model.frame`), and eventually the `formula` argument (a formula or an object for which the `formula` method can be applied). If no formula is provided, but if `data` is a `dfidx` object, the formula is extracted from it. \idxfun{logsum}{mlogit} ```{r } #| label: 'computation of the after change logsum' ivafter <- logsum(ml.MC1, data = NMC) ``` Surplus variation is then computed as the difference of the log-sums divided by the opposite of the cost coefficient which can be interpreted as the marginal utility of income: \idxfun{coef}{stats} ```{r } #| label: 'computation of consumers surplus' #| collapse: true surplus <- - (ivafter - ivbefore) / coef(ml.MC1)["cost"] summary(surplus) ``` Consumer surplus variations range from 0.6 to 31 Canadian dollars, with a median value of about $4. Marginal effects are computed using the `effects` method. By default, they are computed at the sample mean, but a `data` argument can be provided. The variation of the probability and the covariate can be either absolute or relative. This is indicated with the `type` argument which is a combination of two `a` (as absolute) and `r` (as relative) characters. For example, `type = "ar"` means that what is measured is an absolute variation of the probability for a relative variation of the covariate. \idxfun{effects}{stats} ```{r } #| label: 'marginal effects for an individual specific covariate' #| collapse: true effects(ml.MC1, covariate = "income", type = "ar") ``` The results indicate that, for a 100\% increase of income, the probability of choosing `air` increases by 33 percentage points, as the probabilities of choosing `car` and `train` decrease by 18 and 15 percentage points. For an alternative specific covariate, a matrix of marginal effects is displayed. \idxfun{effects}{stats} ```{r } #| label: 'marginal effects for an alternative specific covariate' effects(ml.MC1, covariate = "cost", type = "rr") ``` The cell in the $l^{\mbox{th}}$ row and the $c^{\mbox{th}}$ column indicates the change of the probability of choosing alternative $c$ when the cost of alternative $l$ changes. As `type = "rr"`, elasticities are computed. For example, a 10\% change of train cost increases the probabilities of choosing car and air by 3.36\%. Note that the relative changes of the probabilities of choosing one of these two modes are equal, which is a consequence of the IIA property. Finally, in order to compute travel time valuation, we divide the coefficients of travel times (in minutes) by the coefficient of monetary cost (in dollars). \idxfun{coef}{stats}\idxfun{names}{base} ```{r } #| label: 'computation of the marginal rate of substitution' #| collapse: true coef(ml.MC1)[grep("time", names(coef(ml.MC1)))] / coef(ml.MC1)["cost"] * 60 ``` The value of travel time ranges from 23 for a train to 37 Canadian dollars per hour for a plane. \idxdata[)]{toronto\_montreal}{micsr.data} ## Logit models relaxing the iid hypothesis {#sec-relaxing_iid} In the previous section, we assumed that the error terms were iid (identically and independently distributed), i.e., uncorrelated and homoskedastic. Extensions of the basic multinomial logit model have been proposed by relaxing one of these two hypotheses while maintaining the hypothesis of a Gumbel distribution. ### Heteroskedastic logit model \index[general]{heteroskedasticity!random utility model|(} \index[general]{heteroskedastic logit model|(} The heteroskedastic logit model was proposed by @BHAT:95\index[author]{Bhat}. The probability that $U_l>U_j$ is: $$ P(\epsilon_j<V_l-V_j+\epsilon_l)=e^{-e^{-\frac{(V_l-V_j+\epsilon_l)}{\theta_j}}}, $$ which implies the following conditional and unconditional probabilities $$ (P_l \mid \epsilon_l) =\prod_{j\neq l}e^{-e^{-\frac{(V_l-V_j+\epsilon_l)}{\theta_j}}}, $$ $$ \begin{array}{rcl} P_l&=&\displaystyle\int_{-\infty}^{+\infty} \prod_{j\neq l} \left(e^{-e^{-\frac{(V_l-V_j+t)}{\theta_j}}}\right)\frac{1}{\theta_l}e^{-\frac{t}{\theta_l}}e^{-e^{-\frac{t}{\theta_l}}} dt\\ &=& \displaystyle \int_{0}^{+\infty}\left(e^{-\sum_{j \neq l}e^{-\frac{V_l-V_j-\theta_l \ln t}{\theta_j}}}\right)e^{-t}dt. \end{array} $$ {#eq-prob_heterosc_logit} There is no closed form for this integral, but it can be efficiently computed using a Gauss quadrature method, and more precisely the Gauss-Laguerre quadrature method. \index[general]{heteroskedasticity!random utility model|(} \index[general]{heteroskedastic logit model|(} \index[general]{gaussian quadrature} @BHAT:95\index[author]{Bhat} estimated the heteroskedastic logit model on the `toronto_montreal` data set. Using `mlogit`, the heteroskedastic logit model is obtained by setting the `heterosc` argument to `TRUE`: \idxdata[(]{toronto\_montreal}{micsr.data} \idxfun{mlogit}{mlogit}\idxfun{gaze}{micsr} ```{r } #| label: 'heteroscedastic model for the ModeCanada data' ml.MC <- mlogit(choice ~ freq + cost + ivt + ovt | urban + income, MC, reflevel = 'car', alt.subset = c("car", "train", "air")) hl.MC <- mlogit(choice ~ freq + cost + ivt + ovt | urban + income, MC, reflevel = 'car', alt.subset = c("car", "train", "air"), heterosc = TRUE) hl.MC %>% gaze(coef = 11:12) ``` Two supplementary coefficients (`sp.train` and `sp.air`) are estimated ($\theta_j$ in @eq-prob_heterosc_logit), the third for the reference modality being set to 1. The variance of the error terms of train and air are respectively higher and lower than the variance of the error term of car (set to 1). Note that the z-values and p-values of the output are not particularly meaningful, as the hypothesis that the coefficient is zero (and not 1) is tested. The homoskedasticity hypothesis can be tested using any of the three tests. For the likelihood ratio and the Wald test, one can use only the fitted heteroskedastic model as argument. In this case, it is guessed that the hypothesis that the user wants to test is the homoskedasticity hypothesis. \idxfun{lrtest}{lmtest}\idxfun{waldtest}{lmtest}\idxfun{gaze}{micsr} \index[general]{likelihood ratio test!heteroskedastic logit model} \index[general]{Wald test!heteroskedastic logit model} ```{r } #| label: 'homoscedasticity tests: lr and Wald (1)' #| collapse: true lrtest(hl.MC, ml.MC) %>% gaze waldtest(hl.MC, heterosc = FALSE) %>% gaze ``` or, more simply: \idxfun{lrtest}{lmtest}\idxfun{waldtest}{lmtest} ```{r } #| label: 'homoscedasticity tests: lr and Wald (2)' #| results: 'hide' lrtest(hl.MC) waldtest(hl.MC) ``` The Wald test can also be computed using the `linearHypothesis` function from the `car` package: \idxfun{linearHypothesis}{car}\idxfun{gaze}{micsr} ```{r } #| label: 'homoscedasticity tests: Wald test' #| collapse: true car::linearHypothesis(hl.MC, c('sp.air = 1', 'sp.train = 1')) %>% gaze ``` \index[general]{score test!heteroskedastic logit model} For the score test, we provide the constrained model as argument, which is the standard multinomial logit model and the supplementary argument which defines the unconstrained model, which is in this case `heterosc = TRUE`. \idxfun{scoretest}{mlogit}\idxfun{gaze}{micsr} ```{r } #| label: 'homoscedasticity tests: score test' #| collapse: true scoretest(ml.MC, heterosc = TRUE) %>% gaze ``` The homoskedasticity hypothesis is therefore strongly rejected using the Wald test, but only at the 1 and 5\% level for, respectively, the score and the likelihood ratio tests. \idxdata[)]{toronto\_montreal}{micsr.data} ### Nested logit model \index[general]{correlated errors!random utility model|(} \index[general]{nested logit model|(} The nested logit model was first proposed by @MCFAD:78\index[author]{McFadden}. It is a generalization of the multinomial logit model that is based on the idea that some alternatives may be joined in several groups (called nests). The error terms may then present some correlation in the same nest, whereas error terms of different nests are still uncorrelated. Denoting $m=1... M$ the nests and $B_m$ the set of alternatives belonging to nest $m$, the cumulative distribution of the errors is: $$ \mbox{exp}\left(-\sum_{m=1}^M \left( \sum_{j \in B_m} e^{-\epsilon_j/\lambda_m}\right)^{\lambda_m}\right). $$ The marginal distributions of the $\epsilon$s are still univariate extreme values, but there is now some correlation within nests. $1-\lambda_m$ is a measure of the correlation, i.e., $\lambda_m = 1$ implies no correlation. In the special case where $\lambda_m=1\; \forall m$, the errors are iid Gumbel errors and the nested logit model reduce to the multinomial logit model. It can then be shown that the probability of choosing alternative $j$ that belongs to nest $l$ is: $$ P_j = \frac{e^{V_j/\lambda_l}\left(\sum_{k \in B_l} e^{V_k/\lambda_l}\right)^{\lambda_l-1}} {\sum_{m=1}^M\left(\sum_{k \in B_m} e^{V_k/\lambda_m}\right)^{\lambda_m}}, $$ and that this model is a random utility model if all the $\lambda$ parameters are in the $0-1$ interval.^[A slightly different version of the nested logit model [@DALY:87]\index[author]{Daly} is often used, but is not compatible with the random utility maximization hypothesis. Its difference from the previous expression is that the deterministic parts of the utility for each alternative are not divided by the nest elasticity. The differences between the two versions have been discussed in @KOPP:WEN:98\index[author]{Koppelman}\index[author]{Wen}, @HEIS:02\index[author]{Heiss} and @HENS:GREEN:02\index[author]{Hensher}\index[author]{Greene}.] Let us now write the deterministic part of the utility of alternative $j$ as the sum of two terms: the first ($Z_j$) being specific to the alternative and the second ($W_l$) to the nest it belongs to: $$V_j=Z_j+W_l.$$ We can then rewrite the probabilities as follow: $$ \begin{array}{rcl} P_j&=&\frac{e^{(Z_j+W_l)/\lambda_l}}{\sum_{k \in B_l} e^{(Z_k+W_l)/\lambda_l}}\times \frac{\left(\sum_{k \in B_l} e^{(Z_k+W_l)/\lambda_l}\right)^{\lambda_l}} {\sum_{m=1}^M\left(\sum_{k \in B_m} e^{(Z_k+W_m)/\lambda_m}\right)^{\lambda_m}}\\ &=&\frac{e^{Z_j/\lambda_l}}{\sum_{k \in B_l} e^{Z_k/\lambda_l}}\times \frac{\left(e^{W_l/\lambda_l}\sum_{k \in B_l} e^{Z_k/\lambda_l}\right)^{\lambda_l}} {\sum_{m=1}^M\left(e^{W_m/\lambda_m}\sum_{k \in B_m} e^{Z_k/\lambda_m}\right)^{\lambda_m}}. \end{array} $$ Then denote $I_l=\ln \sum_{k \in B_l} e^{Z_k/\lambda_l}$ which is often called the log-sum, the inclusive value or the inclusive utility.^[We've already encountered this expression in @sec-consumer_surplus.] We then can write the probability of choosing alternative $j$ as: $$ P_j=\frac{e^{Z_j/\lambda_l}}{\sum_{k \in B_l} e^{Z_k/\lambda_l}}\times \frac{e^{W_l+\lambda_l I_l}}{\sum_{m=1}^Me^{W_m+\lambda_m I_m}}. $$ The first term $\mbox{P}_{j\mid l}$ is the conditional probability of choosing alternative $j$ if nest $l$ is chosen. It is often referred to as the *lower model*. The second term $\mbox{P}_l$ is the marginal probability of choosing nest $l$ and is referred to as the *upper model*. $W_l+\lambda_l I_l$ can be interpreted as the expected utility of choosing the best alternative in $l$, $W_l$ being the expected utility of choosing an alternative in this nest (whatever this alternative is) and $\lambda_l I_l$ being the expected extra utility gained by being able to choose the best alternative in the nest. The inclusive values link the two models. It is then straightforward to show that IIA applies within nests, but not for two alternatives in different nests. A consistent but inefficient way of estimating the nested logit model is to estimate separately its two components. The coefficients of the lower model are first estimated, which enables the computation of the inclusive values $I_l$. The coefficients of the upper model are then estimated, using $I_l$ as covariates. Maximizing directly the likelihood function of the nested model leads to a more efficient estimator. \index[general]{upper model} \index[general]{lower model} \index[general]{nested logit model!two-step estimation|)} \index[general]{two-step estimator!nested logit} \index[general]{correlated errors!random utility model|)} \index[general]{nested logit model|)} \idxdata[(]{telephone}{micsr} To illustrate the estimation of the nested logit model, we use the `telephone` data set, used by @TRAI:MCFA:BENA:87\index[author]{Train}\index[author]{McFadden}\index[author]{Ben-Akiva} and @WALK:BENA:BOLD:07\index[author]{Walker}\index[author]{Ben-Akiva}\index[author]{Bolduc}. ```{r} telephone %>% print(n = 5) ``` A total of 428 households were surveyed in 1984 about their choice of a local telephone service, which typically involves the choice between a flat service (a fixed monthly charge for an unlimited calls within a specified geographical area) and a measured (a reduced fixed monthly charge for a limited number of calls plus usage charges for additional calls) service. Households had the choice between five services: - *budget measured* (`budget`): no fixed monthly charge; usage charges apply to each call made, - *standard measured* (`standard`): a fixed monthly charge covers up to a specified dollar amount (greater than the fixed charge) of local calling, after which usage charges apply to each call made, - *local flat* (`local`): a greater monthly charge that may depend upon residential location; unlimited free calling within a local calling area; usage charges apply to calls made outside local calling area, - *extended area flat* (`extended`): a further increase in the fixed monthly charge to permit unlimited free calling within an extended area, - *metro area flat* (`metro`): the greatest fixed monthly charge that permits unlimited free calling within the entire metropolitan area. \newpage The first two services are measured, and the last three are flat services. There is therefore an obvious nesting structure for this example. We first estimate the multinomial logit model, with the log of cost as the unique covariate: \idxfun{mlogit}{mlogit} ```{r} ml_tel <- mlogit(choice ~ log(cost), telephone, idx = c("household", "service")) ``` We then update this model in order to introduce nests, using the `nests` argument. It is a list of characters that contains the alternatives for the different nests. It is advisable to use a named list (we use here `"measured"` and `"flat"` as names of the nests): \idxfun{mlogit}{mlogit}\idxfun{coef}{stats} ```{r} #| collapse: true nl_tel <- mlogit(choice ~ cost, telephone, idx = c("household", "service"), nests = list(measured = c("budget", "standard"), flat = c("local", "metro", "extended"))) coef(nl_tel) ``` Two supplementary coefficients are estimated, `iv:measured` and `iv:flat`. The two values are in the 0-1 interval and close to each other. The `un.nest.el` argument enables to estimate a unique supplementary coefficient for the two nests: \idxfun{update}{stats} ```{r} nl_tel_u <- update(nl_tel, un.nest.el = TRUE) ``` We then have three nested models and the hypothesis of a unique parameter can be tested using any of the three tests: \idxfun{scoretest}{mlogit}\idxfun{lrtest}{lmtest}\idxfun{waldtest}{lmtest}\idxfun{gaze}{micsr} \index[general]{score test!nested logit} \index[general]{Wald test!nested logit} \index[general]{Lagrange multiplier test!nested logit} ```{r} #| collapse: true scoretest(nl_tel_u, un.nest.el = FALSE) %>% gaze lrtest(nl_tel, nl_tel_u) %>% gaze waldtest(nl_tel, un.nest.el = TRUE) %>% gaze ``` The three tests conclude that a unique parameter can be estimated. Then, we can test whether this parameter is 1, in which case the nested logit model reduces to the multinomial logit model: \idxfun{scoretest}{mlogit}\idxfun{lrtest}{lmtest}\idxfun{waldtest}{lmtest}\idxfun{gaze}{micsr} \idxfun{list}{base}\idxfun{linearHypothesis}{car} ```{r} #| collapse: true scoretest(ml_tel, nests = list(measured = c("budget", "standard"), flat = c("local", "metro", "extended"))) %>% gaze lrtest(nl_tel_u, ml_tel) %>% gaze waldtest(nl_tel_u, nests = NULL) %>% gaze car::linearHypothesis(nl_tel_u, "iv = 1") %>% gaze ``` Based on the Wald and the likelihood ratio, the preferred specification is the nested logit model (but note that the p-value for the score test is slightly higher than 5%). \idxdata[)]{telephone}{micsr} ## Random parameters (or mixed) logit model {#sec-mixed_logit} ### Derivation of the model \index[general]{mixed model!multinomial logit|(} \index[general]{random parameter logit model|(} \index[general]{mixed logit model|see{random parameter logit model}} A **mixed logit** model (or random parameters logit model) is a logit model whose parameters are assumed to vary from one individual to another. It is therefore a model that takes the heterogeneity of the population into account. For the standard logit model, the probability that individual $n$ chooses alternative $j$ is: $$ P_{nl}=\frac{e^{\beta'x_{nl}}}{\sum_j e^{\beta'x_{nj}}}. $$ Suppose now that the coefficients are individual-specific. The probabilities are then: $$ P_{nl}=\frac{e^{\beta_n'x_{nl}}}{\sum_j e^{\beta_n'x_{nj}}}. $$ A first approach consists of estimating the parameters for every individual. However, these parameters are identified and can be consistently estimated only if a large number of choice situations per individual is available, which is scarcely the case. A more appealing approach consists of considering $\beta_n$ as random draws on a distribution whose parameters are estimated, which leads to the mixed logit model. The probability that individual $n$ will choose alternative $l$, for a given value of $\beta_n$ is: $$ P_{nl} \mid \beta_n =\frac{e^{\beta_n'x_{nl}}}{\sum_j e^{\beta_n'x_{nj}}}. $$ To get the unconditional probability, we have to integrate out this conditional probability, using the density function of $\beta$. Suppose that $V_{nl}=\beta_n x_{nl}$, i.e., that there is only one individual-specific coefficient and that the density of $\beta_n$ is $f(\beta,\theta)$, $\theta$ being the vector of the parameters of the distribution of $\beta$. The unconditional probability is then: $$ P_{nl}= \mbox{E}(P_{nl} \mid \beta_n) = \int_{\beta}(P_{nl} \mid \beta)f(\beta,\theta)d\beta = \int_{\beta}\frac{e^{\beta^\top x_{nl}}}{\sum_j e^{\beta^\top x_{nj}}}f(\beta,\theta)d\beta, $$ which is a one-dimensional integral that can be efficiently estimated by quadrature methods. If $V_{nl}=\beta_n^{\top} x_{nl}$ where $\beta_n$ is a vector of length $K$ and $f(\beta,\theta)$ is the joint density of the $K$ individual-specific coefficients, the unconditional probability is: $$ P_{nl}= \mbox{E}(P_{nl} \mid \beta_n) = \int_{\beta_1}\int_{\beta_2}...\int_{\beta_K}(P_{nl} \mid \beta)f(\beta,\theta)d\beta_1d\beta_2... d\beta_K. $$ This is a $K$-dimensional integral which cannot easily be estimated by quadrature methods. The only practical method is then to use simulations. More precisely, $R$ draws of the parameters are taken from the distribution of $\beta$, the probability is computed for every draw and the unconditional probability, which is the expected value of the conditional probabilities is estimated by the average of the $R$ probabilities. The expected value of a random coefficient ($\mbox{E}(\beta)$) is simply estimated by the mean of the $R$ draws on its distribution: $\bar{\beta}=\sum_{r=1}^R \beta_r$. Individual parameters are obtained by first computing the probabilities of the observed choice of $n$ for every value of $\beta_r$: $$ P_{nr}=\frac{\sum_j y_{nj} e^{\beta_r^{'}x_{nj}}}{\sum_j e^{\beta_r^{'}x_{nj}}}, $$ where $y_{nj}$ is a dummy equal to 1 if $n$ has chosen alternative $j$. The expected value of the parameter for an individual is then estimated by using these probabilities to weight the $R$ $\beta$ values: $$ \hat{\beta}_n = \frac{\sum_r P_{nr} \beta_r}{\sum_r P_{nr}}. $$ If there are repeated observations for the same individuals, the longitudinal dimension of the data can be taken into account in the mixed logit model, assuming that the random parameters of individual $n$ are the same for all their choice situations. Denoting $y_{ntl}$ a dummy equal to 1 if $n$ chooses alternative $l$ for the $t$^th^ choice situation, the probability of the observed choice is: $$ P_{nt}\mid \beta_n=\prod_j \frac{\sum_j y_{ntj}e^{\beta_n ^\top x_{ntj}}}{\sum_j e^{\beta_n ^\top x_{ntj}}}. $$ The joint probability for the $T$ observations of individual $n$ is then: $$ P_{n}\mid \beta_n=\prod_t \prod_j \frac{\sum_jy_{ntj}e^{\beta_n ^\top x_{ntj}}}{\sum_j e^{\beta_n ^\top x_{ntj}}} $$ \index[general]{mixed model!multinomial logit|)} \index[general]{random parameter logit model|)} ### Application The random parameter logit model is estimated by providing a `rpar` argument to `mlogit`. This argument is a named vector, the names being the random coefficients and the acronyms for the law of distribution. Currently, the normal (`"n"`), log-normal (`"ln"`), zero-censored normal (`"cn"`), uniform (`"u"`) and triangular (`"t"`) distributions are available. For these distributions, two parameters are estimated which are, for normal related distributions, the mean and the standard-deviation of the underlying normal distribution and for the uniform and triangular distribution, the mean and the half-range of the distribution. For these last two distributions, zero-bounded variants are also provided (`"zbt"` and `"zbu"`). These two distributions are defined by only one parameter (the mean) and their definition domain varies from 0 to twice the mean. Several considerations may lead to the choice of a specific distribution: - if correlated coefficients are required, the natural choice is a (transformed-) normal distribution, `"n"`, `"ln"`, `"tn"` and `"cn"`, - it's often the case that one wants to impose that the distribution of a random parameter takes only positive or negative values. For example, the price coefficient should be negative for every individual. In this case, `"zbt"` and `"zbu"` can be used. The use of `"ln"` and `"cn"` can also be relevant but, in this case, if only negative values are expected, one should consider the distribution of the opposite of the random price coefficient. This can easily be done using the `opposite` argument of `dfidx`,^[See @sec-wide_format.] - the use of unbounded distributions often leads to implausible values of some statistics of the random parameters, especially the mean. This is particularly the case of the log-normal distribution, which has an heavy right tail. In this case, the use of bounded distribution like the uniform and the triangular distributions can be used. `R` is the number of draws, `halton` indicates whether Halton draws [see @TRAI:09\index[author]{Train}, chapter 9] should be used (`NA` and `NULL` indicate respectively that default Halton draws are used and that pseudo-random numbers are used), `panel` is a boolean which indicates if the panel data version of the log-likelihood should be used. Correlations between random parameters can be introduced only for normal-related distributed random parameters, using the `correlation` argument. If `TRUE`, all the normal-related random parameters are correlated. The `correlation` argument can also be a character vector indicating a subset of the random parameters that are assumed to be correlated. \idxdata[(]{dutch\_railways}{micsr.data} We use the `dutch_railways` data set, previously coerced to a `dfidx` object called `Tr`. We first estimate the multinomial model: both alternatives being virtual train trips, it is relevant to use only generic coefficients and to remove the intercept: \idxfun{dfidx}{dfidx}\idxfun{mlogit}{mlogit}\idxfun{gaze}{micsr} ```{r } #| label: 'multinomial logit for the Train data' Tr <- dfidx(dutch_railways, choice = "choice", varying = 4:11, sep = "_", opposite = c("price", "comfort", "time", "change"), idx = list(c("choiceid", "id")), idnames = c("chid", "alt")) Train.ml <- mlogit(choice ~ price + time + change + comfort | - 1, Tr) Train.ml %>% gaze ``` \newpage All the coefficients are highly significant and have the predicted positive sign (remember than an increase in the variable `comfort` implies using a less comfortable class). The coefficients can't be directly interpreted, but dividing them by the price coefficient, we get monetary values: \idxfun{coef}{stats} ```{r } #| label: 'marginal rates of substitution for Train' #| collapse: true coef(Train.ml)[- 1] / coef(Train.ml)[1] ``` ```{r} mv <- coef(Train.ml)[- 1] / coef(Train.ml)[1] ``` We obtain the value of `r round(mv[1], 1)` euros for an hour of traveling, `r round(mv[2], 1)` euro for a change and `r round(mv[3], 1)` euros to travel in a more comfortable class.^[Remember that the survey took place in 1987. The values should be multiplied by 1.73 to get the value of 1 euro in 2024.] We then estimate a model with three random parameters, `time`, `change` and `comfort`. We first estimate the uncorrelated mixed logit model: \idxfun{mlogit}{mlogit}\idxfun{names}{base}\idxfun{coef}{stats} ```{r } #| label: 'mixed logit estimation for Train (1)' #| cache: true #| collapse: true Train.mxlu <- mlogit(choice ~ price + time + change + comfort | - 1, Tr, rpar = c(time = "n", change = "n", comfort = "n"), R = 100, panel = TRUE, correlation = FALSE, halton = NA, method = "bhhh") names(coef(Train.mxlu)) ``` Compared to the multinomial logit model, there are now three more coefficients which are the standard deviations of the distribution of the three random parameters. The correlated model is obtained by setting the `correlation` argument to `TRUE`. \idxfun{names}{base}\idxfun{coef}{stats}\idxfun{update}{stats} ```{r } #| cache: true #| label: 'mixed logit estimation for Train (2)' Train.mxlc <- update(Train.mxlu, correlation = TRUE) names(coef(Train.mxlc)) ``` There are now six parameters which are the elements of the Choleski decomposition\index[general]{Choleski decomposition} of the covariance matrix of the three random parameters. These six parameters are therefore the elements of the following matrix: $$ C= \left( \begin{array}{ccc} c_{11} & c_{12} & c_{13} \\ 0 & c_{22} & c_{23} \\ 0 & 0 & c_{33} \end{array} \right) $$ such that: $$ C^{\top}C= \left( \begin{array}{ccc} c_{11}^2 & c_{11} c_{12} & c_{11}c_{13} \\ c_{11}c_{12} & c_{12}^2 + c_{22}^2 & c_{12}c_{23}+c_{22}c_{23} \\ c_{11}c_{13} & c_{12}c_{3} + c_{22}c_{23} & c_{13}^2 + c_{23}^2 c_{33}^2 \end{array} \right) = \left( \begin{array}{ccc} \sigma_{1}^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{12} & \sigma_{2}^2 & \sigma_{23} \\ \sigma_{13} & \sigma_{23} & \sigma_{3}^2 \end{array} \right) $$ where $\sigma_k^2$ and $\sigma_{kl}$ are respectively the variance of the random parameter $k$ and the covariance between two random parameters $k$ and $l$. Therefore, the first estimated parameter can be simply interpreted as the standard deviation of the first random parameter, but the five other can't be interpreted easily. Random parameters may be extracted using the function `rpar` which takes as first argument a `mlogit` object and as second argument `par` the parameter(s) to be extracted. This function returns a `rpar` object and a `summary` method is provided to describe it: \idxfun{rpar}{mlogit} ```{r } #| label: 'summary of a random parameter in the preference space' #| collapse: true marg.ut.time <- rpar(Train.mxlc, "time") summary(marg.ut.time) ``` The estimated random parameter is in the "preference space", which means that it is the marginal utility of time. Parameters in the "willingness to pay" (WTP) space are more easy to interpret. They can be estimated directly (a feature not supported by `mlogit`) or can be obtained from the marginal utility by dividing by the coefficient of a covariate expressed in monetary value (a price for example), taken as a non-random parameter. The ratio can then be interpreted as a monetary value (or willingness to pay). To obtain the distribution of the random parameters in the WTP space, one can use the `norm` argument of `rpar`: \idxfun{rpar}{mlogit} ```{r } #| label: 'summary of a random parameter in the wtp space' #| collapse: true wtp.time <- rpar(Train.mxlc, "time", norm = "price") summary(wtp.time) ``` The median value (and the mean value as the distribution is symmetric) of transport time is about 33 euros. Several methods/functions are provided to extract the individual statistics (`mean`, `med` and `stdev` respectively for the mean, median and standard deviation): \idxfun{rpar}{mlogit}\idxfun{med}{mlogit}\idxfun{stdev}{mlogit} ```{r } #| label: 'statistics of the random parameter in the wtp space' #| collapse: true mean(rpar(Train.mxlc, "time", norm = "price")) med(rpar(Train.mxlc, "time", norm = "price")) stdev(rpar(Train.mxlc, "time", norm = "price")) ``` In case of correlated random parameters, as the estimated parameters can't be directly interpreted, a `vcov` method for `mlogit` objects is provided. It has a `what` argument whose default value is `coefficient`. In this case the usual covariance matrix of the coefficients is return. If `what = "rpar"`, the covariance matrix of the correlated random parameters is returned if `type = "cov"` (the default) and the correlation matrix (with standard deviations on the diagonal) is returned if `type = "cor"`. The object is of class `vcov.mlogit` and a `summary` method for this object is provided which computes, using the delta method,\index[general]{delta method} the standard errors of the parameters of the covariance or the correlation matrix. \idxfun{vcov}{stats} ```{r } #| label: 'vcov method for mlogit objects' vcov(Train.mxlc, what = "rpar") vcov(Train.mxlc, what = "rpar", type = "cor") summary(vcov(Train.mxlc, what = "rpar", type = "cor")) ``` The correlation can be restricted to a subset of random parameters by filling the `correlation` argument with a character vector indicating the corresponding covariates: \idxfun{update}{stats}\idxfun{vcov}{stats} ```{r } #| label: 'mixed logit with a subset of correlated paramaters' #| cache: true Train.mxlc2 <- update(Train.mxlc, correlation = c("time", "comfort")) vcov(Train.mxlc2, what = "rpar", type = "cor") ``` The presence of random coefficients and their correlation can be investigated using any of the three tests. Actually, three nested models can be considered, a model with no random effects, a model with random but uncorrelated effects and a model with random and correlated effects. We first present the three tests of no correlated random effects: \idxfun{lrtest}{lmtest}\idxfun{waldtest}{lmtest}\idxfun{scoretest}{mlogit} \idxfun{linearHypothesis}{car}\idxfun{gaze}{micsr} ```{r } #| label: 'tests of no correlated random effects' #| collapse: true lrtest(Train.mxlc, Train.ml) %>% gaze waldtest(Train.mxlc) %>% gaze car::linearHypothesis(Train.mxlc, c("chol.time:time = 0", "chol.time:change = 0", "chol.time:comfort = 0", "chol.change:change = 0", "chol.change:comfort = 0", "chol.comfort:comfort = 0")) %>% gaze scoretest(Train.ml, rpar = c(time = "n", change = "n", comfort = "n"), R = 100, correlation = TRUE, halton = NA, panel = TRUE) %>% gaze ``` The hypothesis of no correlated random parameters is strongly rejected. We then present the three tests of no correlation, the existence of random parameters being maintained. \idxfun{lrtest}{lmtest}\idxfun{waldtest}{lmtest}\idxfun{scoretest}{mlogit} \idxfun{linearHypothesis}{car}\idxfun{gaze}{micsr} ```{r } #| label: 'tests of no correlation' #| collapse: true lrtest(Train.mxlc, Train.mxlu) %>% gaze car::linearHypothesis(Train.mxlc, c("chol.time:change = 0","chol.time:comfort = 0", "chol.change:comfort = 0")) %>% gaze waldtest(Train.mxlc, correlation = FALSE) %>% gaze scoretest(Train.mxlu, correlation = TRUE) %>% gaze ``` The hypothesis of no correlation is strongly rejected with the Wald and the likelihood ratio test, only at the 5\% level for the score test. \idxdata[)]{dutch\_railways}{micsr.data} ## Multinomial probit {#sec-multinom_probit} \index[general]{multinomial probit model|(} ### The model The multinomial probit is obtained with the same modeling that we used while presenting the random utility model. The utility of an alternative is still the sum of two components : $U_j = V_j + \epsilon_j$ but the joint distribution of the error terms is now a multivariate normal with mean 0 and with a matrix of covariance denoted by $\Omega$.^[See @HAUS:WISE:78\index[author]{Hausman}\index[author]{Wise} and @DAGA:79\index[author]{Daganzo}.] Alternative $l$ is chosen if: $$ \left\{ \begin{array}{rcl} U_1-U_l&=&(V_1-V_l)+(\epsilon_1-\epsilon_l)<0\\ U_2-U_l&=&(V_2-V_l)+(\epsilon_2-\epsilon_l)<0\\ & \vdots & \\ U_J-U_l&=&(V_J-V_l)+(\epsilon_J-\epsilon_l)<0\\ \end{array} \right. $$ wich implies, denoting $V^l_j=V_j-V_l$ : $$ \left\{ \begin{array}{rclrcl} \epsilon^l_1 &=& (\epsilon_1-\epsilon_l) &<& - V^l_1\\ \epsilon^l_2 &=& (\epsilon_2-\epsilon_l) &<& - V^l_2\\ &\vdots & & \vdots & \\ \epsilon^l_J &=& (\epsilon_J-\epsilon_l) &<& - V^l_J\\ \end{array} \right. $$ The initial vector of errors $\epsilon$ are transformed using the following transformation: $\epsilon^l = M^l \epsilon$, where the transformation matrix $M^l$ is a $(J-1) \times J$ matrix obtained by inserting in an identity matrix a $l^{\mbox{th}}$ column of $-1$. For example, if $J = 4$ and $l = 3$: $$ M^3 = \left( \begin{array}{cccc} 1 & 0 & -1 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & -1 & 1 \\ \end{array} \right) $$ The covariance matrix of the error differences is: $$ \mbox{V}\left(\epsilon^l\right)=\mbox{V}\left(M^l\epsilon\right) = M^l\mbox{V}\left(\epsilon\right){M^l}^{\top} = M^l\Omega{M^l}^{\top} $$ {#eq-cov-err_diff-mprobit} The probability of choosing $l$ is then: $$ P_l =\mbox{P}(\epsilon^l_1<-V_1^l \;\&\; \epsilon^l_2<-V_2^l \;\&\; ... \; \epsilon^l_J<-V_J^l) $$ {#eq-prob_choice_probit} with the hypothesis of normal distribution, this writes: $$ P_l = \int_{-\infty}^{-V_1^l}\int_{-\infty}^{-V_2^l}...\int_{-\infty}^{-V_J^l}\phi(\epsilon^l) d\epsilon^l_1 d\epsilon^l_2... d^l_J $$ with : $$ \phi\left(\epsilon^l\right)=\frac{1}{(2\pi)^{(J-1)/2}\mid\Omega^l\mid^{1/2}} e^{-\frac{1}{2}\epsilon^l{\Omega^l}^{-1}\epsilon^l} $$ Two problems arise with this model: - the identified parameters are the elements of $\Omega^l$ and not of $\Omega$. We must then carefully investigate the meanings of these elements, - the probability is a $J-1$ integral, which should be numerically computed. The relevant strategy in this context is to use simulations. ### Identification The meaningful parameters are those of the covariance matrix of the error $\Omega$. For example, with $J = 3$: $$ \Omega = \left( \begin{array}{ccc} \sigma_1 ^ 2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_3^3 \\ \end{array} \right) $$ Computing @eq-cov-err_diff-mprobit for $l=1$, we get: $$ \Omega^1 = M^1 \Omega {M^1}^{\top}= \left( \begin{array}{cc} \sigma_1^2+\sigma_2^2-2\sigma_{12} & \sigma_1^2 + \sigma_{23} - \sigma_{12} -\sigma_{13} \\ \sigma_1^2+\sigma_{23}- \sigma_{12} - \sigma_{13} & \sigma_1^2 + \sigma_3^2 - 2 \sigma_{13} \\ \end{array} \right) $$ The overall scale of utility being unidentified, one has to impose the value of one of the variance, for example the first one is set to 1. We then have : $$ \Omega^1 = \left( \begin{array}{cc} 1 & \frac{\sigma_1^2+ \sigma_{23} - \sigma_{12} -\sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} \\ \frac{\sigma_1^2+\sigma_{23}- \sigma_{12} - \sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} & \frac{\sigma_1^2 + \sigma_3^2 - 2 \sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} \\ \end{array} \right) $$ Therefore, out of the six structural parameters of the covariance matrix, only three can be identified. Moreover, it's almost impossible to interpret these parameters. More generally, with $J$ alternatives, the number of the parameters of the covariance matrix is $(J+1)\times J/2$ and the number of identified parameters is $J\times(J-1)/2-1$. ### Simulations Let $C^l$ be the Choleski decomposition of the covariance matrix of the error differences:\index[general]{Choleski decomposition} $$ \Omega^C= {C^l}^{\top}C^l $$ This matrix is an upper triangular matrix of dimension $(J-1)$ : $$ C^l= \left( \begin{array}{ccccc} l_{11} & l_{12} & l_{13} &... & l_{1(J-1)} \\ 0 & l_{22} & l_{23} & ... & l_{2(J-1)} \\ 0 & 0 & l_{33} & ... & l_{3(J-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & ... & l_{(J-1)(J-1)} \\ \end{array} \right) $$ Let $\nu$ be a vector of standard normal deviates: $\nu \sim \mathcal{N}(0, I)$ Therefore, we have : $$ \mbox{V}\left({C^l}^\top\nu\right)={C^l}^\top V(\nu){C^l}={C^l}^{\top}IC^l=\Omega^l $$ Therefore, if we draw a vector of standard normal deviates $\nu$ and apply to it this transformation, we get a realization of $\epsilon^l$. This probability of choosing $l$ given by @eq-prob_choice_probit can be written as a product of conditional and marginal probabilities: $$ \begin{array}{rcl} P_l &=& \mbox{P}(\epsilon^l_1<- V_1^l \;\&\; \epsilon^l_2<-V_2^l \;\&\; ... \;\&\; \epsilon^l_J<-V_J^l))\\ &=& \mbox{P}(\epsilon^l_1<- V_1^l))\\ &\times&\mbox{P}(\epsilon^l_2<-V_2^l \mid \epsilon^l_1<-V_1^l) \\ &\times&\mbox{P}(\epsilon^l_3<-V_3^l \mid \epsilon^l_1<-V_1^l \;\&\; \epsilon^l_2<-V_2^l) \\ & \vdots & \\ &\times&\mbox{P}(\epsilon^l_J<-V_J^l \mid \epsilon^l_1<-V_1^l \;\&\; ... \;\&\; \epsilon^l_{J-1}<-V_{J-1}^l)) \\ \end{array} $$ The vector of error difference deviates is: $$ \left( \begin{array}{c} \epsilon^l_1 \\ \epsilon^l_2 \\ \epsilon^l_3 \\ \vdots \\ \epsilon^l_J \end{array} \right) = {C^l} ^ \top \nu = \left( \begin{array}{ccccc} l_{11} & 0 & 0 &... & 0 \\ l_{12} & l_{22} & 0 & ... & 0 \\ l_{13} & l_{23} & l_{33} & ... & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ l_{1(J-1)} & l_{2(J-1)} & l_{3(J-1)} & ... & l_{(J-1)(J-1)} \\ \end{array} \right) \times \left( \begin{array}{c} \nu_1 \\ \nu_2 \\ \nu_3 \\ \vdots \\ \nu_J \end{array} \right) $$ $$ \left( \begin{array}{c} \epsilon^l_1 \\ \epsilon^l_2 \\ \epsilon^l_3 \\ \vdots \\ \epsilon^l_J \end{array} \right) = \left( \begin{array}{l} l_{11}\nu_1 \\ l_{12}\nu_1+l_{22}\nu_2 \\ l_{13}\nu_1+l_{23}\nu_2 + l_{33}\nu_3\\ \vdots \\ l_{1(J-1)}\nu_1+l_{2(J-1)}\nu_2+...+l_{(J-1)(J-1)}\nu_{J-1} \end{array} \right) $$ Let's now investigate the marginal and conditional probabilities: - the first is simply the marginal probability for a standard normal deviate, therefore we have: $\mbox{P}(\epsilon^l_1<-V_1^l) = \Phi\left(-\frac{V_1^l}{l_{11}}\right)$ - the second is, for a given value of $\nu_1$ equal to $\Phi\left(-\frac{V^l_2+l_{21}\nu_1}{l_{22}}\right)$. We then have to compute the mean of this expression for any value of $\nu_1$ lower than $-\frac{V^l_1}{l_{11}}$. We then have, denoting $\bar{\phi}_1$ the truncated normal density: $$\mbox{P}(\epsilon^l_2<-V_2^l)=\int_{-\infty}^{-\frac{V^l_1}{l_{11}}}\Phi\left(-\frac{V^l_2+l_{21}\nu_1}{l_{22}}\right) \bar{\phi}_1(\nu_1)d\nu_1$$ - the third is, for given values of $\nu_1$ and $\nu_2$ equal to: $\Phi\left(-\frac{V^l_3+l_{31}\nu_1+l_{32}\nu_2}{l_{33}}\right)$. We then have: $$\mbox{P}(\epsilon^l_3<-V_3^l)=\int_{-\infty}^{-\frac{V^l_1}{l_{11}}}\int_{-\infty}^{-\frac{V^l_2+l_{21}\nu_1}{l_{22}}} \Phi\left(-\frac{V^l_3+l_{31}\nu_1+l_{32}\nu_2}{l_{33}}\right)\bar{\phi}_1(\nu_1)\bar{\phi}_2(\nu_2)d\nu_1d\nu_2$$ - and so on. These probabilities can easily be simulated by drawing numbers from a truncated normal distribution. This so called GHK (for Geweke, Hajivassiliou and Keane) algorithm [see for example @GEWE:KEAN:RUNK:94]\index[author]{Geweke}\index[author]{Keane}\index[author]{Runkle} can be described as follow: 1. compute $\Phi\left(-\frac{V_1^l}{l_{11}}\right)$, 1. draw a number called $\nu_1^r$ from a standard normal distribution upper-truncated at $-\frac{V_1^l}{l_{11}}$ and compute $\Phi\left(-\frac{V^l_2+l_{12}\nu_1^r}{l_{22}}\right)$, 1. draw a number called $\nu_2^r$ from a standard normal distribution upper-truncated at $-\frac{V^l_2+l_{12}\nu_1^r}{l_{22}}$ and compute $\Phi\left(-\frac{V^l_3+l_{13}\nu_1^r+l_{23}\nu_2^r}{l_{33}}\right)$, 1. $...$ draw a number called $\nu_{J-1}^r$ from a standard normal distribution upper-truncated at $-\frac{V^l_{J-1}+l_{1(J-1)}\nu_1^r+... l_{(J-2)(J-1)}\nu_{J-2}^r}{l_{(J-1)(J-1)}}$, 1. multiply all these probabilities and get a realization of the probability of choosing $l$ called $P^r_l$, 1. repeat all these steps many times and average all these probabilities; this average is an estimation of the probability: $\bar{P}_l = \sum_{r=1}^R P^r_l/R$. Several points should be noted concerning this algorithm: - the utility differences should be computed respective to the chosen alternative for each individual, - the Choleski decomposition used should rely on the same\index[general]{Choleski decomposition} covariance matrix of the errors. One method to attained this goal is to start from a given difference, e.g., the difference respective with the first alternative. The vector of error difference is then $\epsilon^1$ and its covariance matrix is $\Omega^1={L^1}^{\top}L^1$. To apply a difference with another alternative $l$, we construct a matrix called $S^l$ which is obtained by using a $J-2$ identity matrix, adding a first row of 0 and inserting a column of $-1$ at the $l-1$th position. For example, with four alternatives and $l=3$, we have: $$S^3= \left( \begin{array}{ccc} 0 & -1 & 0 \\ 1 & -1 & 0 \\ 0 & -1 & 1 \\ \end{array} \right) $$ The elements of the Choleski decomposition of the covariance matrix is then obtained as follow: $$ \Omega^l = S^l \Omega^1 {S^l}^{\top}=L^l {L^l}^{\top} $$ - to compute draws from a normal distribution truncated at $a$, the following trick is used : take a draw $\mu$ from a uniform distribution (between 0 and 1); then $\nu = \Phi^{-1}\left(\mu \Phi(a)\right)$ is a draw from a normal distribution truncated at $a$. \index[general]{multinomial probit model|)} ### Application \idxdata[(]{toronto\_montreal}{micsr.data} In @sec-appl_mult_logit, we estimated a multinomial logit for mode choice in the Toronto-Montreal corridor, using `cost` and `freq` as alternative specific covariates with a generic coefficient, `income` as a choice situation specific covariate and `time` as an alternative specific covariate with a specific coefficient. We fit the same model using this time a probit. This is simply done by setting the `probit` argument to `TRUE`. \idxfun{mlogit}{mlogit} ```{r} #| cache: true pbt <- mlogit(formula = choice ~ cost + freq | income | time, data = MC, probit = TRUE, alt.subset = c("car", "train", "air"), reflevel = "car") ``` As previously, we want to predict the effect of a reduction of 20% of a train's fare: \idxfun{fitted}{stats}\idxfun{predict}{stats}\idxfun{rbind}{base}\idxfun{apply}{base} ```{r} Oprob <- fitted(pbt, type = "probabilities") Nprob <- predict(pbt, newdata = NMC) rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean)) ``` With this model, the IIA property is not operative, as the ratio of probabilities of choosing air and car is no longer the same before and after the change of a train's fare. \idxfun{head}{utils} ```{r} #| collapse: true head(Nprob[, "air"] / Nprob[, "car"]) head(Oprob[, "air"] / Oprob[, "car"]) ``` \idxdata[)]{toronto\_montreal}{micsr.data} \backmatter