14 Discrete choice models
Consider the case where the response is the choice of an alternative among a set of mutually exclusive alternatives. This choice can be modeled in a utility maximization framework, which means that we hypothesize that the individual chooses the alternative which corresponds to the maximum level of utility.1 Namely, denoting \(j = 1 \ldots J\) the set of alternatives, we’ll denote \(U_{nj}\) the level of utility of individual \(n\) if he chooses alternative \(j\) and the \(j\)thalternative will be chosen if \(U_{nj} > U_{nk}\; \forall k \neq j\). The level of utility will depends on some observable covariates and on some other unobservables whose effect will be summarized in a variable \(\epsilon_{nj}\) that will be considered, from the researcher’s point of view, as the realization of a random variable. Two key questions for these models are how the covariates enter the utility function and what is the assumed distribution of the error. Section 14.1 will deal with the first point, namely how to deal with covariates in discrete choice model. Section 14.2 will present the landmark model in this field, the multinomial logit model. Section 14.3 will go beyond the hypothesis of iid errors to present two extensions of the basic model, the nested and the heteroscedastic logit models. Section 14.4 will introduce the rich field of random parameters logit models. Finally, Section 14.5 will be devoted to the multinomial probit model.2
Throughout this chapter, we’ll use the mlogit package which is devoted to the estimation of discrete choice models:
14.1 Data management and model description
Data management
Data sets used for discrete choice models estimation concern some individuals, who make one or a sequential choice of one alternative among a set of mutually exclusive alternatives. The determinants of these choices are covariates that can depend on the alternative and the choice situation, only on the alternative or only on the choice situation. Data sets can have two different shapes: a wide shape (one row for each choice situation) or a long shape (one row for each alternative and, therefore, as many rows as there are alternatives for each choice situation). mlogit deals with both formats. It depends on the dfidx package which takes as first argument a data.frame
and returns a dfidx
object, which is a data.frame
in “long” format with a special data frame column which contains the indexes. The second argument is called idx
. In its simple use, it should be a list (or a vector) of two characters containing the choice situation and the alternative indexes.
Wide format
dutch_railways
3 is an example of a wide data set:
# A tibble: 2,929 × 11
id choiceid choice price_A price_B time_A time_B change_A
<int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 A 10.9 18.2 2.5 2.5 0
2 1 2 A 10.9 14.5 2.5 2.17 0
3 1 3 A 10.9 18.2 1.92 1.92 0
4 1 4 B 18.2 14.5 2.17 2.5 0
# ℹ 2,925 more rows
# ℹ 3 more variables: change_B <int>, comfort_A <int>,
# comfort_B <int>
This data set contains data about a stated preference survey in the Netherlands in 1987. Each individual has responded to several (up to 16) scenarios. For every scenario, two train trips are proposed to the user, with different combinations of four attributes: price
(the price in euros), time
(travel time in minutes), change
(the number of changes) and comfort
(the class of comfort, 0, 1 or 2, 0 being the most comfortable class). This “wide” format is suitable to store choice situation (or individual specific) variables because, in this case, they are stored only once in the data. It is cumbersome for alternative-specific variables because there are as many columns for such variables as there are alternatives.
For such a wide data set, the shape
argument of dfidx
is mandatory, as its default value is "long"
. The alternative-specific variables are indicated with the varying
argument which is a numeric vector that indicates their position in the data frame. This argument is then passed to stats::reshape
that coerced the original data.frame
in “long” format. Further arguments may be passed to reshape
. For example, as the names of the variables are of the form price_A
, one must add sep = "_"
(the default value being "."
). The choice
argument is also mandatory because the response has to be transformed in a logical value in the long format. In “wide” format, there is no alternative index. The choice situation index is not mandatory, as there is one line for each choice situation. In this data set, there is a choice situation index called id
, and it is nested in the individual index called choiceid
. To take the panel dimension into account, idx
is a list of length 1 (the choice situation) containing a vector of length 2 with choiceid
and id
. The idnames
is used to give a relevant name for the second index, the NA
in first position indicating that the name of the first index is unchanged.
Note the use of the opposite
argument for the four covariates: we expect negative coefficients for all of them, taking the opposite of the covariates will lead to expected positive coefficients.
# A tibble: 5,858 × 6
# Index: 2929 (choiceid) x 2 (alt)
# Balanced: yes
# Nesting: choiceid (id)
idx choice price time change comfort
<idx> <fct> <dbl> <dbl> <int> <int>
1 1:A A -10.9 -2.5 0 -1
2 1:B A -18.2 -2.5 0 -1
3 2:A A -10.9 -2.5 0 -1
4 2:B A -14.5 -2.17 0 -1
5 3:A A -10.9 -1.92 0 -1
# ℹ 5,853 more rows
An idx
column is added to the data, which contains the three relevant indexes: choiceid
is the choice situation index, alt
the alternative index and id
the individual index. This column can be extracted using the idx
function:
idx(Tr)
Long format
toronto_montreal
,4 is an example of a data set in long format. It presents the choice of individuals for a transport mode for the Toronto-Montreal corridor in 1989:
# A tibble: 15,520 × 11
case alt choice dist cost ivt ovt freq income urban
<int> <fct> <int> <int> <dbl> <int> <int> <int> <int> <int>
1 1 train 0 83 28.2 50 66 4 45 0
2 1 car 1 83 15.8 61 0 0 45 0
3 2 train 0 83 28.2 50 66 4 25 0
4 2 car 1 83 15.8 61 0 0 25 0
5 3 train 0 83 28.2 50 66 4 70 0
# ℹ 15,515 more rows
# ℹ 1 more variable: noalt <int>
There are four transport modes (air
, train
, bus
and car
) and most of the variables are alternative-specific (cost
for monetary cost, ivt
for in-vehicle time, ovt
for out-vehicle time, freq
for frequency). The only choice situation-specific variables are dist
(distance of the trip), income
(household income), urban
(a dummy for trips which have a large city at the origin or the destination) and noalt
(the number of available alternatives). The advantage of this shape is that there are much fewer columns than in the wide format, the caveat being that values of dist
, income
and urban
are repeated up to four times. For data in “long” format, the shape
and the choice
arguments are no longer mandatory. To replicate published results later in the text, we’ll use only a subset of the choice situations, namely those for which the four alternatives are available. This can be done using the subset
function with the subset
argument set to noalt == 4
while estimating the model. This can also be done within dfidx
, using the subset
argument.
The information about the structure of the data can be explicitly indicated using choice situations and alternative indexes (respectively case
and alt
in this data set) or, in part, guessed by the dfidx
function. Here, after subsetting, we have 2779 choice situations with 4 alternatives, and the rows are ordered first by choice situation and then by alternative (train
, air
, bus
, and car
in this order). The first way to read correctly this data frame is to ignore completely the two index variables. In this case, the only supplementary argument to provide is the alt.levels
argument, which is a character vector that contains the name of the alternatives in their order of appearance:
Note that this can only be used if the data set is “balanced”, which means that the same set of alternatives is available for all choice situations. It is also possible to provide the name of the variable that contains the alternatives through the argument idx
:
The name of the variable that contains the information about the choice situations can also be indicated through the argument idx
:
Both alternative and choice situation variables can also be provided:
More simply, as the two indexes are stored in the first two columns of the original data frame, the idx
argument can be unset:
MC <- dfidx(toronto_montreal, subset = noalt == 4)
and the indexes can be kept as standalone series if the drop.index
argument is set to FALSE
:
MC <- dfidx(toronto_montreal, subset = noalt == 4,
idx = c("case", "alt"), drop.index = FALSE)
MC %>% print(n = 5)
# A tibble: 11,116 × 12
# Index: 2779 (case) x 4 (alt)
# Balanced: yes
idx case alt choice dist cost ivt ovt freq income
<idx> <int> <fct> <int> <int> <dbl> <int> <int> <int> <int>
1 109:train 109 train 0 377 58.2 215 74 4 45
2 109:air 109 air 1 377 143. 56 85 9 45
3 109:bus 109 bus 0 377 27.5 301 63 8 45
4 109:car 109 car 0 377 71.6 262 0 0 45
5 110:train 110 train 0 377 58.2 215 74 4 70
# ℹ 11,111 more rows
# ℹ 2 more variables: urban <int>, noalt <int>
Model description
Standard formula
s are not very practical to describe random utility models, as these models may use different sets of covariates. Actually, working with random utility models, one has to consider at most three sets of covariates:
- alternative- and choice situation-specific covariates \(x_{nj}\) with generic coefficients \(\beta\) and alternative-specific covariates \(t_j\) with a generic coefficient \(\nu\),
- choice situation-specific covariates \(z_n\) with alternative-specific coefficients \(\gamma_j\),
- alternative- and choice situation-specific covariates \(w_{nj}\) with alternative-specific coefficients \(\delta_j\).
The covariates enter the observable part of the utility which can be written, for alternative \(j\):
\[ V_{nj}=\alpha_j + \beta x_{nj} + \nu t_j + \gamma_j z_n + \delta_j w_{nj} \]
As the absolute value of utility is irrelevant, only utility differences are useful to modelize the choice for one alternative. For two alternatives \(j\) and \(l\), we obtain:
\[ V_{nj}-V_{nl}=(\alpha_j-\alpha_l) + \beta (x_{nj}-x_{nl}) + \nu(t_j - t_l) + (\gamma_j-\gamma_l) z_n + (\delta_j w_{nj} - \delta_k w_{nl}) \]
It is clear from the previous expression that coefficients of choice situation-specific variables (the intercept being one of those) should be alternative-specific; otherwise they would disappear in the differentiation. Moreover, only differences of these coefficients are relevant and can be identified. For example, with three alternatives 1, 2 and 3, the three coefficients \(\gamma_1, \gamma_2, \gamma_3\) associated with a choice situation-specific variable cannot be identified, but only two linear combinations. Therefore, one has to make a choice of normalization, and the simplest one is just to set \(\gamma_1 = 0\).
Coefficients for alternative and choice situation-specific variables may (or may not) be alternative-specific. For example, transport time is alternative-specific, but 10 mn in public transport may not have the same impact on utility than 10 mn in a car. In this case, alternative-specific coefficients are relevant. Monetary cost is also alternative-specific, but in this case, one can consider than $1 is $1 however it is spent for the use of a car or in public transports. In this case, a generic coefficient is relevant. The treatment of alternative-specific variables doesn’t differ much from the alternative and choice situation-specific variables with a generic coefficient. However, if some of these variables are introduced, the \(\nu\) parameter can only be estimated in a model without intercepts to avoid perfect multicolinearity.
A logit model with only choice situation-specific variables is sometimes called a multinomial logit model, one with only alternative-specific variables, a conditional logit model, and one with both kinds of variables, a mixed logit model. This is seriously misleading: conditional logit model is also a logit model for longitudinal data in the statistical literature, and mixed logit is one of the names of a logit model with random parameters. Therefore, in what follows, we’ll use the name multinomial logit model for the model we’ve just described whatever the nature of the explanatory variables used.
The mlogit package provides objects of class mFormula
which are built upon Formula
objects provided by the Formula package. To illustrate the use of mFormula
objects, we use again the toronto_montreal
data set and consider three sets of covariates that will be indicated in a three-part formula, which refers to the three items at the beginning of this section.
-
cost
(monetary cost) is an alternative-specific covariate with a generic coefficient (part 1), -
income
andurban
are choice situation-specific covariates (part 2), -
ivt
(in-vehicle travel time) is alternative-specific and alternative-specific coefficients are expected (part 3).
Some parts of the formula may be omitted when there is no ambiguity. For example, the following sets of formula
s are identical:
By default, an intercept is added to the model; it can be removed by using + 0
or - 1
in the second part.
A model.frame
method is provided for dfidx
objects. It differs from the formula
method by the fact that the returned object is an object of class dfidx
and not an ordinary data frame, which means that the information about the structure of the data is not lost. Defining a specific model.frame
method for dfidx
objects implies that the first argument of the function should be a dfidx
object, which results in an unusual order of the arguments in the function (the data first, and then the formula). Moreover, as the model matrix for random utility models has specific features, we add a supplementary argument called pkg
to the dfidx
function so that the returned object has a specific class (and inherits the dfidx
class):
MC <- dfidx(toronto_montreal, subset = noalt == 4, pkg = "mlogit")
class(MC)
## [1] "dfidx_mlogit" "dfidx" "tbl_df" "tbl"
## [5] "data.frame"
f <- Formula(choice ~ cost | income | ivt)
mf <- model.frame(MC, f)
class(mf)
## [1] "dfidx_mlogit" "dfidx" "tbl_df" "tbl"
## [5] "data.frame"
Using mf
as the argument of model.matrix
enables the construction of the relevant model matrix for random utility model, as a specific model.matrix
method for dfidx_mlogit
objects is provided.
head(model.matrix(mf), 4)
(Intercept):air (Intercept):bus (Intercept):car cost income:air
1 0 0 0 58.25 0
2 1 0 0 142.80 45
3 0 1 0 27.52 0
4 0 0 1 71.63 0
income:bus income:car ivt:train ivt:air ivt:bus ivt:car
1 0 0 215 0 0 0
2 0 0 0 56 0 0
3 45 0 0 0 301 0
4 0 45 0 0 0 262
The model matrix contains \(J-1\) columns for every choice situation-specific variables (income
and the intercept), which means that the coefficient associated with the first alternative (train
) is set to 0. It contains only one column for cost
because we want a generic coefficient for this variable. It contains \(J\) columns for ivt
, because it is an alternative specific variable for which we want alternative specific coefficients.
14.2 Random utility model and multinomial logit model
Random utility model
The utility for alternative \(l\) is written as: \(U_l=V_l+\epsilon_l\) where \(V_l\) is a function of some observable covariates and unknown parameters to be estimated, and \(\epsilon_l\) is a random deviate which contains all the unobserved determinants of the utility. Alternative \(l\) is therefore chosen if \(\epsilon_j < (V_l-V_j)+\epsilon_l \;\forall\;j\neq l\) and the probability of choosing this alternative is then:
\[ \mbox{P}(\epsilon_1 < V_l-V_1+\epsilon_l, \epsilon_2 < V_l-V_2+\epsilon_l, ..., \epsilon_J < V_l-V_J+\epsilon_l). \]
Denoting \(F_{-l}\) as the cumulative density function of all the \(\epsilon\)s except \(\epsilon_l\), this probability is:
\[ (\mbox{P}_l \mid \epsilon_l)= F_{-l}(V_l-V_1+\epsilon_l, ..., V_l-V_J+\epsilon_l). \]
Note that this probability is conditional on the value of \(\epsilon_l\). The unconditional probability (which depends only on \(\beta\) and on the value of the observed covariates) is obtained by integrating out the conditional probability using the marginal density of \(\epsilon_l\), denoted by \(f_l\):
\[ \mbox{P}_l=\int F_{-l}(V_l-V_1+\epsilon_l, ...,V_l-V_J)+\epsilon_l)f_l(\epsilon_l) d\epsilon_l. \]
The conditional probability is an integral of dimension \(J-1\), and the computation of the unconditional probability adds one more dimension of integration.
Distribution of the error terms
The multinomial logit model (McFadden 1974) is a special case of the model developed in the previous section. It is based on three hypotheses. The first hypothesis is the independence of the errors. In this case, the univariate distribution of the errors can be used, which leads to the following conditional and unconditional probabilities:
\[ (\mbox{P}_l \mid \epsilon_l)=\prod_{j\neq l}F_j(V_l-V_j+\epsilon_l) \mbox{ and } \mbox{P}_l =\int \prod_{j\neq l}F_j(V_l-V_j+\epsilon_l) \; f_l(\epsilon_l) \;d\epsilon_l, \]
which means that the conditional probability is the product of \(J-1\) univariate cumulative density functions, and the evaluation of only a one-dimensional integral is required to compute the unconditional probability. The second hypothesis is that each \(\epsilon\) follows a Gumbel (maximum) distribution, whose density and probability functions are respectively:
\[ f(z)=\frac{1}{\theta}e^{-\frac{z-\mu}{\theta}} e^{-e^{-\frac{z-\mu}{\theta}}} \mbox{ and } F(z)=\int_{-\infty}^{z} f(t) dt=e^{-e^{-\frac{z-\mu}{\theta}}}, \]
where \(\mu\) is the location parameter and \(\theta\) the scale parameter. The first two moments of the Gumbel distribution are \(\mbox{E}(z)=\mu+\theta \gamma\), where \(\gamma\) is the Euler-Mascheroni constant (\(\approx 0.57721\)) and \(\mbox{V}(z)=\frac{\pi^2}{6}\theta^2\). The mean of \(\epsilon_j\) is not identified if \(V_j\) contains an intercept. We can then, without loss of generality suppose that \(\mu_j=0, \; \forall j\). Moreover, the overall scale of utility is not identified. Therefore, only \(J-1\) scale parameters may be identified, and a natural choice of normalization is to impose that one of the \(\theta_j\) is equal to 1. The last hypothesis is that the errors are identically distributed. As the location parameter is not identified for any error term, this hypothesis is essentially a homoskedasticity hypothesis, which means that the scale parameter of the Gumbel distribution is the same for all the alternatives. As one of them has been previously set to 1, we can therefore suppose that, without loss of generality, \(\theta_j = 1, \;\forall j \in 1... J\). The conditional and unconditional probabilities then further simplify to:
\[ (\mbox{P}_l \mid \epsilon_l)%=\prod_{j\neq l}F(V_l-V_j+\epsilon_l) =\prod_{j\neq l}e^{-e^{-(V_l-Vj+\epsilon_l)}} \mbox{ and } \mbox{P}_l =\int_{-\infty}^{+\infty}\prod_{j\neq l}e^{-e^{-(V_l-Vj+t)}}e^{-t}e^{-e^{-t}}dt. \]
The probabilities have then very simple, closed forms, which correspond to the logit transformation of the deterministic part of the utility.5
\[ P_l=\frac{e^{V_l}}{\sum_{j=1}^J e^{V_j}}. \]
IIA property
If we consider the probabilities of choice for two alternatives \(l\) and \(m\), we have \(P_l=e^{V_l}/\sum_j e^{V_j}\) and \(P_m=e^{V_m}/\sum_j e^{V_j}\). The ratio of these two probabilities is:
\[ \frac{P_l}{P_m}=\frac{e^{V_l}}{e^{V_m}}=e^{V_l-V_m}. \]
This probability ratio for the two alternatives depends only on the characteristics of these two alternatives and not on those of other alternatives. This is called the independence of irrelevant alternatives (IIA) property. IIA relies on the hypothesis that the errors are identical and independent. It is not a problem in itself and may even be considered as a useful feature for a well-specified model. However, this hypothesis may be in practice violated, especially if some important variables are omitted.
Interpretation
Marginal effects
The marginal effects are the derivatives of the probabilities with respect to the covariates, which can be choice situation-specific (\(z_n\)) or alternative-specific (\(x_{nj}\)):
\[ \begin{array}{rcl} \displaystyle \frac{\partial P_{nl}}{\partial z_{n}}&=&P_{nl}\left(\beta_l-\sum_j P_{nj}\beta_j\right) \\ \displaystyle \frac{\partial P_{nl}}{\partial x_{nl}}&=&\gamma P_{nl}(1-P_{nl})\\ \displaystyle \frac{\partial P_{nl}}{\partial x_{nk}}&=&-\gamma P_{nl}P_{nk}. \end{array} \]
For a choice situation-specific variable, the sign of the marginal effect is not necessarily the sign of the coefficient. Actually, the sign of the marginal effect is given by \(\left(\beta_l-\sum_j P_{nj}\beta_j\right)\), which is positive if the coefficient for alternative \(l\) is greater than a weighted average of the coefficients for all the alternatives, the weights being the probabilities of choosing the alternatives. In this case, the sign of the marginal effect can be established with no ambiguity only for the alternatives with the lowest and the greatest coefficients.
For an alternative-specific variable, the sign of the coefficient can be directly interpreted. The marginal effect is obtained by multiplying the coefficient by the product of two probabilities which is at most 0.25. The rule of thumb is therefore to divide the coefficient by 4 in order to have an upper bound of the marginal effect.
Note that the last equation can be rewritten: \(\frac{\mbox{d} P_{nl} / P_{nl}}{\mbox{d}x_{nk}} = -\gamma P_{nk}\). Therefore, when a characteristic of alternative \(k\) changes, the relative changes of the probabilities for every alternative except \(k\) are the same, which is a consequence of the IIA property.
Marginal rates of substitution
Coefficients are marginal utilities, which cannot be interpreted. However, ratios of coefficients are marginal rates of substitution. For example, if the observable part of utility is: \(V=\beta_o +\beta_1 x_1 +\beta x_2 + \beta x_3\), joint variations of \(x_1\) and \(x_2\) which ensure the same level of utility are such that: \(dV=\beta_1 dx_1+\beta_2 dx_2=0\) so that:
\[ - \frac{dx_2}{dx_1}\mid_{dV = 0} = \frac{\beta_1}{\beta_2}. \]
For example, if \(x_2\) is transport cost (in dollars), \(x_1\) transport time (in hours), \(\beta_1 = 1.5\) and \(\beta_2=0.05\), \(\frac{\beta_1}{\beta_2}=30\) is the marginal rate of substitution of time in terms of dollars and the value of 30 means that, to reduce the travel time of 1 hour, the individual is willing to pay at most $30 more. Stated more simply, time value is $30 per hour.
Consumer surplus
Consumer’s surplus has a very simple expression for multinomial logit models, which was first derived by Small and Rosen (1981). The level of utility attained by an individual is \(U_j=V_j+\epsilon_j\), \(j\) being the chosen alternative. The expected utility, from the searcher’s point of view is then: \(\mbox{E}(\max_j U_j)\), where the expectation is taken over the values of all the error terms. Its expression is simply, up to an additive unknown constant, the log of the denominator of the logit probabilities, often called the “log-sum”:
\[ \mbox{E}(U)=\ln \sum_{j=1}^Je^{V_j}+C. \]
If the marginal utility of income (\(\alpha\)) is known and constant, the expected surplus is simply \(\frac{\mbox{E}(U)}{\alpha}\).
Application
Random utility models are fitted using the mlogit
function. Basically, only two arguments are mandatory, formula
and data
, if an dfidx
object (and not an ordinary data.frame
) is provided. We use the toronto_montreal
data set, which was already coerced to a dfidx
object (called MC
) in the previous section. The same model can then be estimated using as data
argument this dfidx
object:
or a data.frame
. In this latter case, further arguments that will be passed to dfidx
should be indicated:
mlogit
provides two further useful arguments:
-
reflevel
indicates which alternative is the “reference” alternative, i.e., the one for which the coefficients of choice situation-specific covariates are set to 0, -
alt.subset
indicates a subset of alternatives on which the estimation has to be performed; in this case, only the lines that correspond to the selected alternatives are used, and all the choice situations where unselected alternatives have been chosen are removed.
We estimate the model on the subset of three alternatives (we exclude bus
whose market share is negligible in our sample) and we set car
as the reference alternative. Moreover, we use a total transport time variable computed as the sum of the in-vehicle and out-vehicle time variables.
The main results of the model are computed and displayed using the summary
method:
summary(ml.MC1)
Call:
mlogit(formula = choice ~ cost + freq | income | time, data = MC,
alt.subset = c("car", "train", "air"), reflevel = "car",
method = "nr")
Frequencies of alternatives:choice
car train air
0.458 0.167 0.375
nr method
6 iterations, 0h:0m:0s
g'(-H)^-1g = 6.94E-06
successive function values within tolerance limits
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):train -0.970344 0.265131 -3.66 0.00025 ***
(Intercept):air -1.898566 0.684143 -2.78 0.00552 **
cost -0.028497 0.006559 -4.34 1.4e-05 ***
freq 0.074029 0.004733 15.64 < 2e-16 ***
income:train -0.006469 0.003104 -2.08 0.03713 *
income:air 0.028246 0.003654 7.73 1.1e-14 ***
time:car -0.014024 0.001380 -10.16 < 2e-16 ***
time:train -0.010969 0.000818 -13.40 < 2e-16 ***
time:air -0.017551 0.003992 -4.40 1.1e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -1950
McFadden R^2: 0.312
Likelihood ratio test : chisq = 1770 (p.value = <2e-16)
The frequencies of the different alternatives in the sample are first indicated. Next, some information about the optimization is displayed: the Newton-Raphson method (with analytical gradient and hessian) is used, as it is the most efficient method for this simple model for which the log-likelihood function is globally concave. Note that very few iterations and computing times are required to estimate this model. Then the usual table of coefficients is displayed, followed by some goodness-of-fit measures: the value of the log-likelihood function, which is compared to the value when only intercepts are introduced, which leads to the computation of the McFadden \(R^2\) and to the likelihood ratio test. The fitted
method can be used either to obtain the probability of actual choices (type = "outcome"
) or the probabilities for all the alternatives (type = "probabilities"
).
109 110 111 112 113 114
0.1909 0.3400 0.1471 0.3400 0.3400 0.2440
car train air
109 0.4206 0.3884 0.1909
110 0.3696 0.2904 0.3400
111 0.4297 0.4233 0.1471
112 0.3696 0.2904 0.3400
Note that the log-likelihood is the sum of the log of the fitted outcome probabilities and that, as the model contains intercepts, the average fitted probabilities for every alternative equals the market shares of the alternatives in the sample.
Predictions can be made using the predict
method. If no data is provided, predictions are made for the sample mean values of the covariates.
predict(ml.MC1)
## car train air
## 0.5066 0.2117 0.2817
Assume, for example, that we wish to predict the effect of a reduction of train transport time of 20%. We first create a new data.frame
simply by multiplying train transport time by 0.8 and then using the predict
method with this new data.frame
.
NMC <- MC
NMC <- NMC %>% mutate(time = ifelse(idx$alt == "train", 0.8 * time, time))
Oprob <- fitted(ml.MC1, type = "probabilities")
Nprob <- predict(ml.MC1, newdata = NMC)
rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean))
car train air
old 0.4576 0.1672 0.3752
new 0.4045 0.2636 0.3319
If, for the first individuals in the sample, we compute the ratio of the probabilities of the air and the car mode, we obtain:
which is an illustration of the IIA property. If train time changes, it changes the probabilities of choosing air and car, but not their ratio. We next compute the surplus for individuals of the sample induced by train time reduction. This requires the computation of the log-sum term (also called inclusive value or inclusive utility) for every choice situation, which is:
\[ \mbox{iv}_n = \ln \sum_{j = 1} ^ J e^{\beta^\top x_{nj}}. \]
For this purpose, we use the logsum
function, which works on a vector of coefficients and a model matrix. The basic use of logsum
consists of providing as unique argument (called coef
) a mlogit
object. In this case, the model.matrix
and the coef
are extracted from the same model:
ivbefore <- logsum(ml.MC1)
To compute the log-sum after train time reduction, we must provide a model matrix which is not the one corresponding to the fitted model. This can be done using the X
argument which is a matrix or an object from which a model matrix can be extracted. This can also be done by filling the data
argument (a data frame or an object from which a data frame can be extracted using model.frame
), and eventually the formula
argument (a formula or an object for which the formula
method can be applied). If no formula is provided, but if data
is a dfidx
object, the formula is extracted from it.
ivafter <- logsum(ml.MC1, data = NMC)
Surplus variation is then computed as the difference of the log-sums divided by the opposite of the cost coefficient which can be interpreted as the marginal utility of income:
Consumer surplus variations range from 0.6 to 31 Canadian dollars, with a median value of about $4. Marginal effects are computed using the effects
method. By default, they are computed at the sample mean, but a data
argument can be provided. The variation of the probability and the covariate can be either absolute or relative. This is indicated with the type
argument which is a combination of two a
(as absolute) and r
(as relative) characters. For example, type = "ar"
means that what is measured is an absolute variation of the probability for a relative variation of the covariate.
effects(ml.MC1, covariate = "income", type = "ar")
## car train air
## -0.1822 -0.1509 0.3331
The results indicate that, for a 100% increase of income, the probability of choosing air
increases by 33 percentage points, as the probabilities of choosing car
and train
decrease by 18 and 15 percentage points.
For an alternative specific covariate, a matrix of marginal effects is displayed.
effects(ml.MC1, covariate = "cost", type = "rr")
car train air
car -0.9131 0.9377 0.9377
train 0.3358 -1.2505 0.3358
air 1.2317 1.2317 -3.1410
The cell in the \(l^{\mbox{th}}\) row and the \(c^{\mbox{th}}\) column indicates the change of the probability of choosing alternative \(c\) when the cost of alternative \(l\) changes. As type = "rr"
, elasticities are computed. For example, a 10% change of train cost increases the probabilities of choosing car and air by 3.36%. Note that the relative changes of the probabilities of choosing one of these two modes are equal, which is a consequence of the IIA property. Finally, in order to compute travel time valuation, we divide the coefficients of travel times (in minutes) by the coefficient of monetary cost (in dollars).
The value of travel time ranges from 23 for a train to 37 Canadian dollars per hour for a plane.
14.3 Logit models relaxing the iid hypothesis
In the previous section, we assumed that the error terms were iid (identically and independently distributed), i.e., uncorrelated and homoskedastic. Extensions of the basic multinomial logit model have been proposed by relaxing one of these two hypotheses while maintaining the hypothesis of a Gumbel distribution.
Heteroskedastic logit model
The heteroskedastic logit model was proposed by Bhat (1995). The probability that \(U_l>U_j\) is:
\[ P(\epsilon_j<V_l-V_j+\epsilon_l)=e^{-e^{-\frac{(V_l-V_j+\epsilon_l)}{\theta_j}}}, \]
which implies the following conditional and unconditional probabilities
\[ (P_l \mid \epsilon_l) =\prod_{j\neq l}e^{-e^{-\frac{(V_l-V_j+\epsilon_l)}{\theta_j}}}, \]
\[ \begin{array}{rcl} P_l&=&\displaystyle\int_{-\infty}^{+\infty} \prod_{j\neq l} \left(e^{-e^{-\frac{(V_l-V_j+t)}{\theta_j}}}\right)\frac{1}{\theta_l}e^{-\frac{t}{\theta_l}}e^{-e^{-\frac{t}{\theta_l}}} dt\\ &=& \displaystyle \int_{0}^{+\infty}\left(e^{-\sum_{j \neq l}e^{-\frac{V_l-V_j-\theta_l \ln t}{\theta_j}}}\right)e^{-t}dt. \end{array} \tag{14.1}\]
There is no closed form for this integral, but it can be efficiently computed using a Gauss quadrature method, and more precisely the Gauss-Laguerre quadrature method. Bhat (1995) estimated the heteroskedastic logit model on the toronto_montreal
data set. Using mlogit
, the heteroskedastic logit model is obtained by setting the heterosc
argument to TRUE
:
ml.MC <- mlogit(choice ~ freq + cost + ivt + ovt |
urban + income, MC, reflevel = 'car',
alt.subset = c("car", "train", "air"))
hl.MC <- mlogit(choice ~ freq + cost + ivt + ovt |
urban + income, MC, reflevel = 'car',
alt.subset = c("car", "train", "air"),
heterosc = TRUE)
hl.MC %>% gaze(coef = 11:12)
Estimate Std. Error z-value Pr(>|z|)
sp.train 1.237 0.110 11.20 < 2e-16
sp.air 0.540 0.112 4.83 1.4e-06
Two supplementary coefficients (sp.train
and sp.air
) are estimated (\(\theta_j\) in Equation 14.1), the third for the reference modality being set to 1. The variance of the error terms of train and air are respectively higher and lower than the variance of the error term of car (set to 1). Note that the z-values and p-values of the output are not particularly meaningful, as the hypothesis that the coefficient is zero (and not 1) is tested. The homoskedasticity hypothesis can be tested using any of the three tests. For the likelihood ratio and the Wald test, one can use only the fitted heteroskedastic model as argument. In this case, it is guessed that the hypothesis that the user wants to test is the homoskedasticity hypothesis.
or, more simply:
The Wald test can also be computed using the linearHypothesis
function from the car
package:
car::linearHypothesis(hl.MC, c('sp.air = 1', 'sp.train = 1')) %>% gaze
## Chisq = 25.196, df: 2, pval = 0.000
For the score test, we provide the constrained model as argument, which is the standard multinomial logit model and the supplementary argument which defines the unconstrained model, which is in this case heterosc = TRUE
.
The homoskedasticity hypothesis is therefore strongly rejected using the Wald test, but only at the 1 and 5% level for, respectively, the score and the likelihood ratio tests.
Nested logit model
The nested logit model was first proposed by McFadden (1978). It is a generalization of the multinomial logit model that is based on the idea that some alternatives may be joined in several groups (called nests). The error terms may then present some correlation in the same nest, whereas error terms of different nests are still uncorrelated. Denoting \(m=1... M\) the nests and \(B_m\) the set of alternatives belonging to nest \(m\), the cumulative distribution of the errors is:
\[ \mbox{exp}\left(-\sum_{m=1}^M \left( \sum_{j \in B_m} e^{-\epsilon_j/\lambda_m}\right)^{\lambda_m}\right). \]
The marginal distributions of the \(\epsilon\)s are still univariate extreme values, but there is now some correlation within nests. \(1-\lambda_m\) is a measure of the correlation, i.e., \(\lambda_m = 1\) implies no correlation. In the special case where \(\lambda_m=1\; \forall m\), the errors are iid Gumbel errors and the nested logit model reduce to the multinomial logit model. It can then be shown that the probability of choosing alternative \(j\) that belongs to nest \(l\) is:
\[ P_j = \frac{e^{V_j/\lambda_l}\left(\sum_{k \in B_l} e^{V_k/\lambda_l}\right)^{\lambda_l-1}} {\sum_{m=1}^M\left(\sum_{k \in B_m} e^{V_k/\lambda_m}\right)^{\lambda_m}}, \]
and that this model is a random utility model if all the \(\lambda\) parameters are in the \(0-1\) interval.6 Let us now write the deterministic part of the utility of alternative \(j\) as the sum of two terms: the first (\(Z_j\)) being specific to the alternative and the second (\(W_l\)) to the nest it belongs to:
\[V_j=Z_j+W_l.\]
We can then rewrite the probabilities as follow:
\[ \begin{array}{rcl} P_j&=&\frac{e^{(Z_j+W_l)/\lambda_l}}{\sum_{k \in B_l} e^{(Z_k+W_l)/\lambda_l}}\times \frac{\left(\sum_{k \in B_l} e^{(Z_k+W_l)/\lambda_l}\right)^{\lambda_l}} {\sum_{m=1}^M\left(\sum_{k \in B_m} e^{(Z_k+W_m)/\lambda_m}\right)^{\lambda_m}}\\ &=&\frac{e^{Z_j/\lambda_l}}{\sum_{k \in B_l} e^{Z_k/\lambda_l}}\times \frac{\left(e^{W_l/\lambda_l}\sum_{k \in B_l} e^{Z_k/\lambda_l}\right)^{\lambda_l}} {\sum_{m=1}^M\left(e^{W_m/\lambda_m}\sum_{k \in B_m} e^{Z_k/\lambda_m}\right)^{\lambda_m}}. \end{array} \]
Then denote \(I_l=\ln \sum_{k \in B_l} e^{Z_k/\lambda_l}\) which is often called the log-sum, the inclusive value or the inclusive utility.7 We then can write the probability of choosing alternative \(j\) as:
\[ P_j=\frac{e^{Z_j/\lambda_l}}{\sum_{k \in B_l} e^{Z_k/\lambda_l}}\times \frac{e^{W_l+\lambda_l I_l}}{\sum_{m=1}^Me^{W_m+\lambda_m I_m}}. \]
The first term \(\mbox{P}_{j\mid l}\) is the conditional probability of choosing alternative \(j\) if nest \(l\) is chosen. It is often referred to as the lower model. The second term \(\mbox{P}_l\) is the marginal probability of choosing nest \(l\) and is referred to as the upper model. \(W_l+\lambda_l I_l\) can be interpreted as the expected utility of choosing the best alternative in \(l\), \(W_l\) being the expected utility of choosing an alternative in this nest (whatever this alternative is) and \(\lambda_l I_l\) being the expected extra utility gained by being able to choose the best alternative in the nest. The inclusive values link the two models. It is then straightforward to show that IIA applies within nests, but not for two alternatives in different nests.
A consistent but inefficient way of estimating the nested logit model is to estimate separately its two components. The coefficients of the lower model are first estimated, which enables the computation of the inclusive values \(I_l\). The coefficients of the upper model are then estimated, using \(I_l\) as covariates. Maximizing directly the likelihood function of the nested model leads to a more efficient estimator.
To illustrate the estimation of the nested logit model, we use the telephone
data set, used by Train, McFadden, and Ben-Akiva (1987) and Walker, Ben-Akiva, and Bolduc (2007).
# A tibble: 2,170 × 4
choice service household cost
<lgl> <fct> <int> <dbl>
1 FALSE budget 1 1.76
2 FALSE extended 1 13.8
3 FALSE local 1 2.55
4 FALSE metro 1 3.15
5 TRUE standard 1 1.75
# ℹ 2,165 more rows
A total of 428 households were surveyed in 1984 about their choice of a local telephone service, which typically involves the choice between a flat service (a fixed monthly charge for an unlimited calls within a specified geographical area) and a measured (a reduced fixed monthly charge for a limited number of calls plus usage charges for additional calls) service. Households had the choice between five services:
-
budget measured (
budget
): no fixed monthly charge; usage charges apply to each call made, -
standard measured (
standard
): a fixed monthly charge covers up to a specified dollar amount (greater than the fixed charge) of local calling, after which usage charges apply to each call made, -
local flat (
local
): a greater monthly charge that may depend upon residential location; unlimited free calling within a local calling area; usage charges apply to calls made outside local calling area, -
extended area flat (
extended
): a further increase in the fixed monthly charge to permit unlimited free calling within an extended area, -
metro area flat (
metro
): the greatest fixed monthly charge that permits unlimited free calling within the entire metropolitan area.
The first two services are measured, and the last three are flat services. There is therefore an obvious nesting structure for this example. We first estimate the multinomial logit model, with the log of cost as the unique covariate:
We then update this model in order to introduce nests, using the nests
argument. It is a list of characters that contains the alternatives for the different nests. It is advisable to use a named list (we use here "measured"
and "flat"
as names of the nests):
nl_tel <- mlogit(choice ~ cost, telephone,
idx = c("household", "service"),
nests = list(measured = c("budget", "standard"),
flat = c("local", "metro", "extended")))
coef(nl_tel)
## (Intercept):extended (Intercept):local (Intercept):metro
## 1.2255 1.2716 1.7837
## (Intercept):standard cost iv:measured
## 0.3782 -1.4900 0.4848
## iv:flat
## 0.4362
Two supplementary coefficients are estimated, iv:measured
and iv:flat
. The two values are in the 0-1 interval and close to each other. The un.nest.el
argument enables to estimate a unique supplementary coefficient for the two nests:
nl_tel_u <- update(nl_tel, un.nest.el = TRUE)
We then have three nested models and the hypothesis of a unique parameter can be tested using any of the three tests:
The three tests conclude that a unique parameter can be estimated. Then, we can test whether this parameter is 1, in which case the nested logit model reduces to the multinomial logit model:
scoretest(ml_tel,
nests = list(measured = c("budget", "standard"),
flat = c("local", "metro", "extended"))) %>% gaze
## chisq = 5.685, df: 2, pval = 0.058
lrtest(nl_tel_u, ml_tel) %>% gaze
## Chisq = 17.177, df: 1, pval = 0.000
waldtest(nl_tel_u, nests = NULL) %>% gaze
## chisq = 25.541, df: 1, pval = 0.000
car::linearHypothesis(nl_tel_u, "iv = 1") %>% gaze
## Chisq = 25.541, df: 1, pval = 0.000
Based on the Wald and the likelihood ratio, the preferred specification is the nested logit model (but note that the p-value for the score test is slightly higher than 5%).
14.4 Random parameters (or mixed) logit model
Derivation of the model
A mixed logit model (or random parameters logit model) is a logit model whose parameters are assumed to vary from one individual to another. It is therefore a model that takes the heterogeneity of the population into account. For the standard logit model, the probability that individual \(n\) chooses alternative \(j\) is:
\[ P_{nl}=\frac{e^{\beta'x_{nl}}}{\sum_j e^{\beta'x_{nj}}}. \]
Suppose now that the coefficients are individual-specific. The probabilities are then:
\[ P_{nl}=\frac{e^{\beta_n'x_{nl}}}{\sum_j e^{\beta_n'x_{nj}}}. \]
A first approach consists of estimating the parameters for every individual. However, these parameters are identified and can be consistently estimated only if a large number of choice situations per individual is available, which is scarcely the case. A more appealing approach consists of considering \(\beta_n\) as random draws on a distribution whose parameters are estimated, which leads to the mixed logit model. The probability that individual \(n\) will choose alternative \(l\), for a given value of \(\beta_n\) is:
\[ P_{nl} \mid \beta_n =\frac{e^{\beta_n'x_{nl}}}{\sum_j e^{\beta_n'x_{nj}}}. \]
To get the unconditional probability, we have to integrate out this conditional probability, using the density function of \(\beta\). Suppose that \(V_{nl}=\beta_n x_{nl}\), i.e., that there is only one individual-specific coefficient and that the density of \(\beta_n\) is \(f(\beta,\theta)\), \(\theta\) being the vector of the parameters of the distribution of \(\beta\). The unconditional probability is then:
\[ P_{nl}= \mbox{E}(P_{nl} \mid \beta_n) = \int_{\beta}(P_{nl} \mid \beta)f(\beta,\theta)d\beta = \int_{\beta}\frac{e^{\beta^\top x_{nl}}}{\sum_j e^{\beta^\top x_{nj}}}f(\beta,\theta)d\beta, \]
which is a one-dimensional integral that can be efficiently estimated by quadrature methods. If \(V_{nl}=\beta_n^{\top} x_{nl}\) where \(\beta_n\) is a vector of length \(K\) and \(f(\beta,\theta)\) is the joint density of the \(K\) individual-specific coefficients, the unconditional probability is:
\[ P_{nl}= \mbox{E}(P_{nl} \mid \beta_n) = \int_{\beta_1}\int_{\beta_2}...\int_{\beta_K}(P_{nl} \mid \beta)f(\beta,\theta)d\beta_1d\beta_2... d\beta_K. \]
This is a \(K\)-dimensional integral which cannot easily be estimated by quadrature methods. The only practical method is then to use simulations. More precisely, \(R\) draws of the parameters are taken from the distribution of \(\beta\), the probability is computed for every draw and the unconditional probability, which is the expected value of the conditional probabilities is estimated by the average of the \(R\) probabilities.
The expected value of a random coefficient (\(\mbox{E}(\beta)\)) is simply estimated by the mean of the \(R\) draws on its distribution: \(\bar{\beta}=\sum_{r=1}^R \beta_r\). Individual parameters are obtained by first computing the probabilities of the observed choice of \(n\) for every value of \(\beta_r\):
\[ P_{nr}=\frac{\sum_j y_{nj} e^{\beta_r^{'}x_{nj}}}{\sum_j e^{\beta_r^{'}x_{nj}}}, \]
where \(y_{nj}\) is a dummy equal to 1 if \(n\) has chosen alternative \(j\). The expected value of the parameter for an individual is then estimated by using these probabilities to weight the \(R\) \(\beta\) values:
\[ \hat{\beta}_n = \frac{\sum_r P_{nr} \beta_r}{\sum_r P_{nr}}. \]
If there are repeated observations for the same individuals, the longitudinal dimension of the data can be taken into account in the mixed logit model, assuming that the random parameters of individual \(n\) are the same for all their choice situations. Denoting \(y_{ntl}\) a dummy equal to 1 if \(n\) chooses alternative \(l\) for the \(t\)th choice situation, the probability of the observed choice is:
\[ P_{nt}\mid \beta_n=\prod_j \frac{\sum_j y_{ntj}e^{\beta_n ^\top x_{ntj}}}{\sum_j e^{\beta_n ^\top x_{ntj}}}. \]
The joint probability for the \(T\) observations of individual \(n\) is then:
\[ P_{n}\mid \beta_n=\prod_t \prod_j \frac{\sum_jy_{ntj}e^{\beta_n ^\top x_{ntj}}}{\sum_j e^{\beta_n ^\top x_{ntj}}} \]
Application
The random parameter logit model is estimated by providing a rpar
argument to mlogit
. This argument is a named vector, the names being the random coefficients and the acronyms for the law of distribution. Currently, the normal ("n"
), log-normal ("ln"
), zero-censored normal ("cn"
), uniform ("u"
) and triangular ("t"
) distributions are available. For these distributions, two parameters are estimated which are, for normal related distributions, the mean and the standard-deviation of the underlying normal distribution and for the uniform and triangular distribution, the mean and the half-range of the distribution. For these last two distributions, zero-bounded variants are also provided ("zbt"
and "zbu"
). These two distributions are defined by only one parameter (the mean) and their definition domain varies from 0 to twice the mean.
Several considerations may lead to the choice of a specific distribution:
- if correlated coefficients are required, the natural choice is a (transformed-) normal distribution,
"n"
,"ln"
,"tn"
and"cn"
, - it’s often the case that one wants to impose that the distribution of a random parameter takes only positive or negative values. For example, the price coefficient should be negative for every individual. In this case,
"zbt"
and"zbu"
can be used. The use of"ln"
and"cn"
can also be relevant but, in this case, if only negative values are expected, one should consider the distribution of the opposite of the random price coefficient. This can easily be done using theopposite
argument ofdfidx
,8 - the use of unbounded distributions often leads to implausible values of some statistics of the random parameters, especially the mean. This is particularly the case of the log-normal distribution, which has an heavy right tail. In this case, the use of bounded distribution like the uniform and the triangular distributions can be used.
R
is the number of draws, halton
indicates whether Halton draws (see Train 2009, , chapter 9) should be used (NA
and NULL
indicate respectively that default Halton draws are used and that pseudo-random numbers are used), panel
is a boolean which indicates if the panel data version of the log-likelihood should be used.
Correlations between random parameters can be introduced only for normal-related distributed random parameters, using the correlation
argument. If TRUE
, all the normal-related random parameters are correlated. The correlation
argument can also be a character vector indicating a subset of the random parameters that are assumed to be correlated.
We use the dutch_railways
data set, previously coerced to a dfidx
object called Tr
. We first estimate the multinomial model: both alternatives being virtual train trips, it is relevant to use only generic coefficients and to remove the intercept:
Tr <- dfidx(dutch_railways, choice = "choice", varying = 4:11, sep = "_",
opposite = c("price", "comfort", "time", "change"),
idx = list(c("choiceid", "id")), idnames = c("chid", "alt"))
Train.ml <- mlogit(choice ~ price + time + change + comfort | - 1, Tr)
Train.ml %>% gaze
Estimate Std. Error z-value Pr(>|z|)
price 0.3271 0.0165 19.85 < 2e-16
time 1.7206 0.1604 10.73 < 2e-16
change 0.3263 0.0595 5.49 4.1e-08
comfort 0.9457 0.0649 14.56 < 2e-16
All the coefficients are highly significant and have the predicted positive sign (remember than an increase in the variable comfort
implies using a less comfortable class). The coefficients can’t be directly interpreted, but dividing them by the price coefficient, we get monetary values:
We obtain the value of 5.3 euros for an hour of traveling, 1 euro for a change and 2.9 euros to travel in a more comfortable class.9 We then estimate a model with three random parameters, time
, change
and comfort
. We first estimate the uncorrelated mixed logit model:
Train.mxlu <- mlogit(choice ~ price + time + change + comfort | - 1, Tr,
rpar = c(time = "n", change = "n", comfort = "n"),
R = 100, panel = TRUE, correlation = FALSE,
halton = NA, method = "bhhh")
names(coef(Train.mxlu))
## [1] "price" "time" "change" "comfort" "sd.time"
## [6] "sd.change" "sd.comfort"
Compared to the multinomial logit model, there are now three more coefficients which are the standard deviations of the distribution of the three random parameters. The correlated model is obtained by setting the correlation
argument to TRUE
.
[1] "price" "time"
[3] "change" "comfort"
[5] "chol.time:time" "chol.time:change"
[7] "chol.change:change" "chol.time:comfort"
[9] "chol.change:comfort" "chol.comfort:comfort"
There are now six parameters which are the elements of the Choleski decomposition of the covariance matrix of the three random parameters. These six parameters are therefore the elements of the following matrix:
\[ C= \left( \begin{array}{ccc} c_{11} & c_{12} & c_{13} \\ 0 & c_{22} & c_{23} \\ 0 & 0 & c_{33} \end{array} \right) \]
such that:
\[ C^{\top}C= \left( \begin{array}{ccc} c_{11}^2 & c_{11} c_{12} & c_{11}c_{13} \\ c_{11}c_{12} & c_{12}^2 + c_{22}^2 & c_{12}c_{23}+c_{22}c_{23} \\ c_{11}c_{13} & c_{12}c_{3} + c_{22}c_{23} & c_{13}^2 + c_{23}^2 c_{33}^2 \end{array} \right) = \left( \begin{array}{ccc} \sigma_{1}^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{12} & \sigma_{2}^2 & \sigma_{23} \\ \sigma_{13} & \sigma_{23} & \sigma_{3}^2 \end{array} \right) \]
where \(\sigma_k^2\) and \(\sigma_{kl}\) are respectively the variance of the random parameter \(k\) and the covariance between two random parameters \(k\) and \(l\). Therefore, the first estimated parameter can be simply interpreted as the standard deviation of the first random parameter, but the five other can’t be interpreted easily. Random parameters may be extracted using the function rpar
which takes as first argument a mlogit
object and as second argument par
the parameter(s) to be extracted. This function returns a rpar
object and a summary
method is provided to describe it:
The estimated random parameter is in the “preference space”, which means that it is the marginal utility of time. Parameters in the “willingness to pay” (WTP) space are more easy to interpret. They can be estimated directly (a feature not supported by mlogit
) or can be obtained from the marginal utility by dividing by the coefficient of a covariate expressed in monetary value (a price for example), taken as a non-random parameter. The ratio can then be interpreted as a monetary value (or willingness to pay). To obtain the distribution of the random parameters in the WTP space, one can use the norm
argument of rpar
:
The median value (and the mean value as the distribution is symmetric) of transport time is about 33 euros. Several methods/functions are provided to extract the individual statistics (mean
, med
and stdev
respectively for the mean, median and standard deviation):
In case of correlated random parameters, as the estimated parameters can’t be directly interpreted, a vcov
method for mlogit
objects is provided. It has a what
argument whose default value is coefficient
. In this case the usual covariance matrix of the coefficients is return. If what = "rpar"
, the covariance matrix of the correlated random parameters is returned if type = "cov"
(the default) and the correlation matrix (with standard deviations on the diagonal) is returned if type = "cor"
. The object is of class vcov.mlogit
and a summary
method for this object is provided which computes, using the delta method, the standard errors of the parameters of the covariance or the correlation matrix.
vcov(Train.mxlc, what = "rpar")
time change comfort
time 28.6460 -0.2788 5.558
change -0.2788 3.1047 1.232
comfort 5.5579 1.2325 7.896
vcov(Train.mxlc, what = "rpar", type = "cor")
time change comfort
time 5.35220 -0.02956 0.3696
change -0.02956 1.76203 0.2489
comfort 0.36956 0.24893 2.8099
Estimate Std. Error z-value Pr(>|z|)
sd.time 5.3522 0.3811 14.04 <2e-16 ***
sd.change 1.7620 0.1446 12.19 <2e-16 ***
sd.comfort 2.8099 0.1783 15.76 <2e-16 ***
cor.time:change -0.0296 0.2324 -0.13 0.8988
cor.time:comfort 0.3696 0.1141 3.24 0.0012 **
cor.change:comfort 0.2489 0.1103 2.26 0.0240 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The correlation can be restricted to a subset of random parameters by filling the correlation
argument with a character vector indicating the corresponding covariates:
Train.mxlc2 <- update(Train.mxlc, correlation = c("time", "comfort"))
vcov(Train.mxlc2, what = "rpar", type = "cor")
time comfort
time 5.5726 0.3909
comfort 0.3909 3.0631
The presence of random coefficients and their correlation can be investigated using any of the three tests. Actually, three nested models can be considered, a model with no random effects, a model with random but uncorrelated effects and a model with random and correlated effects. We first present the three tests of no correlated random effects:
lrtest(Train.mxlc, Train.ml) %>% gaze
## Chisq = 388.057, df: 6, pval = 0.000
waldtest(Train.mxlc) %>% gaze
## chisq = 288.287, df: 6, pval = 0.000
car::linearHypothesis(Train.mxlc,
c("chol.time:time = 0", "chol.time:change = 0",
"chol.time:comfort = 0", "chol.change:change = 0",
"chol.change:comfort = 0",
"chol.comfort:comfort = 0")) %>% gaze
## Chisq = 288.287, df: 6, pval = 0.000
scoretest(Train.ml, rpar = c(time = "n", change = "n", comfort = "n"),
R = 100, correlation = TRUE, halton = NA,
panel = TRUE) %>% gaze
## chisq = 208.765, df: 6, pval = 0.000
The hypothesis of no correlated random parameters is strongly rejected. We then present the three tests of no correlation, the existence of random parameters being maintained.
lrtest(Train.mxlc, Train.mxlu) %>% gaze
## Chisq = 42.621, df: 3, pval = 0.000
car::linearHypothesis(Train.mxlc,
c("chol.time:change = 0","chol.time:comfort = 0",
"chol.change:comfort = 0")) %>% gaze
## Chisq = 103.195, df: 3, pval = 0.000
waldtest(Train.mxlc, correlation = FALSE) %>% gaze
## chisq = 103.195, df: 3, pval = 0.000
scoretest(Train.mxlu, correlation = TRUE) %>% gaze
## chisq = 10.483, df: 3, pval = 0.015
The hypothesis of no correlation is strongly rejected with the Wald and the likelihood ratio test, only at the 5% level for the score test.
14.5 Multinomial probit
The model
The multinomial probit is obtained with the same modeling that we used while presenting the random utility model. The utility of an alternative is still the sum of two components : \(U_j = V_j + \epsilon_j\) but the joint distribution of the error terms is now a multivariate normal with mean 0 and with a matrix of covariance denoted by \(\Omega\).10 Alternative \(l\) is chosen if: \[ \left\{ \begin{array}{rcl} U_1-U_l&=&(V_1-V_l)+(\epsilon_1-\epsilon_l)<0\\ U_2-U_l&=&(V_2-V_l)+(\epsilon_2-\epsilon_l)<0\\ & \vdots & \\ U_J-U_l&=&(V_J-V_l)+(\epsilon_J-\epsilon_l)<0\\ \end{array} \right. \]
wich implies, denoting \(V^l_j=V_j-V_l\) :
\[ \left\{ \begin{array}{rclrcl} \epsilon^l_1 &=& (\epsilon_1-\epsilon_l) &<& - V^l_1\\ \epsilon^l_2 &=& (\epsilon_2-\epsilon_l) &<& - V^l_2\\ &\vdots & & \vdots & \\ \epsilon^l_J &=& (\epsilon_J-\epsilon_l) &<& - V^l_J\\ \end{array} \right. \]
The initial vector of errors \(\epsilon\) are transformed using the following transformation: \(\epsilon^l = M^l \epsilon\), where the transformation matrix \(M^l\) is a \((J-1) \times J\) matrix obtained by inserting in an identity matrix a \(l^{\mbox{th}}\) column of \(-1\). For example, if \(J = 4\) and \(l = 3\):
\[ M^3 = \left( \begin{array}{cccc} 1 & 0 & -1 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & -1 & 1 \\ \end{array} \right) \]
The covariance matrix of the error differences is:
\[ \mbox{V}\left(\epsilon^l\right)=\mbox{V}\left(M^l\epsilon\right) = M^l\mbox{V}\left(\epsilon\right){M^l}^{\top} = M^l\Omega{M^l}^{\top} \tag{14.2}\]
The probability of choosing \(l\) is then:
\[ P_l =\mbox{P}(\epsilon^l_1<-V_1^l \;\&\; \epsilon^l_2<-V_2^l \;\&\; ... \; \epsilon^l_J<-V_J^l) \tag{14.3}\]
with the hypothesis of normal distribution, this writes:
\[ P_l = \int_{-\infty}^{-V_1^l}\int_{-\infty}^{-V_2^l}...\int_{-\infty}^{-V_J^l}\phi(\epsilon^l) d\epsilon^l_1 d\epsilon^l_2... d^l_J \]
with :
\[ \phi\left(\epsilon^l\right)=\frac{1}{(2\pi)^{(J-1)/2}\mid\Omega^l\mid^{1/2}} e^{-\frac{1}{2}\epsilon^l{\Omega^l}^{-1}\epsilon^l} \]
Two problems arise with this model:
- the identified parameters are the elements of \(\Omega^l\) and not of \(\Omega\). We must then carefully investigate the meanings of these elements,
- the probability is a \(J-1\) integral, which should be numerically computed. The relevant strategy in this context is to use simulations.
Identification
The meaningful parameters are those of the covariance matrix of the error \(\Omega\). For example, with \(J = 3\):
\[ \Omega = \left( \begin{array}{ccc} \sigma_1 ^ 2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_3^3 \\ \end{array} \right) \] Computing Equation 14.2 for \(l=1\), we get:
\[ \Omega^1 = M^1 \Omega {M^1}^{\top}= \left( \begin{array}{cc} \sigma_1^2+\sigma_2^2-2\sigma_{12} & \sigma_1^2 + \sigma_{23} - \sigma_{12} -\sigma_{13} \\ \sigma_1^2+\sigma_{23}- \sigma_{12} - \sigma_{13} & \sigma_1^2 + \sigma_3^2 - 2 \sigma_{13} \\ \end{array} \right) \]
The overall scale of utility being unidentified, one has to impose the value of one of the variance, for example the first one is set to 1. We then have :
\[ \Omega^1 = \left( \begin{array}{cc} 1 & \frac{\sigma_1^2+ \sigma_{23} - \sigma_{12} -\sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} \\ \frac{\sigma_1^2+\sigma_{23}- \sigma_{12} - \sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} & \frac{\sigma_1^2 + \sigma_3^2 - 2 \sigma_{13}}{\sigma_1^2+\sigma_2^2-2\sigma_{12}} \\ \end{array} \right) \]
Therefore, out of the six structural parameters of the covariance matrix, only three can be identified. Moreover, it’s almost impossible to interpret these parameters. More generally, with \(J\) alternatives, the number of the parameters of the covariance matrix is \((J+1)\times J/2\) and the number of identified parameters is \(J\times(J-1)/2-1\).
Simulations
Let \(C^l\) be the Choleski decomposition of the covariance matrix of the error differences:
\[ \Omega^C= {C^l}^{\top}C^l \]
This matrix is an upper triangular matrix of dimension \((J-1)\) :
\[ C^l= \left( \begin{array}{ccccc} l_{11} & l_{12} & l_{13} &... & l_{1(J-1)} \\ 0 & l_{22} & l_{23} & ... & l_{2(J-1)} \\ 0 & 0 & l_{33} & ... & l_{3(J-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & ... & l_{(J-1)(J-1)} \\ \end{array} \right) \]
Let \(\nu\) be a vector of standard normal deviates: \(\nu \sim \mathcal{N}(0, I)\)
Therefore, we have :
\[ \mbox{V}\left({C^l}^\top\nu\right)={C^l}^\top V(\nu){C^l}={C^l}^{\top}IC^l=\Omega^l \]
Therefore, if we draw a vector of standard normal deviates \(\nu\) and apply to it this transformation, we get a realization of \(\epsilon^l\). This probability of choosing \(l\) given by Equation 14.3 can be written as a product of conditional and marginal probabilities:
\[ \begin{array}{rcl} P_l &=& \mbox{P}(\epsilon^l_1<- V_1^l \;\&\; \epsilon^l_2<-V_2^l \;\&\; ... \;\&\; \epsilon^l_J<-V_J^l))\\ &=& \mbox{P}(\epsilon^l_1<- V_1^l))\\ &\times&\mbox{P}(\epsilon^l_2<-V_2^l \mid \epsilon^l_1<-V_1^l) \\ &\times&\mbox{P}(\epsilon^l_3<-V_3^l \mid \epsilon^l_1<-V_1^l \;\&\; \epsilon^l_2<-V_2^l) \\ & \vdots & \\ &\times&\mbox{P}(\epsilon^l_J<-V_J^l \mid \epsilon^l_1<-V_1^l \;\&\; ... \;\&\; \epsilon^l_{J-1}<-V_{J-1}^l)) \\ \end{array} \]
The vector of error difference deviates is:
\[ \left( \begin{array}{c} \epsilon^l_1 \\ \epsilon^l_2 \\ \epsilon^l_3 \\ \vdots \\ \epsilon^l_J \end{array} \right) = {C^l} ^ \top \nu = \left( \begin{array}{ccccc} l_{11} & 0 & 0 &... & 0 \\ l_{12} & l_{22} & 0 & ... & 0 \\ l_{13} & l_{23} & l_{33} & ... & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ l_{1(J-1)} & l_{2(J-1)} & l_{3(J-1)} & ... & l_{(J-1)(J-1)} \\ \end{array} \right) \times \left( \begin{array}{c} \nu_1 \\ \nu_2 \\ \nu_3 \\ \vdots \\ \nu_J \end{array} \right) \]
\[ \left( \begin{array}{c} \epsilon^l_1 \\ \epsilon^l_2 \\ \epsilon^l_3 \\ \vdots \\ \epsilon^l_J \end{array} \right) = \left( \begin{array}{l} l_{11}\nu_1 \\ l_{12}\nu_1+l_{22}\nu_2 \\ l_{13}\nu_1+l_{23}\nu_2 + l_{33}\nu_3\\ \vdots \\ l_{1(J-1)}\nu_1+l_{2(J-1)}\nu_2+...+l_{(J-1)(J-1)}\nu_{J-1} \end{array} \right) \]
Let’s now investigate the marginal and conditional probabilities:
- the first is simply the marginal probability for a standard normal deviate, therefore we have: \(\mbox{P}(\epsilon^l_1<-V_1^l) = \Phi\left(-\frac{V_1^l}{l_{11}}\right)\)
- the second is, for a given value of \(\nu_1\) equal to \(\Phi\left(-\frac{V^l_2+l_{21}\nu_1}{l_{22}}\right)\). We then have to compute the mean of this expression for any value of \(\nu_1\) lower than \(-\frac{V^l_1}{l_{11}}\). We then have, denoting \(\bar{\phi}_1\) the truncated normal density: \[\mbox{P}(\epsilon^l_2<-V_2^l)=\int_{-\infty}^{-\frac{V^l_1}{l_{11}}}\Phi\left(-\frac{V^l_2+l_{21}\nu_1}{l_{22}}\right) \bar{\phi}_1(\nu_1)d\nu_1\]
- the third is, for given values of \(\nu_1\) and \(\nu_2\) equal to: \(\Phi\left(-\frac{V^l_3+l_{31}\nu_1+l_{32}\nu_2}{l_{33}}\right)\). We then have: \[\mbox{P}(\epsilon^l_3<-V_3^l)=\int_{-\infty}^{-\frac{V^l_1}{l_{11}}}\int_{-\infty}^{-\frac{V^l_2+l_{21}\nu_1}{l_{22}}} \Phi\left(-\frac{V^l_3+l_{31}\nu_1+l_{32}\nu_2}{l_{33}}\right)\bar{\phi}_1(\nu_1)\bar{\phi}_2(\nu_2)d\nu_1d\nu_2\]
- and so on.
These probabilities can easily be simulated by drawing numbers from a truncated normal distribution. This so called GHK (for Geweke, Hajivassiliou and Keane) algorithm (see for example Geweke, Keane, and Runkle 1994) can be described as follow:
- compute \(\Phi\left(-\frac{V_1^l}{l_{11}}\right)\),
- draw a number called \(\nu_1^r\) from a standard normal distribution upper-truncated at \(-\frac{V_1^l}{l_{11}}\) and compute \(\Phi\left(-\frac{V^l_2+l_{12}\nu_1^r}{l_{22}}\right)\),
- draw a number called \(\nu_2^r\) from a standard normal distribution upper-truncated at \(-\frac{V^l_2+l_{12}\nu_1^r}{l_{22}}\) and compute \(\Phi\left(-\frac{V^l_3+l_{13}\nu_1^r+l_{23}\nu_2^r}{l_{33}}\right)\),
- \(...\) draw a number called \(\nu_{J-1}^r\) from a standard normal distribution upper-truncated at \(-\frac{V^l_{J-1}+l_{1(J-1)}\nu_1^r+... l_{(J-2)(J-1)}\nu_{J-2}^r}{l_{(J-1)(J-1)}}\),
- multiply all these probabilities and get a realization of the probability of choosing \(l\) called \(P^r_l\),
- repeat all these steps many times and average all these probabilities; this average is an estimation of the probability: \(\bar{P}_l = \sum_{r=1}^R P^r_l/R\).
Several points should be noted concerning this algorithm:
the utility differences should be computed respective to the chosen alternative for each individual,
-
the Choleski decomposition used should rely on the same covariance matrix of the errors. One method to attained this goal is to start from a given difference, e.g., the difference respective with the first alternative. The vector of error difference is then \(\epsilon^1\) and its covariance matrix is \(\Omega^1={L^1}^{\top}L^1\). To apply a difference with another alternative \(l\), we construct a matrix called \(S^l\) which is obtained by using a \(J-2\) identity matrix, adding a first row of 0 and inserting a column of \(-1\) at the \(l-1\)th position. For example, with four alternatives and \(l=3\), we have:
\[S^3= \left( \begin{array}{ccc} 0 & -1 & 0 \\ 1 & -1 & 0 \\ 0 & -1 & 1 \\ \end{array} \right) \] The elements of the Choleski decomposition of the covariance matrix is then obtained as follow: \[ \Omega^l = S^l \Omega^1 {S^l}^{\top}=L^l {L^l}^{\top} \]
to compute draws from a normal distribution truncated at \(a\), the following trick is used : take a draw \(\mu\) from a uniform distribution (between 0 and 1); then \(\nu = \Phi^{-1}\left(\mu \Phi(a)\right)\) is a draw from a normal distribution truncated at \(a\).
Application
In Section 14.2.5, we estimated a multinomial logit for mode choice in the Toronto-Montreal corridor, using cost
and freq
as alternative specific covariates with a generic coefficient, income
as a choice situation specific covariate and time
as an alternative specific covariate with a specific coefficient. We fit the same model using this time a probit. This is simply done by setting the probit
argument to TRUE
.
As previously, we want to predict the effect of a reduction of 20% of a train’s fare:
Oprob <- fitted(pbt, type = "probabilities")
Nprob <- predict(pbt, newdata = NMC)
rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean))
car train air
old 0.4606 0.1676 0.3721
new 0.4194 0.2457 0.3353
With this model, the IIA property is not operative, as the ratio of probabilities of choosing air and car is no longer the same before and after the change of a train’s fare.
We’ve already encountered this model in the special case where the response is binomial in Section 10.2.2.↩︎
Used by Ben-Akiva, Bolduc, and Bradley (1993) and Meijer and Rouwendal (2006).↩︎
Used in particular by Forinash and Koppelman (1993), Bhat (1995), Koppelman and Wen (1998) and Koppelman and Wen (2000).↩︎
A slightly different version of the nested logit model (Daly 1987) is often used, but is not compatible with the random utility maximization hypothesis. Its difference from the previous expression is that the deterministic parts of the utility for each alternative are not divided by the nest elasticity. The differences between the two versions have been discussed in Koppelman and Wen (1998), Heiss (2002) and Hensher and Greene (2002).↩︎
We’ve already encountered this expression in Section 14.2.4.3.↩︎
See Section 14.1.1.1.↩︎
Remember that the survey took place in 1987. The values should be multiplied by 1.73 to get the value of 1 euro in 2024.↩︎