Jose Pinheiro gave a web seminar on the S+ CorrelatedData Library. An archive of this
presentation is at
I have reported on other S+ web seminars in the past.
The CorrelatedData library extends the Generalized Linear Model (GLM) to single level and
multi-level group problems.
The classic linear regression model assume that the outcome variable is a linear function
of the predictor variables plus and error term with constant variance. The GLM extends the
linear model in two different ways. First, it allows for a link function so that a function
can be linear on a different scale (such as a log scale). It also allows for a variance
function so that the error term has a non-constant variance that changes as the mean changes.
This allows, for example, for better modeling of count data because groups with higher
average counts also tend to have greater amounts of variability.
In S-plus, you use the glm() function to fit a GLM model. There are three important
arguments in glm()
-
formula is a linear formula relating the predictor variables to the outcome,
-
family specifies a combination of link function and variance function that works well
for a particular distribution. For example, the Poisson family uses a log link and a
variance function that is proportional to the mean squared.
-
data specifies the data frame that includes the predictor variables and the outcome.
Dr. Pinheiro showed two examples of GLM model where the data did not seem to fit very
well. In both models, the dispersion parameter could be estimated as the residual deviance
divided by the degrees of freedom was much larger than 1. This implies that there was
substantial center to center variation in the first study (a multi-center trial) and
substantial patient to patient to patient variation in the second study (a longitudinal
trial). The residual plots were also troublesome because some centers/patients had entirely
negative residuals and others had entirely positive residuals.
The Generalized Linear Mixed Model (GLMM) is an extension of GLM that allows for better
modeling of center to center variation or patient to patient variation.
GLMM can be thought of as a compromise between
-
the GLM model across all patients or centers that ignores the center or patient effects and
-
a separate GLM model for each patient or center that fits each center or patient exactly.
The latter approach can be very inefficient, since you lose a degree of freedom for each
patient or center in your analysis. The GLMM model can thought of having each individual or
center borrowing strength from the other patients or centers in the model.
GLMM is also an extension of the linear mixed effects (LME) model. It allows for mixed
effects like LME, but provides the flexibility of a link function and variance function.
The algorithms for the GLMM model are complex because a maximum likelihood approach would
take too much time, even with today's superfast computers. Alternative approaches include PQL
(Penalized Quasi Likelihood) and MQL (Marginal Quasi Likelihood). There are restricted
versions of these algorithms, REPQL and PREMQL. For specific families (the binomial and
Poisson families), there are additional fitting algorithms, the Laplacian algorithm and
adaptive Gaussian Quadarature. These approaches may avoid some of the problems of PQL and MQL
which have been shown in certain circumstances to produce biased estimates.
To fit a GLMM in S-plus, you use the glme() function. This function has the arguments for
fixed, data, and family that match the glm() function, and a new argument, random. Random
specifies the random effects. Typically this would incorporate variation due to centers in a
multi-center trial or variation due to patients in a longitudinal study.
The glme() function produces a glme() object with information that you can print or plot.
You can, for example, easily produce residual plots from the glme() object or test for the
normality of the random effects.
You can generalize a GLMM to allow for multiple random effects, such as effects due to
differing countries, different centers with each country, and different patients within each
center. This is an example of nested random effects. The PQL and MQL algorithms work very
efficiently with nested random effects and take advantage of the special structure that the
nesting produces. You can also look at random effects that are crossed with each other rather
than nested, but the algorithms here are not as efficient.
I aksed if Dr. Pinheiro could explain the difference between the Generalized Estimating
Equations, which uses a marginal effects approach and GLMM which uses a random effects
approach. He said that the GEE approach goes straight to the correlation structure of the
data, while GLMM produces a correlation structure implicitly.
07/08/2008.