12 What are some of the problems with stepwise regression?

All of this material is quoted from various e-mails that appeared on STAT-L/SCI.STAT.CONSULT in 1996. Thanks go to Ira Bernstein, Ronan Conroy, Frank Harrell for their detailed explanations and to Richard Ulrich who originally compiled these comments. I have done some very minor editing, (mostly adding and changing line breaks) but have tried to avoid any substantive changes to these well written explanations.

Frank Harrell's comments:

Here are SOME of the problems with stepwise variable selection.

1. It yields R-squared values that are badly biased high.

2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.

3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med).

4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem.

5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996).

6. It has severe problems in the presence of collinearity.

7. It is based on methods (e.g. F tests for nested models) that were intended to be used to test pre-specified hypotheses.

8. Increasing the sample size doesn't help very much (see Derksen and Keselman)

9. It allows us to not think about the problem.

10. It uses a lot of paper.

Note that 'all possible subsets' regression does not solve any of these problems.

References

@article{alt89,author = "Altman, D. G. and Andersen, P. K.",journal = "Statistics in Medicine",pages = "771-783",title = "Bootstrap investigation of the stability of a {C}ox regression model",volume = "8",year = "1989" Shows that stepwise methods yields confidence limits that are fartoo narrow.}

@article{der92bac,author = {Derksen, S. and Keselman, H. J.},journal = {British Journal of Mathematical and Statistical Psychology},pages = {265-282},title = {Backward, forward and stepwise automated subset selection algorithms: {F}requency of obtaining authentic and noise variables},volume = {45},year = {1992},annote = {variable selection} Conclusions: "The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model. The number of candidate predictor variables affected the number of noise variables that gained entry to the model. The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model. The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model."}

@article{roe91pre,author = {Roecker, Ellen B.},journal = {Technometrics},pages = {459-468},title = {Prediction error and its estimation for subset--selected models},volume = {33},year = {1991} Shows that all-possible regression can yield models that are "too small".}

@article{man70why,author = {Mantel, Nathan},journal = {Technometrics},pages = {621-625},title = {Why stepdown procedures in variable selection},volume = {12},year = {1970},annote = {variable selection; collinearity}}

@article{hur90,author = "Hurvich, C. M. and Tsai, C. L.",journal = American Statistician,pages = "214-217",title = "The impact of model selection on inference in linear regression",volume = "44",year = "1990"}

@article{cop83reg,author = {Copas, J. B.},journal = "Journal of the Royal Statistical Society B",pages = {311-354},title = {Regression, prediction and shrinkage (with discussion)},volume = {45},year = {1983},annote = {shrinkage; validation; logistic model} Shows why the number of CANDIDATE variables and not the number in the final model is the number of d.f. to consider.}

@article{tib96reg,author = {Tibshirani, Robert},journal = "Journal of the Royal Statistical Society B",pages = {267-288},title = {Regression shrinkage and selection via the lasso},volume = {58},year = {1996},annote = {shrinkage; variable selection; penalized MLE; ridge regression}}

Ira Bernstein's comments:

I think that there are two distinct questions here: (a) _when_ is stepwise selection appropriate and (b) _why_ is it so popular.

Since I have seen some variation in usage of the term "stepwise", I define it as any of a number of _data_ driven variable selection schemes used in regression and discriminant analysis, among other applications. Some, inappropriately IMHO (since there is no official body to define "appropriate"), use it to describe what I would call hierarchical (_hypothesis_ driven) selection. Like I would assume many, I would discourage stepwise selection and encourage hierarchical selection. I, of course, assume the researcher does not "cheat" by defining his/her "hierarchy" given the data but does so by considering alternatives in advance of analysis and, preferably, replicates the study (dream on).

I would probably only argue slightly with "never" as an answer to the use of stepwise selection since I don't know what knowledge we would lose if all papers using stepwise regression were to vanish from journals at the same time programs providing their use were to become terminally virus-laden. However, I have been in situations that looked like "I have good reason to look at variables A, B, and C; then look at D, and E, but I have no basis to favor F over G or vice versa past that point." Older versions of SPSS (I haven't used newer versions since switching to SAS a decade ago) allowed this mixture, and I would personally not object to it as long as the strategy were defined in advance and made clear to readers.

As to part (b), I think that there are two groups that are inclined to favor its usage. One consists of individuals with little formal training in data analysis who confuse knowledge of data analysis with knowledge of the syntax of SAS, SPSS, etc. They seem to figure that "if its there in a program, its gotta be good and better than actually thinking about what my data might look like". They are fairly easy to spot and to condemn in a right-thinking group of well-trained data analysts (like ourselves). However, there is also a second group who are often well trained (and may be here in this group ready to flame me). They believe in statistics uber alles--given any properly obtained data base, a suitable computer program can objectively make substantive inferences without active consideration of the underlying hypotheses. If stepwise selection is the parent of this line blind data analysis, then automatic variable respecification in confirmatory factor analysis is the child.

Ronan Conroy's comments:

I am struck by the fact that Judd and McClelland in their excellent book "Data Analysis: A Model Comparison Approach" (Harcourt Brace Jovanovich, ISBN 0-15-516765-0) devote less than 2 pages to stepwise methods. What they do say, however, is worth repeating:

1. Stepwise methods will not necessarily produce the best model if there are redundant predictors (common problem).

2. All-possible-subset methods produce the best model for each possible number of terms, but larger models need not necessarily be subsets of smaller ones, causing serious conceptual problems about the underlying logic of the investigation.

3. Models identified by stepwise methods have an inflated risk of capitalising on chance features of the data. They frequently fail when applied to new datasets. They are rarely tested in this way.

4. Since the interpretation of coefficients in a model depends on the other terms included, "it seems unwise," to quote J and McC, "to let an automatic algorithm determine the questions we do and do not ask about our data". RC adds that stepwise methods abusers frequently would rather not think about their data, for reasons that are funny to describe over a second Guinness.

5. I quote this last point directly, as it is sane and succinct: "It is our experience and strong belief that better models and a better understanding of one's data result from focussed data analysis, guided by substantive theory." (p 204)

They end with a quote from Henderson and Velleman's paper "Building multiple regression models interactively". Biometrics 1981;37:391-411 "The data analyst knows more than the computer" and add "failure to use that knowledge produces inadequate data analysis."

Personally, I would no more let an automatic routine select my model than I would let some best-fit procedure pack my suitcase.

11 What should I do about these "Spams"?

13 What is the answer to the Monty Hall, Envelope, or Birthday problem?

[Go back to FAQ table of contents]