Ratio of observations to independent variables (November 17, 2004). [Incomplete]
A widely quoted rule is that you need 10 or 15 observations per independent variable in a regression model. The original source of this rule of thumb is difficult to find. I briefly commented on this in an earlier weblog entry, but here is a more complete elaboration.
When you are trying to build a regression model using a stepwise variable selection process (or something similar to stepwise selection), there is substantial reason for caution. Stepwise selection tends to lead to poor choices for the regression model that do not replicate well. I abstracted some arguments against stepwise variable selection as part of the STAT-L FAQ.
Frank Harrell did some empirical investigation of stepwise variable selection in the logistic regression model and the Cox Proportional Hazards regression model. For these models, it is not the number of observations you have, but the number of events that is important. Suppose you study thousands of patients and find that in the control group four die, but only two die in the treatment group. That represents a halving of the mortality rate, yet no one would trust those results. Your sample size is effectively those six deaths rather than the thousands of patients being studied.
This webpage was written by Steve Simon on 2004-11-17, edited by Steve Simon, and was last modified on 2008-07-08. This page needs minor revisions. Category: Ask Professor Mean, Category: Sample size justification
