Category: Modeling issues. These pages discuss issues about statistical models which are relevant across a broad class of models. These pages may mention a specific model like logistic regression to provide context, but the ideas generalize easily to other models. Articles are arranged by date with the most recent entries at the top. You can find the theme and closely related categories and other resources at the bottom of this page.
Stats: Presenting unadjusted and adjusted estimates side by side (March 24, 2008). Someone on the Medstats discussion group asked about reporting the analysis of a model without adjustment for covariates along with the analysis adjusted for covariates. What is the purpose of reporting the unadjusted analysis?
Stats: Assessing the assumption of an exponential distribution (February 25, 2008). The following 41 observations: 8, 2, 26, 29, 1, 2, 11, 8, 0, 5, 10, 1, 4, 9, 12, 3, 6, 5, 2, 12, 1, 5, 3, 5, 7, 0, 2, 8, 3, 3, 1, 0, 4, 8, 1, 8, 12, 0, 6, 1, 5, represent waiting times that we suspect follow an exponential distribution. There are several ways to examine this belief, and the simplest way to to draw a Q-Q plot for the exponential distribution.
Stats: When should you use a log transformation? (December 28, 2007). Dear Professor Mean, How do I know whether it is appropriate to use a log transformation for my data?
Stats: The order of entering interactions into a model (September 20, 2007). Dear Professor Mean, I like your titanic example! But shouldn't you enter the interaction term on a second step following entry of the main effects on the first step? If you enter the terms all at the same time, the interaction term will compete for variance with the two main effects on which is depends.
Stats: Are we assuming a normal sample or a normal population? (August 30, 2007). Dear Professor Mean, I'm fitting an ANOVA model to a sample of 25 observations, and the data is skewed. I'm quite worried about this, but my husband reassures me that this is not a problem. He says that the assumption is that the population is normal, not the sample. Should I listen to him?
Stats: How good is my prediction? (August 13, 2007). Dear Professor Mean, I have two time series of data, one actual and one predicted. Since I'm quite new to statistical methods, I would like to know what methods are used to evaluate the different between the two time series. I would like to able to say something like "the predicted values were 70% accurate."
Stats: Frank Harrell's Philosophy of Biostatistics (October 10, 2006). There are a lot of people in the world who are a lot smarter than I am and it is always a humbling experience when I recognize how little I really know. Frank Harrell, chair of the Department of Biostatistics at Vanderbilt University, is one of those people.
Stats: Slash and burn models (June 26, 2006). I received an email question about developing a logistic regression model with some interaction terms. One of the interaction terms was statistically significant but one or both of the main effects associated with the interaction was not. So is it okay, I was asked to include the interaction in the final model but not the non-significant main effects? First, I need to comment on the "slash and burn" model building practice that this person is using. A recent posting to the MedStats email discussion group outlines problems with this approach (although it does not use the term "slash and burn"). The person who adopts a "slash and burn" approach to models has a parsimonious intent. He/she wants to use as few degrees of freedom as possible in the final statistical model and one way to do this is to strip out anything that has an insignificant p-value. The ideal in the "slash and burn" world is a model where every single p-value is smaller than 0.05.
Stats: Multicollinearity is not a violation of assumptions (January 20, 2006). A colleague from my days at the National Institute for Occupational Safety and Health emailed me a question. Apparently, one of the co-authors of a paper he is writing is in a bit of a panic because the linear regression model that they are using has multicollinearity. She calls this a violation of assumptions and wonders if she should look at certain transformations that are difficult to interpret but which remove much of the multicollinearity. To me this seems like jumping from the frying pan into the fire.
Stats: I abhor Lilliefor and other tests of normality (April 14, 2005). Someone asked me about a log transformation for their data. It seemed like a good idea, based on my general comments on the log transformation, but the test of significance for normality (Lilliefor's test) was still rejected even after the log transformation. In general, I dislike Lilliefor's test (and other tests of normality like the Shapiro-Wilks test).
Stats: Discrepancy between univariate and multivariate models (November 12, 2004). Someone asked me about an analysis that showed certain factors were predictive of a health outcome when considered individually. When these factors were included in a multivariate model that included other factors, they were no longer statistically significant. This is worth investigating further but perhaps you need to live with a bit of ambiguity in the data.
Stats: What is the best statistical model? (September 17, 2004). Someone asked me by email about the advantages and disadvantages of various statistical models (multinomial logistic regression, ordinal logistic regression, and structural equations models). This is a somewhat difficult question to answer by email, but as a general rule, I think that people worry too much about the particular model that they choose.
Stats: Central Limit Theorem (March 9, 2004). Dear Professor Mean, How does the central limit theorem affect the statistical tests that I might use for my data?
Stats: What does "overfitting" mean? (July 24, 2003). Dear Professor Mean, I am conducting binary logistic regression analyses with a sample size of 80 of which 20 have the outcome of interest (e.g. are "very successful" versus somewhat/not very successful). I have thirty possible independent variables which I examined in a univariate logistic regression with the dependent variable. Of these thirty, five look like they might have a relationship with the independent variable. Now I want to include these six variables in a stepwise logistic regression model, but I am worried about overfitting the data. I have heard that there should be about 10 cases with the outcome of interest per independent variable to avoid overfitting. What exactly does overfitting mean?
Stats: Log transformation (October 11, 2002) Dear Professor Mean, I have some data that I need help with analysis. One suggestion is that I use a log transformation. Why would I want to do this? -- Stumped Susan
Stats: Checking the assumption of normality (September 11, 2002). Dear Professor Mean, I have some data that don't seem to meet the assumption of normality. What should I do? -Anxious Abby
Stats: What is collinearity? (January 27, 2000). Dear Professor Mean, Could you describe the term collinearity for me? I understand that it has to do with variables which are not totally independent, but that is all I know!
Stats: Best fitting curve (January 26, 2000). Dear Professor Mean: I have a graph of the trend for the mean frequency of injuries among children from 1 to 11 years of age. The shape of the curve suggests a nonlinear relationship between the age and the frequency of injuries. Is there some software that would provide the best fitting curve for this data from among a large family of nonlinear curves?
Theme and closely related categories:
- Theme: Data analysis
- Category: Covariate adjustment
- Category: Linear regression
- Category: Logistic regression
- Category: Unusual data
- Negative Consequences of Dichotomizing Continuous Predictor Variables Description: This Java applet shows graphically how creating a median split for a predictor variable leads to loss of precision and power.
- Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon - the reversal paradox. Description: This article provides a nice overview of how associations between two variables can be modified by a third variable.
[Return to full topic list] [Read current weblog entries]
This webpage was written by Steve Simon on 2007-08-13, edited by Steve Simon, and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.
