I was asked to give a talk to the medical residents with the title
"Statistics for Boards". Many health care professionals need to take boards
or other certifying examinations during their training and afterwards to
certify or re-certify their skill in an area. These boards often ask some
basic statistics questions. A common theme appears to be, what statistic
should I use in what situation. The answer often depends on what the
predictor variable is and what the outcome variable is. Either variable could
either be
- binary (two possible values),
- categorical (more than two possible values, but still a small
number), or
- continuous (a large number of values, potentially any number
inside a particular range).
In theory binary variables are subsumed under the categorical variable also
includes binary variables, but I am deliberately separating the two.
Some people divide continuous variables into those that are normally
distributed (their histogram follows a bell shaped curve) and those that are
non-normal. I dislike such distinctions for a variety of reasons, but they
don't ask me to write these exams. Normality is never an issue for the
predictor variable, only for the outcome variable.
There are often variables that are difficult to place in this
classification scheme, but don't worry about these. The goal of the boards is
not to trip people up with technical distinctions, but rather to see if you
understand some fundamental distinctions among various statistical analysis
methods.
Here are some examples of binary variables:
- exposure status (exposed or unexposed),
- sex (male or female), and
- drug (active or placebo)
Here are some examples of categorical variables:
- cancer stage (Stage I, II, III, or IV),
- race/ethnicity (white/black/hispanic/other), and
- likert scale (strongly disagree, disagree, neutral, agree, strongly
agree).
Here are some examples of continuous variables:
- body mass index (any value between 15 and 50 is possible),
- patient's age in years (any value between 1 and 99), and
- length of stay (any value between 1 day and 1 year).
Here's a simple description of the statistical methods that are typically
applied. I want to provide some of the 'buzzwords' that you are likely to
encounter without providing an in-depth discussion of any particular method.
These questions are usually multiple choice. I'll list the most commonly
cited answer first, but include some variants that you might encounter.
Binary predictor and binary outcome. Chisquare test (also known as
Chi-square, Chi Squared, etc.). For small sample sizes some people will
recommend a continuity correction or the use of Fisher's Exact Test. In
theory, you can use logistic regression here, but most exams will not be
looking for or mentioning this option. It's possible that the exam writers
are looking for an odds ratio here or a relative risk. Don't suggest a
relative risk if the data comes from a case-control design.
Possible question: A study is examining demographic factors such as
employment status (full/part time work vs. unemployed) educational level
(high school diploma or better versus no high school degree) to see if they
are associated with intestinal parasites (present or absent). What
statistical test would you use?
Categorical predictor and binary outcome. Chisquare test again.
Fisher's Exact Test will not be an option. Technically, an extension is
available, but ignore that. Logistic regression is also a possibility.
Possible question: Dental students were asked what influences were very
important in helping them choose a career in Dentistry such as "regular
working hours". Influences rated as very important were coded as 1 and
influences rated only important or lower were coded as 0. What statistic
would you use to examine the association between influence and the
race/ethnicity of the respondent?
Continuous predictor and binary outcome. Logistic regression. There
are no other serious competitors here.
Possible question: A group of 110 elderly patients were followed over a two
year span to estimate the prevalence of falls and how it might be predicted
by the patients age. What statistical model would be appropriate here?
Binary predictor and categorical outcome. Chisquare test again. This
type of question is less likely to appear. For certain categorical outcomes
that represent ranks or ordinal variables, consider the responses under
binary predictor and continuous but non-normal outcome.
Binary predictor and continuous outcome. T-test. If the data is
unmatched, then specify a two sample t-test or an independent samples t-test.
If the data is matched, then specify a paired t-test.
Binary predictor and continuous but non-normal outcome.
Mann-Whitney-Wilcoxon test. There are several permutations of the name of
this test that incorporates different order or which ignores the contribution
of Dr. Wilcoxon. If the data is matched, then specify a Wilcoxon signed ranks
test.
Categorical predictor and binary outcome. Chisquare test. Logistic
regression is also a solid choice here.
Categorical predictor and categorical outcome. Chi-square test. Some
people will use the term contingency table analysis. In some situations, a
specialized logistic regression model might work (ordinal logistic
regression, multinomial logistic regression) but these choices are too
technical to be on a board exam.
Categorical predictor and continuous outcome. Analysis of variance
(ANOVA).
Categorical predictor and continuous but non-normal outcome. Kruskal-Wallis
test. Sometimes this is called rank ANOVA or nonparametric ANOVA.
Continuous predictor and binary outcome. Logistic regression.
Continuous predictor and categorical outcome. This scenario is
certainly possible, but will almost never be used in a board exam. For the
record, you need to use specialized logistic regression model like ordinal
logistic or multinomial logistic regression.
Continuous predictor and continuous outcome. Linear regression is
your best choice here. A correlation coefficient (Pearson correlation,
product moment correlation) might also be a possibility.
Continuous predictor and continuous but non-normal outcome. Spearman
correlation coefficient. Another good choice is Kendall's correlation
coefficient.
Other areas of statistics that a board exam might cover are:
- definitions of sensitivity and specificity,
- interpretation of confidence intervals and p-values,
- different epidemiological designs (e.g., case-control design, cohort
design).