A while back, the IRB asked me to look into a randomized study where the
interim report indicated a huge disparity in the two treatment arms. One arm
of the study had almost all good outcomes and the other arm had almost all
bad outcomes or at best no improvement. The sample size, though, was only 20
patients, and the protocol had no formal rule for stopping the study early.
Even without such a rule, a careful analysis of the data revealed that there
was little justification for continuing randomization when one arm of the
study was clearly inferior.
The principal investigator and the IRB both agreed, so we stopped the study
wrote up the results and published them. The PI wrote back to me a few months
later with the following comments (loosely paraphrased to simplify the
discussion and to preserve confidentiality).
I have now presented this paper to two separate groups of docs
here at CMH and once nationally and I have heard a consistent theme
of questioning regarding the discontinuation of the study. The most notable
commentary came from a doctor who is also chair of a Data Safety and
Monitoring Board ' he was surprised that we didn't enroll at least a third
of the targeted 144 patients and questioned why we even looked once
reaching the initial 20 patients.
He had a hard time arguing the statistics but asked how we didn't know
that the next 20 patients wouldn't have shown the opposite results. I think
most people are having trouble accepting the findings based on the final
enrollment of only 20 patients, the fact that we didn't control for an
important confounding variable, and the fact that the [inferior arm] in
this study is a pretty deeply rooted therapy modality in many medical
settings. It's also unfortunate that [an important covariate] was worse for
the [inferior arm] than for the [superior arm] which may have contributed
to us not seeing much improvement in that group.
The paper has now been accepted for an oral presentation at another
major medical meeting. I would like to talk over some of the above
noted concerns as I try to figure out the next step for this project and as
I make sure that I have the best answers in hand for rebuttal of the
questions that are being raised.
Here's a summary of what I replied by email (again with some paraphrasing).
People will always be skeptical, so there's only so much you can do.
Here's how I would argue this.
The comment "how do we know that the next 20 people wouldn't show the
opposite result?" is more than a bit silly. If you flipped a coin 20 times
and it came up heads 20 times, you'd suspect that the coin was loaded. But
what they are saying is "why don't you flip it 20 more times, because you
don't know, maybe it will come up tails the next 20 times." We have to
assume that our universe shows some level of order and consistency. If we
jump off a cliff 20 times and each time we break a leg, how do we know that
we won't land softly and safely the next 20 times? Furthermore, that
comment is one that can apply to any sample size. If we get a certain
result with 200 patients, how do we know that the next 200 patients won't
show the opposite results? If we get a certain result with 2,000 patients,
how do we know that the next 2,000 patients won't show the opposite
results?
Why stop at 20 rather than at a third or half of the patients? You can
argue that this was driven by the IRB concerns, or that you thought a
yearly review was appropriate. A big weakness of this is that you did not
specify the criteria for stopping early in the protocol itself. That would
have been nice, but you were seeing an all-or-nothing phenomena where the
worst patient in one arm is still better off than any patient in the other
arm. You were just uncomfortable continuing the study in the face of such
an extreme finding.
An all-or-nothing finding starts to become convincing when the total
sample size is 10 or so. Go back to the coin analogy. Who among even the
most skeptical of your colleagues would believe that a coin was fair after
seeing 10 straight heads come up?
Lack of control for and imbalance in [an important covariate] is
indeed a problem. I think the magnitude of the difference seen here is so
extreme that it is unlikely to be caused by [this covariate]. But that has
to be a qualitative argument, because we don't have the proper data to test
this formally.
When the first research on smoking and cancer came out, it was based
on imperfect data, but the magnitude of the effect was so large, that only
a fool (or a tobacco company lawyer) would argue that this was caused by
the imperfections in the data.
The fact that the [inferior arm] is pretty deeply rooted is not a
serious argument. Hormone replacement therapy for post-menopausal women is
also a deeply rooted practice.
It would also help for you to review a good article that has argued
that failure to stop some of these trials early has led to serious ethical
lapses.
Safeguarding patients in clinical trials with high mortality rates.
Bradley D. Freeman, Robert L. Danner, Steven M. Banks, Charles Natanson. Am
J Respir Crit Care Med 2001: 164(2); 190-192.
[Full text]
http://ajrccm.atsjournals.org/cgi/content/full/164/2/190
[PDF]
http://ajrccm.atsjournals.org/cgi/reprint/164/2/190.pdf
This is not to say that the people raising these concerns are wrong. There
is always some level of ambiguity in research and different people will draw
differing conclusions from the same data. If people are really upset, they
always have the option of replicating this research at their site.