Statistical Evidence. Chapter 4. What do the other witnesses say? Corroborating Evidence.
[This is the first draft for Chapter 4 of "Statistical Evidence"]
4.0 Introduction
In a criminal trial, the prosecutor will sometimes try to demonstrate that the defendant had:
- the means to commit the crime;
- the motive to commit the crime; and
- and the opportunity to commit the crime.
All three elements are not really necessary for a conviction--many people are convicted without the need to show a motive for example. But when the prosecution can identify a motive, that makes their case that much more convincing.
This analogy also holds for research studies. Some studies are so well done that their evidence alone would be enough to convince you. Other studies, however, provide only weak evidence. But when this evidence is combined with other information, the evidence can become quite strong.
Sir Austin Bradford Hill outlined a series of tests that you could use to evaluate whether an association between an environmental factor and disease was credible (Hill 1965). These criteria are not perfect. In particular, no one criterion by itself will establish the credibility of a research study if it is present and no one criterion will destroy the credibility of a study if it is absent. You should look at the aggregate impact of these factors. When most of them are present, they add to the credibility of a study. When they are absent, they weaken the credibility of the study. As Sir Austin Bradford Hill himself notes:
"All scientific work is incomplete- whether it be observational or experimental. All scientific work is liable to be upset or modified by advancing knowledge. This does not confer upon a freedom to ignore the knowledge we already have, or to postpone the action that it appears to demand at a given time. Who knows, asked Robert Browning, but that the world may end to-night? True, but on available evidence most of us make ready to commute on the 8:30 next day."
Case Study: A Drug Treatment That Only Works in Black Patients
There has been a lot of published research that shows that heart disease is different and more deadly among black patients. Some possible explanations of these differences involve the renin-angiotensin system in bioavailability of nitric oxide. In a study that seemed to show no overall differences in efficacy for a drug treatment, hydralazine plus isosorbide dinitrate, for treating congestive heart failure, there was nevertheless the suggestion that this treatment might be effective when analysis was restricted to just the black patients in the study. This study, however, was not designed to look for race-specific effects, so the results had to be treated as preliminary. The authors of one review state that "prospective trials involving large numbers of black patients are needed to further clarify their response to therapy" (Carson 1999). With this justification, a new randomized trial, recruiting just black patients, was begun. This study did indeed show that the two drugs were effective among these black patients (Taylor 2004), and became one of the first examples of a therapy recommended solely for a specific racial subgroup.
The concept of using race or ethnicity in medical decisions is controversial, because of the potential for misuse and abuse of this information (Bhopal 1997). There is also debate about whether there is enough genetic variations among different racial and ethnic groups to justify treating them as distinct group. The authors of the second study skirt this issue by using the phrase "patients who self-identify as black".
The important lesson, though, is that no study should be examined in isolation. You should always be looking for corroborating evidence. The subgroup finding in Carson 1999 was indeed a weak form of evidence, but it was supported by several mechanistic explanations described above. When these results were replicated in an independent study, the evidence in favor of this controversial treatment became overwhelmingly persuasive.
What do the other witnesses say? What to look for.
Additional details, both within the research study itself and outside the research study can provide support for an otherwise weak form of evidence.
Is there are strong association? A treatment that has a large impact is unlikely to become undone by small flaws in the research.
Is there a dose response pattern? A treatment that shows stronger effects when given in stronger doses adds credibility to a study because it reduces the credibility of certain biasing factors.
Is the association consistent? A result that is replicated across diverse populations using diverse research designs adds credibility because it is unlikely that a particular flaw in the research could affect all these studies in the same way.
Is the association specific? A treatment that cures "everything" lacks specificity. You should mistrust such a treatment because it is likely to be caused by a global difference in the health of the treated and untreated patients. In contrast, a treatment that cures one particular condition, but not others would rule out such a global difference.
Is the association biologically plausible? A treatment that has no sound biological basis has to pass a high threshhold of evidence than a treatment that has a plausible biological mechanism.
Is there a conflict of interest? Research that is untainted by commercial temptations is more credible because the researchers have no financial incentive to skew the research results.
Is there any evidence of fraud? Research that is carefully reviewed reduces the chances of deliberate falsification of the data.
4.1 Is there a strong association?
No research is perfect, and there is always the possibility that a some accounted for factor might have caused the results seen in the research rather than the treatment being studied. This is less likely to be the case, however, when there is a strong association effect, in other words when the treatment has a large effect on the outcome. By contrast, a weak association, one where the treatment only has a small effect on the outcome, is less persuasive because any small bias or problem with the research could swamp the effect.
Perhaps the best example of a strong association is the link between cigarette smoking and lung cancer. The studies that established this link in the 1950s and 1960s were not perfect studies. They did not use randomization, because it would be unethical. They often relied on retrospective data, because of the long latency period between exposure and the development of cancer. They did not have a perfect control group for many of these studies, and there were a lot of potential confounding variables that had to be accounted for. Nevertheless, these studies showed a large effect, typically a ten fold or greater risk of lung cancer when smokers were compared to non-smokers. So while these studies did have numerous flaws and biases, it would be very difficult to find something that independent of cigarette smoking had a ten fold or greater effect on lung cancer. Because it would take a bias or flaw that severe to produce such a lopsided finding.
What is a strong association/large effect? There is no magic number. A commonly quoted rule of thumb is that an odds ratio or relative risk of two or greater represents a large effect. Any treatment that can double the chances of a cure or cut the risk of side effects in half is considered a strong association that is unlikely to be due to small biases in the research. Ratios less than two are less credible because they could easily be caused by small biases or flaws in the research.
The problem with this rule of thumb is when it is taken too literally. Rothmann points out correctly that "a strong association serves only to rule out hypotheses that the association is entirely due to one weak unmeasured confounder or other source of experimenter bias." (Rothmann 19980. It is a mistake to blindly trust any odds ratio greater than two. Some research studies have major flaws that could artefactually produce odds ratios of two or larger. It is also a mistake to totally disregard any odds ratio less than two. Some research studies are so well conducted that even a small odds ratio is credible.
4.2 Is there a dose response pattern?
If a treatment or exposure is given in varying doses, and increasing doses lead to increasing effects on the outcome, then you have a dose response pattern. Having such a pattern generally adds to the credibility of the study. The reason for this is that many (but not all) biases and flaws in a research study would affect all doses of a treatment equally, and it is only a few flaws and biases that would produce an artefactual dose response pattern. For example, if a drug is effective only because the control subjects were poorly chosen, then the difference in the outcome should be the same for all levels of the treatment. This sort of bias could not produce a dose response pattern and could not serve as a credible alternative.
Some biases and flaws, though, could still produce a dose response pattern. Suppose you are looking at groups of patients who have good, better, and best levels of exercise. There may be a third factor, such as nutrition, where those patients in the good exercise group typically have good nutrition, but those in the best exercise group have typically the best nutrition. Then this nutrition factor might produce a dose response pattern of bias which the naive researcher might mistake for an effect of exercise itself.
Not all treatments can or should be expected to produce a dose response pattern. Sometimes medicines have a threshold effect: any dose below a certain point is completely ineffective, and any dose above that produces a roughly comparable effect. Similarly, some exposures are perfectly safe up to a certain point, but beyond that point they are uniformly fatal.
Some exposures are actually protective up to a certain point and then harmful beyond that point. The technical term for this is hormesis, and the best example, though the final word has not been written yet, is in the consumption of wine. The evidence to date suggests that people who consume a small amount of wine daily have a lower risk of heart attacks and therefore a better overall mortality profile than those who consume no wine or those who consume a lot of wine.
Rothmann is also critical of this criteria and points out that birth order shows a dose-response relationship with Down's syndrome with first born children being less likely to have this condition. This relationship however is just a reflection of the fact that age of the parents is positively associated with Down's syndrome. Older parents are more likely to have children with Down's syndrome, and birth order and age of the parents shows a strong negative correlation.
4.3 Is the association consistent?
The most common request in research is "I won't believe it until I see it replicated." It's one the first things that I would look for if the evidence in a particular study is weak.
The link between cigarette smoking and lung cancer provides the best example of the value of replication. As noted above, any single study of smoking had potential flaws. So when the first study appeared, skeptics could produce a reason (call it A) that might explain away the results of the study. A second and different study would appear, and skeptics could find a different reason (call it B) that might explain away the results of that study as well. And for the third study, they offered up C and for the fourth study, they offered up D. Eventually, though, these series of claims, A, B, C, D,... became less credible than the hypothesis that smoking causes cancer. Because those series of counterarguments also need to be skeptically evaluated. It was the strength of a wide range of studies, not any single study, that produced convincing evidence of a link between lung cancer and smoking.
You have to be careful to look at the type of replication. Mindless replication that just repeats the same experiment over and over again will just end up producing the exact same biases. In the real world, different researchers try different approaches. Although there is not always an explicit plan, a series of replications will often be varied enough so that a confounding factor that might be present in one study is unlikely to be present in all studies.
In fact, researchers do often have an explicit plan to replicate in such a way that any biasing factor in one study will be eliminated in another study (Rosenbaum 2001).
4.4 Is the association specific?
A new therapy that makes narrow claims about the outcomes that it can influence provides greater credibility for a research study. In contrast, a new therapy that seems to influence a wide range of health outcomes is less persuasive. Something that cures everything should make you suspicious that perhaps the groups being compared are not apples and apples. It may mean instead that the research design ended up implicitly selecting healthier patients in the treatment group and sicker patients in the control group.
A good example of a therapy that makes overly broad and non-specific claims is craniosacral therapy. This is an alternative medicine practice that involves
"listening with the fingers" to the body's subtle rhythms and any patterns of inertia and congestions. The emphasis of treatment is to encourage and enhance the body's own self-healing and self-regulating capabilities, even in the most acute resistance and pathologies. Source: Kern,Michael; "What is Craniosacral Therapy?
Practitioners of craniosacral therapy offer a wide and nonspecific array of conditions of symptoms that it claims to help with.
Impingement of cranial nerves or spinal nerves, left-right imbalances, head injuries, confusion, feelings of compression or pressure, anxiety, depression, circulatory disorders, organ dysfunctions, learning difficulties, neuro-endocrine problems, TMJ and dental problems, and trauma of all kinds — birth, falls, accidents and other injuries, physical, sexual or emotional abuse, PTSD, loss/grief, surgery, anesthesia — all are good indicators that a visit to your craniosacral therapist will be helpful. -- www.craniosacraltherapy.org/FAQ.htm
Some conditions that commonly respond well to treatment include: Autism, Central nervous system disorders, Chronic back pain, Migraine headaches, Neurovascular disorders, Immune disorders, Post-traumatic stress disorder, Fibromyalgia, and Learning disabilities. -- fitnessandmassage.com/CST.html
Specificity is one of the criteria that Sir Austin Bradford Hill used to identify causal relationships. An exposure that affects a single disease provides more credible evidence than an exposure that affects a broad range of diseases. Applying this to claims about therapy, the conclusion would be that a therapy that claims to cure everything probably cures nothing. Stated less extremely, you should use greater caution and demand a greater level of evidence for any therapy that makes overly broad claims of efficacy.
A savvy researcher can exploit specificity to strengthen the credibility of their findings. Suppose an Epidemiologist is examining the effects of a toxic exposure, such as carbon monoxide. You can't randomly assign patients in such a study because of ethical constraints, so instead you choose an observational study where one of the groups is exposed by the nature of their job to excessive amounts of carbon monoxide, such as toll booth operators.
When the Epidemiologist compares these workers to a control group, they would normally ask about symptoms such as shortness of breath, dizziness, nausea, and headaches which are associated with carbon monoxide exposure. But they will often also ask about symptoms that are unrelated to this exposure, such as watery eyes, itchy skin, and sneezing. If the exposed group rated all the symptoms higher than the controls, then you would know that toll booth workers just like to complain more about problems in general. But when they report higher levels only for those symptoms specific to carbon monoxide poisoning, you have greater confidence because you have eliminated a possible alternate explanation for these findings.
Specificity is, by itself, not a perfect indicator of causality. Certain exposures, such as cigarette smoking cause a very broad and non-specific set of diseases. Certain drugs, such as aspirin, are effective for a wide range of illnesses.
You should not discount the possible benefits of a therapy just because it is nonspecific. Just be cautious and the broader the claims the more caution you should use.
4.5 Is the association biologically plausible?
"Extraordinary claims require extraordinary proof." This is the mantra of skeptical thinkers and it provides useful for evaluating claims that fall outside the mainstream of science. This is part of the network of corroborating evidence that we demand as we review research claims in medical journal articles.
One aspect of a claim that makes it extraordinary is that there is no plausible mechanism that would explain how the therapy works. Therapies without such a mechanism would be subjected to a higher standard of proof. Don't reject a therapy automatically, though, just because no known mechanism exists. Many successful medical interventions were adopted before a mechanism was discovered that explained how and why that intervention worked.
Another problem is that not everyone agrees on when a plausible mechanism exists. Proponents of homeopathy argue that their approach works because water has "memory" that retains the effect of medicines that otherwise would be diluted out. They point to a series of experiments published in Nature magazine (Davenas 1988).
These experiments represent proof of mechanism, according to proponents of homeopathy. Critics, however, argued that this was an aberrant finding, or perhaps even fraud.
The editors of Nature acknowledged this unusual finding and then demanded a supervised replication of the findings. This is just common sense, according to the critics of homeopathy. The French scientists, however, cried foul. No other results reported in Nature ever needed supervised replication.
The supervised replication failed, of course, but then the proponents of homeopathy claimed that the replication was flawed and not the original finding.
So does a proven mechanism for homeopathy exist? Well, it depends who you talk to. The concept of a plausible mechanism, just like the concept of extraordinary claims is subject to personal biases and interpretation. Beware of mechanistic claims that allude to obscure theories in mathematics or physics. Any type of medicine, for example, that relies on quantum physics to justify its existence is probably suspect.
Robert Park dissects some of the mechanistic claims of homeopathy. One of the many mechanistic explanations suggested among homeopaths is chaos theory. One of the central tenets of chaos theory is that small changes in the a nonlinear system can lead to large changes in the outcome. The analogy that is commonly used is that the flapping of a butterfly's wings in China could cause a hurricane in Florida six months later.
The notion of chaos theory was of immediate interest to homeopaths who routinely use extremely dilute solutions in their practice. If a butterfly's wings can cause a hurricane, then what is to stop a very dilute solution of medicine from curing a patient? This is a rather bizarre claim, though, because chaos theory shows the basic unpredictably of nonlinear processes, which is in direct opposition to the claim that homeopathy provides consistently good health outcomes. As Robert Park puts it,
Thus, while the flapping of a butterfly's wings might conceivably trigger a hurricane, killing butterflies is unlikely to reduce the incidence of hurricanes. As for homeopathic remedies that exceed the dilution limit, a better analogy might be to the flapping of a caterpillar's wings. -- www.csicop.org/si/9709/park.html
Another attempt to build a plausible mechanism also falters on the perception of what is plausible. A publication in the 2001 Christmas holiday issue of the British Medical Journal (Liebovici 2001) studied the effects of remote intercessory retrospective prayer. This researcher collected medical records of patients with severe blood infections. These were retrospective records, and the patients were hospitalized anywhere from four to ten years earlier. These records were randomly divided into two groups and then one group of records was prayed for.
Models of space and time permitting bidirectional interactions between present and past exist. A current image of the topology of the space-time continuum includes wormholes that link remote regions, when space-time is pinched or folded. Some physicists hypothesise that Calabi-Yau space might allow bidirectional interactions between past and future. These possibilities cannot be dismissed.
This is an example where a search for a plausible mechanism ignores other choices which are far simpler.
Sometimes trying to find out what is biologically plausible is like navigating a minefield.
Nonaspirin nonsteroidal anti-inflammatory drugs (NSAIDs) include those that inhibit both the cyclooxygenase- 1 (COX-1) isoenzyme and the COX-2 isoenzyme (nonselective NSAIDs) and those that are more selective for the COX-2 isoenzyme (COX-2 selective inhibitors, herein called COX-2 inhibitors). Nonselective NSAIDs may reduce the risk for myocardial infarction (MI) by inhibiting platelet aggregation (1–3). On the other hand, studies have postulated that COX-2 inhibitors increase the risk for atherothrombotic events because they inhibit prostacyclin, which may increase thrombotic tendencies and vascular injury without the beneficial effect of platelet inhibition derived from COX-1 inhibition (4). However, COX-2 inhibitors may also reduce cardiovascular risk by inhibiting vascular inflammation, improving endothelial dysfunction, and enhancing coronary plaque stability (5– 8). These effects may differ among COX-2 inhibitors. Along with potential differences in blood pressure effects (9), recent evidence suggests that celecoxib and rofecoxib may differ in their effects on endothelial dysfunction and oxidative stress (8).
This is difficult to read, but as I understand it, there is a plausible mechanism for just about any finding that you can think of with respect to COX-2 inhibitors and heart disease.
4.6 Is there a conflict of interest?
Sir Austin Bradford Hill did not mention commercial biases back in 1965, but these have, sad to say, become an important consideration in evaluating today's research.
When a potential conflict of interest is brought to your attention, you need to approach the research cautiously, and you should rightly demand extra evidence. Don't turn into a statistical nihilist, though, and disregard any research with a potential conflict of interest.
Do commercial ties influence research findings? There are many documented cases where money does alter the research. Perhaps the best understood conflict of interest involves the tobacco companies. Financial support from tobacco companies has a large and quantifiable impact on the findings of a study. Articles on passive smoking written by authors affiliated with the tobacco industry were far more likely to conclude that passive smoking was not harmful (Barnes 1998). A review of studies on the economic effects of laws restricting smoking (Scollo 2003) showed that tobacco affiliations were associated with greater use of subjective outcomes, a lower rate of peer review, and a greater tendency to report negative economic impacts.
Support or commercial ties with pharmaceutical companies can also be troublesome. At least thirty studies have examined whether authors with commercial ties come up with more favorable conclusions about the drugs they are studying. A review of these studies, (Lexchin 2003) and showed that industry financed studies were four times more likely to reach conclusions favorable to the company's product when the researchers were supported by the drug company. The authors offered five possible explanations:
- drug companies might preferentially support and test only those drugs that have especially good prospects;
- the drug company sponsored trials could be of poorer quality and therefore more likely to draw contradictory conclusions;
- researchers might deliberately chose the "wrong " dose of the standard drug offered in the control group, leading to a higher rate of efficacy for the new drug, fewer side effects noted for the new drug, or both;
- drug companies might preferentially publish only the studies that support the use of the new drug; and
- drug companies might deliberately target symposiums, since the lack of peer review might allow them to make a stronger statements about their drugs than the data itself would support.
Another problem is that authors rarely disclose possible conflicts. Hussain in the 2001 BMJ [Medline] calculated the rate of disclosure at 1.4% (52 out of 3,642), a number that is far too low to be credible. If authors fail to report potential conflicts of interest, it may be out of the stubborn beliefs that commercial ties only influence other people (Boyd 2003). This is not unlike the belief that physicians are unaffected by the small gifts that drug companies often use as a marketing tools (Katz 2003).
Charges of financial conflict of interest are sometimes a "red herring" that is intended to distract from a discussion of the merits of the research. Stephen Senn tells an interesting story about himself (Senn 2001) where such a charge was leveled. Stephen Senn is a famous statistician with over 190 publications. Because of his stellar reputation, he is widely sought out as a statistical consultant to the pharmaceutical industry. In a discussion with an academic researcher, though, Dr. Senn was informed that "source of employment" meant that his recommendations about the proper analysis of crossover trials were worthless. It didn't matter that Dr. Senn had written the definitive textbook on that very subject, Cross-over Trials in Clinical Research.
Another "red herring" was claims about a financial conflict of interest with the James Randi prize.
The idea of letting a former illusionist with a substantial financial stake in a negative result supervise a "double-blind" experiment is perhaps questionable. -- www.weirdtech.com/sci/expe.html
So how should you approach a research article where the authors have declared a conflict of interest? You should be cautious, but not cynical. If the research is objective, well documented, and subject to external review, then you should not let financial conflict of interest exert a veto power over the findings. On the other hand, an editorial article written by an author with commercial ties to a product being discussed in the editorial is very troublesome (Angel 1996).
Is there an explicit assurance from the author that the industry support still allowed the author to independently assess the data and to publish the results without first getting approval from the sponsor? A reasonable review period by the sponsor is acceptable as long the final decision to publish rests with the author and not the sponsor. A 2001 revision to the statement on publication ethics from the International Committee of Medical Journal Editors highlights how important this assurance is.
Keep in mind that sometimes the conflict is not in the author of the study, but in the reader of the study. Lerner in a 2002 Cmaj article [Medline], highlights the controversy over breast self exam. Because the use of this exam is empowering, because it promotes self care, and because of a general belief in individual testimonials, many readers have reacted angrily to reports that use of a breast self-exam has no detectable impact on mortality.
Does having a commercial interest in the results of a drug trial cause a problem for the people running the trial? If it does, then much of the research that we rely on could be flawed. A recent article in the British Medical Journal raises some serious concerns (Jurendini 2004).
4.7 Is there any evidence of fraud?
Another consideration not covered by Sir Austin Bradford Hill is fraud. This is also another sad development in research that you need to be concerned about. Research fraud is
the intentional fabrication or falsification of data or results, plagiarism, or other similarly deceptive practices that seriously deviate from those that are commonly accepted within the scientific community for proposing (e.g., in grant proposals), conducting, or reporting research. It does not include honest error or honest differences in interpretations or judgments of data.
-- www.jhsph.edu/ora/fraud.htmlIt is almost impossible for the average reader to detect fraud in a journal publication.
Journals can protect against fraudulent (and inadvertent) changes in the research protocol by insisting that researchers present the original research protocol to the peer reviewers along with the paper itself (Hawkey 2001). The peer reviewers could then look for any deviations from the original plan that might indicate an attempt to deceive or mislead the readers.
Example: In 2001, the Journal of the American Medical Association (JAMA) published a study of celecoxib (Silverstein 2000) showed that it had fewer side effects than competing drugs. The rates of stomach and intestinal ulcers after six months were far lower than two competing drugs. M. Michael Wolfe wrote a strongly supportive editorial in that same issue of JAMA (Lichtenstein 2000). Later, Dr. Wolfe discovered the same study, as reported to the U.S. Food and Drug Administration (FDA). The drug company's report to the FDA showed that the original plan for the study was to study side effects for a full year. Almost all of the side effects found during the second six months were in among patients taking celecoxib, and when you combined the second six months of data with the first, most of the advantage for celecoxib disappeared. The authors of the JAMA study argued that the high dropout rate in the second half of the study made the rates based on a full year of data unreliable, but even if this were so, the authors still had an obligation to present the full year data to allow readers to make up their own mind.
Counterpoint: biological plausibility is not all it is cracked up to be.
###Fix this. Add material to this section.###
On your own
###Fix this. Add material to this section.###
This webpage was written by Steve Simon on 2004-07-07, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence