Children's Mercy Hospital
Find a Doctor | Press Room | Careers | Directions & Locations

About Us | Contact Us | Giving to Children's Mercy
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Mountain or Molehill?

This is a first draft for Chapter 3 of my book, "Statistical Evidence."

3.0 Introduction

Do the research results add up to something important or are the results trivial? For the results to be important, the study needs to have a narrow focus, it has to measure the right outcomes, and the change in the outcome has to be large from a clinical perspective.

3.1 Did they measure the right thing?
3.2 Did they measure it well?
3.3 Were the changes clinically important?

3.1 Did they measure the right thing (suitable outcomes)?

There's a well known story about a man who fumbling about in the middle of the street on a very dark night. A passerby stopped and asked what was going on. The man replied "I dropped my keys and I can't find them". So the passerby agrees to help look for the lost keys. After a half hour, the passerby gets frustrated and asks the man if he remembers exactly  where he was standing when he dropped the keys. "Over in the alley there" came the response. The passerby looked with surprise and exasperation at the man. "Over in the alley? Then why are you looking out here in the middle of the street?" The man replied "Because the light is better here."

3.1.1 Surrogate Measures

Patients are generally interested in one of four things. Mortality (will I die?), morbidity (will I go blind?), symptoms (will I throw up?), or quality of life (will I be able to walk up a flight of steps without getting winded?). They don't care about concentration of homocysteine in their blood, or what their CD4 cell count is, unless those values relate to something that is important to them.

Good research, then, should measure something that is important to patient. There is an acronym for this, POEM, which stands for Patient Oriented Evidence that Matter (www.infopoems.com). Every research study should directly measure an outcome that matters to the patient. Direct measurements, though, are often difficult to obtain. So sometimes researchers will examine intermediate measures that are faster and easier to assess, but which may or may not be predictive of more important endpoints. These intermediate measures are called surrogate measures.

Some examples of surrogate measures are forced expiratory volume and premature ventricular contractions. These measures are not important to a patient in themselves, but only in their ability to predict events like asthma difficulties or recurrence of heart attacks.

Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks.

You have to show a strong correlation between the surrogate measure and the patient-oriented outcome. If there is only a weak correlation, then establishing a large effect on the surrogate measure will not translate into a large effect on the patient-oriented outcome.

You also need to establish that changes in the surrogate measure lead to changes in the outcome of interest. The surrogate measure might be strongly correlated with the patient-oriented outcome but only because both are related to a third factor. That third factor might end up being the measure that you need to change, not the surrogate measure.

Example: A study that showed an association between duration of breast feeding and brachial artery distensibility at 20 to 28 years of age (Leeson 2001) recognized that brachial artery distensibility is a surrogate outcome. Distensibility is a measure of stiffness, and could be considered a marker for cardiovascular disease in mid and later life. Such a link is tenuous and the authors themselves, as well as an accompanying editorial, (Booth 2001) admit that this does not establish a cause and effect relationship between breast feeding and heart disease.

Example: A study of chemotherapy for colorectal cancer (Buyse 2000) noted that tumor response was often used to assess the value of new treatments, but there was an uncertain connection between tumor response and mortality. The authors demonstrated through a meta-analysis that there was a link between tumor response and survival, but this link was weak. A 50% improvement in tumor response would only lead to a 6% change in the odds of death.

In contrast, a study of cholesterol lowering drugs (Law 2003), showed a significant decrease in LDL cholesterol and tied that lowering to a decreased risk of heart attacks and strokes. A 1.8 mmol/l change for example, was achieved and could be linked to a 61% reduction in the risk of ischemic heart disease and a 17% reduction in the risk of stroke.

You also need to assure yourself that the measure is sensitive to changes associated with improvement in health. There are a wide range of measures of pulmonary function, for example, and some are more responsive than others to changes in health (de Torres 2002).

3.1.2 Short term changes in outcome

Perhaps it is just human nature, but we are all impatient and we want to focus on the short term and the immediate. That's true for researchers also. They want to do the research, publish it, and move on as quickly as possible. Using a short term outcome measure facilitates this way of life. I'm sure that budgetary constraints have something to do with this as well.

The problem with the focus on short term outcomes this is that it is usually easier to get a short term change, but that's not what is really important from a clinical perspective. It's easy, for example, to get a smoker to quit smoking for a day, or maybe even a week. But most interventions that try to help people quit smoking don't work as well for keeping people off cigarettes for three months or for two years. Pretty much any diet works well in the first week or so. People will lose a few pounds right away. But can people continue to lose weight and maintain that weight loss for a full year? That's a much harsher but much more realistic test of the value of a diet.

Example: A study of a youth tobacco education program (Mahoney 2002) looked at immediate recall and recall four months later of the knowledge and attitudes that this program was trying to reinforce. Although most concepts were retained for the short term, only two: "recognition that smokers have yellow teeth and fingers" and "smoking one pack of cigarettes a day costs several hundred dollars per year" were retained at the four month evaluation.

3.1.3 Multiple outcome measures

The presence of a narrowly drawn research plan developed prior to the start of data collection adds a great deal to the credibility of a study. In contrast, a scattershot approach will dilute the credibility of the research. There is a saying in Statistics circles, "If you torture your data long enough, it will confess to something."

Example: A study of the relationship between childhood cancer and diet (Sarasua 1994) examined five different types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods), two different types of cancer (acute lymphocytic leukemia and brain tumor), and considered diet both of the child and of the mother during pregnancy. This led to 20 different combinations of these factors. In addition, the authors provided additional discussion using a different definition of high and low consumption. High consumption of hot dogs, for example, was defined as one or more hot dogs per week, but later results defining high consumption as two or more hot dogs were described.

A good research study has limited objectives that are specified in advance. There is solid empirical evidence that specifying a hypothesis prior to data collection reduced the chances of a false positive finding by a factor of three (Swaen 2001).  Failure to limit the scope of a study leads to problems with multiple testing.

There are good reasons to look at multiple outcomes when you are trying to explore a new area. The results of this exploratory analysis would then provide justification and focus to a second study that would replicate the results. Looking at multiple outcomes is also fine if there are several distinct dimensions, like efficacy and side effects, that need to be evaluated. But looking at multiple outcome measures just because you can leads to a "fishing expedition," a study that looks at a large number of exposures or a large number outcomes without any effort to prioritize.

3.1.4 Subgroup comparisons

Examining a large number of subgroups will dilute the credibility of a study. Maybe a drug is ineffective overall, but could you please check to see if it is effective in women? In patients with the most severe conditions? In patients younger than 30? In patients who smoke cigars? In patients who have a college education? In patients who live with a dog or cat? In patients who get a moderate amount of exercise?

Example: A light-hearted study on astrology (Pollex 2001) shows the problem with subgroup analysis. They established a statistically significant association between certain astrological signs and winning the Nobel prize (Geminis were more likely, Leos were less likely). The authors conclude that "foraging through databases using contrived study designs in the absence of biological mechanistic data sometimes yields spurious results."

Subgroup comparisons suffer from three problems. First, the subgroup comparison is usually a non-randomized comparison. Second, the subgroup comparison has less precision because the sample size is smaller. Third, the sample size in a study could be swamped by the potential number of possible subgroups that could potentially be examined.

If you find a subgroup that behaves differently, then you need to ask yourself a few questions. Is this a subgroup that I would have studied a priori if I had been more careful during the planning stage? Is there a plausible mechanism to explain why this subgroup behaves differently? Are there other studies that have similar findings for this subgroup?

There are some technical issues with subgroup comparisons. You wouldn't want to declare that a therapy is effective for one subgroup if the p-value for that subgroup was 0.043 and the p-value for the others was 0.062. The analysis of subgroups should be done as a formal test of interaction.

3.1.5 Measuring the right outcome--what to look for

When you are looking at the outcome measured in a study, ask yourself the following questions:

  • Is the outcome evaluating only short term changes?
  • Is the outcome related to an event that patients care about?
  • Is the research diluted through the look at multiple outcomes or multiple subgroups?

References on Suitable Outcomes

3.2 Did they measure the outcome well (measurement quality)?

Quality measurements are important for all variables, but they are especially important for the outcome measure. There are several types of measurements that provide weaker evidence. Be cautious about measurements that are retrospective, unblinded, unvalidated, or unreliable.

3.2.1 Retrospective Measurements

Retrospective measurements have less credibility than measurements taken prospectively.

Retrospective data are data that is collected by looking backwards in time. We obtain this data by asking subjects to recall events that occurred earlier in their lives. We also get retrospective data when we review medical records, birth certificates, death certificates, or other sources of historical data. In contrast, data collected during the course of the study is known as prospective data.

Retrospective data are often inexpensive to collect, but you should be concerned about their accuracy. The ability of a subject to recall information is sometimes affected by which group that they are in.

Women who have experienced miscarriages, for example, are more likely to search for and remember events that they feel might "explain" their miscarriage, much more so than a group of comparable control subjects. This differential level of reporting is known as recall bias.

In addition, historical data are often incomplete and it is sometimes difficult to verify their accuracy. Therefore, retrospective data are considered less authoritative than prospective data.

Sometimes, though, you can establish credibility for retrospective measures. A review of research on smoking illustrates this well (Gail 1996). The author recalls a 1950 study that looked at the smoking habits of lung cancer patients and controls. The authors were concerned about the retrospective assessment of smoking among patients in both groups. Would patients with lung cancer exaggerate the amount of smoking? Would the interviewers press harder for information about smoking among the cancer patients? While it would be impossible to totally rule out recall bias, the authors did examine a third group, patients who were diagnosed with lung cancer and who later found out that they suffered from a different disease (false cases). If recall bias was the sole explanation of the difference in reported smoking, then the group of false cases should have had a similar level of smoking with the lung cancer patients. Instead they reported a lower level of smoking. This helped to rule out the possibility that recall bias alone accounted for the higher reported smoking levels in the lung cancer patients.

Another difficulty with retrospective data is that you may not be able to identify which was the cause and which was the effect. Causes have to occur before and effects have to occur after, but when you examine causes and effects retrospectively, you may end up losing information about timing.

There's an old joke about a statistician who was examining the fire department records, including information about how much damage the fire caused, and how many fire engines responded to the blaze. The statistician noticed a strong relationship between the two variables and concluded that the more fire engines you send, the more damage they cause.

Example: The British Medical Journal highlighted a research study where speech patterns were recorded in two groups of surgeons. The first group had two or more malpractice claims filed against them and the second group had none. There was a large difference between the two groups, with the first group having a dominant tone with less concern for the patient. While the news report of this research suggested that: "dominance coupled with a lack of anxiety in the voice may imply surgeon indifference and lead a patient to launch a malpractice suit when poor outcomes occur." -- bmj.com/cgi/content/full/325/7359/297/a

One reader, however, pointed out that perhaps: "being sued is a brutalizing and demoralizing experience and that this experience fundamentally changes the attitude of doctors towards their patients." -- bmj.com/cgi/eletters/325/7359/297/a#24658

Retrospective studies can also use data from charts and these charts often have incomplete or ambiguous information (Horwitz 1984). A big advantage of prospective studies is that the researchers know what data they want to collect. They don't always get what they want, even in a prospective study, but the chances of getting complete and accurate data are much better.

3.2.2 Unblinded measurements

In an experimental study, it is desirable (but not always possible) to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as "blinding" or "masking." Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study.

There is always some individual who knows which patients get which treatments, such as the pharmacy that prepares the pills and placebos. This is perfectly fine as long as these individuals do not interact with the patients or evaluate the patients.

There is a bit of ambiguity with respect to who is blinded (Devereaux 2001). For example, a survey of 25 textbooks produced nine different definitions of "double blind." Therefore, you should avoid using these terms and focus instead on which individuals are blinded. If you are evaluating an article, look for evidence of blinding for the following groups:

  • the patients themselves,
  • clinicians who have substantial interactions with the patients,
  • anyone who assesses outcomes in these patients, or
  • anyone who collects data from these patients.

If only some of the above are unaware of the treatment, then the study is partially blinded.

Blinding prevents the placebo effect from distorting the research results. The placebo effect is a product of "belief, expectancy, cognitive reinterpretation, and diversion of attention" that can lead to psychological and sometimes physiological improvements in situations where the treatment is known to have no effect, such as sugar pills (Beyerstein 1997).

There are three specific situations where the placebo effect is of particular concern: when enthusiasm by the patient or the doctor for the new procedure is strong, when outcomes are based on the patient's self-assessment (e.g. quality of life studies), and when the treatment is primarily for symptoms (Johnson 1997). The placebo effect is less critical for objective outcomes like survival.

A recent study showed that the placebo effect might be overstated in some contexts (Hrobjartsson 2001). Some of the effects attributed to the placebo are perhaps caused instead by statistical artefacts like regression to the mean or by the tendency of some conditions to resolve spontaneously .

Even without a placebo effect, blinding would still be important to ensure uniform rates of compliance. You want to avoid a situation where a patient thinks "I'm in the placebo arm, so it's not really important whether I show up for my follow-up evaluation."

The value of blinding also extends to the research team, and should include anyone who interacts with the patients. In a clinical trial of treatments for multiple sclerosis, a pair of neurologists assessed the outcome of each patient (Noseworthy 1994). One neurologist was blinded to the treatment status and one was unblinded. The unblinded neurologist gave substantially lower ratings to patients in the placebo group, which would have led to falsely concluding that one of the treatments was effective.

Researchers can also influence the outcome through their attitudes and through their differential use of other medications (Schulz 2002).

Unfortunately, there are many situations where blinding is impossible. For example, if you are comparing oral versus rectal administration of a drug, that's pretty hard to conceal from the patient. In general, observational studies cannot be blinded, because the patient and/or their doctor selects the treatment group.

Surgical procedures are often difficult to completely blind. Nevertheless, You can take some partial steps at blinding that prevent some of the biases from creeping in (Johnson 1997). If two surgical procedures use different types of incisions, identical blood or iodine stained opaque dressings could be used to keep the patients unaware of which operation was performed. Also, although the surgeon cannot be blinded to the difference in surgery, those who evaluate the health of the patient after surgery could be kept unaware of the particular operation, so as to ensure that their evaluation of the patient is unbiased.

Even though the placebo may look the same, sometimes the doctor may infer which group a patient belongs to, perhaps through noting a characteristic set of side effects. If you are worried about this, ask the doctors to try to identify which treatment group they believe each patient belonged to. If the percentage of correct guesses is significantly larger than 50%, then the allocation scheme was not sufficiently blinded.

Blinding is just of many factors that combine to indicate a study's rigor and quality. Although unblinded studies are considered less authoritative than blinded studies, you should not use blinding by itself as a surrogate marker for the quality of the research (Schulz 2002). For example, Rupert Sheldrake conducted a survey of various journals and showed that blinding was used in 85% of all parapsychology research. But it would be a mistake to claim, as Dr. Sheldrake does, that "Parapsychologists ... have been constantly subjected to intense scrutiny by skeptics, and this has made them more rigorous." http://www.parascope.com/en/articles/blindScience.htm

Two researchers have examined studies with and without blinding. These authors found that studies without blinding show an average bias of 11-17% (Schulz 1996; Colditz 1989). In other words, when an unblinded study was compared to a blinded study, the former study tended to estimate a treatment effect that was (on average) 11% to 17% higher than the latter.

Additional evidence of this problem appears in a meta-analysis of the effect of intermittent sunlight exposure and melanoma (Nelemans 1995). When nine studies without blinding were combined, they showed a odds ratio of 1.84 which was statistically significant (95% confidence interval 1.52 to 2.25). When the seven studies with blinding were combined, they showed a much smaller odds ratio (1.17, 95% confidence interval 0.98 to 1.39) which was not statistically significant. This is further evidence that unblinded studies are more likely to show statistical significance than blinded studies.

3.2.3 Self report measurements

Self report measurements, when the patients evaluate themselves, raise some special concerns. The degree to which patients report problems, for example, is associated with their level of education, as more educated patients are better able to describe their illnesses (Sen 2002).

You can only get certain measurements, such as pain, through self report. Other measures, like quality of life, are best obtained directly from the patient (Moinpour 2000).

In a study of stress (Macleod 2002), there was a relationship between high levels of stress and increased rates for self reported angina. There was no relationship, however, with more objective measures of heart disease. The apparent relationship with self reported angina might be a tendency for some patients to over report negative events (both psychological and medical) and for other patients to under report negative events.

A criticism of self report measures has to acknowledge that patients perceptions of disease are an important dimension of health. Appropriate medical treatment should not ignore the patient's perceptions, because health cannot be entirely reduced to objective numerical measures.

[Elaborate on this and add more examples.]

3.2.4 Measurements without established validity

Validity is a term that every discipline has a different definition for. In very simple and general terms, validity means that an outcome is measuring what you think it is measuring. There are several ways to measure validity, but most of these involve comparison to an external standard.

The classic example of a measurement without established validity is the Rorschach Ink Blot test. Patients would be asked to interpret geometric figures that were essentially random and featureless forms. The interpretation given by the patient would reveal to a trained psychologist many insights into the patient's personality.

The inkblot test is difficult to evaluate under objective conditions, but when careful evaluations have been done, they have shown that this test has very limited ability to diagnose personality traits. It does have some ability to distinguish schizophrenic patients, but most of the other uses of this test have been discredited.

The subjective nature of the interpretations made it difficult to verify the accuracy of the predictions. Much like the predictions of palm readers and astrologers, the interpretations were so general as to apply to just about anybody.

Contrast this with the visual analog scale assessment of pain. To validate this measure, researchers examined how patients rated their pain before an operation and afterwards. They examined ratings before administration of analgesics and afterwards. When the scale showed changes under these conditions, it established the validity of the scale.

You should be cautious about self reported measurements. For some measures, especially for pain, self-report is the only practical way to assess an outcome. Quality of life measurements also have to be self reported. But asking a patient to assess whether they have a certain medical condition can be dangerous.

Example: A study of concussions (Piland 2003) used a 16 item self-reported scale and validated it by comparing it to composite balance and neuropsychological measures.

Be cautious about results that explain the role of race/ethnicity data in predicting a medical outcome (Walsh 2003). Quite often, race/ethnicity is not directly related to the outcome, but rather it is socioeconomic markers that are directly related.

3.2.5 Measurements without established reliability

Reliability means different things in different fields, but the general concept is that a reliable measurement is one that would stay about the same if it were repeated under similar circumstances. Depending on the context, you would establish reliability differently. For example, one way to establish reliability is to have two people make independent assessments and show a good level of agreement. If you are measuring something that is stable over time, then you could take two measurements on different days or weeks and see how well they agree.

Be especially careful about measurements that have some level of subjectivity. If there is no establishment of reliability for these measures, then you have no assurance that the research is repeatable.

[Add two good examples.]

3.2.6 Post Hoc Changes

No research plan is perfect, and you should expect minor deviations from the plan in just about any research study. Major deviations, however, from the protocol can reduce the credibility of a study. Some examples of deviations from the plan include:

  • Investigating end-points other than those originally specified
  • Developing new exclusion criteria after the study has started
  • Stopping the study unexpectedly or extending it beyond the planned sample size

You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.

Changes to the planned end of the study, either stopping the study early, or extending it beyond the planned sample size, can raise some serious problems (Ludbrook 2003). There are several reasons that you might want to stop a study early:

  • early evidence that one of therapies is much better than the other (efficacy),
  • early evidence that continuing the study would be unlikely to yield a significant result (futility),
  • early evidence that one of the therapies is too dangerous (safety), and/or
  • finishing the study  would end up being far more expensive or time consuming than the original plan (economics).

[Insert a good example here.]

In order to maintain credibility, a study should have rules for stopping early that were specified prior to the start of data collection. Pre-determined rules are especially important when a study ends early for efficacy. If a study ends early for economic reasons, and the result is not statistically significant, you need some assurance that the truncated sample size still provided a reasonable level of precision. In this situation, the width of the confidence intervals would indicate clearly if the sample size was still adequate.

Extending a study beyond the original end date can also be problematic. Extensions for economic reasons (the budget went further than expected or an extra funding source appeared) is probably not a serious problem, but be very careful  if the study gets extended because of a failure to achieve statistical significance at the planned sample size. The provisions for such an extension must be specified prior to the start of data collection.

Detecting a deliberate and fraudulent change  in a research study is extremely difficult for anyone, but especially difficult for the reader. A thorough peer review provides a limited level of protection from fraud. Another suggested remedy is a proposed requirement that journals should see the original protocols for research studies as part of the peer review process (Hawkey 2001). Sometimes a careful review of the numbers in a study can highlight the possibility of fraud. If a study used randomization, for example, watch out if there is an unexpected and unexplained deviation from a 50-50 split between treatment and control. Replication of research findings is also a good protection against fraud.

Example: An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity. In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol. Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found. Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance. There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia. Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what impact the early termination of the study had on the statistics. Early termination of a study can cause serious biases, unless specific rules for early termination are established at the start of the study.

3.2.7 Measuring the outcome well--what to look for

When you are looking at how the outcome was measured, ask yourself the following questions:

  • Was the outcome dependent on the memory of the patients?
  • Did the outcome have established validity and reliability?
  • Were there post hoc changes in the protocol?

References on Measurement Quality

3.3 Were the changes clinically important?

[Add material to this section]

Examples

Absolute Risk

Particularizing

References on clinical importance

3.4 Summary - Mountain or molehill?

Look carefully at how the researchers measured the outcome in their study.

Did they measure the right thing? You would like to see an outcome of direct interest to your patients.

Did they measure it well? You want an outcome that is valid and reliable and not subject to changes are the start of data collection.

Were the changes clinically important? You want a change that is large enough to have a practical impact in a clinical setting.

This webpage was written by Steve Simon on (unknown date), edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence