Statistical Evidence. Chapter 3. Mountain or Molehill? The Clinical Importance of the Results
[This is the first draft of Chapter 3 of "Statistical Evidence."]
3.0 Introduction
Do the research results add up to something important or are the results trivial? For the results to be important, the study needs to have a narrow focus, it has to measure the right outcomes, and the change in the outcome has to be large from a clinical perspective.
3.1 Did they measure the right thing?
3.2 Did they measure it well?
3.3 Were the changes clinically important?
Case Study: Side Effects of Vaccination
A pair of articles on vaccination that appeared next to each other in a 1999 issue of BMJ (Karvonen 1999; Henderson 1999) offer an interesting contrast in reporting styles. I commented about this on the BMJ webpages (Simon 1999).
Both studies used a cohort design to examine side effects of vaccination. In the first article, the authors compared the rate of Type I diabetes among children vaccinated at the age of three months to children vaccinated at the age of 24 months. They reported the relative risk as 1.06 (p=0.54). In the second study, the authors compared the risk of intermittent wheezing between vaccinated and non-vaccinated children. They reported the relative risk as 1.06 (95% CI: 0.81 to 1.37).
Both studies are negative, but the second study tells you something extra. In that study, you know that even after allowing for sampling error, there is no justification for believing that the risk of side effects could be increased by 50%. You know this because the relative risk of 1.5 lies outside the confidence interval. With the first study, you are left wondering. That looks like a small relative risk, but is it possible that sampling error would allow for a 50% or 100% increase in risk? You'd have to calculate the confidence interval for yourself to be sure.
Since you've been such a good reader, I'll save you the trouble. The 95% confidence interval for the relative risk in the first paper is 0.88 to 1.28. So you can rule out a large change in risk.
Unfortunately, neither paper reported a measure of absolute risk. With a bit of effort, you can calculate these values yourself. In the first paper, the number needed to harm for Type I diabetes is 4,500 (95% CI: 1,100 to infinity). In the second paper, the number needed to harm for intermittent wheezing is 109 (95% CI: 37 to infinity).
Why is this important? Because you need to know what the best course of action is with respect to these vaccinations. If there is a large risk that outweighs the benefit of the vaccination, you should stop vaccinating your patients. Even if the risk does not outweigh the benefit, if it is large enough, you should warn people about the side effect.
Notice that I did not define "large" here. How much of an increase in side effect risk is large? It's an easier question to ask rather than answer, but in the case of vaccines, the answer is especially difficult. What disease is the vaccine trying to prevent? How much more prevalent would that disease become if people stopped using the vaccine? Is the disease life threatening? How serious is the side effect?
These are complicated questions, but they are questions that you have to ask if you want to assess whether the research findings add up to a mountain or if they are just an unimportant molehill.
Mountain or Molehill? What to look for.
Make sure that any research study measures something of practical importance.
Did they measure the right thing? Researchers should focus on outcomes of interest to the patient and long term rather than short term outcomes. Examining multiple outcome measures or multiple subgroups will dilute the quality and strength of the evidence.
Did they measure it well? Certain types of measurements have a lower strength of evidence. Be cautious about measurements that are retrospective because memory is imperfect. Unblinded measurements can allow your patients' expectations to influence the outcome. Don't trust unvalidated/unreliable measurements or post hoc changes in the protocol.
Were the changes clinically important? With a large enough sample size, a difference between two groups that is statistically significant might represent a change so small as to be clinically trivial. Specify a clinically important change for a study by asking how much of a change would be needed to convince you to adopt a new treatment or therapy. For negative trials, look for a precise confidence interval or a justification of the sample size that was conducted prior to data collection.
3.1 Did they measure the right thing ?
There's a well-known story about a man who was fumbling about in the middle of the street on a very dark night. A passerby stopped and asked what was going on. The man replied, "I dropped my keys and I can't find them". So the passerby agrees to help look for the lost keys. After a half hour, the passerby gets frustrated and asks the man if he remembers exactly where he was standing when he dropped the keys. "Over in the alley there" came the response. The passerby looked with surprise and exasperation at the man. "Over in the alley? Then why are you looking out here in the middle of the street?" The man replied "Because the light is better here."
Surrogate Measures
Patients are generally interested in one of four things. Mortality (will I die?), morbidity (will I go blind?), symptoms (will I throw up?), or quality of life (will I be able to walk up a flight of steps without getting winded?). They don't care about concentration of homocysteine in their blood, or what their CD4 cell count is, unless those values relate to something that is important to them.
Good research, then, should measure something that is important to patient. There is an acronym for this, POEM, which stands for Patient Oriented Evidence that Matter (www.infopoems.com). Every research study should directly measure an outcome that matters to the patient. Direct measurements, though, are often difficult to obtain. So sometimes researchers will examine intermediate measures that are faster and easier to assess, but which may or may not be predictive of more important endpoints. These intermediate measures are called surrogate measures.
Some examples of surrogate measures are forced expiratory volume and premature ventricular contractions. These measures are not important to a patient in themselves, but only in their ability to predict events like asthma difficulties or recurrence of heart attacks.
Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks.
You have to show a strong correlation between the surrogate measure and the patient-oriented outcome. If there is only a weak correlation, then establishing a large effect on the surrogate measure will not translate into a large effect on the patient-oriented outcome.
You also need to establish that changes in the surrogate measure lead to changes in the outcome of interest. The surrogate measure might be strongly correlated with the patient-oriented outcome but only because both are related to a third factor. That third factor might end up being the measure that you need to change, not the surrogate measure.
You also need to assure yourself that the measure is sensitive to changes associated with improvement in health. There are a wide range of measures of pulmonary function, for example, and some are more responsive than others to changes in health (de Torres 2002).
Example: A study that showed an association between duration of breast feeding and brachial artery distensibility at 20 to 28 years of age (Leeson 2001) recognized that brachial artery distensibility is a surrogate outcome. Distensibility is a measure of stiffness, and could be considered a marker for cardiovascular disease in mid and later life. Such a link is tenuous and the authors themselves, as well as an accompanying editorial, (Booth 2001) admit that this does not establish a cause and effect relationship between breast feeding and heart disease.
Example: A study of chemotherapy for colorectal cancer (Buyse 2000) noted that tumor response was often used to assess the value of new treatments, but there was an uncertain connection between tumor response and mortality. The authors demonstrated through a meta-analysis that there was a link between tumor response and survival, but this link was weak. A 50% improvement in tumor response would only lead to a 6% change in the odds of death.
Example: A study of cholesterol lowering drugs (Law 2003), showed a significant decrease in LDL cholesterol and, in contrast to the previous example, tied that lowering to a decreased risk of heart attacks and strokes. A 1.8 mmol/l change for example, was achieved and could be linked to a 61% reduction in the risk of ischemic heart disease and a 17% reduction in the risk of stroke.
Short Term Changes In Outcome
Perhaps it is just human nature, but we are all impatient and we want to focus on the short term and the immediate. That's true for researchers also. They want to do the research, publish it, and move on as quickly as possible. Using a short term outcome measure facilitates this way of life. I'm sure that budgetary constraints have something to do with this as well.
The problem with the focus on short term outcomes this is that it is usually easier to get a short term change, but that's not what is really important from a clinical perspective. It's easy, for example, to get a smoker to quit smoking for a day, or maybe even a week. But most interventions that try to help people quit smoking don't work as well for keeping people off cigarettes for three months or for two years. Pretty much any diet works well in the first week or so. People will lose a few pounds right away. But can people continue to lose weight and maintain that weight loss for a full year? That's a much harsher but much more realistic test of the value of a diet.
Example: A study of a youth tobacco education program (Mahoney 2002) looked at immediate recall and recall four months later of the knowledge and attitudes that this program was trying to reinforce. Although most concepts were retained for the short term, only two: "recognition that smokers have yellow teeth and fingers" and "smoking one pack of cigarettes a day costs several hundred dollars per year" were retained at the four month evaluation.
Multiple outcome measures
The presence of a narrowly drawn research plan developed prior to the start of data collection adds a great deal to the credibility of a study. In contrast, a scattershot approach will dilute the credibility of the research. There is a saying in Statistics circles, "If you torture your data long enough, it will confess to something."
Example: A study of the relationship between childhood cancer and diet (Sarasua 1994) examined five different types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods), two different types of cancer (acute lymphocytic leukemia and brain tumor), and considered diet both of the child and of the mother during pregnancy. This led to 20 different combinations of these factors. In addition, the authors provided additional discussion using a different definition of high and low consumption. High consumption of hot dogs, for example, was defined as one or more hot dogs per week, but later results defining high consumption as two or more hot dogs were described.
A good research study has limited objectives that are specified in advance. There is solid empirical evidence that specifying a hypothesis prior to data collection reduced the chances of a false positive finding by a factor of three (Swaen 2001). Failure to limit the scope of a study leads to problems with multiple testing.
There are good reasons to look at multiple outcomes when you are trying to explore a new area. The results of this exploratory analysis would then provide justification and focus to a second study that would replicate the results. Looking at multiple outcomes is also fine if there are several distinct dimensions, like efficacy and side effects, that need to be evaluated. But looking at multiple outcome measures just because you can leads to a "fishing expedition," a study that looks at a large number of exposures or a large number outcomes without any effort to prioritize.
Consider a hypothetical example where a drug company is comparing their new pain relief drug to another company's drug. When they design the study, they look at pain levels every hour for five hours after the patient takes the drug. The multiple time points give the drug company extra chances to declare success.
If the new drug shows a greater degree of relief earlier on in time but a comparable amount of relief later, then they can claim that their product is faster acting.
If the new drug shows a comparable degree of relief earlier, but a greater degree of relief later, then they can claim that their product is longer lasting.
Subgroup comparisons
Examining a large number of subgroups will dilute the credibility of a study. Maybe a drug is ineffective overall, but could you please check to see if it is effective in women? In patients with the most severe conditions? In patients younger than 30? In patients who smoke cigars? In patients who have a college education? In patients who live with a dog or cat? In patients who get a moderate amount of exercise?
Example: A light-hearted study on astrology (Pollex 2001) shows the problem with subgroup analysis. They established a statistically significant association between certain astrological signs and winning the Nobel prize (Geminis were more likely, Leos were less likely). The authors conclude that "foraging through databases using contrived study designs in the absence of biological mechanistic data sometimes yields spurious results."
Subgroup comparisons suffer from three problems. First, the subgroup comparison is usually a non-randomized comparison. Second, the subgroup comparison has less precision because the sample size is smaller. Third, the sample size in a study could be swamped by the potential number of possible subgroups that could potentially be examined.
If you find a subgroup that behaves differently, then you need to ask yourself a few questions. Is this a subgroup that I would have studied a priori if I had been more careful during the planning stage? Is there a plausible mechanism to explain why this subgroup behaves differently? Are there other studies that have similar findings for this subgroup?
There are some technical issues with subgroup comparisons. You wouldn't want to declare that a therapy is effective for one subgroup if the p-value for that subgroup was 0.043 and the p-value for the others was 0.062. The analysis of subgroups should be done as a formal test of interaction.
Measuring the right outcome--what to look for
When you are looking at the outcome measured in a study, ask yourself the following questions:
- Is the outcome evaluating only short term changes?
- Is the outcome related to an event that patients care about?
- Is the research diluted through the look at multiple outcomes or multiple subgroups?
3.2 Did they measure the outcome well?
Quality measurements are important for all variables, but they are especially important for the outcome measure. There are several types of measurements that provide weaker evidence. Be cautious about measurements that are retrospective, unblinded, unvalidated, or unreliable.
Retrospective Measurements
Retrospective measurements have less credibility than measurements taken prospectively.
Retrospective data are data that is collected by looking backwards in time. We obtain this data by asking subjects to recall events that occurred earlier in their lives. We also get retrospective data when we review medical records, birth certificates, death certificates, or other sources of historical data. In contrast, data collected during the course of the study is known as prospective data.
Retrospective data are often inexpensive to collect, but you should be concerned about their accuracy. The ability of a subject to recall information is sometimes affected by which group that they are in.
Women who have experienced miscarriages, for example, are more likely to search for and remember events that they feel might "explain" their miscarriage, much more so than a group of comparable control subjects. This differential level of reporting is known as recall bias.
In addition, historical data are often incomplete and it is sometimes difficult to verify their accuracy. Therefore, retrospective data are considered less authoritative than prospective data.
Sometimes, though, you can establish credibility for retrospective measures. A review of research on smoking illustrates this well (Gail 1996). The author recalls a 1950 study that looked at the smoking habits of lung cancer patients and controls. The authors were concerned about the retrospective assessment of smoking among patients in both groups. Would patients with lung cancer exaggerate the amount of smoking? Would the interviewers press harder for information about smoking among the cancer patients? While it would be impossible to totally rule out recall bias, the authors did examine a third group, patients who were diagnosed with lung cancer and who later found out that they suffered from a different disease (false cases). If recall bias was the sole explanation of the difference in reported smoking, then the group of false cases should have had a similar level of smoking with the lung cancer patients. Instead they reported a lower level of smoking. This helped to rule out the possibility that recall bias alone accounted for the higher reported smoking levels in the lung cancer patients.
Another difficulty with retrospective data is that you may not be able to identify which was the cause and which was the effect. Causes have to occur before and effects have to occur after, but when you examine causes and effects retrospectively, you may end up losing information about timing.
There's an old joke about a statistician who was examining the fire department records, including information about how much damage the fire caused, and how many fire engines responded to the blaze. The statistician noticed a strong relationship between the two variables and concluded that the more fire engines you send, the more damage they cause.
Example: The British Medical Journal highlighted a research study where speech patterns were recorded in two groups of surgeons. The first group had two or more malpractice claims filed against them and the second group had none. There was a large difference between the two groups, with the first group having a dominant tone with less concern for the patient. While the news report of this research suggested that: "dominance coupled with a lack of anxiety in the voice may imply surgeon indifference and lead a patient to launch a malpractice suit when poor outcomes occur." -- bmj.com/cgi/content/full/325/7359/297/a
One reader, however, pointed out that perhaps: "being sued is a brutalizing and demoralizing experience and that this experience fundamentally changes the attitude of doctors towards their patients." -- bmj.com/cgi/eletters/325/7359/297/a#24658
Retrospective studies can also use data from charts and these charts often have incomplete or ambiguous information (Horwitz 1984). A big advantage of prospective studies is that the researchers know what data they want to collect. They don't always get what they want, even in a prospective study, but the chances of getting complete and accurate data are much better.
Unblinded measurements
In an experimental study, it is desirable (but not always possible) to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as "blinding" or "masking." Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study.
There is always some individual who knows which patients get which treatments, such as the pharmacy that prepares the pills and placebos. This is perfectly fine as long as these individuals do not interact with the patients or evaluate the patients.
There is a bit of ambiguity with respect to who is blinded (Devereaux 2001). For example, a survey of 25 textbooks produced nine different definitions of "double blind." Therefore, you should avoid using these terms and focus instead on which individuals are blinded. If you are evaluating an article, look for evidence of blinding for the following groups:
- the patients themselves,
- clinicians who have substantial interactions with the patients,
- anyone who assesses outcomes in these patients, or
- anyone who collects data from these patients.
If only some of the above are unaware of the treatment, then the study is partially blinded.
Blinding prevents the placebo effect from distorting the research results. The placebo effect is a product of "belief, expectancy, cognitive reinterpretation, and diversion of attention" that can lead to psychological and sometimes physiological improvements in situations where the treatment is known to have no effect, such as sugar pills (Beyerstein 1997).
There are three specific situations where the placebo effect is of particular concern: when enthusiasm by the patient or the doctor for the new procedure is strong, when outcomes are based on the patient's self-assessment (e.g. quality of life studies), and when the treatment is primarily for symptoms (Johnson 1997). The placebo effect is less critical for objective outcomes like survival.
Even without a placebo effect, blinding would still be important to ensure uniform rates of compliance. You want to avoid a situation where a patient thinks "I'm in the placebo arm, so it's not really important whether I show up for my follow-up evaluation."
The value of blinding also extends to the research team, and should include anyone who interacts with the patients. In a clinical trial of treatments for multiple sclerosis, a pair of neurologists assessed the outcome of each patient (Noseworthy 1994). One neurologist was blinded to the treatment status and one was unblinded. The unblinded neurologist gave substantially lower ratings to patients in the placebo group, which would have led to falsely concluding that one of the treatments was effective.
Researchers can also influence the outcome through their attitudes and through their differential use of other medications (Schulz 2002).
Surgical procedures are often difficult to completely blind. Nevertheless, You can take some partial steps at blinding that prevent some of the biases from creeping in (Johnson 1997). If two surgical procedures use different types of incisions, identical blood or iodine stained opaque dressings could be used to keep the patients unaware of which operation was performed. Also, although the surgeon cannot be blinded to the difference in surgery, those who evaluate the health of the patient after surgery could be kept unaware of the particular operation, so as to ensure that their evaluation of the patient is unbiased.
Even though the placebo may look the same, sometimes the doctor may infer which group a patient belongs to, perhaps through noting a characteristic set of side effects. If you are worried about this, ask the doctors to try to identify which treatment group they believe each patient belonged to. If the percentage of correct guesses is significantly larger than 50%, then the allocation scheme was not sufficiently blinded.
Two researchers have examined studies with and without blinding. These authors found that studies without blinding show an average bias of 11-17% (Schulz 1996; Colditz 1989). In other words, when an unblinded study was compared to a blinded study, the former study tended to estimate a treatment effect that was (on average) 11% to 17% higher than the latter.
Additional evidence of this problem appears in a meta-analysis of the effect of intermittent sunlight exposure and melanoma (Nelemans 1995). When nine studies without blinding were combined, they showed a odds ratio of 1.84 which was statistically significant (95% confidence interval 1.52 to 2.25). When the seven studies with blinding were combined, they showed a much smaller odds ratio (1.17, 95% confidence interval 0.98 to 1.39) which was not statistically significant. This is further evidence that unblinded studies are more likely to show statistical significance than blinded studies.
Self report measurements
Self report measurements, when the patients evaluate themselves, raise some special concerns. The degree to which patients report problems, for example, is associated with their level of education, as more educated patients are better able to describe their illnesses (Sen 2002).
You can only get certain measurements, such as pain, through self report. Other measures, like quality of life, are best obtained directly from the patient (Moinpour 2000).
A criticism of self report measures has to acknowledge that patients perceptions of disease are an important dimension of health. Appropriate medical treatment should not ignore the patient's perceptions, because health cannot be entirely reduced to objective numerical measures.
Example: A comparison of self report versus hospital records of resource utilization (Kennedy 2002) showed substantial disagreement between the two measures, with individuals reporting substantially more use of physiotherapy than the hospital records would indicate.
Example: In a study of stress (Macleod 2002), there was a relationship between high levels of stress and increased rates for self reported angina. There was no relationship, however, with more objective measures of heart disease. The apparent relationship with self reported angina might be a tendency for some patients to over report negative events (both psychological and medical) and for other patients to under report negative events.
Measurements without established validity
Validity is a term that every discipline has a different definition for. In very simple and general terms, validity means that an outcome is measuring what you think it is measuring. There are several ways to measure validity, but most of these involve comparison to an external standard.
The classic example of a measurement without established validity is the Rorschach Ink Blot test. Patients would be asked to interpret geometric figures that were essentially random and featureless forms. The interpretation given by the patient would reveal to a trained psychologist many insights into the patient's personality.
The inkblot test is difficult to evaluate under objective conditions, but when careful evaluations have been done, they have shown that this test has very limited ability to diagnose personality traits. It does have some ability to distinguish schizophrenic patients, but most of the other uses of this test have been discredited.
The subjective nature of the interpretations made it difficult to verify the accuracy of the predictions. Much like the predictions of palm readers and astrologers, the interpretations were so general as to apply to just about anybody.
Contrast this with the visual analog scale assessment of pain. To validate this measure, researchers examined how patients rated their pain before an operation and afterwards. They examined ratings before administration of analgesics and afterwards. When the scale showed changes under these conditions, it established the validity of the scale.
You should be cautious about self reported measurements. For some measures, especially for pain, self-report is the only practical way to assess an outcome. Quality of life measurements also have to be self reported. But asking a patient to assess whether they have a certain medical condition can be dangerous.
Be cautious about results that explain the role of race/ethnicity data in predicting a medical outcome (Walsh 2003). Quite often, race/ethnicity is not directly related to the outcome, but rather it is socioeconomic markers that are directly related.
Example: A study of concussions (Piland 2003) used a 16 item self-reported scale and validated it by comparing it to composite balance and neuropsychological measures.
Example: In a validation study of motion palpation (Humphreys 2004), twenty chiropractic students were asked to identify the most hypomobile segment of the spine in patients with fused vertebrae. If the students failed to consistently identify the correct location in the extreme situation of a fused spine, then their ability to validly diagnose more subtle spinal motion problems would be called into question. The students showed good levels of agreement with the location of the fused spine.
Example: In a study of methods to assess urine specific gravity (Steumpfle 2003), hydrometry and reagent strips showed consistent disagreements with refractometer measurements, and these methods could not be recommended for determining urine specific gravity measures during weight certification of collegiate wrestlers.
Measurements without established reliability
Reliability means different things in different fields, but the general concept is that a reliable measurement is one that would stay about the same if it were repeated under similar circumstances. Depending on the context, you would establish reliability differently. For example, one way to establish reliability is to have two people make independent assessments and show a good level of agreement. If you are measuring something that is stable over time, then you could take two measurements on different days or weeks and see how well they agree.
Be especially careful about measurements that have some level of subjectivity. If there is no establishment of reliability for these measures, then you have no assurance that the research is repeatable.
Wallace Sampson criticizes a study of homeopathic treatment for diarrhea (Jacobs 1994) because the outcome measures were all subjective measurements. The number of bowel movements, for example, as well as the smell and appearance of the feces, are open to interpretation. One could imagine first-time parents overreacting to small changes and being more likely to report that their child has diarrhea.
Post Hoc Changes
No research plan is perfect, and you should expect minor deviations from the plan in just about any research study. Major deviations, however, from the protocol can reduce the credibility of a study. Some examples of deviations from the plan include:
- Investigating end-points other than those originally specified
- Developing new exclusion criteria after the study has started
- Stopping the study unexpectedly or extending it beyond the planned sample size
You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.
Changes to the planned end of the study, either stopping the study early, or extending it beyond the planned sample size, can raise some serious problems (Ludbrook 2003). There are several reasons that you might want to stop a study early:
- early evidence that one of therapies is much better than the other (efficacy),
- early evidence that continuing the study would be unlikely to yield a significant result (futility),
- early evidence that one of the therapies is too dangerous (safety), and/or
- finishing the study would end up being far more expensive or time consuming than the original plan (economics).
Example: A study of fascial interposition during vasectomy (Sokal 2004) planned for an interim analysis halfway through the study. At that evaluation, patients randomized to receive fascial interposition had a much shorter time to azospermia and half the failure rate of the control group. These differences were so large that the study was halted early.
Example: A study of lung reduction surgery for patients with emphysema (The National Emphysema Treatment Trial Research Group 2001) ended the study early for a subgroup of patients with who have a low FEV1 and either homogeneous emphysema or a very low carbon monoxide diffusing capacity. In these patients, surgery had a 30 day mortality of 16% compared to 0% in the non-surgical intervention group.
In order to maintain credibility, a study should have rules for stopping early that were specified prior to the start of data collection. Pre-determined rules are especially important when a study ends early for efficacy. If a study ends early for economic reasons, and the result is not statistically significant, you need some assurance that the truncated sample size still provided a reasonable level of precision. In this situation, the width of the confidence intervals would indicate clearly if the sample size was still adequate.
Extending a study beyond the original end date can also be problematic. Extensions for economic reasons (the budget went further than expected or an extra funding source appeared) is probably not a serious problem, but be very careful if the study gets extended because of a failure to achieve statistical significance at the planned sample size. The provisions for such an extension must be specified prior to the start of data collection.
Detecting a deliberate and fraudulent change in a research study is extremely difficult for anyone, but especially difficult for the reader. A thorough peer review provides a limited level of protection from fraud. Another suggested remedy is a proposed requirement that journals should see the original protocols for research studies as part of the peer review process (Hawkey 2001). Sometimes a careful review of the numbers in a study can highlight the possibility of fraud. If a study used randomization, for example, watch out if there is an unexpected and unexplained deviation from a 50-50 split between treatment and control. Replication of research findings is also a good protection against fraud.
Example: An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity. In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol. Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found. Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance. There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia. Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what impact the early termination of the study had on the statistics. Early termination of a study can cause serious biases, unless specific rules for early termination are established at the start of the study.
Measuring the outcome well--what to look for
When you are looking at how the outcome was measured, ask yourself the following questions:
- Was the outcome dependent on the memory of the patients?
- Did the outcome have established validity and reliability?
- Were there post hoc changes in the protocol?
3.3 Were the changes clinically important?
Many journal authors have the bad habit of looking just at the p-value of a study and ignoring everything else. It's like there is a switch inside their brain that turns off the moment the p-value is calculated. Statistical significance, as measured by the p-value, is indeed important, but just as important is the clinical significance of the research.
It is difficult for me to talk about clinical importance because I am an outsider when it comes to medicine. I tell a story in my classes about how statisticians may be good with numbers but often have no perspective on their practical or clinical application.* A statistician is driving through the countryside in a beat up old pickup truck. He stops on the road to let a large flock of sheep pass. He calls out to the shepherd from the truck and brags that he can count the number of sheep in the flock to an accuracy of plus or minus five. The shepherd scoffs and offers a bet. If you can count the sheep that accurately, you can take one of the sheep home with you. If you are wrong, I get your pickup truck. The statistician agrees top the bet. After scanning the flock for a few seconds, he say that there are 527 sheep in the flock. The shepherd is dumbfounded. That's amazing, he says. You were only off by one. Come one out and take any sheep you want. So the statistician gets out and claims his prize. Wait, cries the shepherd, I'll bet you double or nothing that I can tell you what your day job is. The statistician thinks this is a safe bet and agrees. The shepherd says, you are obviously a statistician. Now the statistician is dumbfounded. How did you know, he asks. Well, replies the shepherd, put down my sheepdog and I'll explain it to you.
What exactly does clinical importance mean?
The pivotal word here is "clinical." To establish clinical importance, you need to use clinical judgment. I am not a clinician, so I can't exercise clinical judgment. What I can do is get you to ask the right questions about clinical importance.
For a change to be clinically important, it has to be large enough for you to justify all the added trouble, expense, inconvenience, etc. to justify changing your clinical practice. You need to assess the size of the benefit relative to the cost of the treatment and the possible harms that might come from side effects.
You should incorporate your patient's values in this calculation, of course. Suppose that a drug has a side effect in that it reduces the fertility potential in the men that take it. For some men, no benefit is large enough if the treatment seriously hampers their ability to father a child. Other men might be indifferent to this side effect, and some might even consider it an added bonus.
David Sackett talks about "particularizing" a research finding. If your patient belongs to a particular subgroup where the disease is more prevalent, or more virulent, or that subgroup is more likely to experience side effects, then you should adjust the research findings to fit the results of that subgroup. The calculations vary from situation to situation, but some good examples of particularizing are Ola 2001, Glasziou 1998, .
There is some data to suggest that doctors and patients do not agree on the balance between benefits of a treatment relative to its costs and possible side effects. For example, when researchers interviewed 72 family physicians and 74 patients with hypertension (McAlister 2000), the patients were less likely to want antihypertensive treatments under conditions where doctors would normally encourage their use.
Not surprisingly, patients may not agree with themselves about clinical importance, nor should they. In a study of patients with artial fibrillation who might be candidates for warfarin therapy (Howitt 1999), patients were interviewed about one group of patients felt that warfarin would be worthwhile if their annual risk of stroke was at least 2.4% while another group demanded a much higher average annual rate (4.1%) before they would adopt warfarin. The former group represented patients who had already adopted warfarin and the latter group represented patients who had refused warfarin treatment. I can't say for sure what level of risk would justify warfarin therapy, of course, but I take some solace in the fact that patients appeared to make choices consistent with their articulated beliefs.
Researchers won't define clinical importance for you
In a perfect world, the researchers would tell you how much of a change is important from a clinical perspective. After all, they are the experts in the area, or they wouldn't be doing the research. Surprisingly, researchers are very reluctant to share this information (Chan 2001). Perhaps they have never thought of the issue in terms of clinical importance before. Perhaps they don't want to impose their values on the readers, or they don't want to commit to a particular viewpoint or perspective. Researchers may be uncomfortable doing this, but they should still offer an opinion. Even if you, the reader, have a different perspective, when the researchers offer up an assessment of what they consider clinically impotant, it opens up the debate. It gets you thinking along the lines of "is that the sort of difference that I would hope to see, or would I demand to see a larger difference instead."
Once in a while, you will get a researcher to commit to a discussion of this very question. For example, in a study of an educational intervention intended to reduce the number of prescriptions to a drug that is often prescribed inappropriately (Pimlott 2003), researchers found that primary care physicians randomized to an educational intervention did indeed decrease the number of prescriptions to an inappropriate drug (20.3% before and 19.6% after the intervention) while a control group showed an increase (19.8% before to 20.9% after). Although the change was statistically significant (p=0.036), the researchers admit that the size of the change was so small as to be unimportant from a clinical perspective.
How to establish a level of clinical importance.
Clinical importance represents a value judgment, and the best way to assess values of your patients is to ask them.
Example: In a study comparing two allergy drugs (Hampel 2003), a particular drug was described as being "less drowsy" than the other. What did that really mean? The researchers measured drowsiness on a visual analog scale (VAS). This scale is simply a line that is exactly 10 centimeters long. Patients are asked to mark somewhere on the line how drowsy they feel with one end of the line representing no drowsiness and the other end representing the maximum possible drowsiness (presumably the maximum drowsiness that you can have and still be awake enough to make a mark on a line). For one drug, the average drowsiness was 3.6cm at baseline and remained about the same at the end of the study. In the other drug, the average drowsiness declined from 3.6 to 3.3. On the basis of a 3 mm shift, the researchers made the claim of less drowsiness. The 3 mm shift (see below) was indeed statistically significant, but does such a small shift have any practical value?
![]()
I was asked to coauthor an editorial discussing this question (Portnoy 2003). We chose a provocative title "Is 3-mm less drowsiness important?" It turns out that there is no research on this question. The best information that we could come up with was a study that showed how to establish clinical significance for the VAS used in pain measurement (Powell 2001). In that study, children visiting an emergency room were asked to rate their pain on the VAS at 20 minute intervals and also asked to categorize the change from the last time point as either "heaps better," "a bit better," "much the same," "a bit worse," or "heaps worse." The average change in VAS for those patients saying either a bit better or a bit worse was 10 mm.
Example: Cancer patients have major problems with fatigue. The only good measure is a self-report, and this can be measured in several different ways:
- Profile of Mood States (POMS), a 65 item scale with a subscale of five items representing fatigue. Each item is rated from 0 to 4.
- Schwartz Cancer Fatigue Scale (SCFS), a 28 item scale with four subscales: physical, emotional, cognitive, and temporal. Each item is rated from 0 to 4.
- General Fatigue Scale (GFS), a ten item scale with no subscales. Each item is rated from 1 to 10.
- A single question "what is your level of fatigue today" with 0 representing "no fatigue" and 10 representing "the greatest possible fatigue."
To establish a minimal level of clinical importance, researchers measured a group of 103 cancer patients before and after initiation of chemotherapy (Schwartz 2002). In addition to getting the four scales, the patients were asked at follow-up whether their fatigue levels had changed and by how much.
If you look at the average change in each scale for those patients who report a small change in fatigue, this represents a minimally important clinical difference. The numbers don't seem to quite match the tables, but the authors suggest that a 5.6 unit shift in POMS, 5.0 for SCFS, 9.7 for GFS, and 2.4 for the single item scale is important. If you divide each of these values by the number of items in the scale, you get values that hover around 1.0 for the first three scales, which is similar to the general recommendation in Guyatt 1998.
Another approach is to get an estimate of the benefits associated with a cure relative to the costs, inconvenience, and other troubles associated with the new treatment. This ratio will provide you with a threshold cure rate that you would demand in order to justify the new treatment. Let's suppose for example, that the benefits of a cure are five times as valuable as the burden imposed by a new treatment. Since the burdens of the treatment are borne by all who adopt the therapy, but the benefits accrue only to that fraction of patients who are actually cured, you should demand that more than one-fifth of your patients achieve a cure in order for the treatment to achieve a level of clinical importance.
A more sophisticated argument along the same lines appears in Chapter 7, where the ratio of the number needed to treat to the number needed to harm gives you a perspective on how many side effects must be endured in order to achieve one additional cure.
You can also apply an economic argument to establish clinical importance. For example, you can assess the value of a screening program by the proportion of patients discovered with an otherwise undiagnosed disease. When the proportion is high, the overall cost of screening is spread out over a large number of newly diagnosed patients. A screening program has a clinically trivial impact if the proportion of new cases identified is so small that the cost per diagnosis becomes outrageously expensive.
Evaluating negative trials
Establishing a level of clinical importance is especially important for negative trials--trials that fail to achieve statistical significance. You would like some assurance that the trial was negative because a clinically significant change was well outside the range of sampling error. You can look for a confidence interval that is narrow enough to fit entirely inside the range of clinical indifference. You could also look for a justification of the sample size, such as a power calculation.
The problem with a lot of negative trials, though, is that there is too much imprecision in the confidence intervals and no attempt was made prior to the start of the study to justify the sample size. These negative trials are truly uninformative because you can't tell if the trial is negative because nothing is going on versus having a sample so small that effectively makes it impossible to detect important changes.
How often does this happen? More often than you'd like to think. Recall the review of 2,000 schizophrenia trials, where only 3% of the studies had a reasonable sample size.
Evaluating equivalence and non-inferiority trials
Certain studies strive for a "negative" result. These trials, called equivalence trials, try to demonstrate that a new drug or treatment is comparable to a standard drug or treatment. For example, before the U.S. Food and Drug Administration will approve a generic equivalent for a name brand drug, they require that the generic manufacturer show that the rate and extent of absorption for the generic drug is not much greater (usually not more than 125%) or not much less (usually not less than 80%) than for the name brand drug. This is usually easy to show. In some cases, though, this agency will demand a greater degree of evidence by asking that the generic drug manufacturer show equivalence in the therapeutic benefits of the generic drug.
The goal of an equivalence study is not to show that two drugs are identical. That would be impossible to show. Instead, you want to show that the difference between the two drugs is no larger than a specified amount.
You should pay extra close attention to the conduct of the research in an equivalence trial. Researchers who are trying to demonstrate that two drugs are equivalent have a built in incentive to conduct the research haphazardly. The researchers may study patients that were not very sick to begin with, or they may not aggressively work to ensure that patients take their drugs regularly, or they may get a bit sloppy in evaluating the outcome. These problems tend to dilute the differences between the two drugs, making it easier to show that they are equivalent.
There are several approaches that work well when you are trying to show equivalence. The simplest is to compute a confidence interval for the and see whether it lies entirely inside the range of clinical indifference. Another effective approach is to conduct two tests. If the first test rejects the hypothesis that Drug A is inferior by a certain margin to Drug B and the second test rejects the hypothesis that Drug B is inferior by the same margin to Drug A, then you have sufficient evidence of equivalence.
You might be tempted to to set up a null hypothesis that the two drugs have the same average effects and when you fail to reject that hypothesis, conclude that the two drugs are equivalent. This approach won't work because you can't be sure that accepting the null hypothesis wasn't due to an insufficient sample size.
A similar type of trial, the non-inferiority trial attempts to show that a new drug is not worse by a specified amount from the standard drug (Snapinn 2000). You might be interested in non-inferiority when the new drug is cheaper, more readily tolerated, or has fewer side effects than the standard drug. For such a drug, you would readily adopt it over than standard drug unless you knew that the new drug was much less effective. So you set a non-inferiority margin, and try to assure yourself that the new drug is well within the non-inferiority margin.
Like the equivalence trial, small details about how the trial was conducted can dilute the differences between two drugs, making it easier to show non-inferiority.
Counterpoint: blinding is not all it is cracked up to be.
There's a strong belief that a study has to be blinded in order to be credible. Some meta-analyses will not include unblinded studies in their summaries in the belief that their quality is too poor (see Busse 2002 and Cooper 2003, for example).
Blinding is just of many factors that combine to indicate a study's rigor and quality. Although unblinded studies are considered less authoritative than blinded studies, you should not use blinding by itself as a surrogate marker for the quality of the research (Schulz 2002). For example, Rupert Sheldrake conducted a survey of various journals and showed that blinding was used in 85% of all parapsychology research. But it would be a mistake to claim, as Dr. Sheldrake does, that "Parapsychologists ... have been constantly subjected to intense scrutiny by skeptics, and this has made them more rigorous." http://www.parascope.com/en/articles/blindScience.htm
There are some situations where blinding is impossible. I commented in one article that if one of the treatments in a research study is a bilateral orchiectomy, you can't blind the study. Sooner or later, your patients are going to notice that something is missing.
Blinding is often achieved through the use of a placebo, but sometimes the price you pay with a placebo is too great to tolerate. In a study of Parkinson's disease (Freed 2002), patients in the treatment group received a transplant of nerve cells injected directly into their brains through two holes drilled into their skulls. The control group received a placebo surgery. Holes were drilled into their skulls also, but no cells were injected. This study was met with a storm of criticism. One of the harsher criticisms (Weijer 2002) had the provocative title "I need a placebo like I need a hole in the head."
Besides, a recent study showed that the benefits of blinding through the use of a placebo effect might be overstated in some contexts (Hrobjartsson 2001). This study compared research studies which had a treatment arm, a placebo arm, and a no treatment arm. The only difference between the placebo and the no treatment arm is that the latter is unblinded. These researchers found that with a few exceptions (most notably studies involving pain assessment), there was not a big difference between the placebo arm and the no treatment arm. So maybe all the fuss about placebos and blinding is overrated. Some of the effects attributed to the placebo are perhaps caused instead by statistical artefacts like regression to the mean or by the tendency of some conditions to resolve spontaneously.
So is blinding really necessary? It is nice to have, but not at the expense of your ethical principles.
On your own
1. Review the following abstracts and identify one or more surrogate outcomes. Specify a patient oriented outcome that might be related to each surrogate outcome.
Effects of disease modifying agents and dietary intervention on insulin resistance and dyslipidemia in inflammatory arthritis: a pilot study. Patrick H Dessein, Barry I Joffe and Anne E Stanwix. Arthritis Res 2002, 4:R12 doi:10.1186/ar597. Abstract Patients with rheumatoid arthritis (RA) experience excess cardiovascular disease (CVD). We investigated the effects of disease-modifying antirheumatic drugs (DMARD) and dietary intervention on CVD risk in inflammatory arthritis. Twenty-two patients (17 women; 15 with RA and seven with spondyloarthropathy) who were insulin resistant (n = 20), as determined by the Homeostasis Model Assessment, and/or were dyslipidemic (n = 11) were identified. During the third month after initiation of DMARD therapy, body weight, C-reactive protein (CRP), insulin resistance, and lipids were re-evaluated. Results are expressed as median (interquartile range). DMARD therapy together with dietary intervention was associated with weight loss of 4 kg (0–6.5 kg), a decrease in CRP of 14% (6–36%; P < 0.006), and a reduction in insulin resistance of 36% (26–61%; P < 0.006). Diet compliers (n = 15) experienced decreases of 10% (0–20%) and 3% (0–9%) in total and low-density lipoprotein cholesterol, respectively, as compared with increases of 9% (6–20%; P < 0.05) and 3% (0–9%; P < 0.05) in diet noncompliers. Patients on methotrexate (n = 14) experienced a reduction in CRP of 27 mg/l (6–83 mg/l), as compared with a decrease of 10 mg/l (3.4–13 mg/l; P = 0.04) in patients not on methotrexate. Improved cardiovascular risk with DMARD therapy includes a reduction in insulin resistance. Methotrexate use in RA may improve CVD risk through a marked suppression of the acute phase response. Dietary intervention prevented the increase in total and low-density lipoprotein cholesterol upon acute phase response suppression.
This is an open source publication. The full free text is available at arthritis-research.com/4/6/R12
Substituting abacavir for hyperlipidemia-associated protease inhibitors in HAART regimens improves fasting lipid profiles, maintains virologic suppression, and simplifies treatment Philip H Keiser, Michael G Sension, Edwin DeJesus, Allan Rodriguez, Jeffrey F Olliffe, Vanessa C Williams, John H Wakeford , Jerry W Snidow, Anne D Shachoy-Clark, Julie W Fleming, Gary E Pakes, Jaime E Hernandez and for the ESS40003 Study Team. BMC Infectious Diseases 2005, 5:2 doi:10.1186/1471-2334-5-2. Background Hyperlipidemia secondary to protease inhibitors (PI) may abate by switching to anti-HIV medications without lipid effects. Method An open-label, randomized pilot study compared changes in fasting lipids and HIV-1 RNA in 104 HIV-infected adults with PI-associated hyperlipidemia (fasting serum total cholesterol >200 mg/dL) who were randomized either to a regimen in which their PI was replaced by abacavir 300 mg twice daily (n = 52) or a regimen in which their PI was continued (n = 52) for 28 weeks. All patients had undetectable viral loads (HIV-1 RNA <50 copies/mL) at baseline and were naïve to abacavir and non-nucleoside reverse transcriptase inhibitors. Results At baseline, the mean total cholesterol was 243 mg/dL, low density lipoprotein (LDL)-cholesterol 149 mg/dL, high density lipoprotein (HDL)-cholesterol 41 mg/dL, and triglycerides 310 mg/dL. Mean CD4+ cell counts were 551 and 531 cells/mm3 in the abacavir-switch and PI-continuation arms, respectively. At week 28, the abacavir-switch arm had significantly greater least square mean reduction from baseline in total cholesterol (-42 vs -10 mg/dL, P < 0.001), LDL-cholesterol (-14 vs +5 mg/dL, P = 0.016), and triglycerides (-134 vs -36 mg/dL, P = 0.019) than the PI-continuation arm, with no differences in HDL-cholesterol (+0.2 vs +1.3 mg/dL, P = 0.583). A higher proportion of patients in the abacavir-switch arm had decreases in protocol-defined total cholesterol and triglyceride toxicity grades, whereas a smaller proportion had increases in these toxicity grades. At week 28, an intent-to treat: missing = failure analysis showed that the abacavir-switch and PI-continuation arms did not differ significantly with respect to proportion of patients maintaining HIV-1 RNA <400 or <50 copies/mL or adjusted mean change from baseline in CD4+ cell count. Two possible abacavir-related hypersensitivity reactions were reported. No significant changes in glucose, insulin, insulin resistance, C-peptide, or waist-to-hip ratios were observed in either treatment arm, nor were differences in these parameters noted between treatments. Conclusion In hyperlipidemic, antiretroviral-experienced patients with HIV-1 RNA levels <50 copies/mL and CD4+ cell counts >500 cells/mm3, substituting abacavir for hyperlipidemia-associated PIs in combination antiretroviral regimens improves lipid profiles and maintains virologic suppression over a 28-week period, and it simplifies treatment.
This is an open source publication. The full free text is available at www.biomedcentral.com/1471-2334/5/2.
2. Review the following abstracts and identify the total number of outcome variables. Can you identify one or two outcome measures that should be considered of primary importance?
Effect of reproductive factors on stage, grade and hormone receptor status in early-onset breast cancer. Joan A Largent , Argyrios Ziogas and Hoda Anton-Culver. Breast Cancer Research 2005, 7:R541-R554 doi:10.1186/bcr1198. Introduction Women younger than 35 years who are diagnosed with breast cancer tend to have more advanced stage tumors and poorer prognoses than do older women. Pregnancy is associated with elevated exposure to estrogen, which may influence the progression of breast cancer in young women. The objective of the present study was to examine the relationship between reproductive events and tumor stage, grade, estrogen receptor and progesterone receptor status, and survival in women diagnosed with early-onset breast cancer. Methods In a population-based, case–case study of 254 women diagnosed with invasive breast cancer at age under 35 years, odds ratios (ORs) and 95% confidence intervals (CIs) were estimated using unconditional logistic regression with tumor characteristics as dependent variables and adjusting for age and education. Survival analyses also examined the relationship between reproductive events and overall survival. Results Compared with nulliparous women, women with three or more childbirths were more likely to be diagnosed with nonlocalized tumors (OR = 3.1, 95% CI = 1.3–7.7), and early age (<20 years) at first full-term pregnancy was also associated with a diagnosis of breast cancer that was nonlocalized (OR = 3.0, 95% CI = 1.2–7.4) and of higher grade (OR = 3.2, 95% CI 1.0–9.9). The hazard ratio for death among women with two or more full-term pregnancies, as compared with those with one full-term pregnancy or none, was 2.1 (95% CI = 1.0–4.5), adjusting for stage. Among parous women, those who lactated were at decreased risk for both estrogen receptor and progesterone receptor negative tumors (OR = 0.2, 95% CI = 0.1–0.5, and OR = 0.4, 95% CI = 0.2–0.8, respectively). Conclusion The results of the present study suggest that pregnancy and lactation may influence tumor presentation and survival in women with early-onset breast cancer.
This is an open source publication. The full free text is available at breast-cancer-research.com/content/7/4/R541.
Quality of life, functional outcome, and voice handicap index in partial laryngectomy patients for early glottic cancer. Tolga Kandogan and Aylin Sanal. BMC Ear, Nose and Throat Disorders 2005, 5:3 doi:10.1186/1472-6815-5-3. Background In this study, we aim to gather information about the quality of life issues, functional outcomes and voice problems facing early glottic cancer patients treated with the surgical techniques such as laryngofissure cordectomy, fronto-lateral laryngectomy, or cricohyoidopexi. In particular, consistency of life and voice quality issues with the laryngeal tissue excised during surgery is examined. In addition, the effects of arytenoidectomy to the life and voice quality are also studied. Methods 29 male patients were enrolled voluntarily in the study. The average age was 53.9 years. Three out of 10 patients with laryngofissure cordectomy also had arytenoidectomy. 11 patients had fronto-lateral laryngectomy with Tucker reconstruction, two of which also had arytenoidectomy. There were eight patients with cricohyoidopexi and bilateral functional neck dissection. Three of these patients also had arytenoidectomy. In bilateral functional neck dissection cases, spinal accessory nerve was preserved and level V of the neck was not dissected. None of the patients had neither radiotherapy nor voice therapy. Cordectomy patients never had a temporary tracheotomy or were connected to a feeding tube. Data was collected for 13 months for the cordectomy group, 14 months for fronto-lateral laryngectomy and cricohyoidopexi groups on average post-operatively. Statistical analysis in this study was carried out using the one-way analysis of variance, and the Post-Hoc group comparisons were made after Bonferroni and Scheffé-procedures. In order to determine the effects of arytenoidectomy, a regression analysis is carried out to see if there are statistical differences in answers given to the survey questions among patients who were arytenoidectomized during their surgeries. Results There was a statistically significant difference between cordectomy and cricohyoidopexi group in answers to the University of Washington- Quality of Life- Revised survey part 1. (p = 0). A statistically significant difference was also established between cordectomy and fronto-lateral laryngectomy groups, as well as between cordectomy and cricohyoidopexi groups in answers to the University of Washington- Quality of Life- Revised survey part 2. (p = 0,036 and p = 0.009, respectively). Cricohyoidopexi group has given the lowest scores and the cordectomy group has given the highest scores in three survey questions representing the quality of life, performances and new voices. These ranges are also consistent with the laryngeal tissue excised during surgery (cricohyoidopexi > fronto-lateral laryngectomy > cordectomy). There was no statistically significant difference between groups in Performance Status Scale for Head and Neck cancer patients instrument. The difference between the Voice Handicap Index and Voice Handicap Index (functional); Voice Handicap Index (physical) and Voice Handicap Index (emotional) scores in three patient groups was not significant either. All of the patients evaluated that their new voices have similar functional, physical and emotional impact on their life. Decanulation and oral feeding times of cricohyoidopexi and fronto-lateral laryngectomy patients are found to be significantly longer than cordectomy patients. Lastly, the removal of arytenoid does not have any significant adverse effects on the quality of life, the functional outcomes, or the quality of voice. Conclusion In the present study, all patients with early glottic cancer, treated with different surgical technics reported fairly good quality of life outcomes, functional results and voice qualities. This study also finds that the removal of arytenoid does not have any adverse effects on the quality of life and voice from the patients' point of view.
This is an open source publication. The full free text is available at www.biomedcentral.com/1472-6815/5/3
3. Review the following abstracts. These reports represent studies where no blinding was done. How critical is the lack of blinding in these studies? What attempts at partial blinding could have been attempted?
Measurement of tracheal temperature is not a reliable index of total respiratory heat loss in mechanically ventilated patients. Critical Care 2001, 5:24-30 doi:10.1186/cc974 Background Minimizing total respiratory heat loss is an important goal during mechanical ventilation. The aim of the present study was to evaluate whether changes in tracheal temperature (a clinical parameter that is easy to measure) are reliable indices of total respiratory heat loss in mechanically ventilated patients. Method Total respiratory heat loss was measured, with three different methods of inspired gas conditioning, in 10 sedated patients. The study was randomized and of a crossover design. Each patient was ventilated for three consecutive 24-h periods with a heated humidifier (HH), a hydrophobic heat-moisture exchanger (HME) and a hygroscopic HME. Total respiratory heat loss and tracheal temperature were simultaneously obtained in each patient. Measurements were obtained during each 24-h study period after 45 min, and 6 and 24 h. Results Total respiratory heat loss varied from 51 to 52 cal/min with the HH, from 100 to 108 cal/min with the hydrophobic HME, and from 92 to 102 cal/min with the hygroscopic HME (P < 0.01). Simultaneous measurements of maximal tracheal temperatures revealed no significant differences between the HH (35.7-35.9°C) and either HME (hydrophobic 35.3-35.4°C, hygroscopic 36.2-36.3°C). Conclusion In intensive care unit (ICU) mechanically ventilated patients, total respiratory heat loss was twice as much with either hydrophobic or hydroscopic HME than with the HH. This suggests that a much greater amount of heat was extracted from the respiratory tract by the HMEs than by the HH. Tracheal temperature, although simple to measure in ICU patients, does not appear to be a reliable estimate of total respiratory heat loss.
This is an open source publication. The full free text is available at ccforum.com/content/5/1/024
Reexamining age, race, site, and thermometer type as variables affecting temperature measurement in adults – A comparison study. Linda S Smith. BMC Nursing 2003, 2:1 doi:10.1186/1472-6955-2-1 Background As a result of the recent international vigilance regarding disease assessment, accurate measurement of body temperature has become increasingly important. Yet, trusted low-tech, portable mercury glass thermometers are no longer available. Thus, comparing accuracy of mercury-free thermometers with mercury devices is essential. Study purposes were 1) to examine age, race, site as variables affecting temperature measurement in adults, and 2) to compare clinical accuracy of low-tech Galinstan-in-glass device to mercury-in-glass at oral, axillary, groin, and rectal sites in adults. Methods Setting 176 bed accredited healthcare facility, rural northwest US Participants Convenience sample (N = 120) of hospitalized persons ≥ 18 years old. Instruments Temperatures (°F) measured at oral, skin (simultaneous), immediately followed by rectal sites with four each mercury-glass (BD) and Galinstan-glass (Geratherm) thermometers; 10 minute dwell times. Results Participants averaged 61.6 years (SD 17.9), 188 pounds (SD 55.3); 61% female; race: 85% White, 8.3% Native Am., 4.2% Hispanic, 1.7 % Asian, 0.8% Black. For both mercury and Galinstan-glass thermometers, within-subject temperature readings were highest rectally; followed by oral, then skin sites. Galinstan assessments demonstrated rectal sites 0.91°F > oral and 1.3°F > skin sites. Devices strongly correlated between and across sites. Site difference scores between devices showed greatest variability at skin sites; least at rectal site. 95% confidence intervals of difference scores by site (°F): oral (0.142 – 0.265), axilla (0.167 – 0.339), groin (0.037 – 0.321), and rectal (-0.111 – 0.111). Race correlated with age, temperature readings each site and device. Conclusion Temperature readings varied by age, race. Mercury readings correlated with Galinstan thermometer readings at all sites. Site mean differences between devices were considered clinically insignificant. Still considered the gold standard, mercury-glass thermometers may no longer be available worldwide. Therefore, mercury-free, environmentally safe low-tech Galinstan-in-glass may be an appropriate replacement. This is especially important as we face new, internationally transmitted diseases.
This is an open source publication. The full free text is available at www.biomedcentral.com/1472-6955/2/1.
3.4 Summary - Mountain or molehill?
Look carefully at how the researchers measured the outcome in their study.
Did they measure the right thing? You would like to see an outcome of direct interest to your patients.
Did they measure it well? You want an outcome that is valid and reliable and not subject to changes are the start of data collection.
Were the changes clinically important? You want a change that is large enough to have a practical impact in a clinical setting.
* Again, I can't take credit for this one. There are various forms of this joke on the web
http://www.bordercollierescue.org/breed_advice/WorkingSheepdog.html
This webpage was written by Steve Simon on (unknown date), edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence