Stats #31: How to Read a Medical Journal Article (Obsolete)
The class "How to Read a Medical Journal Article" is obsolete. I no longer offer this class. In its place, I am offering:
- Stats #32: Statistical Evidence. Apples to Oranges. Choice of the control group in research studies;
- Stats #33: Statistical Evidence. Who Was Left Out? Exclusions and dropouts in research studies;
- Stats #34: Statistical Evidence. Mountain or Molehill? Clinical significance in research studies
These three (3) classes expand upon the material and provide more opportunities for practice exercises with real journal articles.
I am keeping this web page up for historical and reference purposes.
This three hour training class will give you a general introduction in how to read medical journal articles. The medical journals are filled with research on new medical therapies. What should you look for in this research? How do you gauge the strength of evidence? When should you change your medical practices? The answers lie not in how the research data was analyzed but in how it was collected. Simple factors like how the research subjects were recruited determine the strength of evidence in a research paper. When you are reading a journal article, just ask yourself five simple questions: Who did the choosing?; Was there a plan?; Who knew what when?; Who was left out?; and How much did things change?
In this presentation, you will learn how to:
- assess the strength of evidence in a journal article;
- identify potential problems with observational studies;
- explain why "blinding" is important;
- describe the problems caused by drop-outs.
This class does not qualify for IRB Education Credits (IRBECs).
Please bring a copy of a research paper with you to class. The paper should compare two or more groups of patients and it should have some direct numerical measurements in it. If you have difficulty finding a good example, I will provide some interesting journal articles for you to use.
Some interesting examples:
A Close Look at Therapeutic Touch. L. Rosa, E. Rosa, L. Sarner, S. Barrett. Jama 1998: 279(13); 1005-10. [PDF]
Measles, Mumps, and Rubella Vaccination and Bowel Problems or Developmental Regression in Children with Autism: Population Study. B. Taylor, E. Miller, R. Lingam, N. Andrews, A. Simmons, J. Stowe. Bmj 2002: 324(7334); 393-6. [PDF]
Obstetric Care and Proneness of Offspring to Suicide as Adults: Case-Control Study. Bertil Jacobson, Marc Bygdeman. British Medical Journal 1998: 317(7169); 1346-1349. [PDF]
Midline Episiotomy and Anal Incontinence: Retrospective Cohort Study. Lisa B Signorello, Bernard L Harlow, Amy K Chekos, John T Repke. British Medical Journal 2000: 320(7227); 86-90. [PDF]
Postmarketing Surveillance Study of a Non-Chlorofluorocarbon Inhaler According to the Safety Assessment of Marketed Medicines Guidelines. J. G. Ayres, C. D. Frost, W. F. Holmes, D. R. Williams, S. M. Ward. British Medical Journal 1998: 317(7163); 926-30. [PDF]
A Comparison of Active and Simulated Chiropractic Manipulation as Adjunctive Treatment for Childhood Asthma. J. Balon, P. D. Aker, E. R. Crowther, C. Danielson, P. G. Cox, D. O'Shaughnessy, C. Walker, C. H. Goldsmith, E. Duku, M. R. Sears. New England Journal of Medicine 1998: 339(15); 1013-20. [PDF]
Contents
- Overview of the STATS web pages
- Consulting services that I provide
- What's wrong with medical research
- How to read a medical journal. Introduction.
- Who did the choosing?
- Was there a plan?
- Who knew what when?
- Who was left out?
- How much did things change?
- Please fill out an evaluation form
Overview of the STATS web pages (January 21, 2000)
What are the STATS web pages?
The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.
Where can I find STATS?
If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,
which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.
Some of the fun stuff you can find on the STATS web pages.
Ask Professor Mean. For the tough Statistics questions that Dear Abby won't touch.
Planning Your Research Study. Things you need to plan for before you start collecting your data.
Selecting An Appropriate Sample Size. How much data do you really need?
Managing Your Research Data. Everything you want to know before you step to the keyboard.
Steps In a Typical Data Analysis. I have my data on the computer. Now what?
How to Read a Medical Journal Article. Reading a journal is hard work. Here's some help.
Professor Mean's Library. Good books and good web sites about Statistics.
... and even more good stuff!!!
This webpage was written, edited by Linda Foland, and was last modified on 07/08/2008. . Category: Website details
For CMH employees only: Statistical Consulting Services.
You can get free statistical consulting if you work for Children's Mercy Hospital. Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.
Here area some examples of the services that we have provided:
- setting up your research hypothesis,
- selecting and justifying your sample size,
- writing the statistical methods section for your grant,
- preparing randomization tables for your study,
- reviewing your surveys for content and quality,
- developing a system for entering your data,
- choosing an appropriate statistical model for your data,
- establishing validity and/or reliability for your measurement scales,
- checking for violations of statistical assumptions in your data,
- producing graphs and tables for your research publication, and
- providing references for new and unusual statistical methods.
Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.
This webpage was written on 2003-04-30 and was last modified on 2008-07-08. Category: Professional details
Directions to my new office (April 25, 2008).
I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.
- Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It's best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building's entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ' block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.
2008-07-14. Send Category: Professional details
What's Wrong with Medical Research?
This is a draft version of a speech I gave at the May 4, 2000 meeting of Bluejacket Toastmasters.
Introduction
"I can't believe all the junk that gets published in top notch medical journals!" That's a comment I hear all the time. I work with some prominent researchers at Children's Mercy Hospital and they don't trust a lot of the stuff that is published in the Journal of the American Medical Society, Lancet, the New England Journal of Medicine, and other prominent medical journals.
That's just an opinion, of course. As a statistician, I always trust cold hard data to opinions. In this case, though, the cold hard data backs up what my colleagues are saying.
Thornley and Adams Overview
One piece of data about medical research is an article by Ben Thornley and Clive Adams that appeared in the British Medical Journal in October of 1998. The title of this article is "Content and Quality of 2000 Controlled Trials in Schizophrenia over 50 Years." Thornley and Adams work for the Cochrane Collaboration Group, a group that provides systematic reviews of the results of medical trials, so they are in a good position to write such an article.
Thornley and Adams actually found over 2,500 studies, but decided to summarize only the first 2,000 they uncovered. Only the first 2,000. I still am very impressed at the amount of work this must have taken.
The research covered fifty years, starting in 1948 through 1997. The research covered a variety of therapies: drug therapies, psychotherapy, policy or care packages, or physical interventions like electroconvulsive therapy.
What did Thornley and Adams find? It wasn't a pretty picture. For the most part, researchers in schizophrenia studied the wrong patients, they didn't study enough of them, they didn't study their patients long enough, and they didn't measure them properly. Let's look at each of these assertions in turn.
They studied the wrong patients.
First, the researchers studied the wrong patients. Only 14% of the studies of schizophrenia are community based, but that's where most patients are treated. Community based studies are hard to design, because you have so much less control than you would with a pool of hospitalized or institutionalized patients. But research in a hospital does not extrapolate well to other settings. It's like the story of a man who lost his wallet at night in a dark alley, but was searching for it out in the street underneath a street lamp. Why was he searching in the street, you might ask? Because the light is better out here.
They didn't study enough patients.
Second, the researchers didn't study enough patients. The precision of the research studies depends on the number of patients studied. Thornley and Adams showed that a typical study of schizophrenia would require a total of 200 patients to achieve adequate precision. What was the average number of patients studied in these 2,000 trials? 65. The researchers needed 300 subjects, but on average got only 65. Only 3% of the researchers met Thornley and Adams' goal of 300 patients.
They did not study these patients long enough.
Third, the researchers did not study these patients long enough. According to Thornley and Adams, a good research trial should last at least six months. More than half of the trials lasted six weeks or less. Clearly, it costs less to run a short trial than a long trial. But short term improvements in any medical study are much less important than long term changes.
They did not measure their patients properly.
Finally, the researchers did not measure their patients properly. Measuring improvement in a schizophrenic patient is indeed difficult, but the researchers blew it badly here. In 2,000 trials, the researchers developed and used 640 measures of these patients. This shows a serious lack of standards in schizophrenia research. The use of 640 different measurements shows there is no consensus on how to measure the severity of schizophrenia or how much a patient might have improved under a certain therapy. This makes it very difficult to compare results across different studies. You might say that the research about schizophrenia is schizophrenic.
Conclusion
The problems that Thornley and Adams raise are not limited to schizophrenia or even psychiatry. There is plenty of objective evidence in many other medical and scientific areas that the reports published in research journals have many of the same problems.
It's hard to do good medical research, especially in an area like psychiatry. But as Thornley and Adams have shown, we could be doing a lot better.
- We study the wrong patients. We grab the easy to get hospitalized and institutionalized patients when we should be doing more studies in the community.
- We don't study enough patients. A good study would require 300 patients, but the average study only had 65.
- We don't study patients long enough. A good study would require six months of monitoring, but more than half of the studies lasted only six weeks.
- And we don't measure things well. There were 640 different ways of assessing how the patients in these studies were doing.
So the next time you hear news reports about the latest research findings, be sure to be a little bit skeptical. Just because it appears in the New England Journal of Medicine doesn't mean that it is true.
This webpage was written on 2000-05-04, edited and Linda Foland, and was last modified on 2008-07-08. This page needs minor revisions. Category: Statistical evidence
How to read a medical journal article (November 2001 version).
"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.
The medical journals are filled with research on new medical therapies. What should you look for in this research? How do you gauge the strength of evidence? When should you change your medical practices?
The answers lie not in how the research data was analyzed but in how it was collected. Simple factors like how the research subjects were recruited determine the strength of evidence in a research paper. When you are reading a journal article, just ask yourself five simple questions: Who did the choosing?; Was there a plan?; Who knew what when?; Who was left out?; and How much did things change?
Important Disclaimer.
This presentation will review several published journal articles. The intent is to gauge how much evidence each article presents in favor of the efficacy of a new therapy. Some articles will provide a greater level of evidence and some will provide a lesser level of evidence. But articles which provide lesser levels of evidence are still valuable and important.
Nothing stated in this presentation about a particular journal article should be construed as a statement about the quality of that article. The very nature of research requires a series of steps from very preliminary and speculative levels of evidence to more definitive levels of evidence.
Furthermore, when I point out limitations in the evidence presented in a journal article, more often than not, the authors of the article delineate these same limitations in their discussion. But in general, you need to be aware of these limitations because not every journal author is going to be open and honest about the limitations of their research.
Here are five questions you should ask yourself when reading a journal article.
1. Who did the choosing?
2. Was there a plan?
3. Who knew what when?
4. Who was left out?
5. How much did things change?The first five chapters of this presentation will discuss each of these questions in detail. There are two additional chapters.
7. Special guidelines for meta-analysis.
8. A resource list.
Chapter 1: Was there a good comparison group?
Introduction
Almost all research involves comparison. Do woman who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?
When you make such a comparison between an exposure/treatment group and a control group, you want it to be a fair comparison. You want the control group to be identical to the exposure/treatment group in all respects, except for the exposure/treatment in question. You want an apples to apples comparison.
To insure that the researchers made an apples to apples comparison, ask the following three questions:
1.1 Did the authors use randomization?
1.2 Did the authors use matching?
1.3 Did the authors use statistical adjustments?
Vitamin C and Cancer
Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Cameron and Pauling published an observational study of Vitamin C as a treatment for advanced cancer. For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).
Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."
Ten years later, the Mayo Clinic conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Camoeron and Pauling study differ from the Mayo study?
The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.
But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison. It doesn't matter how bad the prognosis was for a patient diagnosed with terminal cancer; it can't be as bad as the prognosis of a patient who has a death certificate.
Surgical trial without controls
There's another story, unfortunately fictional, which also highlights the importance of a good comparison group.
A prominent surgeon came to give a special lecture at the School of Medicine. He expounded about the great advance that he had made in a specific surgical procedure. At the end of the lecture he drew thunderous applause from the audience.
At first it seemed like there would be no questions, but then a young student in the front row raised her hand. "Did you use any controls?" she asked.
The surgeon seemed to be offended by this question. "Controls?" he asked. "Are you suggesting that I should have denied my surgical advance to half of my patients?"
The rest of the audience grew very quiet. But the young woman was not intimidated. "Yes," she said, "that's exactly what I meant."
The surgeon grew even angrier at this, slammed his fist on the podium and shouted "Why that would have condemned half of my patients to certain death!"
There was silence for a few seconds. Then the entire auditorium burst out in laughter when the young woman asked "Which half?"
Covariate imbalance
If you want to judge how effective a new therapy is, you need a comparison group. The comparison group would be a group of subjects who receive either the standard therapy or, in some cases, no therapy (e.g., a placebo comparison).
The ideal comparison group should be similar in all respects to the new therapy group except for the therapy itself. For example, the two groups should have a similar range of ages and weights and should be composed of roughly the same proportions in gender and race/ethnicity. The groups should be evaluated concurrently.
Sometimes the groups are dissimilar on some important characteristics. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.
In a yet to be published research study here at Children's Mercy Hospital, pre-term infants were randomized either to a group that received normal bottle feeding while they were in the hospital or to a nasogastric (ng) tube feeding group. The researchers wanted to see if the latter group of infants, because they had not become habituated to bottle feeding, would be more likely to breastfeed after discharge from the hospital.
The randomization was only partially effective at preventing covariate imbalance. The infants had comparable birth weights, gestational ages, and Apgar scores. There were similar proportions of caesarian section and vaginal births in both groups. But the mothers in the ng tube group were older on average than the mothers in the bottle fed group.
Since older mothers are more likely to breast feed than younger mothers, we had to include mother's age in an analysis of covariance model so that the effect of ng tube feeding could be estimated independent of mother's age.
Beware of situations where the two treatment groups are handled differently. An example of this would be the study of women who use oral contraceptives. These women visit a doctor at least every six months to get their prescriptions renewed. If these women are compared to a women who do not use oral contraceptives, then the former group will probably be evaluated by a doctor more frequently. An increase in the prevalence of certain diseases may actually reflect the fact these diseases are diagnosed earlier because of the frequency of hospital visits.
Similarly, if a certain drug is suspected to have certain side effects, doctor may question more closely those patients who are on that medication, creating a self-fulfilling prophecy.
1.1 Did the authors use randomization?
If the authors of the study decided who would get the new therapy and who would get the standard therapy, we have an experimental design. If the patient did the choosing, if the patient's doctor did the choosing, or if the groups were intact prior to the start of the research, then we have an observational design.
The distinction between experimental and observational designs is very critical. The greater control that is available in an experimental design generally leads to better quality results. In particular, an experimental designs allows the use of randomization.
Here are some examples of experimental designs and observational studies.
In Adkinson (1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo." Since the researchers generated the sequence of random assignment, this is an experimental design.
In Bullock (1989), "80 severe recidivist alcoholics received accupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)." Since the researchers controlled the nature of the accupuncture, this is an experimental design.
In Cardo (1997), 33 health care workers who became seropositive to HIV after percutaneous exposure to HIV-infected blood were compared to 665 health care workers with similar exposure who did not become seropositive. Since the researchers did not control who became seropositive, this is an observational study.
In Hu (1997), 80,082 women between the ages of 34 and 59 years were followed for 14 years to look for instances of non-fatal myocardial infarction or death from coronary heart disease. These women were divided into low, intermediate, and high groups on the basis of their consumption of dietary fat. Since the women themselves controlled their diets, rather than having a diet imposed on them by the researchers, this represents an observational design.
Information from an experimental design is generally considered more authoritative than information from an observational design because the researchers can then use randomization. Randomization provides some level of assurance that the two groups are comparable in every way except for the therapy received.
Randomization requires the use of a random device, such as a coin flip or a table of random numbers. Systematic allocation (i.e., alternating between treatments) is not the same as randomization.
The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number.
Randomization insures that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. It also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.
Randomization is not always possible or practical. When this is the case, we have to rely on observational data to draw any conclusions. But when randomization is possible, its use makes a research study more authoritative.
Although I do not have a bibliographic citation for this example, I heard an amusing story about a study of water toxicants on fish.
This research required that the fish be separated into five tanks, each of which would get a different level of the toxicant. The researchers caught one fifth of the fish and put then in one tank, then an additional one fifth and put them in a second tank and so forth. The outcome measurements were related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled. and the best outcomes in the last tank filled.
What happened was that the slow-moving, easy-to-catch fish were all allocated to the first tank. The fast-moving, hard-to-catch fish ended up in the last tank. It turned out that the sicker fish were also the slow-moving, easy-to-catch fish, the healthiest fish swam faster and avoided early capture.
A better way to design this experiment was to allocate the fish into tanks randomly. This would insure that each tank got a fair share of the fast-and-healthy and the slow-and-sick fish.
Studies without randomization often require either matching or statistical adjustments. While both matching and adjustments can help to some extent with covariate imbalance, these approaches do not work as well as randomization. In particular, some of the covariate imbalance may be due to factors that are difficult to measure. For example, patients may differ
Nevertheless, much can be learned from non-randomized. Almost everything we know about the risks of cigarette smoking came from observational designs (Gail 1996).
An editorial in the Journal of the American Medical Association (Sherwin 1997) tries to make sense of recent studies of the effect of dietary fat on obesity, heart disease, and stroke. After reviewing the results of numerous studies, the editorial comments:
"At present, most of this evidence in humans is observational and, consequently, an imperfect basis for causal inference. Large scale experimental studies that would provide more compelling data (such as the Women's Health Initiative) cost hundreds of millions of dollars and take decades to complete. Each study can only address the effects of a single nutritional change. Thus, it is still necessary to base advice to patients on dietary information that is less than certain and complete."
Randomized studies do have some weaknesses. These studies typically rely on the use of volunteers in a narrowly defined research setting. Such situations may not be reflective of how a typical patient behaves in a typical health care setting (Sackett 1997). In this particular aspect, a carefully planned observational design may provide a more relevant comparison.
Another problem with randomized designs is the limit to their size and scope. These limits may make it difficult to detect rare but important side effects. An observational approach like post marketing surveillance is more likely to be successful in these situations.
Studies of the potential harm caused by environmental exposures (such as lead based paint, second hand tobacco smoke, or electro-magnetic fields) are often impossible to randomize because of logistical and ethical issues.
These exceptions, however, do not diminish the value of experimental designs. In situations where observational and experimental studies can both be conducted, most researchers will give greater weight to the evidence in an experimental study.
Did the authors use matching?
Matching is the systematic selection, for every subject in the treatment/exposure group, of control subject with similar characteristics. For example, in a study of fetal exposure to cocaine, you might select infants born to a mother who abused cocaine during pregnacy. For every such infant, you would select a infant unexposed to cocaine in utero, but also who had the same sex, race, and socio-economic status.
Matching will prevent covariate imbalance for those variables used in matching. It will also reduce covariate imbalance for any variables closely related to the matching variables. It will not, however, protect against all covariate imbalance, especially for those covariates that are difficult to measure.
Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.
Matching is usually reserved for those variables that are known to be highly predictive of the outcome measure. In a cancer study, for example, matching is usually done on smoking. Many neonatology studies will match on gestational age.
Matching in a randomized design
In some randomized studies, matching will be used as well. Partly, this is a recognition that randomization will not totally remove covariate imbalance, just like a flip of 100 coins will not always result in exactly 50 heads and 50 tails.
More importantly, however, matching in a randomized study will provide extra precision. Matching creates pairs of subjects who will have greater homogeneity and therefore less variability.
The crossover design
The crossover design represents a special type of matching. In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA).
Since the same subject receives both treatments, there is no possibility of covariate imbalance.
When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?
The possibility of learning and fatigue effects are also potential problems in a crossover design.
Special problems arise when each subject receives the standard therapy first and then the new therapy (or vice versa). Many factors other than the change in therapy can cause a shift in the health of patients over time. Unless the researchers can point to other evidence that shows stability of the condition over time, information from this type of study is worthless.
Sometimes difficult circumstances (such as a general failure to respond to the standard therapy) will force the use of this type of design. Further discussion of lack of randomization or other issues with crossover designs can be found in Louis (1992).
Concurrent controls versus historical controls.
Sometimes researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy. This type of study is sometimes called a historical controls study. The very nature of a historical controls study guarantees that there will be a major discrepancy in timing. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially?
1.3 Did the authors use statistical adjustments?
Statistical adjustments represent one way of correcting for covariate imbalance. There are several ways to make statistical adjustments.
First, there are direct adjustments, such as using a per capita rate. This adjustment occurs most frequently, when the outcome measure is some type of count, such as the number of infections, number of medication errors, or number of traffic deaths. This adjustment simply divides the outcome measure by some variable which measures volume of activity in the process that produced the count. It might represent the number of patients (or patient days) at risk in a study of infections. It might represent the number of medications dispensed in a study of medication errors. It might represent the number of people (or the total number of passenger miles) in a county for a study of county-wide traffic deaths.
Second there are regression adjustments. [Discuss]
Third, there are weighting adjustments. [Discuss]
[Discuss the problems of non-overlap]
[Discuss the imperfect nature of these adjustments, especially when the adjustment variable is imperfectly measured.]
[Re-iterate the problem with difficult to measure covariates.]
Summary - Who did the choosing?
1.1 Did the authors use randomization? Randomization insures balance among the two therapy groups with respect to both measurable and unmeasurable factors.
1.2 Did the authors use matching? [Discuss]
1.3 Did the authors use statistical adjustments? [Discuss]
Chapter 2: Was there a plan?
Introduction
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
2.1 Did the research have a narrow focus?
2.2 Did the authors deviate from the plan?
Meat consumption and childhood cancer
Studies of the effects of diet on health often have difficulties with multiple endpoints. An example is a 1994 study of the effect of cured and broiled meat consumption on childhood cancer.
This study examined two types of cancer (acute lymphocytic leukemia and brain tumor). The authors examined five types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods). Finally, the authors looked at food consumption both of the child and of the mother during pregnancy.
In the analysis, the researchers used a cut-off to compare low meat consumption to high meat consumption. For example, they compare one or more hamburgers consumed per week to less than one per week. In the text, however, they went further and discussed results with a different cut-off, children who ate two or more hamburgers per week compared to children who ate one or less per week.
This study came under a lot of criticism for its scattershot approach to investigation, though it also had its share of defenders. There's a saying in statistics "if you torture your data long enough, it will confess to something." When a research study has a plan with limited number of precisely defined hypotheses, the results are more persuasive. When the research has no pre-planned hypotheses, then the results should be considered preliminary and exploratory in nature.
2.1 Did the research have a narrow focus?
A good research study has limited objectives that are specified in advance. Failure to limit the scope of a study leads to problems with multiple testing.
When there are a large number of comparisons being made, the study is considered a fishing expedition. There is a saying in Statistics circles "If you torture your data long enough, it will confess to something."
When is multiple testing likely to occur?
Multiple testing often occurs when a researcher examines a large number of subgroups or a large number of endpoints (Howel 1994). Multiple testing problems also occur when a study examines multiple side effects.
When multiple tests are done simultaneously within a paper, there is an increase in the overall Type I error. If 100 tests were performed at alpha=.05, you would expect that 5 of those tests would be significant, even if there was nothing at all going on. There are statistical adjustments for multiple comparisons, but these are controversial. Significant results from a large number of unplanned comparisons are useful mostly just for setting future research priorities.
Optimal cut points and the problem with multiple comparisons.
Researchers will often simplify analysis of a continuous outcome measure by dividing that measure into two or more distinct groups on the basis of cut points. For example, a researcher might categorize his/her subjects as high or low blood pressure when they are above or below a certain value.
An abuse of this approach, called the minimum p-value approach, was noted by Altman (1994). Researchers would examine a variety of cut points and select the one that yielded the most favorable statistics.
For example, some researchers have chosen the cut point from among a large number of possible cut points so as the make the difference in survival times between those patients above the cut point and those patients below the cut point as large as possible.
By examining a multiple number of cut points the chance of drawing a false conclusion (Type I Error) is inflated from the traditional 5% value to a value as large as 40%.
There are several objective ways to select a cut point. Perhaps the best way is to select the cut point prior to looking at the data. This would involve the use of medical judgment.
After the data has been collected, there are some neutral ways of selecting a cut point. The simplest is a median split. If you wanted to create a median split for blood pressure, you would combine the blood pressure data from both groups, and select a value so that half of the blood pressures are larger and half are smaller.
Subgroup analysis
Subgroup comparisons are a special case of multiple testing. Rather than looking at multiple endpoints, a subgroup analysis compares a single endpoint across several different subgroups within the data.
Subgroup comparisons suffer from three problems. First, the subgroup comparison is usually a non-randomized comparison. Second, the subgroup comparison has less precision because the sample size is smaller. Third, the sample size in a study could be swamped by the potential number of possible subgroups that could potentially be examined.
If you find a subgroup that behaves differently, then you need to ask yourself a few questions. Is this a subgroup that I would have studied a priori if I had been more careful during the planning stage? Is there a plausible mechanism to explain why this subgroup behaves differently? Are there other studies that have similar findings for this subgroup?
There are some technical issues with subgroup comparisons. You wouldn't want to declare that a therapy is effective one subgroup if the p-value for that subgroup was 0.043 and the p-value for the others was 0.062. The analysis of subgroups should be done as a formal test of interaction.
A recent publication in the International Journal of Epidemiology provides empirical evidence that post hoc analyses are more likely to lead to false positive findings.
False positive outcomes and design characteristics in occupational cancer epidemiology studies. Gerard GMH Swaen, Olga Teggeler and Ludovic GPM van Amelsvoort. International Journal of Epidemiology 2001;30:948-954. http://ije.oupjournals.org/cgi/content/abstract/30/5/948
2.2 Did the authors deviate from the plan?
Not all research is predictable, so deviations from a pre-designed plan are sometimes necessary. Nevertheless, be cautious about any major deviation from the original research protocol. Some examples of deviations from the plan include:
Investigating end-points other than those originally specified.
Developing new exclusion criteria after the study has started.
You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.
An example of a deviation from the research plan.
An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity.
In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol.
Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found.
Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance.
There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia.
Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what impact the early termination of the study had on the statistics. Early termination of a study can cause serious biases, unless specific rules for early termination are established at the start of the study.
Premature discontinuation of clinical trial for reasons not related to efficacy, safety, or feasibility ' Commentary: Early discontinuation violates Helsinki principles. Michel Li'vre, Jo'l M'nard, Eric Bruckert, Jo'l Cogneau, Fran'ois Delahaye, Philippe Giral, Eran Leitersdorf, G'rald Luc, Luis Masana, Philippe Moulin, Philippe Passa, Denis Pouchain, G'rard Siest, and K Boyd. BMJ 2001; 322: 603-606.
[Full text] Responsibilities of sponsors are limited in premature discontinuation of trials. Richard Ashcroft. BMJ 2001; 323: 53.
[Full text] Did the authors discard outliers?
You should be skeptical of any study that removes outliers. Inappropriate removal of outliers can seriously bias the study results.
Sometimes the outliers are more interesting than the bulk of the data themselves. You may gain more insight by trying to uncover the cause of an outlying observation than you would by examining the relatively small effects that occur with the rest of the data.
It is generally a bad idea to remove data points on the basis of their data values alone. If an investigation of an outlier leads to a discovery of a typing error or the inclusion of a subject who did not meet the pre-specified inclusion criteria, then correction or removal of the outlier is appropriate.
If there is no such justification, then the best solution is to leave the outlier alone. Another alternative is reporting data analysis results both with and without the outlier.
An example of inappropriate outlier deletion.
The NASA web site has an interesting example of outlier deletion. Researchers in the 1980s first published information about the hole in the ozone layer above Antarctica. These researchers were nervous because the results from the British Antarctic survey did not match results from earlier years taken by an American satellite. The authors discovered, however, that the American satellite had a computer filter built in that automatically removed any large sudden changes in ozone concentration which it considered as instrument errors. When this filter was removed, the authors were able to trace the development of the ozone hole all the way back to 1976.
Further details about the history of the ozone hole can be found at the NASA web site.
[library/pages/sparling.htm]
Summary - Was there a plan?
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
2.1 Did the research have a narrow focus? A large number of comparisons limits the amount of evidence that you can place on any single conclusion. Results from a limited number of planned comparisons are considered more authoritative.
2.2 Did the authors deviate from the plan? While minor deviations are expected, be cautious about major deviations from the research plan, such as developing new exclusion criteria during the course of the study. In particular, removing outliers without a sound scientific reason is dangerous.
Additional resources
Randomised controlled trial of cardiotocography versus Doppler auscultation of fetal heart at admission in labour in low risk obstetric population ' Commentary: changes between protocol and manuscript should be declared at submission ' Commentary: research governance must focus on research training ' Commentary: Approach to power calculations has to be realistic
Gary Mires, Fiona Williams, Peter Howie, Sandy Goldbeck-Wood, Gordon D Murray, and Britt-Ingjerd Nesheim
BMJ 2001; 322: 1457-1462. [Abstract] [Full text]This reference talks about a change in power calculations after the data was collected.
Chapter 3: Who knew what when?
Introduction
Knowledge of group membership, either before or during the data collection can bias the study.
When you are trying to figure out who knew what when, ask the following two questions:
3.1 During the study, did the patients know which group they were in?
3.2 At the start of the study, did the patients know which group they were going to be placed in?
Accupuncture
Acupuncture is an example of a therapy that is difficult to blind. One study of the effect of accupuncture on the prevention of recidivism among alcohol and other drug abusers used a placebo accupuncture that placed needles 5 mm away from the designated accupuncture point. Because of the nature of accupuncture, the accupuncturists were aware of which patients were which, making this a single blind study.
A critique of this study pointed out that there were significant interactions between the accupuncturists and the patients, with opportunities for indirect suggestion and nonverbal communication to occur. One indication that subjects became aware of who was in which group was the fact that there was a far greater tendency for control subjects to drop out of the study.
3.1 During the study, did the patients know which group they were in?
In an experimental study, it is desirable (but not always possible) to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as "blinding." Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study.
Unfortunately, there are many situations where blinding is impossible. For example, if you are comparing oral versus rectal administration of a drug, that's pretty hard to conceal from the patient. In general, observational studies cannot be blinded, because the patient and/or their doctor selects the treatment group.
Unblinded studies are still useful, but they are considered less authoritative than blinded studies.
The placebo effect.
Positive effects of a treatment are sometimes due to a placebo effect. The placebo effect is a product of "belief, expectancy, cognitive reinterpretation, and diversion of attention" that can lead to psychological and sometimes physiological improvements in situations where the treatment is known to have no effect, such as sugar pills (Beyerstein 1997).
Johnson (1997) lists three specific situations where the placebo effect is of particular concern: when enthusiasm by the patient or the doctor for the new procedure is strong, when outcomes are based on the patient's self-assessment (e.g. quality of life studies), and when the treatment is primarily for symptoms. The placebo effect is less critical for objective outcomes like survival.
Blinding in surgical trials.
Surgical procedures are often difficult to completely blind. Nevertheless, Johnson (1997) suggests some partial steps at blinding that prevent some of the biases from creeping in.
If two surgical procedures use different types of incisions, identical blood or iodine stained opaque dressings could be used to keep the patients unaware of which operation was performed.
Also, although the surgeon cannot be blinded to the difference in surgery, those who evaluate the health of the patient after surgery could be kept unaware of the particular operation, so as to insure that their evaluation of the patient is unbiased.
Partial blinding in an observational study.
As noted earlier, it is impossible to completely blind an observational study. Gail (1996), however, describes an observational study where some level of blinding was achieved.
In a study of the relationship of smoking and cancer, the people asking questions about smoking and other risk factors were unaware of when they were interviewing lung cancer patients or controls. Thus, the interviewers could not subconsciously probe harder for smoking information among the lung cancer patients.
The problem with studies without blinding.
Two researchers have examined studies with and without blinding. These authors found that studies without blinding show an average bias of 11-17% (Schulz 1996; Colditz 1989). In other words, when an unblinded study was compared to a blinded study, the former study tended to estimate a treatment effect that was (on average) 11% to 17% higher than the latter.
Additional evidence of this problem appears in a meta-analysis of the effect of intermittent sunlight exposure and melanoma (Nelemans 1995). When nine studies without blinding were combined, they showed a odds ratio of 1.84 which was statistically significant (95% confidence interval 1.52 to 2.25). When the seven studies with blinding were combined, they showed a much smaller odds ratio (1.17, 95% confidence interval 0.98 to 1.39) which was not statistically significant. This is further evidence that unblinded studies are more likely to show statistical significance than blinded studies.
Problems with keeping a treatment blinded.
Even though the placebo may look the same, sometimes the doctor may infer which group a patient belongs to, perhaps through noting a characteristic set of side effects. In an anonymous survey, more than half of the doctors participating in research studies admitted to breaking a blinded allocation (Schulz 1996). If you are worried about this, ask the doctors to try to identify which treatment group they believe each patient belonged to. If the percentage of correct guesses is significantly larger than 50%, then the allocation scheme was not sufficiently blinded.
3.2 At the start of the study, did the patients know which group they were going to be in?
The randomization list should be blinded to those involved with recruiting subjects.
It is always possible to blind the randomization list, even when the treatment itself cannot be blinded. Check out all the exclusion criteria and if the subject qualifies, open a sealed envelope which identifies which group the patient belongs to. So, for example, it is impossible to use blinding when comparing a surgical to a non-surgical technique, but the selection of who gets the surgical technique could be hidden from both the patient and the surgeon until after all the selection and inclusion criteria are applied.
Knowledge of treatment order allows the doctors recruiting patients to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).
Problems with systematic allocation.
Systematic allocations can also cause biases. For convenience, some researchers will allocate in a systematic (non-random) fashion, such as alternating regularly between the two treatments. This is a bad idea. Patients may arrive in a systematic order. Systematic allocations allow the doctors to guess which group the next patient is going to be allocated to. Systematic assignment causes an average bias of 15% (Colditz 1989).
Summary - Who knew what when?
Knowledge of group membership, either before or during the data collection can bias the study.
3.1 During the study, did the patients know which group they were in? While this is not always possible, it is preferred to use a blinded approach to remove the possibility of the placebo effect.
3.2 At the start of the study, did the patients know which group they were going to be in? Even when blinding is impossible, you can always hide the randomization plan through the use of sealed envelopes. This will ensure that the health professional do not consciously or subconsciously influence group membership through the differential application of entry criteria.
3.3 Did the authors rely on retrospective data? Retrospective data are more likely to suffer from inaccuracy, incompleteness and bias.
Research into complementary and alternative medicine: problems and potential. Richard L Nahin and Stephen E Straus. BMJ 2001; 322: 161-164.
[Full text]
Chapter 4: Who was left out?
Introduction
Research studies often have a narrow focus, but sometimes it can be too narrow. When too many patients are left out, those who remain may not be not representative of the types of patients you will encounter.
When you are trying to figure out who was left out and what impact this has, ask the following two questions:
4.1 Who was excluded at the start of the study?
4.2 Who dropped out during the study?
Nicotine patches
The Journal of Pediatrics published a study of adolescent smokers in 1996. The researchers recruited 22 volunteers from five public high schools in the Rochester, MN area for participation in a smoking cessation program involving behavioral counseling, group therapy, and nicotine patches. Researchers measured the number of cigarettes smoked, side effects, and blood levels of nicotine.
The purpose of the research was to evaluate "the safety, tolerance, and efficacy of 22 mg/d nicotine patch therapy in smokers younger than 18 years who were trying to stop smoking." The authors also listed a secondary goal, "to compare blood cotinine levels, nicotine withdrawal scores, and adverse experiences with those of adults obtained in previous patch studies." Cotinine is a metabolite of nicotine and provides a useful objective measure of cigarette smoking. It also allowed the authors to examine whether nicotine toxicity was an issue.
This study did not include major segments of the teenage smoking population. The study included only white subjects because there were too few minority studentsin the Rochester area. Subjects had to get parental permission, excluding smokers who wished to keep their habit secret from their parents. Subjects were also volunteers, and thus could be considered more motivated to quit than the typical teenage smoker.
The study also had a serious drop out rate. Of the presumably thousands of teenage smokers in the Rochester Minnesota area, only 71 volunteers responded to the initial call for subjects. Of the 71 volunteers, 55% met inclusion criteria. Of the remaining 39, 44% declined to attend the initial meeting. Of the remaining 22, 14% were non-compliant. Of the remaining 18, 39% failed to respond to the one year survey. Only 11 completed the entire study (50% of those who started the study; 28% of those meeting inclusion criteria; 15% of the initial volunteers.)
This study had a serious problem with who was left out. The large number of subjects who did not get into the study or who did not complete the study makes it hard to generalize the findings of this research.
4.1 Who was excluded at the start of the study?
Researchers, trying to minimize variation, will use exclusion criteria to create more homogenous groups. While minimizing variability is good, too much homogeneity can backfire. It's difficult to extrapolate results from a very tightly controlled and homogenous clinical trial to the variation of patients seen in your practice. Ask yourself the question "How similar are my patients?"
For the study to be useful to us, we want the research subjects to be as similar as possible to the patients we see. Watch out for exclusion criteria that leave out large groups of patients. Also be aware that too many research studies exclude women unnecessarily.
Ask yourself whether the geographic location or the type of health care setting places restrictions on the type of patients seen. Tertiary care centers only see patients that are extremely ill. A study of Midwest hospitals will not have a representative number of Hispanic patients compared to the Southwest.
Volunteer bias
Quite often, the only patients we are able to study are those who volunteer to help out. The use of volunteers, however, may exclude important segments of the patient population.
Volunteers may differ from the normal population on several critical factors. Volunteers for a study involving cash payments may come more often from economically challenged environments. If a free health check-up is included, volunteers may come more often from people worried about their health status. Volunteers for lengthy studies are less likely to be employed.
Recruiting controls is especially troublesome in a study that involves a painful procedure. Gustavsson (1997) documents volunteer bias in a study of lumbar puncture to obtain cerebrospinal fluid.
In this study, subjects were asked to submit to a lumbar puncture in order to "examine the associations between personality traits and biochemical variables." Of the 87 subjects, 48 declined to participate. The authors were fortunate enough to have measures of personality on both those who participated in the study and those who did not participate.
Those who participated had scores roughly a half standard deviation higher on impulsiveness. They did not differ on other personality traits such as socialization and detachment.
The large difference in the impulsiveness measurement would obviously cloud any attempt to correlate personality traits and biochemical measurements in spinal fluids among those who volunteered.
Volunteers in survey study.
An aspect of volunteering can occur in survey studies. People who volunteer to return a questionnaire are frequently quite different from those who refuse to fill out the survey. In particular, the non-responders tend to be more apathetic. Return rates for surveys vary by the type of survey, but if less than half of the subjects returned the survey, any results are of very limited value. Again, look for efforts to minimize non-response and/or efforts to characterize the demographics of non-responders.
Problems with volunteers are especially troublesome in surveys using 900 numbers and web-based surveys.
In 1976, Shere Hite published a study on female sexual attitudes that represented the responses of 3,019 surveys. While that sounds impressive, it was a small fraction of the 100,000 surveys that were sent out.
One can speculate on the characteristics of those who failed to respond, but it is a pretty good bet that many of them felt uncomfortable discussing aspects of their sex lives in a survey format. It's obvious that this tendency alone would tend to affect many of the responses in the survey.
What to look for in studies using volunteers.
Examine the incentives and disincentives for participation. Are any incentives or disincentives related to important prognostic factors?
Were the researchers able to characterize various aspects of those who did not volunteer? How similar were the volunteers and non-volunteers?
Do people volunteer themselves into specific treatment groups? If so, we have an observational study.
Some studies involve the use of volunteers who are subsequently randomized into two groups. If this case, some problems will diminish. Comparison between the two groups will be unbiased, but it may be difficult to generalize to a non-volunteer population.
4.2 Who dropped out during the study?
It is inevitable that some patients will drop out during the study. If the number is more than a few, this is a cause for concern. Dropouts often have a different prognosis than those who stay. Ignoring the dropouts will often paint a rosier picture of the outcome. Was there any effort (financial inducement, follow-up reminders) made to minimize dropouts? Were the authors able to characterize the demographics of the dropouts?
Were non-compliant patients excluded? Non-compliance is often associated with poor prognosis. Excluding these patients may also paint a rosier picture of the outcome. Patients should be analyzed in the groups they were randomized to. This is known as "intention to treat" analysis.
Consider a new surgical therapy which is being compared to a standard non-surgical therapy. Some patients randomized to the surgical therapy might die prior to receiving the therapy. This is the most extreme form of non-compliance. These patients should still be analyzed as part of the surgical therapy group. Otherwise the rapidly dying patients will be excluded from the treatment group, but not from the control group, leading to serious bias.
Intention-to-treat principle. Victor M. Montori, Gordon H. Guyatt. CMAJ 2001;165(10):1339-41. http://www.cma.ca/cmaj/vol-165/issue-10/1339.asp
Summary - Who was left out?
Exclusion of subjects can make the study biased or less generalizable.
4.1 Who was excluded at the start of the study? Excessively strict entry criteria in a research study can make it difficult to extrapolate to the types of patients that you normally see.
4.2 Who dropped out during the study? A large number of drop-outs during the course of a research study can bias the final conclusions.
Chapter 5: How much did things change?
Introduction
It's not enough just to assess statistical significance in a study. You need to also make sure that the difference has a practical impact, that it represented a clinically relevant outcome, and that there were sufficient number of patients to provide reasonable precision.
When you are looking at how much things changed, ask yourself the following questions:
5.1 Did the authors measure the right thing?
5.2 Was the change clinically significant?
5.3 Were there enough subjects?
Non-steroidal anti-inflammatory drugs
A 1987 study of non-steroidal anti-inflammatory drugs (NSAID) showed that patients who took these drugs were 50% more likely to develop upper gastrointestinal (UGI) bleeding. This rate was statistically significant at alpha=.05. UGI bleeding, however, was rare in both groups. Only 1 case per thousand person years in the controls, 1.5 in the NSAID group. If you see 100 patients a year, you would have to wait two decades, more or less, in order see one excess event of bleeding, on average.
In this article, the authors were up front about the very small increase in risk. Most authors, however, are so relieved to achieve statistical significance that they forget to consider whether the size of the difference will improve clinical practice.
This is summarized well in the following Gertrude Stein quote :"For a difference to be a difference it has to make a difference"
5.1 Did the authors measure the right thing?
There is a tendency to focus on intermediate measures that are easy to assess, but which may or may not be predictive of more important endpoints. Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks. If an intermediate endpoint is used, ask yourself whether there is an adequate link between this endpoint and something that is relevant to your patients.
Be careful that you don't focus solely on the outcomes mentioned in the abstract. There is a tendency to report only in the abstract the outcome measures that were statistically significant, rather than the outcome measures most of interest to health care professionals.
Also always consider whether the researcher provided adequate inspection of side effects.
Measurement error
Measurement error is simply the inability to measure an important variable accurately. Measurement error in the outcome variable does not ordinarily cause bias, buy measurement error in factors that can predict the outcome are of serious concern.
There are several ways to assess dietary fat intake. The most accurate (and also the most costly) way is through the use of prospectively recorded food diaries.
Sometimes the cost limitations or the retrospective nature of a research study will require a less accurate assessment of dietary fat, such as through an interview. Shapiro (1997) points out that estimation of dietary fat using interviews tends to correlate poorly with estimation using prospective diaries. This would cast doubt, for example, on retrospective studies that tried to associate dietary fat intake with the risk of breast cancer.
Unvalidated measures
[Discuss]
Short term measures
[Discuss]
Retrospective data
Retrospective data are data collected by looking backwards in time. We obtain this data by asking subjects to recall events that occurred earlier in their lives. We also get retrospective data when we review medical records, birth certificates, death certificates, or other sources of historical data. In contrast, data collected during the course of the study is known as prospective data.
Retrospective data are often inexpensive to collect, but you should be concerned about their accuracy. The ability of a subject to recall information is sometimes affected by which group that they are in.
Women who have experienced miscarriages, for example, are more likely to search for and remember events that they feel might "explain" their miscarriage, much more so than a group of comparable control subjects. This differential level of reporting is known as recall bias.
In addition, historical data are often incomplete and it is sometimes difficult to verify their accuracy. Therefore, retrospective data are considered less authoritative than prospective data.
An example of recall bias.
An interesting review of the research process that helped establish that smoking causes lung cancer can be found in Gail (1996). One aspect of the research process was addressing the issue of recall bias.
Doll (1950) studied the association between tobacco smoking and cancer. They selected 709 patients with lung cancer and an equal number of matched controls. The authors were concerned about the retrospective assessment of smoking among patients in both groups. Would patients with lung cancer exaggerate the amount of smoking? Would the interviewers press harder for information about smoking among the cancer patients?
While it would be impossible to totally rule out recall bias, the authors did examine a third group, patients who were diagnosed with lung cancer and who later found out that they suffered from a different disease (false cases). If recall bias was the sole explanation of the difference in reported smoking, then the group of false cases should have had a similar level of smoking with the lung cancer patients. Instead they reported a lower level of smoking. This helped to rule out the possibility that recall bias alone accounted for the higher reported smoking levels in the lung cancer patients.
5.2 Was the change clinically significant?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science." William Thomson Kelvin (Lord Kelvin)
Knowing that a new therapy is better is not enough information. You need to quantify how much the new therapy is better. In this respect, confidence intervals are better than p-values. A p-value tells you whether the new therapy is better. A confidence intervals tells you whether the new therapy is better and by how much. A confidence interval allows you to balance the size of the improvement against the possibility of greater cost or more side effects. Many journals now require confidence intervals instead of p-values.
Statistical methods are sometimes able to detect differences that are so small as to be meaningless from any practical perspective. This is known as statistical significance without clinical significance. Always put the numbers into the perspective of your practice. Try to estimate how of the patients you see within a year are likely to perform better under the new therapy.
5.3 Were there enough subjects?
Every research study, especially negative studies, should justify the sample size chosen. It is unethical to perform research on humans or animals without first demonstrating that the sample size you have chosen is appropriate.
Justification of sample size is particularly important for a negative study (one where no difference between the standard and new therapies were found) and in studies assessing the equivalence of two therapies.
How can you tell if the sample size is too small?
Ideally, the authors should provide justification of the sample size in the paper itself. The justification is considered better if it is made a priori (prior to the start of the data collection). If no justification of sample size (e.g., power calculations) is given, examine the width of the confidence intervals. Very wide intervals indicate an inadequate sample size.
There are many examples of studies with inadequate sample sizes.
A revealing study of inadequate sample size appears in Freiman 1992. In a series of 71 publications appearing between 1960 and 1977, the outcome was either percent mortality, percent complications, or a similar outcome that could be measured as a percentage. The authors examined power, the ability of the study to detect either a moderate improvement (25% relative reduction in the outcome) or a large improvement (50% relative reduction in the outcome). For example, if a study showed a 40% mortality in the controls, then a 30% mortality rate in the treated group would be considered a moderate improvement and a 20% mortality rate would considered a large improvement.
The results of the Freiman study were very disappointing.
Of the 71 papers, 57 had greater than a 50% chance for missing a moderate improvement and 31 had a 50% or greater chance for missing a large improvement.
One wonders why anyone would undertake a study when there is such a high probability for failure. You should never initiate a study unless you know that the chance of missing a reasonable improvement is less than 20%.
Special issues in a study of equivalency.
Some studies attempt to show not that a new therapy is superior to the standard therapy, but that it is equivalent. Showing equivalence requires a very careful assessment of sample size.
An example of an equivalence study is when a drug company tests a generic drug and wishes to show equivalence with the (presumably more expensive) brand name drug.
If we applied the traditional testing approach, the company would have a strong disincentive to design the study with an adequate sample size. A small sample size is more likely to show equivalency under the traditional testing framework.
There are several modifications to the traditional testing framework for equivalency studies. The simplest approach uses confidence interval for the ratio of the outcome under new therapy to the outcome under the standard therapy. If both limits of the confidence interval are reasonably close to 1 (e.g., no less than 0.8 and no more than 1.25) then the two therapies are considered equivalent.
Summary - How much did things change?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
5.1 Was there a quantitative measure of the size of the effect? Look for a confidence interval and compare the size of the effect to what you would expect to see in your practice.
5.2 Could other factors account for this effect? Look for differences in demographics between the two groups and ask if these differences could explain the results of the research.
5.3 Were any important outcomes forgotten? Research results should focus on endpoints that are of interest to your patients.
Additional resources
Rating health information on the Internet: navigating to knowledge or to Babel? Jadad, A. R. and A. Gagliardi (1998). Jama 279(8): 611-4.
CONTEXT: The rapid growth of the Internet has triggered an information revolution of unprecedented magnitude. Despite its obvious benefits, the increase in the availability of information could also result in many potentially harmful effects on both consumers and health professionals who do not use it appropriately. OBJECTIVES: To identify instruments used to rate Web sites providing health information on the Internet, rate criteria used by them, establish the degree of validation of the instruments, and provide future directions for research in this area. DATA SOURCES: MEDLINE (1966-1997), CINHAL (1982-1997), HEALTH (1975-1997), Information Science Abstracts (1966 to September 1995), Library and Information Science Abstracts (1969-1995), and Library Literature (1984-1996); the search engines Lycos, Excite, Open Text, Yahoo, HotBot, Infoseek, and Magellan; Internet discussion lists; meeting proceedings; multiple Web pages; and reference lists. INSTRUMENT SELECTION: Instruments used at least once to rate the quality of Web sites providing health information with their rating criteria available on the Internet. DATA EXTRACTION: The name of the developing organization, Internet address, rating criteria, information on the development of the instrument, number and background of people generating the assessments, and data on the validity and reliability of the measurements. DATA SYNTHESIS: A total of 47 rating instruments were identified. Fourteen provided a description of the criteria used to produce the ratings, and 5 of these provided instructions for their use. None of the instruments identified provided information on the interobserver reliability and construct validity of the measurements. CONCLUSIONS: Many incompletely developed instruments to evaluate health information exist on the Internet. It is unclear, however, whether they should exist in the first place, whether they measure what they claim to measure, or whether they lead to more good than harm.
Chapter 6: Special guidelines for overviews and meta-analyses
Introduction
Meta-analysis is the quantitative pooling of data from two or more studies. When you are examining the results of a meta-analysis, you should ask the following questions:
6.1 Were apples combined with oranges? Heterogeneity among studies may make any pooled estimate meaningless.
6.2 Were all of the apples rotten? The quality of a meta-analysis cannot be any better than the quality of the studies it is summarizing.
6.3 Were some apples left on the tree? An incomplete search of the literature can bias the findings of a meta-analysis.
6.4 Did the pile of apples amount to more than just a hill of beans? Make sure that the meta-analysis quantifies the size of the effect in units that you can understand.
Declining sperm counts
In 1992, the British Medical Journal published a controversial meta-analysis. This study (Carlsen et al 1992) reviewed 61 papers published from 1938 and 1991 and showed that there was a significant decrease in sperm count and in seminal volume over this period of time. For example, a linear regression model on the pooled data provided an estimated average count of 113 million per ml in 1940 and 66 million per ml in 1990.
Several researchers (Fisch and Goluboff 1996, Olsen et al 1995) noted heterogeneity in this meta-analysis, a mixing of apples and oranges. Studies before 1970 were dominated by studies in the United States and particularly studies in New York. Studies after 1970 included many other locations including third world countries. Thus the early studies were United States apples. The later studies were international oranges. There was also substantial variation in collection methods, especially in the extent to which the subjects adhered to a minimum abstinence period.
The original meta-analysis and the criticisms of it highlight both the greatest weakness and the greatest strength of meta-analysis.
Meta-analysis is the quantitative pooling of data from studies with sometimes small and sometimes large disparities. Think of it as a multi-center trial where each center gets to use its own protocol and where some of the centers are left out.
On the other hand, a meta-analysis lays all the cards on the table. Sitting out in the open are all the methods for selecting studies, abstracting information, and combining the findings. Meta-analysis allows objective criticism of these overt methods and even allows replication of the research.
Contrast this to an invited editorial or commentary that provides a subjective summary of a research area. Even when the subjective summary is done well, you cannot effectively replicate the findings. Since a subjective review is a black box, the only way, it seems, to repudiate a subjective summary is to attack the messenger.
Meta-analysis is used in a variety of different areas. Vine et al 1994 used meta-analysis studied the relationship between smoking and sperm concentration. Oehninger et al 2000 assessed the utility of sperm function assays in predicting successful outcomes in IVF. Goldberg et al 1999 compared intrauterine and intracervical insemination with frozen donor sperm. Evers et al 2001 reviewed the effectiveness of varicocelectomy in subfertile men.
6.1 Were apples combined with oranges?
Meta-analyses should not have too broad an inclusion criteria. Including too many studies can lead to problems with "apples-to-oranges" comparisons. Example: When studying the effect of cholesterol lowering drugs, it makes no sense to combine a study of patients with recent heart attacks with another study of patients with high cholesterol but no previous heart attacks.
There is a lot of variability in how research is conducted. Even in carefully controlled randomized control trials, researchers have tremendous discretion. Sometimes this discretion creates heterogeneity among studies, making it difficult to combine the studies.
Heterogeneity in the composition of the treatment and control groups.
Researchers can differ in the inclusion and exclusion criteria.
Even if these criteria do not differ, there may still be differences in the baseline levels of health in the patients, due to geographical differences in the patient population.
The controls could be selected independently, or they could be matched to the treatment group subjects.
The control subjects could be given no treatment, a placebo, or a standard treatment.
The treatment could differ, such as differences in dose or timing of a drug.
Heterogeneity in the design of the study.
The length of follow-up for the patients could differ.
The proportion of patients who drop out could differ as well as the proposed statistical treatment of these dropouts.
Heterogeneity in the management of the patients and in the outcome.
How comorbid conditions are treated.
How complications are handled.
How much discretion the patient's physician has in controlling patient care.
The outcome measure itself could differ. For example, Abramson (1990) discusses a meta-analysis of hypertension treatment in the elderly. Some of the studies examined cardiovascular deaths and others examined cardiovascular events. Other studies examined cerebrovascular deaths, cerebrovascular events, cardiac deaths, coronary heart disease deaths, and/or total deaths.
How to handle heterogeneity.
Some level of heterogeneity is acceptable. After all, the purpose of research is to generalize results to large groups of patients. Furthermore, demonstrating that a treatment shows consistent results across a variety of conditions strengthens our confidence in that treatment.
Nevertheless, you should be aware of the problems that excessive heterogeneity can cause. Mixing apples and oranges may not be so bad; you get a fruit salad this way. But when heterogeneity becomes too large, you might end up combining not apples and oranges but apples and onions.
Inclusion of very old studies.
Inclusion of very old studies can also be problematic. They could differ from more recent studies because of changes in medical care or in the natural course of the disease.
A meta-analysis of sperm counts was criticized for this reason. The meta-analysis included studies from the 1990's to as far back as the 1940's. Any comparisons of data over five decades would be difficult because of the many changes in laboratory equipment and methods over that time frame.
Sensitivity analysis.
A good approach to heterogeneity is to include a wide range of studies, but then examine the sensitivity of the results by looking at more narrowly drawn subsets of the studies.
The authors can also weight studies by a quality factor and give greater emphasis to randomized studies, which are less likely to have bias. Second, the authors can perform sensitivity analyses. Would the results change if we changed the entry criteria?
In general, heterogeneity increases uncertainty, but this uncertainty cannot be reflected in the width of the confidence limits in the meta-analysis results. When there is heterogeneity, the most information may reside not in a single estimate of how effective the treatment is, but in a careful examination of the variation in the treatment under different conditions.
6.2 Were all of the apples rotten?
The quality of a meta-analysis is constrained by the quality of articles that are used in a meta-analysis. Meta-analysis cannot correct or compensate for methodologically flawed studies. In fact, meta-analysis may reinforce or amplify the flaws of the original studies.
Observational studies in a meta-analysis.
The use of meta-analysis on observational studies is very controversial. Some experts have argued that the biases inherent in observational studies make a meta-analysis an exercise in mega-silliness. But even those experts who do not take such an extreme viewpoint warn that the current statistical methods for summarizing the results of observational studies may grossly understate the amount of uncertainty in the final result.
Sensitivity analysis may be a useful way of highlighting the uncertainties in a meta-analysis of observational studies. Restricting the meta-analysis to selective subgroups of the data can yield insight into the size and direction of biases in observational studies. For example, the researchers could contrast case-control designs with cohort designs, with the latter expected to show less bias, in general. Or the researchers could compare retrospective studies to prospective studies, where again, the latter is expected to show less bias in general. Another possibilities for comparison involve comparing studies by the amount to which measurement error is expected to cause problems. In general, researchers should try to stratify the observational studies by known sources of bias.
Meta-analyses of randomized trials.
Some meta-analyses restrict their attention to randomized trials because these studies are less likely to have problems with bias. In other words, they wish to avoid mixing bad observational apples with good randomized trial apples. Sometimes further restrictions can be made on the basis of partial or full blinding of results or on the proper accounting of dropouts.
Even for randomized trials, sensitivity analysis may help. Researchers can use "quality scores" to rate individual studies and then see what happens when studies are restricted to those of highest quality only.
Meta-analysis of studies with small sample sizes.
Some experts advocate great caution in the assessment of meta-analyses where all of the trials consist of small sample size studies. The effect of publication bias can be far more pronounced here than in situations where some medium and large size trials are included.
6.3 Were some apples left on the tree?
One of the greatest concerns in a meta-analysis is whether all the relevant studies have been identified. If some studies are missed, this could lead to serious biases.
Publication bias.
Many important studies are never published; these studies are more likely to be negative (Dickersin 1990). This is known as publication bias. The inclusion of unpublished studies, however, is controversial (Cook 1993).
The existence of publication bias and risk factors for its occurrence. Dickersin, K. (1990). Jama 263(10): 1385-9.
Publication bias is the tendency on the parts of investigators, reviewers, and editors to submit or accept manuscripts for publication based on the direction or strength of the study findings. Much of what has been learned about publication bias comes from the social sciences, less from the field of medicine. In medicine, three studies have provided direct evidence for this bias. Prevention of publication bias is important both from the scientific perspective (complete dissemination of knowledge) and from the perspective of those who combine results from a number of similar studies (meta-analysis). If treatment decisions are based on the published literature, then the literature must include all available data that is of acceptable quality. Currently, obtaining information regarding all studies undertaken in a given field is difficult, even impossible. Registration of clinical trials, and perhaps other types of studies, is the direction in which the scientific community should move.
Another aspect of publication bias is that the delay in publication of negative results is likely to be longer than that for positive studies. For example, Stern and Simes 1997 showed that among 130 clinical trials, the median time to publication was 4.7 years among the positive studies and 8.0 years among the negative studies. So a meta-analysis restricted to a certain time window may be more likely to exclude published research that is negative.
Many experts are advocating the registration of trials as a way of avoiding publication bias. If trials are registered prospectively (i.e., prior to data collection and analysis) then they can be included in any appropriate meta-analysis without worry about publication bias.
Duplicate publication.
Duplicate publication is the flip side of the publication bias coin. Studies which are positive are more likely to appear more than once in publication. This is especially problematic for multi-center trials where an individual centers may publish results specific to their site. Tramer et al (1997) found 84 studies of the effect of ondansetron on postoperative emesis. Unfortunately, 14 of these studies (17%) were second or even third time publications of the same data set. The duplicate studies had much larger effects and adding the duplicates to the originals produced an overestimation of treatment efficacy of 23%. Tracking down the duplicate publications was quite difficult. More than 90% of the duplicate publications did not corss-reference the other studies. Four pairs of identical trials were published by completely different authors without any common authorship
The limitations of a Medline search.
While a Medline search is the most convenient way to identify published research, it should not be the only source of publications for a meta-analysis. Medline searches cover only 3,000 of some 13,000 medical journals (Halvorsen 1992). The studies missed by Medline and other databases are more likely to be negative studies.
Furthermore, these databases may fail to index major journals in the third world that can provide important trials. Egger (1997) cites an interesting example of how Medline excludes most Indian journals, even though these journals are published in English and India produces a significant amount of medical research.
Foreign language publications.
Some meta-analyses restrict their attention to English language publications only. While this may seem like a convenience, in some situations, researchers might tend to publish in an English language journal for those trials which are positive, and publish in a (presumably less prestigious) native language journal for those trials which are negative. Interestingly, some studies have shown that the quality of studies published in other languages is comparable to the quality of studies published in English.
How to avoid bias from exclusion of publications.
Search for studies should involve several bibliographic databases, registries for clinical trials, examination of bibliographies of all articles found, the so-called gray literature (presentation abstracts, dissertations, theses, etc.) and a letter calling for unpublished papers to be sent out to key researchers.
Consider the search strategy adopted in Evers et al 2001.
Relevant trials were identified in the Cochrane Menstrual Disorders and Subfertility Group's specialised register of controlled trials. A MEDLINE search, using the group's search strategy, was performed for the period 1966-2000. Also, hand searching was performed of 22 specialist journals in the field from their first issue till 2000. Cross references and references from review articles were checked.
Sensitivity analysis is also useful here. If the results from published studies are comparable to the results from unpublished studies, for example, then publication bias is less of a concern. Along the same lines, the authors can estimate the number of undiscovered negative studies that would be required to overturn the results of this meta-analysis.
Publication bias is also more likely to occur for studies with small sample sizes. If the results of a meta-analysis are stratified by the sample sizes in the studies, a shift away from the null hypothesis in the smaller studies would be a warning flag about the possibility of publication bias. Statistical and graphical methods have been proposed to examine this further (Sterne et al 2001).
Subjectivity
"Blinding," a common tool in other research areas should also be used in meta-analyses. Blinding prevents the differential application of inclusion/exclusion criteria. The people deciding whether a paper meets the inclusion/exclusion criteria should be unaware of the authors of that paper and the journal. They should also include or exclude the paper on the basis of the methods section only; they should not see the results section until later.
There is empirical evidence, however, that blinding does not affect the conclusions of a meta-analysis (Jadad et al 1996, Berlin et al 1997). Furthermore, blinding takes substantial time and energy.
Data should be extracted from papers by multiple sources and their level of agreement should be assessed. Researchers have found disagreements even on such fundamental concepts such as whether a study was positive or negative (Glass 1981).
Like any other research project, an overview or meta-analysis needs a protocol. Unfortunately, many published meta-analyses do not state whether a protocol was used (Sacks 1992). The protocol should specify: the inclusion/exclusion criteria for studies; a detailed description of the process used to identify studies; and the statistical methods used to combine results. Without a protocol, the meta-analysis research is not reproducible.
Authors have been shown to be biased in the articles that they cite in the bibliographies of their research papers (Gotsche 1987; Ravnskov 1992). This same bias could potentially affect the selection of articles in a meta-analysis.
If the authors do not present objective criteria for the selection of articles in their overview or meta-analysis, then you should be concerned about possible conscious or sub-conscious bias in the selection process.
Researchers should also list all of the articles found in the original search, not just the articles used. This allows others to examine whether the inclusion/exclusion criteria were applied appropriately.
6.4 Did the pile of apples amount to more than just a hill of beans?
It's not enough to know that the overall effect of a therapy is positive. You have to balance the magnitude of the effect versus the added cost and/or the side effects of the new therapy. Unfortunately, most meta-analyses use an effect size (the improvement due to the therapy divided by the standard deviation). The effect size is unitless, allowing the combination of results from studies where slightly different outcomes with slightly different measurement units might have been used.
Vote counting.
Avoid "vote counting" or the tallying of positive versus negative studies. Vote counts ignore the possibility that some studies are negative solely because of their sample size. Abramson (1990) notes, for example, a meta-analysis of parenteral nutrition in cancer patients undergoing chemotherapy. Although each of the seven randomized control trials in the meta-analysis failed to achieve statistical significance, the pooled results were highly significant.
Unitless measures
When you are examining a continuous outcome measure, you should be sure that the results are presented in interpretable units. A measure of effect size does not help you much because it is unitless and impossible to interpret. Consider a store that is offering a sale and announces boldly
"All prices reduced by 0.8 standard deviations!"
One meta-analysis shows how important it is to express measurements in interpretable units. Lumley et al (2001) studied the effect of smoking cessation programs on the health of the fetus and infant. One of the outcome measures was birth weight, and the study showed that the typical program can improve birth weight by a statistically significant amount. The researchers then quantified the amount: 28g (95% confidence interval 9 to 49).
Keep in mind that this is measuring the effectiveness of the smoking cessation program, and not the effect of smoking cessation directly. Typically, you would have to send about 12 to 16 women to these programs in order to get one extra woman to quit smoking. So the effect seen here reflects, in part, how difficult it is to get people to change their behavior.
Still the small size of the effect is important. If you want to assess the costs and benefits of smoking cessation programs, it helps to know that the impact of the typical smoking cessation program on birth weight is quite small. This provides a useful yardstick for comparison to other prenatal interventions.
Where does meta-analysis sit on the hierarchy of evidence?
[Meta-analysis] possesses certain flaws and limitations that preclude its use as a broad-based methodologic approach for formulating definitive therapeutic recommendations. -- Boden 1992.
Bibliography
Meta-analysis: a review of pros and cons. Abramson J. Public Health Reviews 1990 18(1): 1-47.
Does blinding of readers affect the results of meta-analyses? Jesse A Berlin, on behalf of University of Pennsylvania Meta-analysis Blinding Study Group. Lancet 1997; 350: 185-186.
Evidence for decreasing quality of semen during past 50 years. Carlsen E, Giwercman A, Keiding N, Skakkebaek NE. Bmj 1992; 305(6854): 609-13.
Egger (1997)
Surgery or embolisation for varicocele in subfertile men (Cochrane Review). Evers JL, Collins JA, Vandekerckhove P. Cochrane Database Syst Rev 2001; 1: CD000479.
Geographic variations in sperm counts: a potential cause of bias in studies of semen quality. Fisch H; Goluboff ET. Fertil Steril (United States), May 1996, 65(5) p1044-6.
Meta-analysis in social research. Glass GV, McGaw B, Smith ML. pp.18-20. Newbury Park CA: Sage (1981).
Comparison of intrauterine and intracervical insemination with frozen donor sperm: a meta-analysis. Goldberg JM, Mascha E, Falcone T, Attaran M. Fertil Steril 1999 Nov; 72(5): 792-5.
Reference bias in reports of drug trials. Gotzsche PC. Bmj 1992 295(6599): 654-6.
Combining Results from Independent Investigations: Meta-analysis in Clinical Research. Halvorsen KT, Burdick E, Colditz GA, Frazier HS, Mosteller F. pp. 413-426, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
Interventions for promoting smoking cessation during pregnancy (Cochrane Review). Lumley J, Oliver S, Waters E. In: The Cochrane Library, 4, 2001. Oxford: Update Software. http://www.update-software.com/abstracts/ab001055.htm
Cigarette smoking and sperm density: a meta-analysis. Vine MF, Margolin BH, Morrison HI, Hulka BS. Fertil Steril 1994 Jan; 61(1): 35-43.
Sperm function assays and their predictive value for fertilization outcome in IVF therapy: a meta-analysis. Oehninger S, Franken DR, Sayed E, Barroso G, Kolm P. Hum Reprod Update 2000 Mar-Apr; 6(2): 160-8.
Have sperm counts been reduced 50 percent in 50 years? A statistical model revisited. Olsen GW; Bodner KM; Ramlow JM; Ross CE; Lipshultz LI . Fertil Steril (United States), Apr 1995, 63(4) p887-93
Frequency of citation and outcome of cholesterol lowering trials. Ravnskov, U. Bmj 1992 305(6855): 717.
Meta-Analyses of Randomized Control Trials: An Update of the Quality and Methodology. Sacks HS, Berrier J, Reitman D, PAgano D, Chalmers TC. pp. 427-442, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
Publication bias: evidence of delayed publication in a cohort study of clinical research projects
Jerome M Stern and R John Simes
BMJ 1997; 315: 640-645.[Abstract] [Full text] Systematic reviews in health care: Investigating and dealing with publication and other biases in meta-analysis
Jonathan A C Sterne, Matthias Egger, and George Davey Smith
BMJ 2001; 323: 101-105.[Full text] Meta-analysis of Observational Studies in Epidemiology: A Proposal for Reporting. Donna F. Stroup, PhD, MSc; Jesse A. Berlin, ScD; Sally C. Morton, PhD; Ingram Olkin, PhD; G. David Williamson, PhD; Drummond Rennie, MD; David Moher, MSc; Betsy J. Becker, PhD; Theresa Ann Sipe, PhD; Stephen B. Thacker, MD, MSc; for the Meta-analysis Of Observational Studies in Epidemiology (MOOSE) Group April 19, 2000. JAMA. 2000;283:2008-2012. Also available at http://www.consort-statement.org/MOOSE.pdf
Impact of covert duplicate publication on meta-analysis: a case study. Martin R Tram'r, D John M Reynolds, R Andrew Moore, and Henry J McQuay. BMJ 1997; 315: 635-640.
[Abstract] [Full text]
Additional resources and materials
The Cochrane Library. www.update-software.com/cochrane/cochrane-frame.html
"The Cochrane Library is an electronic publication designed to supply high quality evidence to inform people providing and receiving care, and those responsible for research, teaching, funding and administration at all levels."
Meta-analysis in clinical trials reporting: has a tool become a weapon? [editorial]. Boden, W. E. (1992). Am J Cardiol 69(6): 681-6.
Systematic reviews in health care: Assessing the quality of controlled clinical trials. Peter J'ni, Douglas G Altman, and Matthias Egger. BMJ 2001; 323: 42-46.
[Full text] A new system for grading recommendations in evidence based guidelines
Robin Harbour and Juliet Miller
BMJ 2001; 323: 334-336.[Full text] Rating the quality of evidence for clinical practice guidelines. Hadorn DC, Baker D, Hodges JS, Hicks N. J Clin Epidemiol 1996 Jul;49(7):749-54.
This article describes the system for rating the quality of medical evidence developed and used during creation of the Agency for Health Care Policy and Research-sponsored heart failure guideline. Previous approaches to rating evidence were not designed for use in the setting of clinical practice guidelines. The present system is based on the tenet that flaws in research design are serious to the extent they threaten the validity of the results of studies. A taxonomy of major and minor flaws based on that tenet was developed for randomized controlled trials and for cohort and medical registry studies. The use of the system is described in the context of two difficult clinical issues considered by the Panel: the role of coronary artery revascularization and the use of metoprolol.
PMID: 8691224 [PubMed - indexed for MEDLINE]
Assessment Criteria http://www.jr2.ox.ac.uk/bandolier/band6/b6-5.html
Evidence-Based Everything http://www.jr2.ox.ac.uk/bandolier/band12/b12-1.html
Web-based resources.
CONSORT: Consolidated Standards of Reporting Trials. (Accessed November 20, 2001) http://www.consort-statement.org/
"The CONSORT statement is an important research tool that takes an evidence-based approach to improve the quality of reports of randomized trials. CONSORT comprises a checklist and flow diagram to help improve the quality of reports of randomized controlled trials. It offers a standard way for researchers to report trials. The checklist includes items, based on evidence, that need to be addressed in the report; the flow diagram provides readers with a clear picture of the progress of all participants in the trial, from the time they are randomized until the end of their involvement. The intent is to make the experimental process more clear, flawed or not, so that users of the data can more appropriately evaluate its validity for their purposes."
[library/pages/charnock.htm] [library/pages/cox.htm] [library/pages/milloy.htm] [library/pages/sparling.htm] [library/pages/kruskal.htm]McMaster Evidence Based Medicine Site.
http://hiru.hirunet.mcmaster.ca/ebm/I found many of the references from this source. EBM "promotes the collection, interpretation, and integration of valid, important and applicable patient-reported, clinician-observed, and research-derived evidence. The best available evidence, moderated by patient circumstances and preferences, is applied to improve the quality of clinical judgements and facilitate cost-effective health care."
Skeptical Inquirer.
http://www.csicop.org/Other examples were found in the pages of Skeptical Inquirer magazine, published by the Committee for the Scientific Investigation of the Paranormal. This group "encourages the critical investigation of paranormal and fringe-science claims from a responsible, scientific point of view and disseminates factual information about the results of such inquiries to the scientific community and the public. It also promotes science and scientific inquiry, critical thinking, science education, and the use of reason in examining important issues." They also maintain a web site, which has the full text of articles from back issues.
Bibliography
[library/articles/adkinson97.htm] [library/articles/altman94.htm] [library/articles/bailar.htm] [library/articles/beyerstein_97.htm] [library/articles/bullock_89.htm]"Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer." Cameron E, Pauling L. Proceedings of the National Academy of Sciences (USA), 73: 3685-3689 (1976) .
"A Case-Control Study of HIV Seroconversion in Health Care Workers After Percutaneous Exposure." Cardo DM, Culver DH, Ciesielski CA, Srivastava PU, Marcus R, Abiteboul D, Heptonstall J, Ippolito G, Lot F, McKibben PS, Bell DM. New England Journal of Medicine, 337(21): 1485-1490 (1997).
"The association of nonsteroidal anti-inflammatory drugs with upper gastrointestinal tract bleeding." Carson JL, Strom BL, Soper KA, et al, West SL, Morse ML. Arch Intern Med, 147: 85-8 (1987).
"Effects of Selenium Supplementation for Cancer Prevention in Patients With Carcinoma of the Skin: A Randomized Control Trial" Clark LC, Combs, GF Jr, Turnbul TW, Slate EH, Chalker DK, Chow J, Davis LS, Glover RA, Graham GF, Gross EG, Krongrad A, Lesher JL, Park K, Sanders BB Jr, Smith CL, Taylor R Journal of the American Medical Association, 276(24): 1957-1963 (1996).
[library/articles/colditz.htm]"Should unpublished data be included in meta-analyses" Cook DJ, Guyatt GH, Ryan E, Clifton J, Buckingham L, Willan A, WcIlroy W, Oxman AD. Journal of the American Medical Association, 269: 2749-2753 (1993).
[library/articles/davies.htm]"The existence of publication bias and risk factors for its occurrence" Dickersin K. Journal of the American Medical Association, 263: 1385-1389 (1990).
"Smoking and Carcinoma of the Lung: Preliminary Report" Doll R, Hill AB. British Medical Journal, 1: 1451-1455 (1950).
"The Importance of Beta, the Type II Error, and Sample Size in the Design and Interpretation of the Randomized Control Trial" Freiman JA, Chalmers TC, Smith Jr H, Kuebler RR. pp. 357-373 in Medical Uses of Statistics, Second Edition, John C. Bailar II and Frederick Mosteller, editors, NEJM Books, Boston MA (1992).
"Statistics in Action." Gail, MH. Journal of the American Statistical Association, 91: 1-13 (1996).
Meta-analysis in social research. Glass GV, McGaw B, Smith ML. pp.18-20. Newbury Park CA: Sage (1981).
"Reference bias in reports of drug trials." Gotsche OC. British Journal of Medicine, 295: 654-656 (1987).
"The healthy control subject in psychiatric research: Impulsivenes and volunteer bias." Gustavsson JP, Asberg M, Schalling D. Acta Psychaitrica Scandinavica, 96: 325-328 (1997).
"Combining Results from Independent Investigations: Meta-analysis in Clinical Research" Halvorsen KT, Burdick E, Colditz GA, Frazier HS, Mosteller F. pp. 413-426, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
"Assessing Cause and Effects from Trials: A Cautionary Note" Howel D, Bhopal R. Controlled Clinical Trials, 15: 331-334 (1994).
"Dietary Fat Intake and the Risk of Coronary Heart Disease in Women." Hu FB, Stampfer MJ, Manson JE, Rimm E, Colditz GA, Rosner BA, Hennekens CH, Willett WC. New England Journal of Medicine, 337(21): 1491-1499 (1997).
"Removing Bias in Surgical Trials." Johnson AG, Dixon JM. British Medical Journal. 314: 916-917 (1997).
"Crossover and Self-Controlled Designs in Clinical Research." Louis TA, Lavori PW, Bailar JC III, Polansky M. pp83-103, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
"High-dose vitamin C versus placebo in the treatment of patients with advanced cancer who have had no prior chemotherapy. A randomized double-blind comparison." Moertel C, Fleming T, Creagan E, Rubin J, O'Connell M, Ames M. New England Journal of Medicine, 312: 137-141 (1985).
"An addition to the controversy on sunlight exposure and melanoma risk: a meta-analytical approach." Nelemans PJ, Rampen FHJ, Ruiter DJ, Verbeek ALM. Journal of Clinical Epidemiology, 48: 1331-1342 (1995).
"Secondhand Smoke and Cholesterol in Children" Neufeld EJ, Mietus-Snyder M, Beisner AS, Baker AL, Newburger JW. Circulation, 96: 1403-1407 (1997).
"Cholesterol lowering trials in coronary heart disease: frequency of citation and outcome." Ravnskov U. British Journal of Medicine, 305: 15-19 (1992).
"Bias in analytical research." Sackett DL. J Chron Dis, 32: 51-63 (1979).
"Meta-Analyses of Randomized Control Trials: An Update of the Quality and Methodology" Sacks HS, Berrier J, Reitman D, PAgano D, Chalmers TC. pp. 427-442, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
"Inconsistencies and Errors in Alternative Medicine Research" Sampson W. Skeptical Inquirer, 21(5): 35-38 (September/October 1997).
"Cured and broiled meat consumption in relation to childhood cancer: Denver, Colorado (United States)" Sarasua S, Savitz DA Cancer Causes and Control 1994, 5, 141-148.
[library/articles/schulz.htm]"Is Meta-Analysis a Valid Approach to the Evaluation of Small Effects in Observational Studies?" Shapiro S. Journal of Clinical Epidemiology. 50(3): 223-229 (1997).
"Fat Chance. Diet and Ischemic Stroke." Sherwin R, Price TR. Journal of the American Medical Association, 278(24): 2185-2186 (1997).
"Nicotine Patch Therapy in Adolescent Smokers." Smith TA, House Jr RF, Croghan IT,Gauvin TR, Colligan RC, Offord KP, Gomez-Dahl LC, Hurt RD. Pediatrics, 98(4): 659-667 (1996).
This webpage was written on 2001-11-01 and was last modified on 2008-07-08. Category: Statistical evidence
Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.


