Statistical Evidence. Chapter 5. Do the pieces fit together?

[This is the first draft of Chapter 5 of "Statistical Evidence."]

5.0 Introduction

Starting in the 1990's, researchers developed tools to collect and synthesize results of research across multiple studies. This approach, commonly called meta-analysis, involves the quantitative pooling of data from two or more studies. More recently, another term, systematic overview, has come into favor. A systematic overview involves the careful review and identification of all research studies associated with a topic, but it may or may not end up pooling the results of these studies. So meta-analysis represents a subset of all the systematic overviews. I tend to use the older term, meta-analysis, partly because I'm stubborn, but partly because I am interested in the quantitative aspects of this type of research. But most of my comments apply more broadly to systematic overviews.

When you are examining the results of a meta-analysis, you should ask the following questions:

Were apples combined with oranges? Heterogeneity among studies may make any pooled estimate meaningless.

Were some apples left on the tree? An incomplete search of the literature can bias the findings of a meta-analysis.

Were all of the apples rotten? The quality of a meta-analysis cannot be any better than the quality of the studies it is summarizing.

Did the pile of apples amount to more than just a hill of beans? Make sure that the meta-analysis quantifies the size of the effect in units that you can understand.

Case study: Declining sperm counts

In 1992, the British Medical Journal published a controversial meta-analysis. This study (Carlsen 1992) reviewed 61 papers published from 1938 and 1991 and showed that there was a significant decrease in sperm count and in seminal volume over this period of time. For example, a linear regression model on the pooled data provided an estimated average count of 113 million per ml in 1940 and 66 million per ml in 1990.

Several researchers (Olsen 1995; Fisch 1996) noted heterogeneity in this meta-analysis, a mixing of apples and oranges. Studies before 1970 were dominated by studies in the United States and particularly studies in New York. Studies after 1970 included many other locations including third world countries. Thus the early studies were United States apples. The later studies were international oranges. There was also substantial variation in collection methods, especially in the extent to which the subjects adhered to a minimum abstinence period.

The original meta-analysis and the criticisms of it highlight both the greatest weakness and the greatest strength of meta-analysis.

Meta-analysis is the quantitative pooling of data from studies with sometimes small and sometimes large disparities. Think of it as a multi-center trial where each center gets to use its own protocol and where some of the centers are left out.

On the other hand, a meta-analysis lays all the cards on the table. Sitting out in the open are all the methods for selecting studies, abstracting information, and combining the findings. Meta-analysis allows objective criticism of these overt methods and even allows replication of the research.

Contrast this to an invited editorial or commentary that provides a subjective summary of a research area. Even when the subjective summary is done well, you cannot effectively replicate the findings. Since a subjective review is a black box, the only way, it seems, to repudiate a subjective summary is to attack the messenger.

Meta-analysis is used in a variety of different areas. (Vine 1994) used meta-analysis studied the relationship between smoking and sperm concentration. (Oehninger 2000) assessed the utility of sperm function assays in predicting successful outcomes in IVF.  (Goldberg 1999) compared intrauterine and intracervical insemination with frozen donor sperm.  (Evers 2001) reviewed the effectiveness of varicocelectomy in subfertile men.

5.1 Were apples combined with oranges?

Meta-analyses should not have too broad an inclusion criteria. Including too many studies can lead to problems with "apples-to-oranges" comparisons. For example, when you are studying the effect of cholesterol lowering drugs, it makes no sense to combine a study of patients with recent heart attacks with another study of patients with high cholesterol but no previous heart attacks.

There is a lot of variability in how research is conducted. Even in carefully controlled randomized control trials, researchers have tremendous discretion. Sometimes this discretion creates heterogeneity among studies, making it difficult to combine the studies. Some examples of heterogeneity cited in (Horwitz 1987) include:

Heterogeneity in the composition of the treatment and control groups

Heterogeneity in the design of the study

Heterogeneity in the management of the patients and in the outcome

The outcome measure itself could differ. For example, a review article on meta-analysis (Abramson 1990) cited a meta-analysis of hypertension treatment in the elderly, where some of the studies examined cardiovascular deaths and others examined cardiovascular events. Other studies examined cerebrovascular deaths, cerebrovascular events, cardiac deaths, coronary heart disease deaths, and/or total deaths.

Examples of heterogeneity

In a meta-analysis looking at antiretroviral combination therapy (Jordan 2002), a plot of duration of trial versus the log odds ratio showed that shorter duration trials of zidovudine had substantial evidence of effect (odds ratios much smaller than one) but that the largest duration studies had little or no evidence of effect (odds ratios very close to one).

In a meta-analysis, (Gotzsche 1998) looking at dust mite control measures to help asthmatic patients, the studies exhibited heterogeneity across several factors. Six studies examined chemical interventions, thirteen examined physical interventions, and four examined a combination approach. Nine of these trials were crossovers, and in the remaining fourteen, there was a parallel control group. Seven studies had no blinding, three studies had partial blinding, and the remaining thirteen studies used a double blind. In nine studies the average age of the patients were only nine or ten years, but nine other studies had an average age of  30 or more. Eleven studies lasted eight weeks or less and five studies lasted a full year.

How to measure heterogeneity

There is a statistic, Cochran's Q, which provides a numeric measure of heterogeneity. When Q is roughly equal to the number of studies in the meta-analysis, there is little evidence of heterogeneity. When Q is much larger than the number of studies, then you have significant evidence of heterogeneity. There is a similar measure, I-squared, which is based on Cochran's Q (Higgins 2003), I-squared ranges between 0 and 100% with small values (25% or less) implying that heterogeneity accounts for little or none of the variation between studies. Larger values, like 50% or 75%, imply that heterogeneity is a serious problem.

Many researchers prefer not to use any quantitative measure of heterogeneity because they do not seem to identify cases where heterogeneity is very large (Gavaghan 2000). Instead these researchers advocate a qualitative examination of heterogeneity by looking at specific study characteristics.

A forest plot can also provide visual evidence of heterogeneity. A forest plot shows each individual study estimate (represented by a square) and confidence limits (represented as lines extending from the square to the upper and lower limits). The size of the square represents the weight that each study receives. There are many ways in which heterogeneity can manifest itself, but you should be especially watchful for one or two outlying studies or an obvious bimodal pattern in the individual study estimates.

Example: In a study of contrast-induced nephropathy after intravascular angiography (Bagshaw 2004), the odds ratios for the effectiveness of prophylactic acetylcysteine plus hydration versus hydration alone are displayed in a forest plot, as shown below. Odds ratios less than 1 represent findings in favor of acetylcysteine.

These odds ratios in this plot show a reasonable amount of consistency, which is evidence that there is little or no heterogeneity.

How to handle heterogeneity

Some level of heterogeneity is acceptable. After all, the purpose of research is to generalize results to large groups of patients. Furthermore, demonstrating that a treatment shows consistent results across a variety of conditions strengthens our confidence in that treatment.

Nevertheless, you should be aware of the problems that excessive heterogeneity can cause. Mixing apples and oranges may not be so bad; you get a fruit salad this way. But when heterogeneity becomes too large, you might end up combining not apples and oranges but apples and onions.

Subgroup analysis

When there is substantial heterogeneity, you can look and compare subgroups of the studies. In a meta-analysis (Geddes, J. 2000) studying atypical antipsychotics, the dose of the comparison drug (haloperidol or an equivalent) varied substantially. Among those studies where the dose of haloperidol was greater than 12 mg/day, atypical antipsychotics showed advantages in efficacy or tolerability. When the dose was less than or equal to 12 mg/day, the atypical antipsychotics showed no advantages in these areas.

Meta-regression

You can try to adjust for heterogeneity in a meta-analysis. This would work very similarly to the adjustment for covariates in a regression model. For example, (Derry 2000) used meta-analysis to see if long term aspirin therapy was associated with problems with gastrointestinal hemorrhage. They identified 24 studies that looked at aspirin as a preventive measure against heart attacks. In each of these studies, the rate of gastrointestinal hemorrhages were recorded for both the aspirin group and the placebo or no treatment group. There was substantial heterogeneity in the dosage of aspirin used in the studies, however, with some studies giving as little as 50 mg/day and some as much as 1500 mg/day.

This was actually good news in a way, because the researchers wanted to see if the risk of gastrointestinal hemorrhage was dependent on the dose of aspirin. A plot of the dose versus the risk showed that there was indeed an increased risk, but this risk seemed to be unrelated to the dosage.

Inclusion of very old studies

Inclusion of very old studies can cause problems, but it depends a lot on the topic. Anything in the field of neonatology would have to have a very narrow time window because the field has changed so much so rapidly.

Other areas where the practice of medicine has been much more stable could have wider time windows. I've seen several reviews that have covered half a century of studies.

If you do select a wide time window be sure to see if your results are similar if you restrict yourself to just the most recent studies.

Ask yourself if there was a sudden change in technology that makes any comparisons before and after that technology an apples-to-oranges comparison. So, for example, a meta-analysis involving AIDS patients should restrict itself to the years following the use of AZT.

Also, ask yourself if researchers in your area tend to discount any research that is more than X years old. If so, then your meta-analysis would lose credibility among those researchers if it included studies older than X.

Sensitivity analysis

A good approach to heterogeneity is to include a wide range of studies, but then examine the sensitivity of the results by looking at more narrowly drawn subsets of the studies.

The authors can also  weight studies by a quality factor and give greater emphasis to randomized studies, which are less likely to have bias. Second, the authors can perform sensitivity analyses. Would the results change if we changed the entry criteria?

In general, heterogeneity increases uncertainty, but this uncertainty cannot be reflected in the width of the confidence limits in the meta-analysis results. When there is heterogeneity, the most information may reside not in a single estimate of how effective the treatment is, but in a careful examination of the variation in the treatment under different conditions.

5.2 Were some apples left on the tree?

One of the greatest concerns in a meta-analysis is whether all the relevant studies have been identified. If some studies are missed, this could lead to serious biases. In any meta-analysis, you have to draw a line somewhere. Studies that fail to meet your criteria will not be included in the results. But this can lead to serious controversy. In a Cochrane Database System Review of mammography (Olsen 2001), seven studies were identified, but only two were of sufficient quality to be used. The Cochrane Review of these two studies reached a negative conclusion, but would have reached an opposite conclusion if the other five studies were added back in (Mayor 2001).

Publication bias

Many important studies are never published; these studies are more likely to be negative (Dickersin 1990). This is known as publication bias. Publication bias is the tendency on the parts of investigators, reviewers, and editors to submit or accept manuscripts for publication based on the direction or strength of the study findings. Much of what has been learned about publication bias comes from the social sciences, less from the field of medicine. In medicine, three studies have provided direct evidence for this bias. Prevention of publication bias is important both from the scientific perspective (complete dissemination of knowledge) and from the perspective of those who combine results from a number of similar studies (meta-analysis). If treatment decisions are based on the published literature, then the literature must include all available data that is of acceptable quality. Currently, obtaining information regarding all studies undertaken in a given field is difficult, even impossible. Registration of clinical trials, and perhaps other types of studies, is the direction in which the scientific community should move.

 The inclusion of unpublished studies, however, is controversial. At least one reference (Cook 1993), has argued that unpublished studies have failed to meet a basic quality screen, the peer review process. Including studies that have not been peer reviewed will lower the overall quality of the meta-analysis.

Another aspect of publication bias is that the delay in publication of negative results is likely to be longer than that for positive studies. For example, among 130 clinical trials, the median time to publication was 4.7 years among the positive studies and 8.0 years among the negative studies (Stern 1997). So a meta-analysis restricted to a certain time window may be more likely to exclude published research that is negative.

Many meta-analyses select studies listed in the bibliographies of papers found on the initial search. While this does broaden the number of studies included, there is a documented preference among research authors to cite positive studies more often than negative studies (Kjaergard 2002).

Many experts are advocating the registration of trials as a way of avoiding publication bias. If trials are registered prospectively (i.e., prior to data collection and analysis) then they can be included in any appropriate meta-analysis without worry about publication bias.

Duplicate publication

Duplicate publication is the flip side of the publication bias coin. Studies which are positive are more likely to appear more than once in publication. This is especially problematic for multi-center trials where individual centers may publish results specific to their site. In 84 studies of the effect of ondansetron on postoperative emesis, 14 (17%) were second or even third time publications of the same data set (Tramer 1997). The duplicate studies had much larger effects and adding the duplicates to the originals produced an overestimation of treatment efficacy of 23%. Tracking down the duplicate publications was quite difficult. More than 90% of the duplicate publications did not cross-reference the other studies. Four pairs of identical trials were published by completely different authors without any common authorship

The limitations of a Medline search

While a Medline search is the most convenient way to identify published research, it should not be the only source of publications for a meta-analysis. Medline searches cover only 3,000 of some 13,000 medical journals (Bailar 1992, page 422). The studies missed by Medline and other databases are more likely to be negative studies.

Furthermore, these databases may fail to index major journals in the third world that can provide important trials. For example, Medline excludes most Indian journals, even though these journals are published in English and India produces a significant amount of medical research (Egger 1998).

Foreign language publications

Some meta-analyses restrict their attention to English language publications only. While this may seem like a convenience, in some situations, researchers might tend to publish in an English language journal for those trials which are positive, and publish in a (presumably less prestigious) native language journal for those trials which are negative. Interestingly, some studies have shown that the quality of studies published in other languages is comparable to the quality of studies published in English.

Picking the low hanging fruit

In an informal meta-analyis, you should also worry about the tendency for people to preferentially choose articles that are convenient. For example, there is a natural tendency to rely on articles where the full text is available on the Internet or where the abstract is available for review (Wentz 2002).

How to evaluate publication bias

The most common approach to evaluate publication bias is to use a funnel plot. The funnel plot displays the results of the individual studies (for example, the log odds ratio) on the horizontal axis, and the size of the study (or sometimes the standard error of the study) on the vertical axis. Often a reference line is drawn at the value that represents no effect to visually separate the region where the new treatment is considered more effective from the region where the standard treatment (or placebo) is considered more effective.

If there is no publication bias, then the funnel plot should show symmetry for both small sample sizes and large sample sizes, though you should expect to see less variation as the sample size increases. This leads to a funnel shape. A deviation from a symmetric funnel shape indicates the possibility of publication bias, especially if there are very few studies with a small sample size in the region where the standard treatment appears to be more effective.

The problem with the funnel plot is that there is no standardized way to draw the plot (Tang 2000) and interpretation of the funnel plot is subjective. There are several quantitative measures based on the funnel plot but these may be difficult to interpret.

Another common approach is to evaluate the articles that were easy to find (e.g., in Medline) with those that were hard to find (e.g., results presented at a meeting but never published). If there is no discrepancy between the easy to find and the hard to find article, then perhaps you can extrapolate and say that there is probably no difference between the easy to find articles and the articles that you never did find.

How to avoid bias from exclusion of publications

Search for studies should involve several bibliographic databases, registries for clinical trials, examination of bibliographies of all articles found, the so-called gray literature (presentation abstracts, dissertations, theses, etc.) and a letter calling for unpublished papers to be sent out to key researchers.

###Fix this. Find an open source journal article for discussion of search strategy.###

Consider the search strategy adopted in a meta-analytic examination of varicocele surgery in subfertile men (Evers 2001).

Relevant trials were identified in the Cochrane Menstrual Disorders and Subfertility Group's specialised register of controlled trials. A MEDLINE search, using the group's search strategy, was performed for the period 1966-2000. Also, hand searching was performed of 22 specialist journals in the field from their first issue till 2000. Cross references and references from review articles were checked.

Subjectivity

"Blinding," a common tool in other research areas should also be used in meta-analyses. Blinding prevents the differential application of inclusion/exclusion criteria. The people deciding whether a paper meets the inclusion/exclusion criteria should be unaware of the authors of that paper and the journal. They should also include or exclude the paper on the basis of the methods section only; they should not see the results section until later.

There is empirical evidence, however, that blinding does not affect the conclusions of a meta-analysis (Jadad, et al 1996; Berlin 1997). Furthermore, blinding takes substantial time and energy.

Data should be extracted from papers by multiple sources and their level of agreement should be assessed. Researchers have found disagreements even on such fundamental concepts such as whether a study was positive or negative (Glass 1981, page 18).

Like any other research project, an overview or meta-analysis needs a protocol. Unfortunately, many published meta-analyses do not state whether a protocol was used (Bailar 1992, page 431). The protocol should specify: the inclusion/exclusion criteria for studies; a detailed description of the process used to identify studies; and the statistical methods used to combine results. Without a protocol, the meta-analysis research is not reproducible.

Authors have been shown to be biased in the articles that they cite in the bibliographies of their research papers (Gotzsche 1987; Ravnskov 1992a; Ravnskov 1992b). This same bias could potentially affect the selection of articles in a meta-analysis.

If the authors do not present objective criteria for the selection of articles in their overview or meta-analysis, then you should be concerned about possible conscious or sub-conscious bias in the selection process.

Researchers should also list all of the articles found in the original search, not just the articles used. This allows others to examine whether the inclusion/exclusion criteria were applied appropriately.

Ethical issues in publication bias

Selective failure to publish research is an example of research fraud if the intent is to withhold information that might prevent people from getting a complete picture of a particular drug or therapy (Antes 2003). Failure to publish is also an ethical lapse (Mann 2002). The human volunteers for a research study often volunteer out of a desire to help future patients with the same disease or illness that they have. The always suffer some level of inconvenience because of the extra time associated with participation in the research study, they often suffer through additional pain such as through the need for extra blood draws, and sometimes they accept an increase in risk when they volunteer. If you withhold the results of the research from publication, you have abused the good will of those volunteers.

Preventing publication bias

###Fix this. Elaborate on clinical trial registries.###

Detecting and correcting for publication bias

Sensitivity analysis is also useful here. If the results from published studies are comparable to the results from unpublished studies, for example, then publication bias is less of a concern. Along the same lines, the authors can estimate the number of undiscovered negative studies that would be required to overturn the results of this meta-analysis.

Publication bias is also more likely to occur for studies with small sample sizes. If the results of a meta-analysis are stratified by the sample sizes in the studies, a shift away from the null hypothesis in the smaller studies would be a warning flag about the possibility of publication bias. Statistical and graphical methods have been proposed to examine this further but you should be cautious, however, because sometimes there are other explanations. For example, smaller studies may tend to use less rigorous designs and these designs may be associated with exaggerated effects (Sterne et al 2001).

###Fix this. McManus et al (1998) highlight the importance of consulting experts in the area. They we trying to identify all publications associated with near patient testing, tests where the results are available without sending materials to a lab. The authors used a search of electronic databases, a survey of experts in the area, and hand searching of specific journals. The electronic databases yielded the most number of publications, 50, but still missed 52 publications found by the other two methods.

###Fix this. Copas and Shi (2000) present a re-analysis of a meta-analysis on lung cancer that adjusts for publication bias, but this adjustment is controversial (Johnson et al 2000).

###Fix this. Elaborate on the ethical issues associated with publication bias.###

5.3 Were all of the apples rotten?

The quality of a meta-analysis is constrained by the quality of articles that are used in a meta-analysis. Meta-analysis cannot correct or compensate for methodologically flawed studies. In fact, meta-analysis may reinforce or amplify the flaws of the original studies.

Observational studies in a meta-analysis

The use of meta-analysis on observational studies is very controversial. Some experts have argued that the biases inherent in observational studies make a meta-analysis an exercise in mega-silliness. But even those experts who do not take such an extreme viewpoint warn that the current statistical methods for summarizing the results of observational studies may grossly understate the amount of uncertainty in the final result (Egger 1998).

Sensitivity analysis may be a useful way of highlighting the uncertainties in a meta-analysis of observational studies. Restricting the meta-analysis to selective subgroups of the data can yield insight into the size and direction of biases in observational studies. For example, the researchers could contrast case-control designs with cohort designs, with the latter expected to show less bias, in general. Or the researchers could compare retrospective studies to prospective studies, where again, the latter is expected to show less bias in general. Another possibility for comparison involve comparing studies by the amount to which measurement error is expected to cause problems. In general, researchers should try to stratify the observational studies by known sources of bias.

Meta-analyses of randomized trials

Some meta-analyses restrict their attention to randomized trials because these studies are less likely to have problems with bias. In other words, they wish to avoid mixing bad observational apples with good randomized trial apples. Sometimes further restrictions can be made on the basis of partial or full blinding of results or on the proper accounting of dropouts.

(Concato 2000) evaluated clinical topics where there were publications of both randomized controlled trials and observational studies. In this review, the observational studies produced results quite similar to the randomized studies.

Sensitivity analysis

Even for randomized trials, sensitivity analysis may help. Researchers can use "quality scores" to rate individual studies and then see what happens when studies are restricted to those of highest quality only.

For example, (Lucassen 1998) looked at interventions for infant colic. Although substituting soy milk for cows milk appeared to have an effect, this effect disappeared when only studies of high methodological quality were considered.

Quality Scores

Many times, the reporting of a study will be inadequate, and this will make it impossible to assess the quality of a study.  There is indeed empirical evidence that incomplete reporting is associated with poor quality (Schulz 1995). In such a case, a "guilty until proven innocent" approach may make sense (Juni 2001). For example, if the authors fail to mention whether their study was blinded, assume that it was not. You might expect that authors are quick to report strengths of their study, but may (perhaps unconsciously) forget to mention their weaknesses. On the other hand, (Liberati 1986) rated the quality of 63 randomized trials, and found that the quality scores increased by seven points on average on a 100 point scale after talking to the researchers over the telephone. So some small amount of  ambiguity may relate to carelessness in reporting rather than quality problems.

Another approach is to look at subgroups of studies of a similar design and see if the results are consistent across subgroups. For example, (Etminan 2003)examined the risk of Alzheimer's disease in users of non-steroidal anti-inflammatory drugs. They identified six cohort studies, which showed a combined relative risk of 0.84 (95% CI 0.54 to 1.05) and three case-control studies which showed a much lower combined relative risk, 0.62 (95% CI 0.45 to 0.82).

Meta-analysis of studies with small sample sizes

Some experts advocate great caution in the assessment of meta-analyses where all of the trials consist of small sample size studies. The effect of publication bias can be far more pronounced here than in situations where some medium and large size trials are included. In addition, smaller studies tend to have greater problems with the methods of randomizing and blinding patients (Kjaergard 2001).

Meta-analysis of Chinese studies

Research published in Chinese journals medicine have shown a substantial deficits in quality that should make you cautious about any meta-analysis using these studies. For example, a review of Chinese medicinal herbs in the treatment of hepatitis B (Liu 2002) showed inadequate documentation of the randomization method and failure of most studies to conceal the allocation list. Further, a small fraction of these studies showed a degree of imbalance between the treatment and control that was well beyond what you would expect by chance. 

 A review of 2,938 publications in Chinese journals (Tang 1999) also noted many problems.

Although methodological quality has been improving over the years, many problems remain. The method of randomisation was often inappropriately described. Blinding was used in only 15% of trials. Only a few studies had sample sizes of 300 subjects or more. Many trials used as a control another Chinese medicine treatment whose effectiveness had often not been evaluated by randomised controlled trials. Most trials focused on short term or intermediate rather than long term outcomes. Most trials did not report data on compliance and completeness of follow up. Effectiveness was rarely quantitatively expressed and reported. Intention to treat analysis was never mentioned.

This paper also shows evidence of serious publication bias. They display a funnel plot for studies of acupuncture for the treatment of stroke, for example, which shows a serious asymmetry with only three studies out of 49  showing a negative effect, but a large number of studies, especially studies with small sample sizes lying just above the threshold of no effect.

A review article on accupuncture (Vickers 1998) evaluated articles published in China (as well as Japan, Taiwan, and Russia). In China, 100% of the accupuncture studies showed a positive result. In areas other than accupuncture, the results were similar. In Chinese journals, 99% of the non-accupuncture studies were positive. To form a basis of comparison, only 75% of the studies published in England were positive. Another revealing statistic was that Chinese journals NEVER published a finding to show that the new therapy was less effective than the control group. There were similar problems with publications from Japan, Taiwan, and Russia.

5.4 Did the pile of apples amount to more than just a hill of beans?

It’s not enough to know that the overall effect of a therapy is positive. You have to balance the magnitude of the effect versus the added cost and/or the side effects of the new therapy. Unfortunately, most meta-analyses use an effect size (the improvement due to the therapy divided by the standard deviation). The effect size is unitless, allowing the combination of results from studies where slightly different outcomes with slightly different measurement units might have been used.

Vote counting

###Fix this. Avoid "vote counting" or the tallying of positive versus negative studies. Vote counts ignore the possibility that some studies are negative solely because of their sample size. Abramson (1990) notes, for example, a meta-analysis of parenteral nutrition in cancer patients undergoing chemotherapy. Although each of the seven randomized control trials in the meta-analysis failed to achieve statistical significance, the pooled results were highly significant.

Unitless measures

When you are examining a continuous outcome measure, you should be sure that the results are presented in interpretable units. A measure of effect size does not help you much because it is unitless and impossible to interpret. Consider a store that is offering a sale and announces boldly

"All prices reduced by 0.8 standard deviations!"

###Fix this. One meta-analysis shows how important it is to express measurements in interpretable units. Lumley et al (2001) studied the effect of smoking cessation programs on the health of the fetus and infant. One of the outcome measures was birth weight, and the study showed that the typical program can improve birth weight by a statistically significant amount. The researchers then quantified the amount: 28g (95% confidence interval 9 to 49).

Keep in mind that this is measuring the effectiveness of the smoking cessation program, and not the effect of smoking cessation directly. Typically, you would have to send about 12 to 16 women to these programs in order to get one extra woman to quit smoking. So the effect seen here reflects, in part, how difficult it is to get people to change their behavior.

Still the small size of the effect is important. If you want to assess the costs and benefits of smoking cessation programs, it helps to know that the impact of the typical smoking cessation program on birth weight is quite small. This provides a useful yardstick for comparison to other prenatal interventions.

5.5 Counterpoint: Meta-analysis is not all that it is cracked up to be.

###Fix this. Write a counterpoint with all the arguments against meta-analysis.###

[Meta-analysis] possesses certain flaws and limitations that preclude its use as a broad-based methodologic approach for formulating definitive therapeutic recommendations. -- Boden 1992.

[Expand on this section]

5.6 Summary

###Fix this. Write a summary.###

This webpage was written by Steve Simon, edited by Linda Foland and Steve Simon, and was last modified on 07/14/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence