Seminar #67: Meta-Analysis and Diagnostic Tests
Content: Meta-analysis is the quantitative combination of results from multiple research studies. Meta-analysis is a relatively new field in Statistics, and standards for the proper data analysis are still evolving. Meta-analysis of studies of diagnostic tests, in particular, is especially controversial, with many conflicting approaches for computing an overall estimate from the individual sensitivity or specificity values from these studies. In the first half of this talk, I will review the general methods for the quantitative combination of results in a meta-analysis, and work out two examples using R and the meta library. In the second half, I will use data from a meta-analysis of 20 studies of endovaginal ultrasonography for detecting endometrial cancer to illustrate and critically evaluate several competing approaches for quantitatively combining results from diagnostic studies. All the data sets used in this presentation come from journal articles where the full free text is available on the web.
Teaching strategies: Didactic lectures and small group exercises.
Objectives: In this class you will learn how to:
- compute fixed and random effects models using R and the meta library;
- display results of individual diagnostic studies on a Summary Receiver Operating Characteristic plot;
- combine estimates of sensitivity and specificity directly and on a log odds scale; and
- compute the diagnostic odds ratio and graphically evaluate heterogeneity.
Contents
- Abstract
- Where can I find this handout?
- Do the pieces fit together? Meta-analyses and systematic overviews.
- Guidelines for meta-analysis models
- Definition: Sensitivity
- Definition: Specificity
- Meta-analysis for a diagnostic test
Where can you find this handout?
This handout and the handouts that I use for all of my seminars and training classes are a compilation of individual web pages at www.childrensmercy.org/stats. I use the "Include Page" feature of Microsoft FrontPage to combine these into a single page. You can always find the most recent version of this compilation by going to the web address listed at the bottom of this page. Links for the handouts for other seminars and classes appear at www.childrensmercy.org/stats/training.asp.
Why don't I use PowerPoint?
I stopped using PowerPoint for my presentations in the mid 1990's. This was based on Edward Tufte's advice that presenting information in a paper handout is more effective than presenting the information on a projected screen. I found this to be excellent guidance. I enjoy talking when I don't have to wrestle with a laptop computer. I look at my audience more and interact with them better. I elaborate on this in greater detail at www.childrensmercy.org/stats/weblog2004/powerpoint.asp.
Statistical Evidence. Chapter 5. Do the Pieces Fit Together? Systematic Overviews and Meta-analysis.
5.0 Introduction
When there are multiple research studies evaluating a new intervention, you need to find a way to assess the cumulative evidence of these studies. You can do this informally, but medical researchers now use a formal process, known as meta-analysis. Meta-analysis involves the quantitative pooling of data from two or more studies. More recently another term, systematic overview, has come into favor. A systematic overview involves the careful review and identification of all research studies associated with a topic, but it may or may not end up pooling the results of these studies. So meta-analysis represents a subset of all the systematic overviews. I tend to use the older term, meta-analysis, partly because I'm stubborn, but partly because I am interested in the quantitative aspects of this type of research. But most of my comments apply more broadly to systematic overviews.
Case study: Declining sperm counts
In 1992, the British Medical Journal published a controversial meta-analysis. This study (Carlsen 1992) reviewed 61 papers published from 1938 and 1991 and showed that there was a significant decrease in sperm count and in seminal volume over this period of time. For example, a linear regression model on the pooled data provided an estimated average count of 113 million per ml in 1940 and 66 million per ml in 1990.
Several researchers (Olsen 1995; Fisch 1996) noted heterogeneity in this meta-analysis, a mixing of apples and oranges. Studies before 1970 were dominated by studies in the United States and particularly studies in New York. Studies after 1970 included many other locations including third world countries. Thus the early studies were United States apples. The later studies were international oranges. There was also substantial variation in collection methods, especially in the extent to which the subjects adhered to a minimum abstinence period.
The original meta-analysis and the criticisms of it highlight both the greatest weakness and the greatest strength of meta-analysis.
Meta-analysis is the quantitative pooling of data from studies with sometimes small and sometimes large disparities. It doesn't always make sense to pool these studies. Think of it as a multi-center trial where each center gets to use its own protocol and where some of the centers don't bother sending you their data. This is meta-analysis at its worst.
On the other hand, the strength of meta-analysis is that it lays all the cards on the table. Sitting out in the open are all the methods for selecting studies, abstracting information, and combining the findings. Meta-analysis allows objective criticism of these overt methods and even allows replication of the research.
Contrast this to an invited editorial or commentary that provides a subjective summary of a research area. Even when the subjective summary is done well, you cannot effectively replicate the findings. Since a subjective review is a black box, the only way, it seems, to repudiate a subjective summary is to attack the messenger.
Do the pieces fit together? What to look for.
When you are examining the results of a meta-analysis, you should ask the following questions:
Were apples combined with oranges? Heterogeneity among studies may make any pooled estimate meaningless.
Were some apples left on the tree? An incomplete search of the literature can bias the findings of a meta-analysis.
Were all of the apples rotten? The quality of a meta-analysis cannot be any better than the quality of the studies it is summarizing.
Did the pile of apples amount to more than just a hill of beans? Make sure that the meta-analysis quantifies the size of the effect in units that you can understand.
[Remainder of this material deleted out of respect for the publisher's copyright.]
This webpage was written on 2005-05-29 and was last modified on 2008-07-08. Category: Statistical evidence
Stats >> Model >> Meta-analysis (March 18, 2005)
Meta-analysis is the quantitative combination of results from multiple research studies. There are three steps in a typical meta-analysis model.
- Extract individual estimates and standard errors from each study
- Combine these estimates using a fixed or random effects model
- Display the results graphically.
This page uses resources originally developed on my weblog: November 29, 2004, January 12, 2005, February 25, 2005, and March 11, 2005. I also have a web page about the special problems associated with a meta-analysis for a diagnostic test and a non-technical introduction on the practical interpretation of a meta-analysis.
Step 1. Extract individual estimates.
When you look at the individual summaries in a meta-analysis, they will report the results in a variety of ways. You need to extract these results in a common format, and the process depends a lot on the type of outcome being reported.
For a continuous outcome, a commonly reported statistic is the difference between the treatment mean and the control mean divided by the standard deviation in the control group.
For this equation and all equations below, the subscript iT represents data from the treatment group of the ith study and the subscript iC represents data from the control group of the ith study.
It seems a bit unusual to use the standard deviation just from the control group. The rationale is that if you have two or more treatments in a study compared to control, the denominator never changes when you use just the control group standard deviation.
There are some variations on this formula that use a pooled variance estimate or that adjust for biases due to small sample sizes.
The standard error of the estimate is
For a binary outcome, such as mortality, you have several choices. You can compute the risk difference
You can also compute the relative risk, but traditionally, this is transformed to the log scale first.
You can also compute the odds ratio, and this is almost always transformed to the log scale as well.
The standard error of the risk difference is
For the relative risk and the odds ratio, we need to analyze the data on the log scale. The log relative risk has a standard error of
and the log odds ratio has a standard error of
There is no consensus on the best measure among the risk difference, relative risk, or odds ratio. The risk difference has certain advantages in interpretability, but the log odds ratio often has fewer problems with heterogeneity.
Step 2. Compute a preliminary estimate of overall effect.
Now that you have all the data together, the first thing you want to do is to combine it. In a perfect world, you would think carefully about your studies and the particular meta-analysis model that you want and whether it makes sense to compute any combined estimate at all. Only after a lot of careful thought would you proceed.
But let's be realistic. You and I are both impatient, so we want to see right away what is going on. So go ahead and compute a simple estimate of combined effect. Don't get emotionally attached to that estimate, because a better choice might be a more complex estimate or possibly no estimate at all.
The simplest combined estimate is a weighted average of the individual study results. The weights are inversely proportional to the square of the standard error,
which gives greater weight to those studies with smaller standard errors. The weighted average is
where r is the number of studies in the meta-analysis. This is known as the fixed effects estimate. It is a good starting point for further analysis, but after you have taken a careful look at this estimate and the individual studies that go into producing this estimate, you may decide to use a different estimate or dispense entirely with estimating an overall effect.
The formulas for confidence limits for this estimate are simple enough, but I won't present them here.
Example: A meta-analysis of inhaled steroid use in chronic obstructive pulmonary disease:
- Effects of inhaled corticosteroids on sputum cell counts in stable chronic obstructive pulmonary disease: a systematic review and a meta-analysis. Gan WQ, Man SP, Sin DD. BMC Pulm Med 2005: 5(1); 3. [Medline] [Abstract] [Full text] [PDF]
showed standardized mean differences (smd) for the reduction in Total Cell counts and confidence limits (lcl, ucl) in six studies in Table 3. I retyped that data in SPSS.
I computed the standard error by subtracting the lower confidence limit from the standardized mean difference and then divided by 1.96. I also computed as the inverse of the squared standard error to represent the weight for each study.
The sum of the weights is 35.37 and the sum of smd times the weights is -14.91. Divide the second value by the first to get the overall estimate of -0.42. The fixed effects standard error for the overall estimate is 0.17 and a 95% confidence interval is -0.09 to -0.75.
Another example of a meta-analysis appears in
- Acetylcysteine for prevention of contrast-induced nephropathy after intravascular angiography: a systematic review and meta-analysis. Bagshaw SM, Ghali WA. BMC Med 2004: 2(1); 38. [Medline] [Abstract] [Full text] [PDF]
I re-typed the table of odds ratios and 95% confidence intervals into Microsoft Excel.
To calculate a standard error, you first have to transform the odds ratio and the confidence limits to the log scale. I used base 10 logarithms, here but any other type of logarithm will also work.
To compute a standard error, take the log(ucl), subtract the log(or) and divide by 1.96. I could have used the log(lcl) instead, but if you look at the original data, some of the lower limits are 0.01 and 0.02. I was worried that there might be a lot of rounding error in those values, since only one significant figure is displayed.
Next, I computed weights and a weighted sum.
The overall estimate of the log odds ratio is -33.317 / 147.115 = -0.226. Take the inverse of the sum of the weights and calculate a square root to get a standard error for this combined estimate (0.082). A 95% confidence interval on the log scale is -0.387 to -0.065. Transforming this back to the original scale of measurement gives you an overall odds ratio of 0.59 and confidence limits of 0.41 to 0.86.
Most commonly used statistical software does not include programs for meta-analysis. You can download special user contributed libraries for meta-analysis for Stata and for R.
Here is an example of an R program, plus the output using the meta library.
f0 <- TotalCells.ma <- "X:/webdata/TotalCells.csv"
Cells.dat <- read.csv(f0)
attach(Cells.dat)
library(meta)
Cells.ma <- metagen(TE=Cells.smd,seTE=Cells.se,studlab=study,sm="SMD")
print(Cells.ma)
plot(Cells.ma,comb.f=T)
SMD
95%-CI %W(fixed) %W(random)
Yildiz -0.6 [-1.5996; 0.3996]
10.98 10.98
Confalonieri -0.4 [-1.1056; 0.3056] 22.03
22.03
Mirici -1.0 [-1.7056; -0.2944] 22.03
22.03
Sugiura 0.2 [-0.7996; 1.1996]
10.98 10.98
Culpitt -0.3 [-1.1036; 0.5036] 16.99
16.99
Keatings -0.1 [-0.9036; 0.7036] 16.99
16.99
Number of trials combined: 6
SMD 95%-CI
z p.value
Fixed effects model -0.4203 [-0.7515; -0.0891] -2.4874 0.0129
Random effects model -0.4203 [-0.7515; -0.0891] -2.4874 0.0129
Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 1.96]; I^2 = 0% [0%; 74.1%]
Test of heterogeneity:
Q d.f. p.value
4.9 5 0.4287
Method: Inverse variance method
Notice that there is no difference between the random effects model and the fixed effects model. That is because for this data set, there is no evidence of heterogeneity. The Cochran's Q value is smaller than the degrees of freedom and the estimate of tau-squared is zero.
Here's what the analysis of the Acetylcysteine data would look like using R and the meta library.
f0 <- "X:/webdata/Acetylcysteine1.csv"
acetyl.dat <- read.csv(f0)
attach(acetyl.dat)
log.or <- log(or)
se <- (log(ucl)-log.or)/1.96
acetyl.ma <- metagen(TE=log.or,seTE=se,studlab=study,sm="OR")
print(acetyl.ma)
OR 95%-CI %W(fixed) %W(random)
Allaqaband 1.23 [0.3889; 3.8899] 10.44 9.18
Baker 0.20 [0.0400; 1.0000] 5.34 6.41
Briguori 0.57 [0.1993; 1.6300] 12.54 9.93
Diaz-Sandova 0.11 [0.0224; 0.5400] 5.47 6.50
Durham 1.27 [0.4518; 3.5699] 12.96 10.06
Efrati 0.19 [0.0086; 4.2098] 1.44 2.40
Fung 1.37 [0.4345; 4.3199] 10.50 9.20
Goldenberg 1.30 [0.2721; 6.2098] 5.66 6.64
Kay 0.29 [0.0895; 0.9400] 10.01 9.00
Kefer 0.63 [0.1013; 3.9199] 4.14 5.44
MacNeill 0.11 [0.0125; 0.9700] 2.92 4.24
Oldemeyer 1.30 [0.2744; 6.1598] 5.72 6.68
Shyu 0.11 [0.0247; 0.4900] 6.20 7.01
Vallero 1.14 [0.2691; 4.8299] 6.64 7.29
Number of trials combined: 14
OR 95%-CI z p.value
Fixed effects model 0.5937 [0.4092; 0.8612] -2.7468 0.006
Random effects model 0.5428 [0.3231; 0.9121] -2.3076 0.021
Quantifying heterogeneity:
tau^2 = 0.4187; H = 1.35 [1; 1.84]; I^2 = 44.9% [0%; 70.5%]
Test of heterogeneity:
Q d.f. p.value
23.6 13 0.035
Method: Inverse variance methodOne important thing to note is that R expects you to use natural logarithms (base e) rather than base 10 logarithms. When I first did this, I used base 10 logarithms and all the results were too small.
A common way to display the individual study results and a combined estimate of effects is a graph known as a forest plot. An example of a forest plot appears in
- Acetylcysteine for prevention of contrast-induced nephropathy after intravascular angiography: a systematic review and meta-analysis. Bagshaw SM, Ghali WA. BMC Med 2004: 2(1); 38. [Medline] [Abstract] [Full text] [PDF]
and because this is an open-access article, I can reproduce the graph here.

Since BMC Medicine is published with an open access license, I can freely reproduce this image, as long as I cite the source.
I was always confused by the funny squares in a forest plot, so I looked for a description. Here is what the User's Guide for RevMan (software created by the Cochrane Collaboration) says about forest plots:
The graph is a forest plot where the confidence interval (CI) for each study is represented by a horizontal line and the point estimate is represented by a square. The size of the square corresponds to the weight of the study in the meta-analysis. The confidence interval for totals are represented by a diamond shape. The scale used on the graph depends on the statistical method. Dichotomous data (except for risk differences) are displayed on a logarithmic scale. Continuous data and risk differences are displayed on a linear scale. Generic inverse variance data are displayed on either a logarithmic scale or a linear scale depending on the settings in RevMan. -- http://www.cc-ims.net/download/revman/Documentation/User%20guide.pdf (page 36).
Here is an example of the Forest plot, as drawn by R and the meta library.
> plot(TotalCells.ma,comb.f=T)

Another way to display the results of a meta-analysis looks at the cumulative effect over time as additional studies accumulate. At the top of the graph, you display the confidence interval for the estimate from the first study published. Directly below that you display the confidence interval for the combined effect of the first and second studies. Below that is the combined effect of the first, second, and third studies, and so forth. An example of this cumulative display appears in
- Erythropoietin, uncertainty principle and cancer related anaemia. Clark O, Adams JR, Bennett CL, Djulbegovic B. BMC Cancer 2002: 2(1); 23. [Medline] [Abstract] [Full text] [PDF]
shows cumulative meta-analysis, which is the cumulated effects over time of studies in the use of erythropoietin (EPO) to treat cancer related anemia.

Since BMC Cancer is published with an open access license, I can freely reproduce this image, as long as I cite the source.
The outcome variable, the odds ratio for whether a patient requires transfusion, showed a significant benefit for EPO. It also shows that sufficient evidence had already accumulated by 1995 to demonstrate this benefit. If such a meta-analysis had been performed back then, there would have been no need to run the additional trials. These redundant trials are bad because they wasted scarce research dollars on a topic where sufficient information had already been accumulated to answer the research question. They are also bad because half of the patients in these post-1995 trials received no treatment or placebo, even though there was enough evidence at that time to show that this is an inferior option.
Some have suggested that any protocol submitted to an Institutional Review Board (IRB) should include a systematic overview or meta-analysis of the previous research (see Chalmers 1996), rather than just a simple literature review, to prevent future IRBs from making the same mistake of those that approved the post-1995 studies of EPO. In some situations, that is definitely overkill, but it is a suggestion worth serious consideration in other circumstances.
Step 3. Evaluate the studies for publication bias and heterogeneity.
After you have an overall estimate, you should compute the amount of variability of each study from the overall estimate. You do this by computing a Z-score for each study,
and then seeing how much all of these Z-scores differ from zero by squaring the Z-scores and adding them up. This gives you a test statistic, Cochran's Q,
An unusually large value for Q implies substantial heterogeneity, because you have more variation among the studies than you would expect just by looking at the individual standard errors. If there is no heterogeneity, then Q should be approximately equal to r-1, which implies that the squared Z-scores are, on average, just slightly less than 1.
Many experts have rejected the use of quantitative measures such as Cochran's Q for assessing heterogeneity and suggest instead that you examine the studies qualitatively and provide a subjective assessment of the degree of heterogeneity among the research studies.
Another alternative is I-squared (Higgins 2003), a statistic that measures the proportion of inconsistency in individual studies that cannot be explained by chance.
Negative values are not allowed for I-squared. If you compute a negative value, set I-squared to zero instead.
I-squared is bounded above by 100% and values close to 100% represent very high degrees of heterogeneity.
This measure is preferred to Cochran's Q. The problem with Cochran's Q, the authors claim, is that it tends to have too little power with a collection of studies with small sample sizes and too much power with a collection of studies with large sample sizes. Values of I-squared equal to 25%, 50%, and 75% representing low, moderate, and high heterogeneity, respectively.
The random effects model is an alternative way to combine estimates that explicitly accounts for heterogeneity. In the random effects model, each study statistic is assumed to be composed of
where the second component is normally distributed random effect
that accounts for the heterogeneity from study to study. A frequent criticism of the random effects meta-analysis is this assumption that the random effects follow a bell shaped curve. There is some suggestion that perhaps heterogeneity manifests itself as a bimodal distribution instead.
You can use the Method of Moments and Cochran's Q statistic to estimate the between study variation:
Notice that the numerator is a measure of how much the Cochran's Q statistic exceeds its degrees of freedom. If you get a negative estimate here, simply replace it with an estimate of zero.
With an estimate of between study variation, you can now compute the random effects estimate as a weighted average, just like the fixed effects estimate, except the weights in the random effects estimate are
where wi are the weights used in the fixed effects model.
These weights are going to be closer to uniform or equal weighting than the weights in a fixed effects model. If you think about it long enough, this is actually quite intuitive. In a model where the study heterogeniety is large, large enough to dominate the standard errors, you effectively have a random sample of studies each of which is more or less identically distributed:
In addition to producing weights that are closer to equal weighting, the confidence intervals for a random effects meta-analysis are typically wider than a fixed effects meta-analysis because the estimated study heterogeneity adds an additional source of uncertainty to the confidence interval calculations.
The funnel plot is a graphical exploration of the study results looking for evidence of publication bias. An example of a funnel plot appears in
- Oral rehydration versus intravenous therapy for treating dehydration due to gastroenteritis in children: a meta-analysis of randomised controlled trials. Bellemare S, Hartling L, Wiebe N, Russell K, Craig WR, McConnell D, Klassen TP. BMC Med 2004: 2(1); 11. [Medline] [Abstract] [Full text] [PDF]

Another funnel plot with conical guidelines superimposed appears in
- Association of circulating Chlamydia pneumoniae DNA with cardiovascular disease: a systematic review. Smieja M, Mahony J, Petrich A, Boman J, Chernesky M. BMC Infect Dis 2002: 2(1); 21. [Medline] [Abstract] [Full text] [PDF]

Interestingly enough, most of the meta-analyses published in Biomed Central had the following statement (almost word for word)
Publication bias was not assessed using funnel plots as these tests have been shown to be unhelpful.
These articles then cited the following two references
- Misleading funnel plot for detection of bias in meta-analysis. Tang JL, Liu JL. J Clin Epidemiol 2000: 53(5); 477-84. [Medline]
- Publication and related bias in meta-analysis: power of statistical tests and prevalence in the literature. Sterne JA, Gavaghan D, Egger M. J Clin Epidemiol 2000: 53(11); 1119-29. [Medline]
I have not yet read these articles, but I would agree that the funnel plot is often difficult to interpret. There are some numerical summary measures that try to quantify the departure from symmetry in the funnel plot, but these measures may also have problems.
The trim and fill method uses the funnel plot to try to estimate the missing unpublished studies. In this approach, studies that are asymmetrically distributed (that have no matching study on the opposite side of the funnel plot) are removed from the plot. Then the funnel plot is filled in using symmetric pairs from the trimmed study. This produces a funnel plot with extra imputed studies that make the plot symmetric. The trim and fill method is quite controversial and should be considered an exploratory approach. If, for example, you use this method and the overall estimate changes by a trivial amount, then you have indirect evidence that publication bias did not seriously influence your outcome.
Further reading
- Changes in clinical trials mandated by the advent of meta-analysis. Chalmers TC, Lau J. Stat Med 1996: 15(12); 1263-8; discussion 1269-72. [Medline]
- Asymmetric funnel plots and publication bias in meta-analyses of diagnostic accuracy. Song F, Khan KS, Dinnes J, Sutton AJ. Int J Epidemiol 2002: 31(1); 88-95. [Medline] [Abstract] [Full text] [PDF]
- Bias in meta-analysis detected by a simple, graphical test. Egger M, Davey Smith G, Schneider M, Minder C. British Medical Journal 1997: 315(7109); 629-34. [Medline] [Abstract] [Full text]
- Measuring inconsistency in meta-analyses. J. P. Higgins, S. G. Thompson, J. J. Deeks, D. G. Altman. Bmj 2003: 327(7414); 557-60. [Medline] [Full text] [PDF]
Stats >> Model >> Meta-analysis
Page last modified on 09/24/2007. Send
What is sensitivity?
The sensitivity of a test is the probability that the test is positive when given to a group of patients with the disease. Sensitivity is sometimes abbreviated Sn.
The formula for sensitivity is
Sn = TP / (TP + FN)
where TP and FN are the number of true positive and false negative results, respectively. You can think of sensitivity as 1- the false negative rate. Notice that the denominator for sensitivity is the number of patients who have the disease. Using conditional probabilities, we can also define sensitivity as
Sn = P [ Test is positive | Patient has the disease ]
The following table summarizes these calculations.
A large sensitivity means that a negative test can rule out the disease. David Sackett coined the acronym "SnNOut" to help us remember this.
Here is an example of a sensitivity calculation.
- In a study of 5,113 subjects checked for gastric cancer by endoscopy (Gut 1999; 44: 693-697), serum pepsinogen concentrations were also measured. A pepsinogen I concentration of less than 70 ng/ml and a ratio of pepsinogen I to pepsinogen II of less than 3 was considered a positive test. There were 13 patients with gastric cancer confirmed by endoscopy. 11 of these patients were positive on the test. The sensitivity is 11/13 = 85%.
This webpage was written on 2005-08-18 and was last modified on 2008-07-08. This page needs minor revisions. Category: Definitions, Category: Diagnostic testing.
What is specificity?
The specificity of a test is the probability that the test will be negative among patients who do not have the disease. Specificity is sometimes abbreviated Sp. The formula for specificity is
Sp = TN / (TN + FP)
where TN and FP and the number of true negative and false positive results, respectively. You can think of specificity as 1 - the false positive rate. Notice that the denominator for specificity is the number of healthy patients. Using conditional probabilities, we can also define specificity as
Sp = P [ Test is negative | Patient is healthy ]
The following table summarizes these calculations.
A large specificity means that a positive test can rule in the disease. David Sackett coined the acronym "SpPIn" to help us remember this.
Here is an example of a specificity calculation.
- In a study of the urine latex agglutination test (AJPH 1998;88(2):285-288), children were tested for H. influenzae using blood, urine, cerebrospinal fluid, or some combination of these. Of all the children tested, 1,352 did not have H. influenzae in any of these fluids. Only 9 of these patients tested positive on the urine latex agglutination test, the remaining 1,343 tested negative. The specificity is 1343 / 1352 = 99.3%.
This webpage was written on 2005-08-18 and was last modified on 2008-07-08. This page needs minor revisions. Category: Definitions, Category: Diagnostic testing.
Stats >> Model >> Meta-analysis >> Diagnostic (no date)
There is no real consensus yet on how to best combine data from several studies of a diagnostic test. I will outline a few approaches that seem to make sense. In addition to this page, I have a general overview on meta-analysis and a non-technical introduction on the practical interpretation of a meta-analysis.
Direct analysis of sensitivity/specificity
The simplest overall estimate of sensitivity (sens) or specificity (spec) is to just combine all the studies in a pot and stir. Just count the number of true positives (tp), false negatives (fn), true negatives (tn) and false positives (fp) in each study. The overall sensitivity would have the sum of the individual true positive values in the numerator and the sum of the individual true positive plus false negative values in the denominator.
This is equivalent to a weighted average of the individual sensitivities where the weights for each individual study is simply the individual true positive plus false negative values. You would calculate an overall estimate of sp.
The tricky part comes when you try to define a confidence interval for the overall estimate. This confidence interval is effectively a combination of the standard errors that you would assign to each individual study.
A first attempt might be to define the standard error of an individual study using the classic binomial formula. Writing the standard error in terms of true positive and false negative values, you would get
The problem with this formula for the standard error is that it gives less weight to studies where sensitivity is close to 50% and greater weight to studies where sensitivity is much smaller than 50% or much larger than 50%. Another problem occurs when one or more of the sensitivities is 100%. The standard error using a binomial distribution equals zero for those studies with 100% sensitivity, which seems at first like a good thing. But when one study has standard error of zero, the meta-analysis model will try to give it an infinite weight, which is not at all a good thing.
One way to avoid some of these problems is to estimate the standard error, not using the individual sensitivities, but the overall sensitivity.
Since the numerator is now the same for every study, you no longer have the problem where studies with sensitivities near 50% get much smaller weights than studies with sensitivities much smaller or much larger than 50%. This approach also avoids the problem when a study has 100% sensitivity.
It's interesting to note that, the overall estimate and the standard error for the overall sensitivity using this particular meta-analysis model with a fixed effects estimate matches perfectly with the traditional binomial confidence interval that you might apply. This is easy enough to show because
which implies that
For a random effects model, the results are a little more complicated and they do not exactly match the traditional binomial confidence interval formula.
Example: In an article describing systematic reviews of diagnostic and screening tests,
- Systematic reviews in health care: Systematic reviews of evaluations of diagnostic and screening tests. Deeks JJ. British Medical Journal 2001: 323(7305); 157-62. [Medline] [Full text] [PDF]
data from 20 studies of endovaginal ultrasonography for detecting endometrial cancer are presented. I typed the data in as a comma separated file.
study,tp,fn,tn,fp
Abu Hmeidan,81,5,186,273
Auslender,16,0,48,90
Botsis,8,0,14,98
Cacclatore,4,0,30,11
Chan,15,2,15,35
Dorum,12,3,34,51
Goldstein,1,0,16,11
Granberg,18,0,32,125
Hanggi,18,3,13,55
Karlsson (a),112,2,414,601
Karlsson (b),14,1,33,57
Klug,7,1,44,127
Malinova,57,0,26,35
Nasri (a),7,0,14,38
Nasri (b),6,0,24,59
Petrl,18,1,96,35
Taviani,2,0,18,21
Varner,1,1,4,9
Weigel,37,0,91,72
Wolman,4,0,18,32
and here is the R code to read in an compute the meta-analysis models.
library(meta)
f0 <- "X:/webdata/EndovaginalUltrasonography.csv"
deeks.example.dat <- read.csv(f0)
attach(deeks.example.dat)
sens <- tp / (tp + fn)
sens.overall <- sum(tp) / sum(tp + fn)
spec <- tn / (tn + fp)
spec.overall <- sum(tn) / sum(tn + fp)
par(mar=c(5.1,4.1,0.1,0.1))
plot(1-spec,sens,xlim=0:1,ylim=0:1)
points(1-spec.overall,sens.overall,pch="+",cex=2)
The last three lines create a graph of the data, which is shown below. The par() function adjusts the margins of the graph to make more effective use of the available space on the screen. The plot() function creates the axes and draws a circle at each individual sens, 1-spec pair. The points() command adds a big plus sign at the overall estimate.

Plotting 1-spec on the x-axis, which seems odd, but it is intended to have the same orientation as an ROC curve. In fact, this plot is often called an SROC (Summary Receiver Operating Characteristic) plot.
I experimented with trying to show the confidence limits for each study in the graph itself, by drawing rectangles with the width representing confidence limits for 1-spec and the height representing confidence limits for sens. Unfortunately, this graph was too cluttered to be useful.
The computations for the actual meta-analysis are shown below. The code is a bit cryptic perhaps, but I am using "te" as an abbreviation for "treatment effect" and "se" as an abbreviation for "standard error." The metagen() function has similar notation. The only thing that is a bit confusing perhaps is the sm= portion. The letters "sm" stand for "summary measure. This is a label that metagen uses to make the output look nicer.
te1 <- sens
se1 <- sqrt(sens.overall * (1 - sens.overall) / (tp + fn))
deeks1.ma <- metagen(TE=te1, seTE=se1, studlab=study, sm="Sensitivity")
te2 <- spec
se2 <- sqrt(spec.overall * (1 - spec.overall) / (tn + fp))
deeks2.ma <- metagen(TE=te2, seTE=se2, studlab=study, sm="Specificity")
and here is the output
> deeks1.ma
Sensitivity
95%-CI %W(fixed) %W(random)
Abu Hmeidan 0.9419 [0.8997; 0.9840] 18.82
10.27
Auslender 1.0000 [0.9022; 1.0978] 3.50
5.62
Botsis 1.0000 [0.8617; 1.1383]
1.75 3.61
Cacclatore 1.0000 [0.8044; 1.1956] 0.88
2.10
Chan 0.8824 [0.7875; 0.9772]
3.72 5.81
Dorum 0.8000 [0.6990; 0.9010]
3.28 5.42
Goldstein 1.0000 [0.6088; 1.3912] 0.22
0.60
Granberg 1.0000 [0.9078; 1.0922] 3.94
5.99
Hanggi 0.8571 [0.7718; 0.9425]
4.60 6.47
Karlsson (a) 0.9825 [0.9458; 1.0191] 24.95
10.77
Karlsson (b) 0.9333 [0.8323; 1.0344] 3.28
5.42
Klug 0.8750 [0.7367; 1.0133]
1.75 3.61
Malinova 1.0000 [0.9482; 1.0518] 12.47
9.37
Nasri (a) 1.0000 [0.8521; 1.1479] 1.53
3.27
Nasri (b) 1.0000 [0.8403; 1.1597] 1.31
2.91
Petrl 0.9474 [0.8576; 1.0371]
4.16 6.16
Taviani 1.0000 [0.7233; 1.2767] 0.44
1.15
Varner 0.5000 [0.2233; 0.7767]
0.44 1.15
Weigel 1.0000 [0.9357; 1.0643]
8.10 8.21
Wolman 1.0000 [0.8044; 1.1956]
0.88 2.10
Number of trials combined: 20
Sensitivity 95%-CI
z p.value
Fixed effects model 0.9584 [0.9401; 0.9767] 102.6404 < 0.0001
Random effects model 0.9481 [0.9171; 0.9792] 59.8249 < 0.0001
Quantifying heterogeneity:
tau^2 = 0.002; H = 1.43 [1.1; 1.85]; I^2 = 51% [18.1%; 70.7%]
Test of heterogeneity:
Q d.f. p.value
38.75 19 0.0048
Method: Inverse variance method
> deeks2.ma
Specificity
95%-CI %W(fixed) %W(random)
Abu Hmeidan 0.4052 [0.3606; 0.4498] 15.27
5.83
Auslender 0.3478 [0.2665; 0.4292] 4.59
5.46
Botsis 0.1250 [0.0347; 0.2153]
3.73 5.35
Cacclatore 0.7317 [0.5825; 0.8810] 1.36
4.49
Chan 0.3000 [0.1648; 0.4352]
1.66 4.71
Dorum 0.4000 [0.2963; 0.5037]
2.83 5.17
Goldstein 0.5926 [0.4087; 0.7765] 0.90
3.97
Granberg 0.2038 [0.1275; 0.2801] 5.22
5.52
Hanggi 0.1912 [0.0753; 0.3071]
2.26 4.99
Karlsson (a) 0.4079 [0.3779; 0.4379] 33.78
5.93
Karlsson (b) 0.3667 [0.2659; 0.4674] 3.00
5.21
Klug 0.2573 [0.1842; 0.3304]
5.69 5.56
Malinova 0.4262 [0.3039; 0.5486] 2.03
4.90
Nasri (a) 0.2692 [0.1367; 0.4018] 1.73
4.75
Nasri (b) 0.2892 [0.1843; 0.3941] 2.76
5.15
Petrl 0.7328 [0.6493; 0.8163]
4.36 5.43
Taviani 0.4615 [0.3085; 0.6146] 1.30
4.43
Varner 0.3077 [0.0426; 0.5728]
0.43 2.91
Weigel 0.5583 [0.4834; 0.6331]
5.42 5.54
Wolman 0.3600 [0.2248; 0.4952]
1.66 4.71
Number of trials combined: 20
Specificity
95%-CI z p.value
Fixed effects model 0.3894 [0.3719; 0.4068] 43.7721 < 0.0001
Random effects model 0.3845 [0.3216; 0.4475] 11.9685 < 0.0001
Quantifying heterogeneity:
tau^2 = 0.0172; H = 3.26 [2.77; 3.85]; I^2 = 90.6% [86.9%; 93.2%]
Test of heterogeneity:
Q d.f. p.value
202.17 19 < 0.0001
Method: Inverse variance method
Notice that there is substantial evidence of heterogeneity in both the sensitivity and specificity values.
Analysis of sensitivity/specificity on the log odds scale
Another approach is to transform the sensitivity/specificity to the log odds scale before entering the data into a meta-analysis model. The log odds transformation is a common transformation for binomial data and serves as the heart of a logistic regression model. The standard error for the log odds sensitivity has a nice simple approximation. To derive this, you have to remember a simple formula about variances of a function.
This formula relies on two things you forgot from your days of calculus, how to take a derivative and how to apply a Taylor series expansion.
The details are tedious, but not difficult. When you use this formula on the log odds function, you get the following approximation.
Compare this to the standard error for sensitivity shown above. The numerator for the standard error has now moved in with its downstairs neighbor, leaving the upstairs empty. For the log odds for sensitivity, this the opposite problem from the sensitivity. Studies with sensitivity close to 50% have greater weight on the log odds scale than studies with sensitivity larger than 50%.
You can simplify this formula further. Note that the denominator of sensi can cancel out the tpi+fni term right next to it. With a bit more algebra, you can get
The log odds transformation also has some problems when the sensitivity is 100%. A simple fix is to add an arbitrary constant (usually 0.5) to both the numerator and denominator. Another approach would be to use the more complex formula listed above, but substitute the overall sensitivity for the individual sensitivity.
Example: Let's use the example in Deeks 2001 again. Here is the R code to compute log odds and analyze the data in a meta-analysis model. Note that the pmax function replaces the zeros in fn with 0.5.
logit <- function(p) {log(p)-log(1-p)}
fn.adj <- pmax(fn,0.5)
sens <- tp/(tp+fn.adj)
te3 <- logit(sens)
se3 <- sqrt(1/tp+1/fn.adj)
deeks3.ma <- metagen(TE=te3,seTE=se3,studlab=study,sm="Log Odds Sens")
spec <- tn/(tn+fp)
te4 <- logit(spec)
se4 <- sqrt(1/tn+1/fp)
deeks4.ma <- metagen(TE=te4,seTE=se4,studlab=study,sm="Log Odds Spec")
Here is the output. Using the summary function results in a briefer output because the results of individual studies are not shown.
summary(deeks3.ma)
Number of trials combined: 20
Log Odds Sens
95%-CI z p.value
Fixed effects model 2.4775 [2.0562; 2.8987] 11.5269 < 0.0001
Random effects model 2.4761 [2.0318; 2.9204] 10.9228 < 0.0001
Quantifying heterogeneity:
tau^2 = 0.0551; H = 1.03 [1; 1.27]; I^2 = 5.4% [0%; 38.1%]
Test of heterogeneity:
Q d.f. p.value
20.07 19 0.3901
Method: Inverse variance method
summary(deeks4.ma)
Number of trials combined: 20
Log Odds Spec
95%-CI z p.value
Fixed effects model -0.4277 [-0.5036; -0.3518] -11.0403 < 0.0001
Random effects model -0.5033 [-0.7668; -0.2399] -3.7446 0.0002
Quantifying heterogeneity:
tau^2 = 0.292; H = 3.07 [2.58; 3.64]; I^2 = 89.4% [85%; 92.5%]
Test of heterogeneity:
Q d.f. p.value
178.76 19 < 0.0001
Method: Inverse variance method
You need to do a few additional calculations to get sensitivity transformed back to the original measurement scale. You can define a function in R to do this calculation for you. I call it the expit function, which is the inverse of the logit function.
expit <- function(log.odds) {exp(log.odds)/(1+exp(log.odds))}
With this function, you can now take the estimates and confidence limits on the log odds scale and transform them back to the original scale.
attach(deeks3.ma)
est.and.cl.fixed <- TE.fixed+c(0,-1.96,1.96)*seTE.fixed
round(100*expit(est.and.cl.fixed),1)
92.3 88.7 94.8
est.and.cl.random <- TE.random+c(0,-1.96,1.96)*seTE.random
round(100*expit(est.and.cl.random),1)
92.2 88.4 94.9
attach(deeks4.ma)
est.and.cl.fixed <- TE.fixed+c(0,-1.96,1.96)*seTE.fixed
round(100*expit(est.and.cl.fixed),1)
39.5 37.7 41.3
est.and.cl.random <- TE.random+c(0,-1.96,1.96)*seTE.random
round(100*expit(est.and.cl.random),1)
37.7 31.7 44.0
The estimated sensitivity and 95% confidence limits under the fixed effects model are 92.3% (88.7% to 94.8%). The estimates and limits change only slightly under than random effects model. The estimated specificity and 95% confidence limits under the fixed effect model are 39.5% (37.7% to 41.3%). Under the random effects model, the estimate is a bit lower and the confidence limits are much wider.
Analysis of the diagnostic odds ratio
A third approach is to compute the diagnostic odds ratio, which compares the odds for sensitivity to the odds for specificity.
Notice how the denominator looks like we accidentally switched things. That was not a mistake. The diagnostic odds ratio is effectively the odds of TPR (the true positive rate or sens) divided by the odds of FPR (the false positive rate or 1-spec).
The first advantage of this approach is that you can use well-known approaches for combining multiple odds ratios. The other advantage is that is analyzes sensitivity and specificity as a pair. Some studies may exhibit heterogeneity in the individual sensitivity or specificity values because one researcher may have tried to maximize sensitivity at the expense of specificity, another may have tried to maximize specificity at the expense of sensitivity, and a third may have tried to balance the two. If there is heterogeneity, then the overall estimates of sensitivity and specificity may be too low.
Although there are no guarantees, the diagnostic odds ratio should exhibit less heterogeneity. The problem with the diagnostic odds ratio is that no one has a very good feel on what it actually represents. One way of thinking about the diagnostic odds ratio is to swap a couple of terms in the fraction.
So you might interpret the diagnostic odds ratio as the spread between the two likelihood ratios. If, for example, the likelihood ratio for a positive test is 10 and is 0.5 for a negative test, then there is a 20 fold change. Another way of interpreting this is that the post-test odds would be 20 fold higher for a positive test than for a negative test.
The book on meta-analysis by Sutton et al suggests that you model the heterogeneity in the diagnostic odds ratio using the following regression model
You might recognize D as the diagnostic odds ratio. The variable S is a bit harder to visualize, but you can rewrite it as
This represents the tendency of an individual study to skew the test more towards sensitivity or more towards specificity.
Here's an example of the problems that can happen when different studies skew more towards sensitivity and others more towards specificity. Imagine a diagnostic test that takes on a range of values. This test follows a bell shaped curve both in the diseased and the healthy populations and the two bell curves are set two standard deviations apart. You could set a cutpoint to maximize specificity, to maximize sensitivity, or something in between.
This series of graphs shows what happens across a range of cutpoints.











When you graph the data on an SROC plot, you get a nice distribution of values. Notice, however, that the average of all these sensitivities and specificities is pushed further away from the upper left hand corner than any of the individual sensitivity/specificity pairs.

By fitting a model to the diagnostic odds ratio, and assessing heterogeneity in that odds ratio, you hope to avoid this obvious underestimate of sensitivity and specificity.
When you fit the regression model, you are hoping is that the slope term is zero. That tells you that the estimated intercept is a valid estimate across the range of S values.
It's unclear whether to use a weighted regression model or an unweighted regression model for these data.
fn.adj <- pmax(fn,0.5)
tpr <- tp/(tp+fn.adj)
fpr <- fp/(tn+fp)
d <- logit(tpr)-logit(fpr)
s <- logit(tpr)+logit(fpr)
se.d <- sqrt(1/tp+1/fn.adj+1/tn+1/fp)
w <- 1/se.d^2
unweighted.regression <- lm(d~s)
weighted.regression <- lm(d~s,weights=w)
par(mar=c(5.1,4.1,0.6,0.6))
plot(s,d)
abline(unweighted.regression)
abline(weighted.regression,lty=2)
For this data set, it appears that there is a non-zero slope, which makes interpretation of the combined diagnostic odds ratio problematic.

deeks5.ma <- metagen(TE=d,seTE=se.d,studlab=study,sm="Log
Diagnostic Odds Ratio")
summary(deeks5.ma)
Number of trials combined: 20
Log Diagnostic Odds Ratio 95%-CI
z p.value
Fixed effects model 1.9772 [1.5400; 2.4145] 8.8633 < 0.0001
Random effects model 1.9732 [1.3618; 2.5847] 6.3249 < 0.0001
Quantifying heterogeneity:
tau^2 = 0.6555; H = 1.27 [1; 1.67]; I^2 = 38.4% [0%; 64%]
Test of heterogeneity:
Q d.f. p.value
30.87 19 0.0418
Method: Inverse variance method
Additional reading
- Reporting of measures of accuracy in systematic reviews of diagnostic literature. Honest H, Khan KS. BMC Health Serv Res 2002: 2(1); 4. [Medline] [Abstract] [Full text] [PDF]
- Conducting systematic reviews of diagnostic studies: didactic guidelines. Deville WL, Buntinx F, Bouter LM, Montori VM, De Vet HC, Van Der Windt DA, Bezemer P. BMC Med Res Methodol 2002: 2(1); 9. [Medline] [Abstract] [Full text] [PDF]
- Systematic reviews in health care: Systematic reviews of evaluations of diagnostic and screening tests. Deeks JJ. British Medical Journal 2001: 323(7305); 157-62. [Medline] [Full text] [PDF]
Stats >> Model >> Meta-analysis >> Diagnostic
Page last modified on 09/24/2007. Send dot edu or click on the email link at the top of the page





































