Children's Mercy Hospital
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Supplemental readings for Stats #21 or Stats #24.

Here is some supplementary material about diagnostic testing that I have published recently in my weblog.

A novel diagnostic test (January 26, 2006). Category: Diagnostic testing

A recently published article on diagnosing cancer got a lot of press. The article

  • Diagnostic Accuracy of Canine Scent Detection in Early- and Late-Stage Lung and Breast Cancers. McCulloch M, Jezierski T, Broffman M, Hubbard A, Kirk Turner, Janecki T. Integrative Cancer Therapies 2006: 5(1); 1-10. [PDF]

noted that canines have an unusually sensitive sense of smell and might be able to diagnose cancer by sniffing breath sample from human patients. This is rather intriguing, since dogs have already been trained to locate explosives, cadavers, drugs, and so forth.

The researchers collected breath samples from 55 patients with lung cancer, 31 patients with breast cancer, and 83 volunteers with no prior cancer history.

Eligible patients were men and women older than 18 years with a very recent biopsy-confirmed conventional diagnosis of lung or breast cancer. We specifically requested that recruitment centers refer patients as soon as possible following definitive diagnosis so that breath sampling would not interfere with or delay planned conventional treatment. As we suspected that chemotherapy treatment would change the exhaled chemicals in cancer patients, we sought patients who had not yet undergone chemotherapy treatment. As we also suspected that patients with more advanced disease, and thus larger tumors, might be exhaling higher concentrations of the chemicals associated with cancer cells and would therefore be more easily identified by the dogs, we sought patients with any stage disease.

The collection of breath samples was quite simple.

For breath sampling, we obtained a cylindrical polypropylene organic vapor sampling tube (Defencetek, Pretoria, South Africa). Each tube is open at either end, is 6 inches long, has an outer diameter of 1 inch, has an inner diameter of 0.75 inches, and has removable end caps. A removable 2-inch-long insert of silicone oil-coated polypropylene “wool” captures volatile organic compounds in exhaled breath as breath passes through the tube. To collect breath samples, we asked donors to exhale 3 to 5 times through the tube. We then fitted the tubes with their end caps and sealed them in ordinary grocery store Ziplock-style bags at room temperature between the time of breath sampling and presentation to the dogs.

Each patient and control contributed multiple breath samples to the study, ranging from 4 to 18 samples per person.

The dogs had to be trained to recognize cancer samples, and in the training sessions, the trainer had to be unblinded to the location of the cancer sample, so they could reward the dogs when they identified the cancer samples correctly. The dogs were trained to indicate a positive result by sitting down by the canister that had the cancer breath sample.

During phase 1 of training, the location of the cancer breath sample was known by both experimenter and trainer (Table 2). One station contained a cancer breath sample, and the remaining 4 stations contained blank sample tubes that had not been used in any breath sampling. To encourage the dogs to seek out the exhaled chemicals associated with cancer, we placed a piece of dog food in the station with the cancer breath sample and covered the container with a piece of paper so the food would not be visible.

The second phase of training still used four blank canisters and food rewards in the cancer breath sample canister.

During phase 2 of training, only the experimenter was aware of the location of the cancer breath sample and apart from encouraging the dog with encouraging phrases such as “go to work,” gave no “sit” or other verbal commands to the dog. Clicker signal by the experimenter and subsequent food reward and praise by the trainer were given only after the dog correctly indicated on the cancer breath sample. When the dog indicated incorrectly on a control, the experimenter would not signal with the clicker and the handler would remain silent, not give the dog any praise reward, and mildly rebuke the dog by saying “no.” Samples used in phases 1 and 2 (contaminated with food scent) were not used again.

The third phase of training was similar to the second, except there were no food rewards in the canister with the cancer breath sample. After the dogs had performed sufficiently well during the training session, they were evaluated in a single blind phase.

During the single-blinded canine scent-testing experiment, using samples previously used in phase 3 of training, the level of challenge to the dogs was increased by placing a cancer breath sample in 1 station and control subject breath samples in the remaining 4 stations. Thus, dogs now had to distinguish cancer patient breath samples from those of healthy controls. Furthermore, the handler was blinded to the location and status of patient and control breath samples. Although the experimenter did not know the location and status of patient and control breath samples during the single-blinded experiments, the possibility of the experimenter giving the dogs cues was minimized by positioning the experimenter in an adjacent room, behind an opaque curtain that almost completely covered the doorway between the training and observation rooms.

This was followed by a double blind phase, the phase used to evaluate sensitivity and specificity.

We designed our double-blinded experiment so that each dog would have the opportunity to sniff breath samples from each subject and each control. During the entire double-blinded testing phase, all breath samples sniffed by dogs, for both cases and controls, were from completely different subjects not previously encountered by the dogs during training or single-blinded testing. Furthermore, all of these breath samples used during double-blinded testing, for both cases and controls, contribute to the overall results reported in Table 3. For each trial, we used a random number table to determine the location of the sample being tested in the lineup.

All other methods were identical to the single-blinded testing phase, except that we now (1) placed the target breath sample of interest, whether from patient or control, within the lineup along with 4 other controls and (2) blinded both the experimenters and dog handlers to the status of that target sample in the lineup. Whereas in the single-blinded experiments only the dog handler was blinded to knowledge of the target sample, in the double-blinded experiments, both handler and experimenter were blinded to ensure that neither experimenters nor handlers could be giving any clues to the dogs. Since the experimenters now no longer knew the status of the target breath sample, they did not activate the clicker device after a sitting indication by the dog, and therefore the handler did not reward the dog with any food. After being given the opportunity to sniff and indicate on samples, the dog was simply led out of the room. Only after leaving the training room was the dog acknowledged with the phrase “good work!” During double-blinded testing, each tube was used a median of 20 times (x = 32.35, SD = 24.46; range, 4-99).

Blinding is very important in a trial like this because of the "Clever Hans" effect, which is the ability of animals to pick up subtle and even subconscious nonverbal cues from the people around them.

  • Clever Hans phenomenon. Carroll RT, The Skeptic's Dictionary. Accessed on 2006-01-27.

    [Excerpt] Clever Hans phenomenon: A form of involuntary and unconscious cuing. The term refers to a horse (Kluge Hans, referred to in the literature as "Clever Hans") who responded to questions requiring mathematical calculations by tapping his hoof. If asked by his master, William Von Osten, what is the sum of 3 plus 2, the horse would tap his hoof five times. It appeared the animal was responding to human language and was capable of grasping mathematical concepts. It was 1891 when Von Osten began showing Hans to the public. (Hans could also tell time and name people,* but we will restrict our discussion of his amazing abilities to his mathematical skills.) It was eventually discovered (in 1904) by Oskar Pfungst that the horse was responding to subtle physical cues (ideomotor reaction) or as Ray Hyman puts it "Hans was responding to a simple, involuntary postural adjustment by the questioner, which was his cue to start tapping, and an unconscious, almost imperceptible head movement, which was his cue to stop" (Hyman 1989: 425). skepdic.com/cleverhans.html

In the trials involving lung cancer patients, 708 of the 712 control canisters were properly identified, and 564 of the 574 cancer canisters were identified. In the trials involving breast cancer patients, 260 of the 275 control canisters were properly identified and 110 of the 116 cancer canisters were identified.

It is unclear how these results were tabulated. One possible method would be the following: If the dog did not sit down at any canister, and the fifth canister was a control breath sample, that trial was labeled a true negative. If the dog sat down at one of the four control canisters or hesitated, that trial was labeled a false positive or false negative depending on the contents of the fifth canister.

Another interpretations would be that if the dog sat down at any control canister, that was considered a false positive for that canister and failure to sit down at any control canister was considered a true positive.

The wording of the paper seems to favor the latter interpretation

The dogs’ response to each of the 5 samples sniffed was included in our analysis; dogs were allowed the opportunity to visit each sample station and thus could have potentially indicated every one of the samples in a trial, although in our experiments, this never occurred. Dog handlers did not try to prevent dogs from visiting any individual station. Therefore, since each individual sample station was considered as a unit of analysis, the use of 4 control subject breath samples along with a cancer patient sample in each experimental trial would not change sensitivity or specificity.

On the other hands the number of control samples during the double blind phase was 987 compared to 690 cancer samples, and it is hard to reconcile these numbers with the fact that at least four control samples were tested in each trial. The ratio of controls to cancers should be at least five to one and probably closer to ten to one.

Because of the number of tests performed, individual patients were used multiple times in the study and even individual breathing tubes were re-used many times.

During double-blinded testing, each tube was used a median of 20 times (x = 32.35, SD = 24.46; range, 4-99).

To account for this, the researchers used "general estimating equations (GEE) random effects linear regression, with standard errors adjusted for clustering on donor." The researchers re-analyzed the data including only the first dog-donor combination in each trial of the double blind phase, and found comparable results.

The GEE estimates were also adjusted for current smoking status since there was more smoking among the lung cancer volunteers than the control volunteers.

This research used a case-control to estimate sensitivity and specificity, which is acceptable for a "proof of concept" study, but the authors do discuss the problem of spectrum bias in this research.

However, our specificity may be overestimated because we used only healthy controls (rather than a broad spectrum of subjects that included, for example, those with bronchitis or emphysema as controls for lung cancer or those with fibrocystic breast disease or mastitis as controls for breast cancer). These questions could be better understood by further study in a prospective cohort design that included both cases and controls representing the full spectrum of disease severity seen in the general population.

There are additional limitations to this research which the authors discuss at the end of the article.

I will include this discussion in the Chance Wiki when I get the time.

This webpage was written and was last modified on 07/08/2008.


An error slips through the peer review process (September 19, 2005). Category: Diagnostic testing

A group of residents wanted me to look at an article because they were confused about the calculation of the likelihood ratio. The numbers that they got were quite different from those in the publication. It turns out that they were calculating things correctly, and did not realize that the paper had several serious errors in some of the more fundamental calculations of sensitivity and specificity.

Here is the paper they showed me

  • A clinical score to reduce unnecessary antibiotic use in patients with sore throat. McIsaac WJ, White D, Tannenbaum D, Low DE. Cmaj 1998: 158(1); 75-83. [Abstract] [PDF]

This paper developed a score to assign to patients who came in complaining of a sore throat to see if they needed to be prescribed antibiotics. The scale was computed using the following formula:

Although scores of -1 and 5 and theoretically possible, no one scored below zero or above 4. The paper suggests the following management strategy:

The results of this score were compared to the physicians subjective evaluation and to a throat swab culture (the gold standard). There are several errors in the calculations of sensitivity and specificity in this paper, but the most obvious one is the claim that:

Among children aged 3 to 14 years, there was no difference between the 2 approaches in the proportion receiving antibiotics or from whom throat swabs were obtained, but significantly more cases of GAS infection would have been identified with the score approach (96.9%) than with usual physician care (70.6%) (p < 0.05). Physician specificity was higher, however (91.7% v. 67.2%) (p < 0.05). Among adults the sensitivity of physician judgement and of the score approach were similar, but both throat swab use (37.3% v. 26.4%) and antibiotic prescription (16.5% v. 3.4%) would have been reduced with the score approach (p < 0.001).

This data is corroborated in Table 3, where the sensitivity for patients aged 3-14 years is reported as 96.9% (31/32) and specificity as 94.3% (413/438). An excerpt from the table is reproduced below.

The residents could not reproduce these numbers because they were looking instead at Table 4, a portion of which is reproduced below.

Can you spot the error in the sensitivity and specificity calculations?

This webpage was written and was last modified on 07/08/2008.


The costs of a false positive test (March 1, 2005). Category: Diagnostic testing

The New York Times had an excellent article on newborn screening tests.

  • Panel to Advise Testing Babies for 29 Diseases. Kolata G. The New York Times, February 21, 2005.

Unfortunately, this article is no longer available online. But it discusses a recent push to standardize and expand the screening tests for newborns to include 29 different diseases. It seems like such an obvious thing to do: let's screen for these conditions, because the more we know, the better we are able to care for these children.

Proponents say that the diseases are terrible and that an early diagnosis can be lifesaving. When testing is not done, parents often end up in a medical odyssey to find out what is wrong with their child. By the time the answer is in, it may be too late for treatment to do much good.

Opponents, however, point out that false positive results may present more problems.

But opponents say that for all but about five or six of the conditions, it is not known whether the treatments help or how often a baby will test positive but never show signs of serious disease. There is a danger, they say, of children with mild versions of illnesses being treated needlessly and aggressively for more serious forms and suffering dire health consequences.

The article also offers a historical perspective.

The history of newborn screening, they say, is filled with cautionary tales. %22The majority of newborn screening tests have failed,%22 said Dr. Norman Fost, a professor of pediatrics and director of the program in medical ethics at the University of Wisconsin. Over the years, Dr. Fost said, %22thousands of normal kids have been killed or gotten brain damage by screening tests and treatments that turned out to be ineffective and very dangerous.%22

and cites phenylketonuria (PKU) testing as an example.An infant with PKU cannot metabolize phenylalanine, and the build up of this amino acid can lead to serious neurological damage. The treatment, a diet low in phenyalanine, is very effective, but only if the condition is diagnosed early. The PKU testing done today is very good, but tests performed 45 years ago had problems.

Back then, any infant who tested positive would be put on this special diet. When phenylalanine is withdrawn from the diet of a healthy infant, that infant suffers from even more serious neurological problems and can even die. Many infants who falsely tested positive were put on this diet and their harms outweighed the benefits of PKU screening. As researchers learned more, they were able to refine the test to prevent most false positives, but the damage had already been done.

An additional article about Universal Newborn Hearing Screening (UNHS),

  • The false-positive in universal newborn hearing screening. Clemens CJ, Davis SA, Bailey AR. Pediatrics 2000: 106(1); E7. [Abstract] [Full text] [PDF]

    also discusses the problems with false positives:

    However, support for UNHS is not universal. One of the most concerning issues raised is the high rate of false-positive results. The literature reports false-positive rates between 3% and 8%. This has caused a number of critics to decline to recommend UNHS until the false-positive rate can be decreased and/or there is further knowledge of the emotional effect this false-positive labeling has on families. A number of studies from other newborn-screening tests have shown that false-positive results can engender lasting anxiety and adversely affect the parent-child relationship. In addition, deUzcategiu and Yoshinga-Itano surveyed mothers immediately after their children had failed the newborn hearing screen and found that 20% to 50% of mothers reported feelings, such as anger, confusion, depression, frustration, shock, and sadness. However, it is still unknown how persistent or detrimental these feelings are. pediatrics.aappublications.org/cgi/content/full/106/1/e7

This article ultimately concludes that with a reduction in the false positive rate, that the benefits of UNHS outweigh the costs.

I'm trying to develop a good set of web pages on diagnostic testing, but there is a lot of work that I need to do. I also offer a couple of training classes that discuss diagnostic tests:

This webpage was written and was last modified on 07/08/2008.


Unnecessary diagnostic tests (October 25, 2004).

You would think that you can never have enough information about your health. Barring financial considerations, the more testing the better.

That actually is not true. In some situations, too many diagnostic tests are being run, and it hurts rather than helps the patient. American Medical News has an article about this

Lab tests go under a critical microscope Experts point out that good tests used badly can lead to bad medicine. Victoria Stagg Elliott. Nov. 1, 2004. www.ama-assn.org/amednews/2004/11/01/hlsd1101.htm

They offer several good examples.

  • Dialysis patients will often show abnormal results immediately post-dialysis, but these values almost always normalize without intervention.
  • A positive herpes test cannot easily distinguish between type 1 and type 2 herpes, but a positive result without drawing such distinctions could result in serious and unnecessary personal difficulties.
  • A slightly abnormal antinuclear antibody test may indicate nothing, but patients who surf the Internet may believe that they have lupus or another serious disease.

The article suggests ordering specific tests rather than an entire panel. Why include tests that you know provide no useful information but which might unduly increase anxiety in your patients?

Gina Kolata wrote an excellent article about the problems with unnecessary tests for the New York Times (Annual Physical Checkup May Be an Empty Ritual, August 12, 2003). Most of the tests done at your annual physical exam have no support in the literature, but patients still expect these tests. She has a marvelous story about a patient with a laundry list of tests that she wanted.

Even doctors who know all about the evidence-based guidelines for preventive medicine say they often compromise in the interest of keeping patients happy. Dr. John K. Min, an internist in Burlington, N.C., tells the story of a 72-year-old patient who came to him for her annual physical, knowing exactly what tests she wanted. She wanted a Pap test, but it would have been useless, Dr. Min said, because she had had a hysterectomy. She wanted a chest X-ray, an electrocardiogram. Not necessary, he told her, because it was unlikely that they would reveal a problem that needed treating before symptoms emerged. She left with just a few tests, including blood pressure and cholesterol. Dr. Min was proud of himself until about a week later, when the local paper published a letter from his patient - about him. "Socialized medicine has arrived," she wrote. Admitting defeat, he called her and offered her the tests she had wanted, on the house. She accepted, Dr. Min said, but after having the full physical exam, she never returned.

I discussed the problems with whole body scans, pap smears for women without a cervix, and prostate specific antigen tests in earlier web log entries.

The real reason that people do not appreciate the problems with too many diagnostic tests is that they do not understand that there are costs associated with a false positive finding: the preventable anxiety that a false positive test produces, the cost and risks associated with additional testing, and sometimes the unnecessary medical treatments given for those who are falsely labeled as being sick. When the prevalence of the disease being tested for is low, then this problem is magnified because the false positives greatly outnumber the false negatives.

This webpage was written and was last modified on 07/08/2008. Category: Diagnostic testing