|
Stats
Supplemental readings for Stats #21 or Stats #24.
Here is some supplementary material about diagnostic testing that I have published recently in
my weblog.
A novel diagnostic test (January 26, 2006). Category:
Diagnostic testing
A recently published article on diagnosing cancer got a lot of press. The article
-
Diagnostic Accuracy of Canine Scent Detection in Early- and Late-Stage Lung and Breast
Cancers. McCulloch M, Jezierski T, Broffman M, Hubbard A, Kirk Turner, Janecki T.
Integrative Cancer Therapies 2006: 5(1); 1-10.
[PDF]
noted that canines have an unusually sensitive sense of smell and might be able to
diagnose cancer by sniffing breath sample from human patients. This is rather intriguing,
since dogs have already been trained to locate explosives, cadavers, drugs, and so forth.
The researchers collected breath samples from 55 patients with lung cancer, 31 patients
with breast cancer, and 83 volunteers with no prior cancer history.
Eligible patients were men and women older than 18 years with a very recent
biopsy-confirmed conventional diagnosis of lung or breast cancer. We specifically
requested that recruitment centers refer patients as soon as possible following
definitive diagnosis so that breath sampling would not interfere with or delay planned
conventional treatment. As we suspected that chemotherapy treatment would change the
exhaled chemicals in cancer patients, we sought patients who had not yet undergone
chemotherapy treatment. As we also suspected that patients with more advanced disease,
and thus larger tumors, might be exhaling higher concentrations of the chemicals
associated with cancer cells and would therefore be more easily identified by the dogs,
we sought patients with any stage disease.
The collection of breath samples was quite simple.
For breath sampling, we obtained a cylindrical polypropylene organic vapor
sampling tube (Defencetek, Pretoria, South Africa). Each tube is open at either end, is
6 inches long, has an outer diameter of 1 inch, has an inner diameter of 0.75 inches,
and has removable end caps. A removable 2-inch-long insert of silicone oil-coated
polypropylene “wool” captures volatile organic compounds in exhaled breath as breath
passes through the tube. To collect breath samples, we asked donors to exhale 3 to 5
times through the tube. We then fitted the tubes with their end caps and sealed them in
ordinary grocery store Ziplock-style bags at room temperature between the time of breath
sampling and presentation to the dogs.
Each patient and control contributed multiple breath samples to the study, ranging from 4
to 18 samples per person.
The dogs had to be trained to recognize cancer samples, and in the training sessions, the
trainer had to be unblinded to the location of the cancer sample, so they could reward the
dogs when they identified the cancer samples correctly. The dogs were trained to indicate a
positive result by sitting down by the canister that had the cancer breath sample.
During phase 1 of training, the location of the cancer breath sample was known by
both experimenter and trainer (Table 2). One station contained a cancer breath sample,
and the remaining 4 stations contained blank sample tubes that had not been used in any
breath sampling. To encourage the dogs to seek out the exhaled chemicals associated with
cancer, we placed a piece of dog food in the station with the cancer breath sample and
covered the container with a piece of paper so the food would not be visible.
The second phase of training still used four blank canisters and food rewards in the
cancer breath sample canister.
During phase 2 of training, only the experimenter was aware of the location of the
cancer breath sample and apart from encouraging the dog with encouraging phrases such as
“go to work,” gave no “sit” or other verbal commands to the dog. Clicker signal by the
experimenter and subsequent food reward and praise by the trainer were given only after
the dog correctly indicated on the cancer breath sample. When the dog indicated
incorrectly on a control, the experimenter would not signal with the clicker and the
handler would remain silent, not give the dog any praise reward, and mildly rebuke the
dog by saying “no.” Samples used in phases 1 and 2 (contaminated with food scent) were
not used again.
The third phase of training was similar to the second, except there were no food rewards
in the canister with the cancer breath sample. After the dogs had performed sufficiently well
during the training session, they were evaluated in a single blind phase.
During the single-blinded canine scent-testing experiment, using samples
previously used in phase 3 of training, the level of challenge to the dogs was increased
by placing a cancer breath sample in 1 station and control subject breath samples in the
remaining 4 stations. Thus, dogs now had to distinguish cancer patient breath samples
from those of healthy controls. Furthermore, the handler was blinded to the location and
status of patient and control breath samples. Although the experimenter did not know the
location and status of patient and control breath samples during the single-blinded
experiments, the possibility of the experimenter giving the dogs cues was minimized by
positioning the experimenter in an adjacent room, behind an opaque curtain that almost
completely covered the doorway between the training and observation rooms.
This was followed by a double blind phase, the phase used to evaluate sensitivity and
specificity.
We designed our double-blinded experiment so that each dog would have the opportunity
to sniff breath samples from each subject and each control. During the entire
double-blinded testing phase, all breath samples sniffed by dogs, for both cases and
controls, were from completely different subjects not previously encountered by the dogs
during training or single-blinded testing. Furthermore, all of these breath samples used
during double-blinded testing, for both cases and controls, contribute to the overall
results reported in Table 3. For each trial, we used a random number table to determine
the location of the sample being tested in the lineup.
All other methods were identical to the single-blinded testing phase, except that
we now (1) placed the target breath sample of interest, whether from patient or control,
within the lineup along with 4 other controls and (2) blinded both the experimenters and
dog handlers to the status of that target sample in the lineup. Whereas in the
single-blinded experiments only the dog handler was blinded to knowledge of the target
sample, in the double-blinded experiments, both handler and experimenter were blinded to
ensure that neither experimenters nor handlers could be giving any clues to the dogs.
Since the experimenters now no longer knew the status of the target breath sample, they
did not activate the clicker device after a sitting indication by the dog, and therefore
the handler did not reward the dog with any food. After being given the opportunity to
sniff and indicate on samples, the dog was simply led out of the room. Only after
leaving the training room was the dog acknowledged with the phrase “good work!” During
double-blinded testing, each tube was used a median of 20 times (x = 32.35, SD = 24.46;
range, 4-99).
Blinding is very important in a trial like this because of the "Clever Hans" effect, which
is the ability of animals to pick up subtle and even subconscious nonverbal cues from the
people around them.
-
Clever Hans phenomenon.
Carroll RT, The Skeptic's Dictionary. Accessed on 2006-01-27.
[Excerpt] Clever Hans phenomenon: A form of involuntary and unconscious cuing.
The term refers to a horse (Kluge Hans, referred to in the literature as "Clever Hans")
who responded to questions requiring mathematical calculations by tapping his hoof. If
asked by his master, William Von Osten, what is the sum of 3 plus 2, the horse would
tap his hoof five times. It appeared the animal was responding to human language and
was capable of grasping mathematical concepts. It was 1891 when Von Osten began showing
Hans to the public. (Hans could also tell time and name people,* but we will restrict
our discussion of his amazing abilities to his mathematical skills.) It was eventually
discovered (in 1904) by Oskar Pfungst that the horse was responding to subtle physical
cues (ideomotor reaction) or as Ray Hyman puts it "Hans was responding to a simple,
involuntary postural adjustment by the questioner, which was his cue to start tapping,
and an unconscious, almost imperceptible head movement, which was his cue to stop"
(Hyman 1989: 425). skepdic.com/cleverhans.html
In the trials involving lung cancer patients, 708 of the 712 control canisters were
properly identified, and 564 of the 574 cancer canisters were identified. In the trials
involving breast cancer patients, 260 of the 275 control canisters were properly identified
and 110 of the 116 cancer canisters were identified.
It is unclear how these results were tabulated. One possible method would be the
following: If the dog did not sit down at any canister, and the fifth canister was a control
breath sample, that trial was labeled a true negative. If the dog sat down at one of the four
control canisters or hesitated, that trial was labeled a false positive or false negative
depending on the contents of the fifth canister.
Another interpretations would be that if the dog sat down at any control canister, that
was considered a false positive for that canister and failure to sit down at any control
canister was considered a true positive.
The wording of the paper seems to favor the latter interpretation
The dogs’ response to each of the 5 samples sniffed was included in our analysis;
dogs were allowed the opportunity to visit each sample station and thus could have
potentially indicated every one of the samples in a trial, although in our experiments,
this never occurred. Dog handlers did not try to prevent dogs from visiting any
individual station. Therefore, since each individual sample station was considered as a
unit of analysis, the use of 4 control subject breath samples along with a cancer
patient sample in each experimental trial would not change sensitivity or specificity.
On the other hands the number of control samples during the double blind phase was 987
compared to 690 cancer samples, and it is hard to reconcile these numbers with the fact that
at least four control samples were tested in each trial. The ratio of controls to cancers
should be at least five to one and probably closer to ten to one.
Because of the number of tests performed, individual patients were used multiple times in
the study and even individual breathing tubes were re-used many times.
During double-blinded testing, each tube was used a median of 20 times (x = 32.35,
SD = 24.46; range, 4-99).
To account for this, the researchers used "general estimating equations (GEE) random
effects linear regression, with standard errors adjusted for clustering on donor." The
researchers re-analyzed the data including only the first dog-donor combination in each trial
of the double blind phase, and found comparable results.
The GEE estimates were also adjusted for current smoking status since there was more
smoking among the lung cancer volunteers than the control volunteers.
This research used a case-control to estimate sensitivity and specificity, which is
acceptable for a "proof of concept" study, but the authors do discuss the problem of spectrum
bias in this research.
However, our specificity may be overestimated because we used only healthy
controls (rather than a broad spectrum of subjects that included, for example, those
with bronchitis or emphysema as controls for lung cancer or those with fibrocystic
breast disease or mastitis as controls for breast cancer). These questions could be
better understood by further study in a prospective cohort design that included both
cases and controls representing the full spectrum of disease severity seen in the
general population.
There are additional limitations to this research which the authors discuss at the end of
the article.
I will include this discussion in the
Chance Wiki when I
get the time.
07/08/2008.
An error slips through the peer review process (September 19, 2005).
Category: Diagnostic testing
A group of residents wanted me to look at an article because they were confused about the
calculation of the likelihood ratio. The numbers that they got were quite different from
those in the publication. It turns out that they were calculating things correctly, and did
not realize that the paper had several serious errors in some of the more fundamental
calculations of sensitivity and specificity.
Here is the paper they showed me
-
A clinical score to reduce unnecessary antibiotic use in patients with sore throat.
McIsaac WJ, White D, Tannenbaum D, Low DE. Cmaj 1998: 158(1); 75-83.
[Abstract]
[PDF]
This paper developed a score to assign to patients who came in complaining of a sore
throat to see if they needed to be prescribed antibiotics. The scale was computed using the
following formula:

Although scores of -1 and 5 and theoretically possible, no one scored below zero or above
4. The paper suggests the following management strategy:

The results of this score were compared to the physicians subjective evaluation and to a
throat swab culture (the gold standard). There are several errors in the calculations of
sensitivity and specificity in this paper, but the most obvious one is the claim that:
Among children aged 3 to 14 years, there was no difference between the 2
approaches in the proportion receiving antibiotics or from whom throat swabs were
obtained, but significantly more cases of GAS infection would have been identified with
the score approach (96.9%) than with usual physician care (70.6%) (p < 0.05). Physician
specificity was higher, however (91.7% v. 67.2%) (p < 0.05). Among adults the
sensitivity of physician judgement and of the score approach were similar, but both
throat swab use (37.3% v. 26.4%) and antibiotic prescription (16.5% v. 3.4%) would have
been reduced with the score approach (p < 0.001).
This data is corroborated in Table 3, where the sensitivity for patients aged 3-14 years
is reported as 96.9% (31/32) and specificity as 94.3% (413/438). An excerpt from the table is
reproduced below.
 
The residents could not reproduce these numbers because they were looking instead at Table
4, a portion of which is reproduced below.
 
Can you spot the error in the sensitivity and specificity calculations?
07/08/2008.
The costs of a false positive test (March 1, 2005). Category:
Diagnostic testing
The New York Times had an excellent article on newborn screening tests.
-
Panel to Advise Testing Babies for 29 Diseases. Kolata G. The New York Times,
February 21, 2005.
Unfortunately, this article is no longer available online. But it discusses a recent push
to standardize and expand the screening tests for newborns to include 29 different diseases.
It seems like such an obvious thing to do: let's screen for these conditions, because the
more we know, the better we are able to care for these children.
Proponents say that the diseases are terrible and that an early diagnosis can be
lifesaving. When testing is not done, parents often end up in a medical odyssey to find
out what is wrong with their child. By the time the answer is in, it may be too late for
treatment to do much good.
Opponents, however, point out that false positive results may present more problems.
But opponents say that for all but about five or six of the conditions, it is not
known whether the treatments help or how often a baby will test positive but never show
signs of serious disease. There is a danger, they say, of children with mild versions of
illnesses being treated needlessly and aggressively for more serious forms and suffering
dire health consequences.
The article also offers a historical perspective.
The history of newborn screening, they say, is filled with cautionary tales.
%22The majority of newborn screening tests have failed,%22 said Dr. Norman Fost, a
professor of pediatrics and director of the program in medical ethics at the University
of Wisconsin. Over the years, Dr. Fost said, %22thousands of normal kids have been
killed or gotten brain damage by screening tests and treatments that turned out to be
ineffective and very dangerous.%22
and cites phenylketonuria (PKU) testing as an example.An infant with PKU cannot metabolize
phenylalanine, and the build up of this amino acid can lead to serious neurological damage.
The treatment, a diet low in phenyalanine, is very effective, but only if the condition is
diagnosed early. The PKU testing done today is very good, but tests performed 45 years ago
had problems.
Back then, any infant who tested positive would be put on this special diet. When
phenylalanine is withdrawn from the diet of a healthy infant, that infant suffers from even
more serious neurological problems and can even die. Many infants who falsely tested positive
were put on this diet and their harms outweighed the benefits of PKU screening. As
researchers learned more, they were able to refine the test to prevent most false positives,
but the damage had already been done.
An additional article about Universal Newborn Hearing Screening (UNHS),
This article ultimately concludes that with a reduction in the false positive rate, that
the benefits of UNHS outweigh the costs.
I'm trying to develop a good set of web pages on diagnostic
testing, but there is a lot of work that I need to do. I also offer a couple of training
classes that discuss diagnostic tests:
07/08/2008.
Unnecessary diagnostic tests (October 25, 2004).
You would think that you can never have enough information about your health. Barring
financial considerations, the more testing the better.
That actually is not true. In some situations, too many diagnostic tests are being run,
and it hurts rather than helps the patient. American Medical News has an article about this
Lab tests go under a critical microscope Experts point out that good tests used badly
can lead to bad medicine. Victoria Stagg Elliott. Nov. 1, 2004.
www.ama-assn.org/amednews/2004/11/01/hlsd1101.htm
They offer several good examples.
-
Dialysis patients will often show abnormal results immediately post-dialysis, but these
values almost always normalize without intervention.
-
A positive herpes test cannot easily distinguish between type 1 and type 2 herpes, but a
positive result without drawing such distinctions could result in serious and unnecessary
personal difficulties.
-
A slightly abnormal antinuclear antibody test may indicate nothing, but patients who surf
the Internet may believe that they have lupus or another serious disease.
The article suggests ordering specific tests rather than an entire panel. Why include
tests that you know provide no useful information but which might unduly increase anxiety in
your patients?
Gina Kolata wrote an excellent article about the problems with unnecessary tests for the
New York Times (Annual Physical Checkup May Be an Empty Ritual, August 12, 2003). Most of the
tests done at your annual physical exam have no support in the literature, but patients still
expect these tests. She has a marvelous story about a patient with a laundry list of tests
that she wanted.
Even doctors who know all about the evidence-based guidelines for preventive
medicine say they often compromise in the interest of keeping patients happy. Dr. John
K. Min, an internist in Burlington, N.C., tells the story of a 72-year-old patient who
came to him for her annual physical, knowing exactly what tests she wanted. She wanted a
Pap test, but it would have been useless, Dr. Min said, because she had had a
hysterectomy. She wanted a chest X-ray, an electrocardiogram. Not necessary, he told
her, because it was unlikely that they would reveal a problem that needed treating
before symptoms emerged. She left with just a few tests, including blood pressure and
cholesterol. Dr. Min was proud of himself until about a week later, when the local paper
published a letter from his patient - about him. "Socialized medicine has arrived," she
wrote. Admitting defeat, he called her and offered her the tests she had wanted, on the
house. She accepted, Dr. Min said, but after having the full physical exam, she never
returned.
I discussed the problems with whole body scans,
pap smears for women without a cervix, and
prostate specific antigen tests in earlier web log entries.
The real reason that people do not appreciate the problems with too many diagnostic tests
is that they do not understand that there are costs associated with a false positive finding:
the preventable anxiety that a false positive test produces, the cost and risks associated
with additional testing, and sometimes the unnecessary medical treatments given for those who
are falsely labeled as being sick. When the prevalence of the disease being tested for is
low, then this problem is magnified because the false positives greatly outnumber the false
negatives.
07/08/2008. Category:
Diagnostic testing
CMH Employees
|