Statistical Evidence. Chapter 1. Apples or oranges? Selection of the control group.

1.0 Introduction

Almost all research involves comparison. Do women who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?

In each of these situations, you are making a comparison between a control group and a treatment/exposure group. I will use the terms treatment and exposure interchangably throughout this book, though I will reserve treatment for those conditions which represent an effort to produce a beneficial result and exposure to represent a condition that is, potentially harmful. You would call drinking water from a natural spring a treatment, but drinking water from a contaminated well an exposure. The distinction between treatment and exposure is not that critical though, and when I discuss a generic "treatment" in this book, feel free to substitute the word "exposure" and vice versa.

When you make such a comparison between a treatment group and a control group, you want a fair comparison. You want the control group to be identical to the treatment group in all respects, except for the treatment in question. You want an apples to apples comparison.

Covariate imbalance

Sometimes, however, you get an unfair comparison, an apples to oranges comparison. The control group differs on some important characteristics that might influence the outcome measure. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.

Women who take oral contraceptives appear to have a higher risk of cervical cancer. But covariate imbalance might be producing an artificial rise in cancer rates for this group. Women who take oral contraceptives behave, as a group, differently than other women. For example, women who take oral contraceptives have a larger number of pap smears. This is probably because these women visit their doctors more regularly in order to get their prescriptions refilled and therefore have more opportunities to be offered a pap smear. This difference could lead to an increase in the number of detected cancer cases. Perhaps the other women have just as much cancer, but it is more likely to remain undetected.

There are many other variables that influence the development of cervical cancer: age of first intercourse, number of sexual partners, use of condoms, and smoking habits. If women who take oral contraceptives differ in any of these lifestyle factors, then that might also produce a difference in cervical cancer rates.*

Case Study: Vitamin C and Cancer

Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Ewan Cameron and Linus Pauling published an observational study of Vitamin C as a treatment for advanced cancer (Cameron 1976). For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).

Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."

Ten years later, the Mayo Clinic (Moertel, et al 1985) conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Cameron and Pauling study differ from the Mayo study?

The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.

But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison because the initial prognosis was worse in the control group than in the treatment group. As Paul Rosenbaum says so well: one can say with total confidence, without reservation or caveat, that the prognosis of the patient who is already dead is not good. (page 4)

The prognosis of a patient with a diagnosis of terminal cancer is also not good, but at least a few of these patients will be misdiagnosed. The ones in the control group, the ones that entered the study clutching their death certificates had no misdiagnosis.

Apples or oranges? What to look for.

When the treatment group is apples and the control group is oranges, you can't make a fair comparison. To ensure that the researchers made an apples to apples comparison, ask the following questions:

Did the authors use randomization? In some studies, the researchers control who gets the new therapy and who gets the standard (control) therapy. When the researchers have this level of control, they almost always will randomize the choice. This type of study, a randomized study, is a very effective and very simple way to prevent covariate imbalance.

If randomization was not done, how were the patients selected? Several alternative approaches are available when the researchers have control of treatment assignment, but minimization is the only credible alternative. When researchers do not have control over treatment assignments, you have an observational study. The three major observational studies, cohort designs, case-control designs, and historical controls, all have weaknesses, but may represent the best available approach that is practical and ethical.

Did the authors use matching to prevent covariate imbalance? Matching is a method for selecting subjects that ensures a similar set of patients for the control group. A crossover design represents the ideal form of matching because each subject serves as his or her own control. Stratification ensures that broad demographic groups are equally represented in the treatment and control group.

Did the authors use statistical adjustments to control for covariate imbalance? Covariate adjustment uses statistical methods to try to correct for any existing imbalance. This methods work well, but only on variables that can be measured easily and accurately.

1.1 Randomly selected controls

Randomization is the assignment of treatment groups through the use of a random device, like the flip of a coin or the roll of a die, or numbers randomly generated by a computer. Randomization is not always possible, practical, or ethical. But when you can use randomization, it greatly adds to the credibility of the research study.

Example: In a study of treatments for osteoarthritis of the knee (Teekachunhatean 2004), 200 patients suffering from osteoarthritis of the knee were randomly assigned to receive either DJW (Duhuo Jisheng Wan, a Chinese herbal remedy) and a placebo for diclofenac or diclofenac and a placebo for DJW. Patients were evaluated on visual analog scale (VAS) score that assessed pain and stiffness, Lequesne's functional index, time for climbing up ten steps, as well as physician's and patients' overall opinions on improvement.

Example: In a study of critical appraisal skills training (Taylor 2004), 145 health professionals were randomly assigned to either receive immediate training in a half-day critical appraisal skills workshop or were placed on a waiting list for a future workshop. These subjects were evaluated on knowledge attitudes and behaviors relating to evidence based medicine.

In both studies the researchers decided who got what. This is a hallmark of a randomized design and it only can occur when the patients and/or their doctors have no say in the assignment. This is an incredible gift that patients in a research study offer you. They sacrifice their ability to choose among two or more therapies and instead let that choice be decided by the flip of a coin.

How does randomization help?

Randomization helps ensure that both measurable and immeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. Used correctly, it also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.

There are situations where covariate imbalance can appear, even in a well randomized study (Roberts 1999). Just as you have no guarantee that a flip of 100 coins will yield exactly 50 heads and 50 tails, you have no guarantee that covariate imbalances cannot creep into a randomized study once in a while. This is not just a theoretical concern. One article (Mann 2002) argues that a difference in baseline stroke severity in a randomized trial of tPA produced an incorrect assertion of the effectiveness of this treatment.

Randomization relies on the law of large numbers. With small sample sizes, covariate imbalance may still creep in. A study examining the probability of covariate imbalance (Hsu 1989) showed that total sample sizes less than 10 could have a 50% chance or higher of having a categorical covariate with levels twice as large in one group than the other. This study also showed that total sample sizes of 40 or greater would have very little chance of such a serious imbalance, and a total of 20-40 subjects would be acceptable if there were only one or two important covariates.

A fishy story about randomization

I was told this story but have no way of verifying its accuracy. It's one of those stories that if it isn't true, it should be. A long, long, time ago, a research group wanted to examine a pollutant to find concentration levels that would kill fish. This research required that 100 fish be separated into five tanks, each of which would get a different level of the pollutant. The researchers caught the first twenty fish and put then in the first tank, then the next twenty fish and put them in a second tank and so forth. The last twenty fish went into the fifth tank. Each fish tank got a different concentration of the pollutant. When the research was done, the mortality was related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled and the best outcomes in the last tank filled. What happened was that the slow-moving, easy-to-catch fish (the weakest and most sickly fish) were all allocated to the first tank. The fast-moving, hard-to-catch fish (the strongest and healthiest fish) ended up in the last tank.

Failure to randomize in this study ruined the entire effort. The huge imbalance caused by putting the sickest fish in the first tank and the healthiest fish in the last tank overwhelmed any differences in mortality caused by varying levels of the pollutant.

The mechanics of randomization

Random assignment means that the choice is left to some device that is inherently random and unpredictable. A flip of a coin is one approach, but usually a table of random numbers or a random number generator is more practical. I can't think of anything more boring than flipping a coin 200 times.

The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion,

generate a random number for each value in the schedule

and then sort the schedule by the random number.

Sorting by a random number is effectively the same thing as putting the list in a random order.

Concealing the randomization list

Another important aspect of randomization is concealed allocation, which is withholding the randomization list from those involved with recruiting subjects. This concealment occurs until after subjects agree to participate and the recruiter determines that the patient is eligible for the study. Only then is a sealed envelope opened that reveals the treatment status. Concealed allocation can also be done through a special phone number that the doctor calls to discover the treatment status.

Please note that concealing the randomization list is not the same as blinding the study (a topic I discuss later in this book). Certain treatments, such as surgery, cannot be blinded but the allocation list can still be concealed. Consider, for example, a randomized trial comparing laparoscopic surgery to traditional surgery. After the fact, the patient can tell by the size of the scar what type of surgery they received. But the choice as to what type of surgery that the patient receives could be made as the patient is being sedated. There is an example of a research study where a sterilized coin was flipped in the operating room to decide which surgery will be used.

If the randomization list is not concealed, doctors have the ability to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).

There are many stories of physicians who have tried and succeeded in recruiting a patient into a preferred group. If the treatment allocation is hidden in sealed envelopes, they can hold it up to a strong light. If the sealed envelopes are not sequentially numbered, they can open several envelopes at once. If the allocation is controlled by a central operator, they can call and ask for the allocation of several patients at once.

When a doctor has an overt preference to enroll a patient into one group over another, it raises ethical issues and perhaps the doctor should not be participating in the trial. You should only participate in a research study if you believe there is genuine uncertainty about whether the new therapy or the standard therapy is better. If not, you have no business participating in a study where some of your patients will be randomized to a treatment that you consider inferior. Unfortunately, some doctors will continue to participate in these trials but will try to skew the enrollment of some or all of the patients towards a favored therapy.

Concealed allocation only makes sense for a truly randomized study. If patients are assigned in an alternating fashion, concealed allocation is buying a fancy burglar alarm and leaving the front door wide open. As you will see in the next section, alternating assignments is a bad idea, but it is even worse because it the doctors will immediately recognize the next patient is going to be allocated to. This makes it easy for them to preferentially recruit to a specific treatment if they want to.

Ethical and practical constraints on randomization

There are many situations where randomization is not practical or possible. Sometimes patients have a strong preference for one particular treatment and would not consider the possibility of being randomized into a different treatment. Surgery is one area with strong patient preferences especially for newer approaches like laparoscopic surgery (Lefering 2003).

Randomization is also problematic for interventions are already known to be effective. While further research would help better define these advantages, you can't ask half of your patients to sacrifice the benefits of the new intervention. A good example of this is breast feeding, which has a whole host of positive effects.** There is still on-going research to identify and better quantify these and other benefits, but almost none of this research is randomized (Kramer 2002 is a notable exception). Some non-randomized studies of the relationship between breastfeeding and intelligence have failed to account for the fact the breastfeeding mothers tend to be better educated, have higher socioeconomic status and that their babies tend to grow up in an environment that has greater overall levels of stimulation (Jain 2002). Still, it would be unethical to ask a random half of new mothers to sacrifice the benefits of breast feeding. While this sometimes leads to limitations on what you can infer from these studies, that's the price you pay to live in an ethical society.

Randomization also does not work when you are studying noxious agents, like second hand cigarette smoke or noisy workplaces. It would be unethical to deliberately expose people to any of these agents, so we have to collect data on those people who are unavoidably exposed to these things.

Sometimes, the sample sizes required or the duration of the study make it difficult to use randomization. Diseases like cancer that have a long latency period are especially hard to study  with a randomized design.

Retrospective studies, studies where the outcome of interest has already occurred and/or you are looking at factors in the past that might have caused this outcome, are also impossible to randomize, unless you have a time machine. See Leibovici 2001 for an amusing exception to this rule, though.

Sometimes, the groups being studied existed prior to the start of the research. Genetic conditions like Down's syndrome cannot be randomly assigned to half of the patients in your study. I like to think of these situations as cases where God does the randomization.

Sometimes researchers just do not want to go to the effort of randomizing. If you assign the treatment or therapy, rather than letting the patients and their doctors choose, you have to expend a lot of energy. Is it worth the effort? It is usually faster and cheaper to use existing non-randomized databases. You get a lot larger sample size for your expenditure. Depending on the situation, that might be enough to counterbalance the advantages of randomization.

A non-randomized study might also be a helpful prelude in the planning of an expensive randomized trial. The non-randomized trial would help you better understand and prepare for the resource requirements and familiarize your staff with the mechanics of treating and evaluating your research subjects.

Randomization: What to look for

If a study is randomized, look for the following features:

1.2 Variations on randomization

There are three variations to randomization where the researchers have control over treatment assignment, but they use something other than a table of random numbers for the assignment. The first approach, minimization, is a credible and reasonable choice , but the other two approaches, alternating assignment and haphazard assignment, do not have much to recommend them. 

Minimization.

An alternative, when the researchers have sufficient control, is to allocate the assignments so that at each step, the covariate imbalance is minimized. So if the treatment group has a slight surplus of older patients and the next patient to join the study is also older than average, then that patient would be assigned to the control group so as to reduce the age discrepancy.

Example: In a study of behavioral counseling (Steptoe 1999), twenty general practices were allocated either to use behavioral counseling based on the stages of change model for all their patients, or no counseling other than what their current standard of care. These practices were assigned using minimization to insure balance on three factors: the degree of underprivileged patients being served, the patient to nurse ratio of the practice, and fund holding status.

Minimization is a good approach if there are one or two covariates which are especially important and which are easily measured at the start of the study. It will perform better than randomization on those factors, although there is no guarantee of covariate balance for other covariates not used in the minimization. Minimization also cannot control for unmeasured covariates.

There is more effort required in setting up a study with minimization. You need a computer to be available at the time and location of the recruitment of each patient because you can't just print a list ahead of time. Another difficulty is that minimization is open to possible abuse because doctors might be able to predict what the next assignment would be.

Alternating assignments.

Another approach used in place of randomization is to alternate the assignment, so that every even patient is in the treatment group and every odd patient is in the control group.

Alternate assignment was popular in trials before World War II; it was felt that researchers would not understand and not tolerate randomization (Yoshioka 1998).

Example: In a study of patients with Cystic Fibrosis (Homnick 1999), the first patient was randomly assigned either manual chest physiotherapy, or a flutter device to treat acute pulmonary exacerbation. After the first patient, each additional patient was assigned to the alternate approach.

Example: In a study of patients with penetrating eye injuries (Lakits 1998), patients were assigned alternately to either helical computed tomography or conventional computed tomography. Images were assessed for the ability to detect and accurately localize foreign bodies.

Alternating assignment seems on the surface to be a good approach, but it can sometimes lead to trouble. This is especially true when one patient has a direct or indirect influence on the next patient. You may have seen this level of influence if you grow vegetables in a garden. If you have a row of cabbages, for example, you will often see a pattern of big cabbage, little cabbage, big cabbage, little cabbage, etc. What happens, usually if the cabbages are planted a bit too closely is that one of the cabbages will grow just a bit faster at first. It will extend into the neighboring cabbage's territory, stealing some of the nutrients and water, and thus growing even faster at the expense of the neighbor. If you assigned a fertilizer to every other cabbage, you would probably see an artificial difference because of the alternating pattern in growth within a row.

This alternating pattern can also occur in medicine. Consider, for example, a study of how much time doctors spend with their patients. If the first patient takes longer than expected, the doctor will probably rush a bit with the second patient in order to keep from falling further behind schedule. On the other hand, if the first patient finishes quickly, then the doctor will feel more relaxed and might tend to take a bit more time with the next patient.

In some situations, alternating assignment would be tolerable, but there is no good reason to prefer this over random assignment. You should be skeptical of this approach because studies with alternating assignment will tend, on average, to overstate the effectiveness of a new therapy by 15% (Colditz 1989).

Haphazard assignment

Other choices that researchers will make it to base assignments on some arbitrary value. Often it is the evenness/oddness of the arbitrary number that determines the treatment assignment. For example, patients born on days which are even numbers would be assigned to the treatment group and those born on odds days would be assigned to the control group. Some months have more odd days than even days (actually my life seems to have more than its fair share of odd days). This is a nitpick, but more importantly, an arbitrary or haphazard number is never going to be as good as a purely random number. The haphazard assignment will always cast a shadow of doubt over the research study. This is a shame, because almost every study with haphazard assignment could have been run as a randomized study with just a little more fuss.

Example: In a study of heparinized saline to maintain the patency of patient catheters (Kulkarni 1994), patients admitted on odd-numbered dates received heparinized saline and patients admitted on even-numbered days received normal saline.

Example: In a study of supplemental oxygen treatment for the treatment of stroke (Ronning 1999), patients born on even days were assigned to the supplemental oxygen group and patients born on odd days were assigned to the control group.

Example: In a study of interview methods for measuring risk behavior in injecting drug users (Des Jarlais 1999), patients were assigned either to a face-to-face interview or to audio-computer-assisted self-interviewing, depending on which week it was. The interview approach alternated from week to week. The patients were assessed to see if reporting of HIV risk behaviors changes between the interview methods.

In some situations, haphazard assignment might be tolerable, but there is no good reason to use this approach. The first study mentioned above was excluded from a meta-analysis of heparinized saline (Randolph 1998) because the reviewers felt the quality level was too low.

Variations on randomized studies.  What to look for.

When a study was not randomized, look for the following features:

For a study using minimization:

For studies using alternating assignments or haphazard assignments:

1.3 Non-randomized studies

As mentioned earlier, there are many situations where randomization is not ethical, practical, or possible. Sometimes, researchers could not in good conscience assign a dangerous exposure randomly to half of their patients. Sometimes researchers do not have the resources to properly randomize patients. Sometimes patients and/or their physicians will select which therapy they receive. Sometimes the treatment or exposure variable represents a group that existed prior to the start of the research.

In these situations where randomization is not possible, you are looking at an observational study. There are four major flavors for observational studies: cohort studies, case control studies, cross sectional studies, and historical controls studies.

The cohort study

In a cohort study, a group of patients has a certain exposure or condition. They are compared to a group of patients without that exposure or condition. Does the exposed cohort differ from the unexposed cohort on an outcome of interest?

Example: In a study of suicide among Swedish men in the Swedish military service conscription register (Gunnell 2005), 987,308 men registered between 1968 and 1994 were divided into nine groups on the basis of four intelligence tests. These men were also linked to a Swedish cause of death register which identified a total of 2,811 suicides among these men. For each of the four intelligence tests, men scoring lower tended to have a higher rate of suicide.

Example: In a study of psychotic symptoms in young people (Henquet 2005), a sample of young adults aged 14-24 years were divided into a group of 320 with admitted use of cannabis and a group of 2,117 did not admit to cannabis use. Both groups were followed four years later for psychotic symptoms.

Cohort studies are intuitively appealing and selection of a control group is usually not too difficult. You have to be very wary of covariate imbalance, but other observational designs are likely to have even more problems. Don't worry about every possible covariate imbalance. You should look for large imbalances, especially for covariates which are closely related to the outcome variable.

When you are studying a very rare outcome, the sample size may have to be extremely large. As a rough rule of thumb, you need to observe 25 to 50 outcomes in each group in order to have a reasonable level of precision. So when a condition occurs only once in every thousand patients, a cohort study would require tens of thousands of patients.

You want to avoid "leaky groups" in a cohort design. If the exposure group includes some unexposed patients and the control group includes some exposed patients, then anything effect you are trying to detect will be diluted. Be especially aware of situations where one group is more leaky than the other.

For example, many studies will classify people into various levels of caffeine exposure on the basis of how much coffee they drink. Although coffee is the major source of caffeine for most people, failure to ask about other sources of caffeine consumption can lead to serious errors. A rabid Diet Coke drinker might mistakenly be classified into the low caffeine consumption group (Brown 2001).

Dietary studies will sometimes rely on household food surveys, but these need adjustment for the varying consumption of individual family members. For example, within the same family, males (especially boys aged 11-17 years) will have higher average intakes of calories and nutrients (Nelson 1986).

The case control study

A case control study selects patients on the basis of an outcome, such as development of breast cancer, and are compared to a group of patients without that outcome. Do the cases differ from the controls in some exposures?

Example: In a study of asthma deaths (Anderson 2005), researchers selected 532 patients who died between 1994 and 1998 with asthma mentioned in part I of the death certificate. For each asthma death, a similar asthma admission (without death) was identified at the same hospital, with a similar admission date and a similar age.

Example: In a study of vascular dementia (Chan Carusone 2004), researchers selected 28 patients with vascular dementia who were enrolled in the Geriatric Clinic at Henderson Hospital in Hamilton, Ontario between July 1999 and October 2001. They also selected controls from a list of all caregivers at that clinic, regardless of the diagnosis of their spouse or family member, as long as the caregiver did not have any signs of dementia or stroke. Caregivers were matched by age (within 5 years) and sex. The researchers tested both cases and controls for C. Pneumoniae.

A case-control study is very efficient in studying rare diseases. With this design, you round up all of the limited number of cases of the disease and then find a comparable control group. By contrast, a cohort design has to round up far more exposures to ensure that a handful of them will develop the rare disease.

Case-control studies do not perform well when you are evaluating a diagnostic test. They are easy to set up, because you have a group of patients with the disease and you estimate the probability of a positive result for the diagnostic test in this group (sensitivity). You also have a control group and you estimate the probability of a negative result for the diagnostic test in this group (specificity). Unfortunately, the case control design usually has a collection of very obviously diseased patients among the cases and very obviously healthy patients among the controls. This is an example of spectrum bias (Ransohoff 1978), the lack of patients in the ambiguous middle of the spectrum. A study with spectrum bias will often overstate the sensitivity and specificity of a diagnostic test.

A study of the rapid dipstick test  for urinary tract infection (Lachs 1992), the sensitivity of the test was very good (92%) when restricted to a sample of patients with obvious signs of infection, but was poor (56%) when patients with more subtle manifestations of the disease were evaluated.

The case-control study is always retrospective because the outcome in a case control study has already occurred. Retrospective studies usually have more problems with data quality because our memory is not always perfect. What's worse is that sometimes the ability to remember is sharply influenced by the outcome being studied. People who experience a tragic event like a miscarriage will have a strong desire to try to understand why this has happened and will search their pasts for risk factors that have been highly publicized in the press (Bryant 1989). They don't make things up, but the problem is that the people in the control group only seem to remember about half the things that have happened in their past. This selective underreporting in the control group is known as recall bias and it can lead to some serious faulty findings.

If you have "leaky groups" in a case-control design, this can cause problems also. Do some of the disease outcomes get left out of the cases? It might be harder, for example, to identify the less serious examples of disease. Patients with milder forms of Alzheimer's disease may not bother to seek out help. Only when the disease progresses enough to interfere with these patients' ability to live and function independently will you encounter such patients. Watch out also for situations where healthy people or people with the incorrect disease are accidentally classified as cases. You can avoid problems with leaky groups if there is some type of registry that allows the researchers to identify every possible case.

The other major problem with this type of study is that it is so hard to find a good control group. You want to find controls that are identical to the cases in all aspects except for the outcome itself. When there is a roster of all potentially eligible subjects (subjects who would be classified as cases if they developed the disease), then selection of a good quality control group is easy (Wacholder 1995). Most studies would not have such a roster. In this case, the controls are often patients admitted to the hospital for outcomes unrelated to the study. So if cases represent newly diagnosed lung cancer, then the controls might be patients admitted for a bone fracture. Other times, you might ask the case to bring a friend with them or to identify a relative.

Selection of controls in a case-control study is difficult enough, but you also have to worry about the selection of the cases. Do you select incident cases (for example all breast cancer patients newly diagnosed during a given time frame) or prevalent cases (for example, all breast cancer patients who are alive during a given time frame)?

Selecting prevalent cases can lead to a very different answer than selecting incident cases. The probability of finding a case in a given time frame is related to mortality risk. Those patients who have a mild form of disease and survive for a relatively long time have a good chance of being around on the date that you go looking for them. Those patients who die quickly are unlikely to be around on the date that you go looking for them. A hypothetical example (Grimes 2002) involves a study of the relationship between snow shoveling and heart attacks. If such a study were done in a hospital setting, it would miss all the cases that died in their driveways. In general, selection of prevalent cases will lead to the selection of the milder and less rapidly fatal forms of the disease.   A more detailed discussion of prevalence and incidence appears in Chapter 6.

Finally, the case-control design just does not sit well with your intuition. You are trying to find factors that cause an outcome, so you are sampling from the causes while a cohort design samples from the effects. Don't let this bother you too much, though. The mathematics that justify the case control design were developed half a century ago (Cornfield 1951) and careful use of the case-control design has helped answer important clinical questions which could not have been answered by other research designs. Case-control designs, for example, established the use of aspirin as a cause of Reye's syndrome (Monto 1999). It's hard to imagine how a randomized trial for Reye's syndrome could have been done, because you would have to tell parents that you suspected, but were not quite sure, that giving an aspirin to a feverish child might lead to some pretty bad outcomes. So would you mind terribly if we recruited your son/daughter to participate in a trial where there is a 50% that they will get this possibly harmful substance?

The cross-sectional study

In contrast to the cohort and the case-control design, the cross-sectional study*** select on the basis of neither exposure nor outcome. With the cross-sectional design, you select a single group of patients and simultaneously assess both their exposure variables and their outcome variables.

Example:  In a study of intimate partner violence (Malcoe 2004), 312 Native American women attending a tribally operated clinic filled out a survey form. The survey included a modified Conflict Tactics Scale to assess whether the women experienced verbal or psychological aggression, or physical or sexual assault. The survey also asked about educational attainment, employment status, receipt of food stamps, and other questions to help determine their socioeconomic status. Since both the outcome (intimate partner violence) and the exposure (socioeconomic status) were determined at the same time, this represents a cross-sectional survey.

Example:  In a study of respiratory problems (Salo 2004), 5,051 seventh grade students in Wuhan, China completed a self-administered questionnaire. These students were classified according to six respiratory outcomes (wheezing with colds, wheezing without colds, bring up phlegm with colds, bringing up phlegm without colds, coughing with colds, coughing without colds) and two exposure variables (coal burning for cooking and cleaning, and smoking in the home). Students were not randomly assigned to an exposure so this is an observational study. Both the outcome variables and the exposure variables were assessed at a single point in time, so this represents a cross-sectional study.

Since there is no separation in time between assessment of exposure and assessment of outcome, you often cannot determine which came first. This loss of temporality makes it difficult to infer a cause and effect relationship. A hypothetical example of patient height (Mann 2003), describes how a cross sectional study might notice a negative association between height and age. Could this be because people shrink as they age, or perhaps successive generations of people are taller because of the improvements in nutrition, or perhaps taller people just die earlier? With a cross-sectional study, you cannot easily disentangle these alternate explanations.

Be cautious about leaky groups again. Will the selection process in a cross-sectional study correctly identify exposures and outcomes? In particular, are patients with more serious illnesses easier/harder to capture in the cross-sectional study than patients with milder forms of the illness?

Cross-sectional studies are fast, though, as you don't have to wait around to see what happens to the patients. These studies also allow you to easily explore relationships between multiple exposure variables and/or multiple outcome variables. But unlike the cohort design, which is useful for rare exposures, or the case-control design, which is useful for rare outcomes, the cross-sectional study is only effective if both the exposure and the outcome are relatively common events.

In general, the cross-sectional study is more useful as an exploratory tool, and can lead to the preparation of more definitive research studies with more rigorous designs.

The historical controls study

In a historical controls study, researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy.

Example:  In a study of the rapid parathyroid hormone test (Johnson 2001), 49 patients undergoing parathyroidectomy received the rapid test. These patients were compared to 55 patients undergoing the same procedure before the rapid test was available. This is an observational study because the calendar, not the researchers, determined which test was applied. This particular observational study is a historical controls design because the control group represents patients tested before the availability of the rapid test.

The very nature of a historical controls study guarantees that there will be a major covariate imbalance between the two groups. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially? For the most part, historical controls are considered one of the weakest forms of evidence. The one exception is when a disease has close to 100% mortality. In that situation, there is no need for a concurrent control group, since any therapy that is remotely effective can readily be detected . Even in this situation, you want to be sure there is a biological basis for the treatment and that the disease group is homogenous.

Non-randomized studies.  What to look for.

For studies using a cohort design:

For studies using a case-control design:

For studies using a cross-sectional design:

For studies using a historical controls design:

For all studies:

1.4 Preventing covariate imbalance before it occurs

To ensure an apples to apples comparison, researchers will often use matching. Matching is the systematic selection, for every subject in the treatment/exposure group, of control subject with similar characteristics. For example, in a study of fetal exposure to cocaine, you would select infants born to a mother who abused cocaine during pregnancy for your exposure group. For every such infant, you would select a infant unexposed to cocaine in utero, but also who had the same sex, race, and socio-economic status for your control group.

Example: In a study of home versus hospital delivery (Ackerman-Liebrich 1996), 489 women who planned to deliver their babies at home were matched with women who planned to deliver at the hospital. Matching was based on age category (5 categories), parity category (3 categories), category of gynecological and obstetric history (24 categories or none), category of medical history (12 categories or none), social class (5 categories), and nationality. Because the matching criteria were so elaborate, they were only able to find a matched hospital delivery for about half of their home deliveries.

Matching will prevent covariate imbalance for those variables used in matching. It will also reduce covariate imbalance for any variables closely related to the matching variables. It will not, however, protect against all covariate imbalance, especially for those covariates that are difficult to measure.

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

Matching is usually reserved for those variables that are known to be highly predictive of the outcome measure. In a cancer study, for example, matching is usually done on smoking. Many neonatology studies will match on gestational age.

Matching in a case control design

When you are selecting patients on the basis of disease and looking back at what exposure might have caused the disease, selection of matching control patients (patients without disease) can sometimes be tricky. You need to find a control that is similar to the case, except for the disease of interest. There are several possibilities, but none of them work perfectly.

Example: In a study of early onset myocardial infarction (Danesh 1999), 1,122 survivors of heart attacks, between the ages of 30-49 were matched with people of the same age and gender who did not have heart attacks. These controls were recruited from a pool of subjects related to the cases. A second analysis used 510 survivors and their siblings, if the sibling was the same sex and within five years of age. All of the cases and the controls had blood tests to look for Helicobacter pylori infection, which was more commonly found in the cases than the controls.

Example: In a study of patients who leave a pediatric emergency department without being seen (Goldman 2005), patients who left were matched with the next two names on an alphabetical list of patients who visited on the same day and who had the same age (within one year), and the same sex. There was a large pool of controls to draw from, since patients who left comprised only 289 of the 11,087 total visitors.

Matching in a randomized design

In some randomized studies, matching will be used as well. Partly, this is a recognition that randomization will not totally remove covariate imbalance;  like a flip of 100 coins will not always result in exactly 50 heads and 50 tails. More importantly, however, matching in a randomized study will provide extra precision. Matching creates pairs of subjects who will have greater homogeneity and therefore less variability.

Example: A study of a Mental Health First Aid course (Jorm 2004), sixteen local government areas in rural Australia were matched into pairs based on size, geography, and socioeconomic level. In each pair, one area was assigned to receive immediate training while the other was assigned to a waiting list.

Matching can sometimes backfire

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

In the tinnitus study mentioned above, although there were 1,121 patients, 143 of them did not have a close match in the data and were excluded from the matched analysis. There was also some attrition in the study, which caused a greater loss in the matched analysis. If one of the patients in a pair dropped out, the other patient's data could not be used in the matched analysis. So the analysis of improvement after 4 weeks included only 414 pairs and the analysis after 14 weeks included only 354 pairs. Although the loss in sample size was probably offset by the added precision from the matching, the authors do acknowledge that this was probably "an unnecessary and disadvantageous complication."

Contrast this, though, with the study of patients who left the ER. These patients represented only 3% (289/11,087) of the total pool of subjects and this made it easy to find not just one, but two matching control patients.

In a case control design, matching can sometimes remove the very effect you are trying to study. You should avoid matching when the matching variable is caused by the exposure or is a similar measure of exposure, then you might "over match" the data and remove the effect of the exposure. In a study examining radiation exposure and the risk of leukemia at a nuclear reprocessing plant (Marsh 2002), there were 37 workers diagnosed with leukemia (cases) and they were each matched to four control workers. Each of the four control workers had to work at the same site, be the same gender, have the same job code, be born within two years of the case, and had to be hired within two years of the hire date of the case.

Unfortunately, there was a strong trend between hire dates and exposures. Exposures were highest early in the plant's history and declined over time. So both hire date and exposure were measuring the same thing. When the data was matched on hire dates, it artefactually controlled the exposures and pretty much ensured that the average radiation exposure would be the same among both the cases and the controls. This led to an estimate of radiation exposure that was actually slightly negative and not statistically significant. When the data was re-matched using all the variables except for hire date, the effect of radiation dose was large and positive and came close to approaching statistical significance.

Stratification

Stratification is a method similar to matching that tries to achieve covariate balance across broad groups or strata. The selection of subjects in both the treatment group and the control group are constrained to have identical proportions in each strata. This guarantees covariate balance for the strata itself and any other factors closely related to the strata.

Example: In a study of medical records (Fine 2003), 54 records were selected from each of ten cardiac surgery centers and were examined for accuracy and completeness. To ensure a good balance, the 54 records at each site were allocated evenly to six different predefined risk strata (nine in each strata).

Example: In a study of retention of doctors in rural Australia (Humphreys 2002), a random sample of 1,400 doctors was sent a questionnaire. The doctors were selected in strata defined by the size of the town they lived in to keep the proportion in each strata equivalent to those proportions in the entire population of Australian doctors.

Another use of stratification is to ensure that the sample has numbers in each strata that are proportional to numbers in the strata for the entire population of interest. This helps ensure that the sample is generalizable to the entire population. The second example described above shows how you can do this.

The strata are usually broadly drawn. If there were a small number of possible patients within each strata, then the logistics become too difficult. So, for example, stratification by age will usually involve large intervals such as 21-30 years, 31-40 years, etc.

You cannot stratify on factors that you cannot measure or on information that is not immediately available at the start of the study. And like matching, stratification only works when you have a large pool of subjects to draw from.

Stratification can add precision to a randomized study. A separate randomization list would be drawn up for each strata. This would ensure that the strata would have perfect balance between the treatment group and the control group.

The crossover design

The crossover design represents a special type of matching. In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA). Since the same subject receives both treatments, there is no possibility of covariate imbalance.

Example: In a study of electronic records (Brown 2003), ten physicians were asked to code patient records with two separate systems: Clinical Terms Version 3 and with the Read Codes 5 byte set. Half of the physicians were randomly assigned to code using Clinical Terms Version 3 first and then later with the Read Codes 5 Byte Set. The other half coded using Read Codes 5 Byte Set first.

When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?

The washout period can sometimes cause ethical concerns. If you are treating patients for depression, an extensive amount of time during the washout would leave the patient without any effective treatment and increase the chances of something bad happening, like the patient committing suicide.

The possibility of learning effects are also potential problems in a crossover design. You can't use a crossover design, for example, to test alternative training approaches. Imagine the instructions for this study (now forget everything we just told you; we're going to teach it a different way). I guess that would work for the classes I teach; the only thing my students remember are the jokes.

Also, watch out for the possibility that a subject may get tired or bored. This could lead to the second treatment assigned being worse than the first. Or, if the outcome involves skill, maybe "practice makes perfect" leading to the second treatment assigned being better than the first.

If there are timing effects, randomization is critical. Even with randomization, though, timing effects are a problem because they increase uncertainty by adding an extra source of variation.

Special problems arise when each subject always receives one therapy first and it is always followed by the other therapy. Many factors, other than the change in therapy, can cause a shift in the health of patients over time. If you cannot randomize the order of treatments, you have all the problems of a historical controls study.

Things to look for in a study with matching or stratification

When a study uses matching, look for the following features:

For a study using matching (stratification):

For studies using a cross-over design:

1.5 Statistical adjustments

Statistical adjustments represent one way of correcting for covariate imbalance. While matching and stratification, try to prevent covariate imbalance before it occurs, statistical adjustment corrects for the imbalance after the fact.

The best example I can find for covariate adjustment is a non-medical example. You might still enjoy this example, though, if you've ever tried to buy a house. The data comes from the Data and Story Library**** and shows the housing prices of 117 homes in Albuquerque, New Mexico in 1993. The data set also includes variables that might influence the sales price of the home such as the size in square feet, the age in years, and whether the house was custom built.

When you look at the average sales price for regular homes and custom built homes, you see a large discrepancy. Regular homes sell, on average for 95 thousand dollars, but custom built homes sell for 145 thousand dollars on average, a 50 thousand dollar discrepancy.

But when you draw a graph that shows both the size of the house and the price (see above), you notice that custom built houses (denoted by C on the graph) aren't all that much different from the regular houses (denoted by R). The margins of the graph explain exactly what is happening. On the right hand side, you see a box plot for the prices of regular and custom homes. The plus signs inside each box plot represent the mean prices. When you look at this dimension alone, the prices seem quite different. At the bottom of the graph are box plots for the size of the homes. Uh-oh! It looks like the custom built homes are quite a bit bigger than the regular homes (2,100 versus 1,500 square feet). This is hardly surprising. People who have the money for custom builts, also have the money for a roomier and more spacious house.

So now you have to wonder--are custom builds more expensive, because they are custom builds, or just because they are bigger? This is the sort of confusion you always have to deal with when you encounter covariate imbalance. The solution is to adjust for the differences in house sizes. There is a fairly strong and predictable relationship between size and price. For every extra square foot of space, the average sales price increases by $55. Multiply this by 600 square feet, the discrepancy in sizes between the average custom build and the average regular home. It turns out that of the 50 thousand dollar gap that you observed, 33 thousand can be explained by the difference in average sizes. The remaining 17 thousand dollars is probably real. So a covariate adjustment would reduce the estimated difference in prices by about 2/3.

The trend lines in the plot above shows the relationship between size and price and the gap between the lines represents the difference in price adjusting for size. So, for example, a house that with 2000 square feet would sell for an estimated 137 thousand dollars if it were custom built and around 120 thousand dollars if it were not.

Example: A study of males residents of Caerphilly, South Wales (Smith 1997) examined the relationship between frequency of orgasm and ten year mortality among males residents of Caerphilly, South Wales. They divided the men into low, medium, and high frequency. Low frequency meant less than monthly and high frequency meant twice a week or more often. This is a study which would have been impossible to randomize--the men (and presumably their wives) determined which group they belonged to. As you might expect, there were demographic differences in the three groups. Age was significantly associated with frequency of orgasm. Men in the low, medium, and high frequency groups were 54, 52, and 50 years old, on average. The job categories also differed, with the proportion of non-manual labor being 29%, 42%, and 42% among the three groups. For other variables (height, body mass index, systolic blood pressure, cholesterol, existing coronary heart disease, and smoking status), the differences in  were smaller and less important. The adjustments used a combination of regression approaches and weighting. After adjustment, there was a strong trend in mortality, with men in the low frequency group having an adjusted mortality rate that was twice as big as the high frequency group. Both the article itself, and a subsequent letter to the editor (Batty 1998) mentioned, however, that additional unmeasured variables could have influenced the outcome.

Avoiding covariate imbalance by looking at a special subgroup

If there is covariate imbalance in the entire sample, perhaps there may be a subgroup where the covariate is balanced. If you can find such a subgroup and it produces results similar to the entire sample, you can have greater confidence in the findings of the entire sample.

Example: In a study of the effect of men's age on time to pregnancy (Hassan.2003), older men tended to have a longer time to pregnancy. These older men, though, also have older wives, on average. This creates an unfair comparison, since the wife's age would probably also influence time to pregnancy. To produce a fairer comparison, they conducted a separate analysis looking at men of all ages who married young wives.

Of course, it is not always possible to find a subgroup without covariate imbalance. And when you do find such a subgroup, the smaller sample size may lead to an unacceptable loss of precision. Furthermore, the subgroup may be somewhat unusual, making it difficult for you to generalize the findings.

Reweighting to restore balance

Another way to restore balance in a study is the use of weights. Suppose the treatment group includes 25 males and 75 females, but in population we know that there should be a 50/50 split by gender. We could re-weight the data, so that each male has a weighting factor of 2.0 and each female has a weighting factor of 0.67. This artificially inflates the number of males to 50 and deflates the number of females to 50. The control group might have 40 males and 60 females. For this group, we would use weights of 1.25 and 0.83.

A recent article on educational testing (Wainer 2004), shows how a simple re-weighting of the data can lead to a fairer comparison between two groups. These researchers  show data on a state by state basis for the National Assessment of Educational Progress (NAEP). Two states, Nebraska and New Jersey show interesting results. The average score for Nebraska is 277 and only 271 for New Jersey. But interestingly enough, New Jersey outperforms Nebraska among whites (283 vs 281), blacks (242 vs 236) and other non-white (260 vs 259).

This odd finding occurs because New Jersey has much different demographics than Nebraska. In New Jersey 66% of the population is white, 15% black, and 19% other non-white. In Nebraska, 87% of the population is white, 5% is black, and 8% is other non-white. It is this differing demographic mix that causes the paradox.

The average score for each state is a weighted average. For Nebraska, the calculation is

281*0.87 +  236*0.05 + 259*0.08 = 277

and for New Jersey, the calculation is

283*0.66 + 242*0.15 + 260*0.19 = 272

Nebraska benefits because a higher weight (0.87) is placed on the race that scored highest in both states. What would happen to Nebraska's and New Jersey's scores if the demographic mix was changed to the overall percentages in the U.S. (69% white, 16% black, and 15% other non-white)?

Here are the re-weighted calculations for Nebraska

281*0.69 + 236*0.16 + 259*0.15 = 271

and New Jersey

283*0.69 + 242*0.16 + 260*0.15 = 273

My numbers don't match perfectly with the original article because of rounding error, but the overall conclusions remain the same. Nebraska does have a higher mean than New Jersey but when you adjust this mean for the racial demographics, New Jersey actually does better.

Re-weighting to a common demographic risk is often used to make adjustments between two groups that have sharply differing mixes of age, gender, and/or racial characteristics.

The statistical analysis gets a bit tricky with weights, but nothing that a professional statistician can't handle. Weights can also improve the generalizability of a study. If the overall a sample has a skewed demographic, weights can help bring it back in line with the population of interest.

Unmeasured covariates

You can only adjust for those things that you can measure. Unfortunately, there are many things such as a patient's psychological state, presence of co-morbid conditions, and initial severity of the disease that are so difficult to assess that they are often just not measured.

Example: A study of asthma and chronic obstructive pulmonary disease in several different data sources (Hansell 2003), showed inconsistent results for asthma across the data sources. The authors speculate that smoking and social class might influence these results, but these variables were not available in most of the data sets used in this study.

Example: A study of hip fractures (Ray 2002), noted that three previous case control studies using large databases had suggested that statins were associated with a lower risk of hip factures among elderly patients. The authors speculated that there may be a "healthy drug user effect" that would bias these findings. By a healthy drug effect, the authors meant that patients who use preventive measures and comply with them faithfully are likely to be less seriously ill at baseline than patients who don't take preventive measures or are poor compliers. Some of this may be that these patients just have better general self-care habits. In addition, doctors might be more likely to prescribe statins to heavier patients and the extra padding in these patients provides some protection against hip fracture. Measuring self-care habits would be impossible to do in most research settings, but especially in a retrospective study like a case-control design. Patient weights are easier to obtain, but unfortunately, these data were not available in two of the three case control studies. The authors conducted a cohort study, which had some of the same problems as the case control studies because it, too, was retrospective and had no data on patient weights. Nevertheless, the fact that patients using statins and patients using other lipid lowering drugs, both had comparable levels of reduced hip fractures compared to non-users, which indicated that it might be an overall effect of healthy lifestyles of patients that use any preventive medicine rather than the effect of the statins themselves that reduced the risk of hip fractures.

Imperfectly measured covariates

Some covariates can be measured, but only crudely. If the covariate itself is difficult to measure accurately, then any attempts to make statistical adjustments will only be partially successful. Your measurement may only capture half of the information in the covariate. The half of the covariate that is unaccounted for will remain behind, leading to an unfair comparison. This is sometimes called residual confounding.

Example: In a study of factors influencing Down syndrome (Chen 1999), smoking had a surprisingly protective effect. This could be explained by the age of the mother. Older mothers smoke less and are also at greater risk for birth of a Down syndrome child. The unadjusted odds ratio for this effect was 0.80 and was borderline statistically significant (95% CI 0.65 to 0.98). A crude adjustment for age used the categories <35 years and >=35 years). With this adjustment, the odds ratio was still small (0.87) and borderline (95% CI 0.71 to 1.07). But when the exact year of age was used to adjust and race parity also included in the adjustment, then there was no association odds ratio=1.00, 95% CI 0.82 to 1.24). This shows that an imperfect adjustment can produce an incorrect conclusion.

Example: In a study of adverse birth outcomes (Elliott 2001), residents who lived within two kilometers of a landfill site were compared to more distant individuals. The authors acknowledged that these landfills were typically located in areas that were already poverty stricken.  So perhaps factors associated with poverty, such as poorer nutrition, might influence the risk of adverse birth outcomes rather than the landfill itself. They tried to account for poverty using the Carstairs index, a measure of deprivation. The authors admit that this is a rather crude adjustment, and perhaps some additional degree of poverty was left unaccounted for. An accompanying editorial (McNamee 2001), pointed out that even a 10% disparity in a risk factor, that doubles the chances of an adverse birth outcome, could lead to changes that dwarf the effects seen in this particular study.

Self report measures are often measured imperfectly, and are especially troublesome if they require the patient to recall events from the distant past.

Smoking is an important covariate for many studies and  it would be better to ask about the amount of smoking for current smokers. For smokers who have quit recently, you might also like to know how recently they quit. For both groups it might also help to know when they started. But often, the only question asked is a yes/no question like "Do you smoke cigarettes?"

Some covariates like blood cholesterol levels are inherently variable. In an ideal world, these covariates would be measured at a second time and the two measures could be averaged to reduce some of the uncertainty. But this is not always possible or practical.

Adjusting for variables in the causal pathway

Although adjusting for covariate imbalance is usually a good thing, you can sometimes take it too far. If your treatment influences and intermediate variable and that variable influences the outcome, then the intermediate variable is said to be in the causal pathway.

For example, I was co-author on a research study (Kliethermes 1999) that was trying to improve rate of breast feeding in a group of pre-term infants. The intervention was to feed these infants, when the mother was not around, through their nasogastric tube. This sounds like an icky thing, but remember that the population is pre-term infants, who probably have to have an ng tube anyway. So it would not be too weird to use this tube for feeding. It might mean that the ng tube would have to stay in a bit longer, but if this lead to a greater proportion of mothers breastfeeding at three and six months, that would be a worthwhile tradeoff.

In the control group, infants would be fed from a bottle when the mother was not around. For both groups, of course, breastfeeding would be encouraged whenever the mother was with the baby. Keep in mind that these are pre-term babies, so some of them may stay in the hospital for weeks. Since the mothers left the hospital much sooner than their babies, what to do while the mother was not around was very critical.

It turns out that the intervention was very successful. Infants randomized to the ng tube feeding group had higher rates of breastfeeding at discharge, three days post discharge, as well as at three and six months. One possible explanation for this success is that infants who receive too many bottles early in life may have trouble latching onto the mother's nipple.

When the researchers collected the data, they included a variable which measured the number of bottles of formula received during the hospital stay. This variable was zero for most of the infants in the ng tube feeding group, although a handful of infants in this group did incorrectly get a few bottles of formula.

Just on a whim, I decided to adjust for the number of bottles received. I was shocked to find out that the effect of the treatment disappeared when I adjusted for the number of bottles.  At first, I panicked, but then I realized that this adjustment, if anything, proved the effectiveness of the intervention. The number of bottles received was directly influenced by the intervention, and the fact that this intermediate variable was more strongly associated with breast feeding rates than the intervention itself should not come as a surprise.

We did not publish the results of this particular analysis, partly for space limitations and partly because it was difficult to explain properly. But I took it as a lesson to think carefully about covariate adjustment, and not to just toss a variable into the fray without thinking carefully about it first.

Matching and adjustments:  What to look for.

If a study uses covariate adjustments, look for the following things:

Counterpoint: Randomized trials are not all they are cracked up to be.

Can matching and/or statistical adjustments in an observational study provide a comparison as fair and as persuasive as a randomized study? This is an unfair question, because sometimes a randomized study is just not possible. Also, there are so many different types of observational studies that it would be difficult to come up with a good general answer. Still, some people have tried to answer this question.

An empirical study of observational and randomized studies of the same topic (Concato 2000), found that there was a high level of consistency between the two. This contradicted the previously held belief that observational studies tended to overstate the effectiveness of a new treatment. The debate about this finding continues to rage, but perhaps the quality of the design and the sophistication of the adjustments used in observational studies places them on a level comparable to randomized studies. A study on thrombolytic treatment in patients with acute myocardial infarction (Koch 1997) showed that a large non-randomized registry provided data that was comparable to that collected in randomized studies.

Randomized studies have some additional weaknesses. The very process of randomization will create an artificial environment that does not represent how medicine is normally practiced (Sackett 1997). When you go to your doctor for assistance with birth control, you do not expect him/her to randomly assign you to a particular method. And if your doctor said you had a 50% of getting a placebo contraceptive, you would probably switch doctors. Because an observational study does not have to cope with the intrusion of the randomization process, it can often study medicine in an environment much closer to reality.

Furthermore, the use of a placebo in a randomized trial creates an artificial situation where patients are more likely to drop out and less likely to report side effects (Rochon 1999).

Another problem with randomized designs is the limit to their size and scope. The logistics of randomization make it more expensive than a comparable observational study. Thus effects that require a very large sample size to detect (such as rare side effects) or effects that take a long time to manifest themselves (such as the progression of many types of cancer) cannot be examined in a randomized experiment. An observational approach, like post marketing surveillance, is more likely to be successful in these situations.

All other things being equal, a randomized study provides a higher standard of evidence than an observational study, but rarely are all other things equal.

On your own

1. Review the following abstracts, all from studies where randomization was not done. Speculate on the reason that randomization was not performed.

Body Fatness During Childhood and Adolescence and Incidence of Breast Cancer in Pre-Menopausal Women: A Prospective Cohort Study.  Heather J Baer, Graham A Colditz, Bernard Rosner, Karin B Michels, Janet W Rich-Edwards, David J Hunter and Walter C Willett. Breast Cancer Research 2005, 7:R314-R325 doi:10.1186/bcr998. Introduction:  Body mass index (BMI) during adulthood is inversely related to the incidence of pre-menopausal breast cancer, but the role of body fatness earlier in life is less clear. We examined prospectively the relation between body fatness during childhood and adolescence and the incidence of breast cancer in pre-menopausal women. Methods:  Participants were 109,267 pre-menopausal women in the Nurses' Health Study II who recalled their body fatness at ages 5, 10 and 20 years using a validated 9-level figure drawing. Over 12 years of follow up, 1318 incident cases of breast cancer were identified. Cox proportional hazards regression was used to compute relative risks (RRs) and 95% confidence intervals (CIs) for body fatness at each age and for average childhood (ages 5–10 years) and adolescent (ages 10–20 years) fatness. Results:  Body fatness at each age was inversely associated with pre-menopausal breast cancer incidence; the multivariate RRs were 0.48 (95% CI 0.35–0.55) and 0.57 (95% CI 0.39–0.83) for the most overweight compared with the most lean in childhood and adolescence, respectively (P for trend < 0.0001). The association for childhood body fatness was only slightly attenuated after adjustment for later BMI, with a multivariate RR of 0.52 (95% CI 0.38–0.71) for the most overweight compared with the most lean (P for trend = 0.001). Adjustment for menstrual cycle characteristics had little impact on the association. Conclusion: Greater body fatness during childhood and adolescence is associated with reduced incidence of pre-menopausal breast cancer, independent of adult BMI and menstrual cycle characteristics.

This is an open source publication. The full free text is available at breast-cancer-research.com/content/7/3/R314.

Impact of a Nurses' Protocol-Directed Weaning Procedure on Outcomes in Patients Undergoing Mechanical Ventilation for Longer Than 48 Hours:  A Prospective Cohort Study with a Matched Historical Control Group. Jean-Marie Tonnelier, Gwenaël Prat, Grégoire Le Gal, Christophe Gut-Gobert, Anne Renault, Jean-Michel Boles and Erwan L'Her. Critical Care 2005, 9:R83-R89 doi:10.1186/cc3030. Introduction:  The aim of the study was to determine whether the use of a nurses' protocol-directed weaning procedure, based on the French intensive care society (SRLF) consensus recommendations, was associated with reductions in the duration of mechanical ventilation and intensive care unit (ICU) length of stay in patients requiring more than 48 hours of mechanical ventilation. Methods:  This prospective study was conducted in a university hospital ICU from January 2002 through to February 2003. A total of 104 patients who had been ventilated for more than 48 hours and were weaned from mechanical ventilation using a nurses' protocol-directed procedure (cases) were compared with a 1:1 matched historical control group who underwent conventional physician-directed weaning (between 1999 and 2001). Duration of ventilation and length of ICU stay, rate of unsuccessful extubation and rate of ventilator-associated pneumonia were compared between cases and controls. Results:  The duration of mechanical ventilation (16.6 ± 13 days versus 22.5 ± 21 days; P = 0.02) and ICU length of stay (21.6 ± 14.3 days versus 27.6 ± 21.7 days; P = 0.02) were lower among patients who underwent the nurses' protocol-directed weaning than among control individuals. Ventilator-associated pneumonia, ventilator discontinuation failure rates and ICU mortality were similar between the two groups. Discussion:  Application of the nurses' protocol-directed weaning procedure described here is safe and promotes significant outcome benefits in patients who require more than 48 hours of mechanical ventilation.

This is an open source publication. The full free text is available at ccforum.com/content/9/2/R83.

Extravascular Lung-Water in Patients with Severe Sepsis:  A Prospective Cohort Study. Greg S Martin, Stephanie Eaton, Meredith Mealer and Marc Moss. Critical Care 2005, 9:R74-R82 doi:10.1186/cc3025. Introduction:  Few investigations have prospectively examined extravascular lung water (EVLW) in patients with severe sepsis. We sought to determine whether EVLW may contribute to lung injury in these patients by quantifying the relationship of EVLW to parameters of lung injury, to determine the effects of chronic alcohol abuse on EVLW, and to determine whether EVLW may be a useful tool in the diagnosis of acute respiratory distress syndrome (ARDS). Methods:  The present prospective cohort study was conducted in consecutive patients with severe sepsis from a medical intensive care unit in an urban university teaching hospital. In each patient, transpulmonary thermodilution was used to measure cardiovascular hemodynamics and EVLW for 7 days via an arterial catheter placed within 72 hours of meeting criteria for severe sepsis. Results:  A total of 29 patients were studied. Twenty-five of the 29 patients (86%) were mechanically ventilated, 15 of the 29 patients (52%) developed ARDS, and overall 28-day mortality was 41%. Eight out of 14 patients (57%) with non-ARDS severe sepsis had high EVLW with significantly greater hypoxemia than did those patient with low EVLW (mean arterial oxygen tension/fractional inspired oxygen ratio 230.7 ± 36.1 mmHg versus 341.2 ± 92.8 mmHg; P < 0.001). Four out of 15 patients with severe sepsis with ARDS maintained a low EVLW and had better 28-day survival than did ARDS patients with high EVLW (100% versus 36%; P = 0.03). ARDS patients with a history of chronic alcohol abuse had greater EVLW than did nonalcoholic patients (19.9 ml/kg versus 8.7 ml/kg; P < 0.0001). The arterial oxygen tension/fractional inspired oxygen ratio, lung injury score, and chest radiograph scores correlated with EVLW (r2 = 0.27, r2 = 0.18, and r2 = 0.28, respectively; all P < 0.0001). Conclusions:  More than half of the patients with severe sepsis but without ARDS had increased EVLW, possibly representing subclinical lung injury. Chronic alcohol abuse was associated with increased EVLW, whereas lower EVLW was associated with survival. EVLW correlated moderately with the severity of lung injury but did not account for all respiratory derangements. EVLW may improve both risk stratification and management of patients with severe sepsis.

This is an open source publication. The full free text is available at ccforum.com/content/9/2/R74.

Breast Implants Following Mastectomy in Women with Early-Stage Breast Cancer:  Prevalence and Impact on Survival. Gem M Le, Cynthia D O'Malley, Sally L Glaser, Charles F Lynch, Janet L Stanford, Theresa HM Keegan and Dee W West. Breast Cancer Res 2005, 7:R184-R193 doi:10.1186/bcr974. Background: Few studies have examined the effect of breast implants after mastectomy on long-term survival in breast cancer patients, despite growing public health concern over potential long-term adverse health effects. Methods: We analyzed data from the Surveillance, Epidemiology and End Results Breast Implant Surveillance Study conducted in San Francisco–Oakland, in Seattle–Puget Sound, and in Iowa. This population-based, retrospective cohort included women younger than 65 years when diagnosed with early or unstaged first primary breast cancer between 1983 and 1989, treated with mastectomy. The women were followed for a median of 12.4 years (n = 4968). Breast implant usage was validated by medical record review. Cox proportional hazards models were used to estimate hazard rate ratios for survival time until death due to breast cancer or other causes for women with and without breast implants, adjusted for relevant patient and tumor characteristics. Results: Twenty percent of cases received postmastectomy breast implants, with silicone gel-filled implants comprising the most common type. Patients with implants were younger and more likely to have in situ disease than patients not receiving implants. Risks of breast cancer mortality (hazard ratio, 0.54; 95% confidence interval, 0.43–0.67) and nonbreast cancer mortality (hazard ratio, 0.59; 95% confidence interval, 0.41–0.85) were lower in patients with implants than in those patients without implants, following adjustment for age and year of diagnosis, race/ethnicity, stage, tumor grade, histology, and radiation therapy. Implant type did not appear to influence long-term survival. Conclusions: In a large, population-representative sample, breast implants following mastectomy do not appear to confer any survival disadvantage following early-stage breast cancer in women younger than 65 years old.

This is an open source publication. The full free text is available at breast-cancer-research.com/content/7/2/R184.

Pregnancy Weight Gain and Breast Cancer Risk. Tarja I Kinnunen, Riitta Luoto, Mika Gissler, Elina Hemminki and Leena Hilakivi-Clarke. BMC Women's Health 2004, 4:7 doi:10.1186/1472-6874-4-7. Background: Elevated pregnancy estrogen levels are associated with increased risk of developing breast cancer in mothers. We studied whether pregnancy weight gain that has been linked to high circulating estrogen levels, affects a mother's breast cancer risk. Methods: Our cohort consisted of women who were pregnant between 1954–1963 in Helsinki, Finland, 2,089 of which were eligible for the study. Pregnancy data were collected from patient records of maternity centers. 123 subsequent breast cancer cases were identified through a record linkage to the Finnish Cancer Registry, and the mean age at diagnosis was 56 years (range 35 – 74). A sample of 979 women (123 cases, 856 controls) from the cohort was linked to the Hospital Inpatient Registry to obtain information on the women's stay in hospitals. Results: Mothers in the upper tertile of pregnancy weight gain (>15 kg) had a 1.62-fold (95% CI 1.03–2.53) higher breast cancer risk than mothers who gained the recommended amount (the middle tertile, mean: 12.9 kg, range 11–15 kg), after adjusting for mother's age at menarche, age at first birth, age at index pregnancy, parity at the index birth, and body mass index (BMI) before the index pregnancy. In a separate nested case-control study (n = 65 cases and 431 controls), adjustment for BMI at the time of breast cancer diagnosis did not modify the findings. Conclusions: Our study suggests that high pregnancy weight gain increases later breast cancer risk, independently from body weight at the time of diagnosis.

This is an open source publication. The full free text is available at www.biomedcentral.com/1472-6874/4/7.

Racial Variations in Processes of Care for Patients with Community-Acquired Pneumonia.  Eric M Mortensen, John Cornell and Jeff Whittle. BMC Health Services Research 2004, 4:20 doi:10.1186/1472-6963-4-20. Background: Patients hospitalized with community acquired pneumonia (CAP) have a substantial risk of death, but there is evidence that adherence to certain processes of care, including antibiotic administration within 8 hours, can decrease this risk. Although national mortality data shows blacks have a substantially increased odds of death due to pneumonia as compared to whites previous studies of short-term mortality have found decreased mortality for blacks. Therefore we examined pneumonia-related processes of care and short-term mortality in a population of patients hospitalized with CAP. Methods: We reviewed the records of all identified Medicare beneficiaries hospitalized for pneumonia between 10/1/1998 and 9/30/1999 at one of 101 Pennsylvania hospitals, and randomly selected 60 patients at each hospital for inclusion. We reviewed the medical records to gather process measures of quality, pneumonia severity and demographics. We used Medicare administrative data to identify 30-day mortality. Because only a small proportion of the study population was black, we included all 240 black patients and randomly selected 720 white patients matched on age and gender. We performed a re-sampling of the white patients 10 times. Results: Males were 43% of the cohort, and the median age was 76 years. After controlling for potential confounders, blacks were less likely to receive antibiotics within 8 hours (odds ratio with 95% confidence interval 0.6, 0.4–0.97), but were as likely as whites to have blood cultures obtained prior to receiving antibiotics (0.7, 0.3–1.5), to have oxygenation assessed within 24 hours of presentation (1.6, 0.9–3.0), and to receive guideline concordant antibiotics (OR 0.9, 0.6–1.7). Black patients had a trend towards decreased 30-day mortality (0.4, 0.2 to 1.0). Conclusion: Although blacks were less likely to receive optimal care, our findings are consistent with other studies that suggest better risk-adjusted survival among blacks than among whites. Further study is needed to determine why this is the case.

This is an open source publication. The full free text is available at www.biomedcentral.com/1472-6963/4/20.

Bibliography

Ackermann-Liebrich U, Voegeli T, Gunter-Witt K, Kunz I, Zullig M, Schindler C, Maurer M, Team ZS. Home versus hospital deliveries: follow up study of matched pairs for procedures and outcome. BMJ 1996: 313(7068); 1313-1318. The full free text of this reference is available at http://bmj.com/cgi/content/full/313/7068/1313.

Anderson HR, Ayres JG, Sturdy PM, Bland JM, Butland BK, Peckitt C, Taylor JC, Victor CR. Bronchodilator Treatment and Deaths from Asthma: Case-Control Study. British Medical Journal 2005: 330(7483); 117. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15618231.

Baer HJ, Colditz GA, Rosner B, Michels KB, Rich-Edwards JW, Hunter DJ, Willett WC. Body Fatness During Childhood and Adolescence and Incidence of Breast Cancer in Premenopausal Women: A Prospective Cohort Study. Breast Cancer Research 2005: 7(3); R314 -R325. The full free text of this reference is available at breast-cancer-research.com/content/7/3/R314.

Baker SG, Kramer BS, Prorok PC. Comparing Breast Cancer Mortality Rates Before-and-After a Change in Availability of Screening in Different Regions: Extension of the Paired Availability Design. BMC Med Res Methodol 2004: 4(1); 12. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15149551.

Baker SG, Lindeman KS, Kramer BS. The paired availability design for historical controls. BMC Med Res Methodol 2001: 1(1); 9. The full free text of this reference is available at www.biomedcentral.com/1471-2288/1/9.

Batty D. Are sex and death related? Study failed to adjust for an important confounder [letter; comment]. British Medical Journal 1998: 316(7145); 1671; discussion 1672. The full free text of this reference is available at bmj.com/cgi/content/full/316/7145/1671/a.

Brown J, Kreiger N, Darlington GA, Sloan M. Misclassification of exposure: coffee as a surrogate for caffeine intake. American Journal of Epidemiology 2001: 153(8); 815-20.

Brown PJ, Warmington V, Laurence M, Prevost AT. Randomised crossover trial comparing the performance of Clinical Terms Version 3 and Read Codes 5 byte set coding schemes in general practice. Bmj 2003: 326(7399); 1127. The full free text of this reference is available at bmj.com/cgi/content/full/326/7399/1127.

Bryant HE, Visser N, Love EJ. Records, recall loss, and recall bias in pregnancy: a comparison of interview and medical records data of pregnant and postnatal women. American Journal of Public Health 1989: 79(1); 78-80.

Cameron E, Pauling L. Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer. Proc Natl Acad Sci U S A 1976: 73(10); 3685-9. The full free text of this reference is available at www.pubmedcentral.nih.gov/picrender.fcgi?artid=431183&blobtype=pdf.

Chan Carusone S, Smieja M, Molly W, Goldsmih CH, Mahoney J, Chernesky M, Gnarpe J, Standish T, Smith S, Loeb M. Lack of Association Between Vascular Dementia and Chlamydua Pneumoniae Infection: A Case-Control Study. BMC Neurol 2004: 4(1); 15. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15476562.

Chen CL, Gilbert TJ, Daling JR. Maternal smoking and Down syndrome: the confounding effect of maternal age. Am J Epidemiol 1999: 149(5); 442-6.

Colditz G, Miller J, Mosteller F. How study design affects outcomes in comparisons of therapy. I: Medical. Stat Med 1989: 8(4); 441-454.

Concato J, Shah N, Horwitz RI. Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs. The New England Journal of Medicine 2000: 342(25); 1887-1892. The full free text of this reference is available at content.nejm.org/cgi/content/full/342/25/1887.

Danesh J, Youngman L, Clark S, Parish S, Peto R, Collins R. Helicobacter pylori infection and early onset myocardial infarction: case-control and sibling pairs study. Bmj 1999: 319(7218); 1157-62. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/319/7218/1157.

Davey Smith G, Frankel S, Yarnell J. Sex and Death: Are They Related? Findings from the Caerphilly Cohort Study. British Medical Journal 1997: 315(7123); 1641-4. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/315/7123/1641.

Des Jarlais DC, Paone D, Milliken J, Turner CF, Miller H, Gribble J, Shi Q, Hagan H, Friedman SR. Audio-Computer Interviewing to Measure Risk Behaviour for HIV Among Injecting Drug Users: A Quasi-Randomised Trial. Lancet 1999: 353(9165); 1657-61.

Elliott P, Briggs D, Morris S, de Hoogh C, Hurt C, Jensen TK, Maitland I, Richardson S, Wakefield J, Jarup L. Risk of adverse birth outcomes in populations living near landfill sites. Bmj 2001: 323(7309); 363-8. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/323/7309/363.

Fine LG, Keogh BE, Cretin S, Orlando M, Gould MM. How to evaluate and improve the quality and credibility of an outcomes database: validation and feedback study on the UK Cardiac Surgery Experience. BMJ 2003: 326(7379); 25-28. The full free text of this reference is available at bmj.com/cgi/content/full/326/7379/25.

Goldman RD, Macpherson A, Schuh S, Mulligan C, Pirie J. Patients Who Leave the Pediatric Emergency Department Without Being Seen: A Case-Control Study. Cmaj 2005: 172(1); 39-43. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15632403.

Grimes DA, Schulz KF. Bias and causal associations in observational research. Lancet 2002: 359(9302); 248-252. The full free text of this reference is available at www.pebita.ch/downloadSTROBE/Grimes-Lancet-2002-Bias.pdf.

Gunnell D, Magnusson PK, Rasmussen F. Low Intelligence Test Scores in 18 Year old Men and Risk of Suicide: Cohort Study. British Medical Journal 2005: 330(7484); 167. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15615767.

Hansell A, Hollowell J, McNiece R, Nichols T, Strachan D. Validity and interpretation of mortality, health service and survey data on COPD and asthma in England. Eur Respir J 2003: 21(2); 279-86. The full free text of this reference is available at erj.ersjournals.com/cgi/content/full/21/2/279.

Hassan MA, Killick SR. Effect of male age on fertility: evidence for the decline in male fertility with increasing age. Fertil Steril 2003: 79 Suppl 3; 2-9.

Henquet C, Krabbendam L, Spauwen J, Kaplan C, Lieb R, Wittchen HU, J. vO. Prospective Cohort Study of Cannabis Use, Predisposition for Psychosis and Psychotic Symptoms in Young People. British Medical Journal 2005: 330(7481); 11. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/330/7481/11.

Homnick DN, Anderson K, Marks JH. Comparison of the Flutter Device to Standard Chest Physiotherapyu in Hospitalized Patients with Cystic Fibrosis: A Pilot Study. Chest 1998: 114(4); 993-7. The full free text of this reference is available at www.chestjournal.org/cgi/reprint/114/4/993.

Hsu LM. Random Sampling, Randomization, and Equivalence of Contrasted Groups in Psychotherapy Outcome Research. Journal of Consulting and Clinical Psychology 1989: 57(1); 131-7.

Humphreys JS, Jones MP, Jones JA, Mara PR. Workforce retention in rural and remote Australia: determining the factors that influence length of practice. Med J Aust 2002: 176(10); 472-6. The full free text of this reference is available at www.mja.com.au/public/issues/176_10_200502/hum10169_fm.html.

Jain A, Concato J, Leventhal JM. How Good Is the Evidence Linking Breastfeeding and Intelligence? Pediatrics 2002: 109(6); 1044-1053. The full free text of this reference is available at pediatrics.aappublications.org/cgi/content/full/109/6/1044.

Johnson LR, Doherty G, Lairmore T, Moley JF, Brunt LM, Koenig J, Scott MG. Evaluation of the performance and clinical impact of a rapid intraoperative parathyroid hormone assay in conjunction with preoperative imaging and concise parathyroidectomy. Clin Chem 2001: 47(5); 919-25. The full free text of this reference is available at www.clinchem.org/cgi/content/full/47/5/919.

Jorm AF, Kitchener BA, O'Kearney R, Dear KB. Mental health first aid training of the public in a rural area: a cluster randomized trial [ISRCTN53887541]. BMC Psychiatry 2004: 4(1); 33. The full free text of this reference is available at www.biomedcentral.com/1471-244X/4/33.

Kinnunen TI, Luoto R, Gissler M, Hemminki E, Hilakivi-Clarke L. Pregnancy Weight Gain and Breast Cancer Risk. BioMed Central 2004: 4((1)); 7. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15498103.

Kliethermes P, Cross M, Lanese M, Johnson K, Simon S. Transitioning preterm infants with nasogastric tube supplementation: increased likelihood of breastfeeding. J Obstet Gynecol Neonatal Nurs 1999: 28(3); 264-73.

Koch A, Hörmann A, Löwel H, Senges J, Published in the Proceedings of the International Conference on Nonrandomized Comparative Clinical Studies in Heidelberg, April 10 -11,1997. "The 60-Minutes-Myocardial Infarction Project": Comparison with a Registry and a Randomized Clinical Trial. Accessed on 2003-06-30. www.symposion.com/nrccs/koch.htm

Kramer MS, Guo T, Platt RW, Shapiro S, Collet JP, Chalmers B, Hodnett E, Sevkovskaya Z, Dzikovich I, Vanilovich I. Breastfeeding and infant growth: biology or bias? Pediatrics 2002: 110(2 Pt 1); 343-7. The full free text of this reference is available at pediatrics.aappublications.org/cgi/reprint/110/2/343.

Kulkarni M, Elsner C, Ouellet D, Zeldin R. Heparinized saline versus normal saline in maintaining patency of the radial artery catheter. Can J Surg 1994: 37(1); 37-42.

Lachs MS, Nachamkin I, Edelstein PH, Goldman J, Feinstein AR, Schwartz JS. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med 1992: 117(2); 135-40.

Lakits A, Prokesch R, Bankier A, Weninger F, Imhof H. Multiplanar Imaging in the Preoperative Assessment of Metallic Intraocular Foreign Bodies. Helical Computed Tomography Versus Conventional Computed Tomography. Ophthalmology 1998: 105(9); 1679-85.

Le GM, O'Malley CD, Glaser SL, Lynch CF, Stanford JL, Keegan THM, West DW. Breast Implants Following Mastectomy in Women With Early-Stage Breast Cancer: Prevalence and Impact on Survival. Breast Cancer Res 2004: 7(2); R184- R193. The full free text of this reference is available at breast-cancer-research.com/content/7/2/r184.

Lefering R, Neugebauer E, Published in the Proceedings of the International Conference on Nonrandomized Comparative Clinical Studies in Heidelberg, April 10 -11,1997. Problems of Randomized Controlled Trails (RCT) in Surgery. Accessed on 2003-06-30. www.symposion.com/nrccs/lefering.htm

Leibovici L. Effects of remote, retroactive intercessory prayer on outcomes in patients with bloodstream infection: randomised controlled trial. British Medical Journal 2001: 323(7327); 1450-1. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/323/7327/1450.

Malcoe LH, Duran BM, Montgomery JM. Socioeconomic disparities in intimate partner violence against Native American women: a cross-sectional study. BMC Med 2004: 2(1); 20. The full free text of this reference is available at www.biomedcentral.com/1741-7015/2/20.

Mann CJ. Observational research methods. Research design II: cohort, cross sectional, and case-control studies. Emerg Med J 2003: 20(1); 54-60. The full free text of this reference is available at emj.bmjjournals.com/cgi/content/full/20/1/54.

Mann J. Truths about the NINDS study: setting the record straight. West J Med 2002: 176(3); 192-194. The full free text of this reference is available at www.ewjm.com/cgi/content/full/176/3/192.

Marsh JL, Hutton JL, Binks K. Removal of Radiation Dose Response Effects: An Example of Over-Matching. British Medical Journal 2002: 325(7359); 327-30. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/325/7359/327.

Martin GS, Eaton S, Mealer M, Moss M. Extravascular Lung Water in Patients with Severe Sepsis: A Prospective Cohort Study. Crit Care Med 2005: 9(2); R74 - R82. The full free text of this reference is available at ccforum.com/content/9/2/r74.

McNamee R, Dolk H. Does exposure to landfill waste harm the fetus? Perhaps, but more evidence is needed. Bmj 2001: 323(7309); 351-2. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/323/7309/351.

Moertel C, Fleming TR, Creagan ET, Rubin J, O'Connell MJ, Ames MM. High-dose vitamin C versus placebo in the treatment of patients with advanced cancer who have had no prior chemotherapy. A randomized double-blind comparison. New England Journal of Medicine 1985: 312(3); 137-141.

Mortensen EM, Cornell J, Whittle J. Racial Variations in Processes of Care for Patients with Community-Acquired Pneumonia. BMC Health Serv Res 2004: 4(20); The full free text of this reference is available at www.biomedcentral.com/1472-6963/4/20.

Nelson M. The distribution of nutrient intake within families. Br J Nutr 1986: 55(2); 267-77.

Randolph AG, Cook DJ, Gonzales CA, Andrew M. Benefit of heparin in peripheral venous and arterial catheters: systematic review and meta-analysis of randomised controlled trials. British Medical Journal 1998: 316(7136); 969-75. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/316/7136/969.

Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978: 299(17); 926-30.

Ray WA, Daugherty JR, Griffin MR. Lipid-lowering agents and the risk of hip fracture in a Medicaid population. Inj Prev 2002: 8(4); 276-9. The full free text of this reference is available at ip.bmjjournals.com/cgi/content/full/8/4/276.

Roberts C, Torgerson D. Understanding Controlled Trials: Baseline imbalance in randomised controlled trials. British Medical Journal 1999: 319(7203); 185. The full free text of this reference is available at bmj.com/cgi/content/full/319/7203/185.

Rochon PA, Binns MA, Litner JA, Litner GM, Fischbach MS, Eisenberg D, Kaptchuk TJ, Stason WB, Chalmers TC. Are randomized control trial outcomes influenced by the inclusion of a placebo group? a systematic review of nonsteroidal antiinflammatory drug trials for arthritis treatment. J Clin Epidemiol 1999: 52(2); 113-22.

Ronning OM, Guldvog B. Should stroke victims routinely receive supplemental oxygen? A quasi-randomized controlled trial. Stroke 1999: 30(10); 2033-7. The full free text of this reference is available at intl-stroke.ahajournals.org/cgi/content/full/30/10/2033.

Sackett DL. Evidence-based medicine and treatment choices. Lancet 1997: 349(9051); 570; discussion 572-3.

Salo PM, Xia J, Johnson CA, Li Y, Kissling GE, Avol EL, Liu C, London SJ. Respiratory symptoms in relation to residential coal burning and environmental tobacco smoke among early adolescents in Wuhan, China: a cross-sectional study. Environ Health 2004: 3(1); 14. The full free text of this reference is available at www.ehjournal.net/content/3/1/14.

Schulz KF. Randomised trials, human nature, and reporting guidelines. Lancet 1996: 348(9027); 596-8.

Steptoe A, Doherty S, Rink E, Kerry S, Kendrick T, Hilton S. Behavioural counselling in general practice for the promotion of healthy behaviour among adults at increased risk of coronary heart disease: randomised trial. Bmj 1999: 319(7215); 943-7; discussion 947-8. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/319/7215/943.

Taylor RS, Reeves BC, Ewings PE, Taylor RJ. Critical Appraisal Skills Training for Health Care Professionals: A Randomized Controlled Trial [ISRCTN46272378]. British Medical Journal Med Educ. 2004: 4(1); 30. The full free text of this reference is available at www.biomedcentral.com/content/pdf/1472-6920-4-30.pdf.

Teekachunhatean S, Kunanusorn P, Rojanasthien N, Sananpanich K, Pojchamarnwiputh S, Lhieochaiphunt S, Pruksakorn S. Chinese Herbal Recipe Versus Diclofenac in Symptomatic Treatment of Osteoarthritis of the Knee: A Randomized Controlled Trial [ISRCTN70292892. British Medical Journal Complementary and Alternative Medicine 2004: 4(1); 19. The full free text of this reference is available at www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15588333.

Tonnelier J-M, Prat G, Le Gal G, Gut-Gobert C, Renault A, Boles J-M, L'Her E. Impact of a Nurses' Protocol-Directed Weaning Procedure on Outcomes in Patients Undergoing Mechanical Ventilation for Longer than 48 Hours: A Prospective Cohort Study with a Matched Historical Control Group. Critical Care 2005: 9(2); R83 - R89. The full free text of this reference is available at ccforum.com/content/9/2/r83.

Wacholder S. Design issues in case-control studies. Stat Methods Med Res 1995: 4(4); 293-309.

Wainer H, Brown LM. Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing data. The American Statistician 2004: 58(2); 117-23.

Yoshioka A. Use of randomisation in the Medical Research Council's clinical trial of streptomycin in pulmonary tuberculosis in the 1940s. British Medical Journal 1998: 317(7167); 1220-1223. The full free text of this reference is available at bmj.bmjjournals.com/cgi/content/full/317/7167/1220.

Footnotes

* The possibility that oral contraceptives causes an increase in the risk of cervical cancer is quite complex; a good summary of all the issues involved appears on the web at www.jhuccp.org/pr/a9/a9chap5.shtml.

** A nice summary of these benefits is on the web at www.breastfeeding.com/all_about/all_about_more.html.

*** A lot of books on research design will intentionally contrast cross-sectional and longitudinal designs. I do not mention longitudinal designs explicitly in this section because these designs do not fit into the hierarchy as I have described it. In general, a longitudinal design is usually a cohort design, with evaluation of the outcome at multiple time points. As such, it shares all the strengths and weaknesses of the cohort design. An additional strength of the longitudinal design, though, is that you can often gain considerable power for comparisons within a patient because you have removed between patient variability from the equation. In this sense it is much like the crossover designs discussed earlier.

****  The Data and Story Library is on the web at lib.stat.cmu.edu/DASL/DataArchive.html and this particular data set is at lib.stat.cmu.edu/DASL/Stories/homeprice.html. There is a lot more going on with this data set than I have discussed here, and if you are the ambitious sort, you should download this data set and try a few additional data analyses yourself.

This webpage was written by Steve Simon on 2004-07-07, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence