Statistical Evidence. Chapter 6. What do all these numbers mean?
[This is the first draft of Chapter 6 of "Statistical Evidence."]
###Fix this. Insert cartoon: "Entering Hillsville. The population, year founded, and altitude are added together for a total figure."###
I have a fictional story that I tell people. It's about someone who comes to my office and says he has trouble understanding a recently published paper. I look at the title "In vitro and in vivo assessment of Endothelin as a biomarker of iatrogenically induced alveolar hypoxia in neonates" and say that I understand why you would have trouble with a paper like this. Yeah, he says in return, I don't understand what this boxplot is on page 3.
You've already mastered the complex language of medicine, so don't be intimidated by technical statistical terms. I will try to provide some simple explanations of medical terms like confidence interval and odds ratio, but it's impossible to list all the possible statistical jargon.
When you do come across a statistical term that you are unfamiliar with, don't panic. Here's some general guidance
- Some of the statistical details are there only for the benefit of those who want to reproduce the research. Most of you recognize that you can safely skim over phrases like "reverse ion phase chromatography" so you likewise skim over phrases like "bootstrap confidence intervals using bias corrected percentiles (Efron 1982)." When a statistical method is followed by a reference as in the example above, then you can take some solace in the fact that the authors do not expect you to be familiar with this method.
- If a statistical term has several words, focus first on the one word in the term you are most familiar with (most often the noun). You may not know what "reverse ion phase chromatography" is, but you probably have a good general idea about "chromatography." Similarly, with the phrase "bootstrap confidence intervals using bias corrected percentiles" focus on the term "confidence intervals."
You do have to know some statistical terminology, of course. Anyone reading research papers should be familiar with Type I and II errors, odds ratios, survival curves, etc. A basic appreciation of simple statistical methods is enough for nine out of ten papers.
6.1 Samples and populations
A population is a collection of items of interest in research. The population represents a group that you wish to generalize your research to. Populations are often defined in terms of demography, geography, occupation, time, care requirements, diagnosis, or some combination of the above. In most cases, researchers will not explicitly specify a population, but you can usually infer a reasonable population from the context of the research.
A sample is a subset of a population. A random sample is a subset where every item in the population has the same probability of being in the sample. Usually, the size of the sample is much less than the size of the population. The primary goal of much research is to use information collected from a sample to try to characterize a certain population. As such, you should pay a lot of attention to how representative the sample is of the population. If there are problems, with representativeness, consider redefining your population a bit more narrowly. For example, a sample of 85 teenage smokers who volunteer for a research study for a new smoking cessation program might not be considered representative of the population of all teenage smokers, because the participants selected themselves. The sample might be more representative , however, if we restrict our population to those teenage smokers who want to quit.
Example: In a study of vertebral and non-vertebral fracture (Adachi 2002), the researchers selected a sample of "2009 postmenopausal women 50 years and older who were seen in consultation at our tertiary care, university teaching hospital-affiliated office [for a bone fracture] and who were registered in the Canadian Database of Osteoporosis and Osteopenia (CANDOO) patients." The population that these researchers wished to generalize to would be all postmenopausal women 50 years or older with a bone fracture who live in North America. If you are worried that this would be too difficult to generalize to, you could restrict the population to fractures serious enough to warrant a visit to a tertiary care center.
The impact of incident vertebral and non-vertebral fractures on health related quality of life in postmenopausal women. Adachi JD, Ioannidis G, Olszynski WP, Brown JP, Hanley DA, Sebaldt RJ, Petrie A, Tenenhouse A, Stephenson GF, Papaioannou A, Guyatt GH, Goldsmith CH. BMC Musculoskeletal Disorders 2002, 3:11 (22 April 2002) Background Little empirical research has examined the multiple consequences of osteoporosis on quality of life. Methods Health related quality of life (HRQL) was examined in relationship to incident fractures in 2009 postmenopausal women 50 years and older who were seen in consultation at our tertiary care, university teaching hospital-affiliated office and who were registered in the Canadian Database of Osteoporosis and Osteopenia (CANDOO) patients. Patients were divided into three study groups according to incident fracture status: vertebral fractures, non-vertebral fractures and no fractures. Baseline assessments of anthropometric data, medical history, therapeutic drug use, and prevalent fracture status were obtained from all participants. The disease-targeted mini-Osteoporosis Quality of Life Questionnaire (mini-OQLQ) was used to measure HRQL. Results Multiple regression analyses revealed that subjects who had experienced an incident vertebral fracture had lower HRQL difference scores as compared with non-fractured participants in total score (-0.86; 95% confidence intervals (CI): -1.30, -0.43) and the symptoms (-0.76; 95% CI: -1.23, -0.30), physical functioning (-1.12; 95% CI: -1.57, -0.67), emotional functioning (-1.06; 95% CI: -1.44, -0.68), activities of daily living (-1.47; 95% CI: -1.97, -0.96), and leisure (-0.92; 95% CI: -1.37, -0.47) domains of the mini-OQLQ. Patients who experienced an incident non-vertebral fracture had lower HRQL difference scores as compared with non-fractured participants in total score (-0.47; 95% CI: -0.70, -0.25), and the symptoms (-0.25; 95% CI: -0.49, -0.01), physical functioning (-0.39; 95% CI: -0.65, -0.14), emotional functioning (-0.97; 95% CI: -1.20, -0.75) and the activities of daily living (-0.47; 95% CI: -0.73, -0.21) domains. Conclusion Quality of life decreased in patients who sustained incident vertebral and non-vertebral fractures.
In a study of post-myocardial infarction pharmacological management in older patients (Di Cecco 2002), a "comprehensive chart audit was conducted of 142 men and 81 women in an academic primary care practice" The populaton that these researchers wanted to generalize to was all post-myocardial patients older than 60 years.
Is there a clinically significant gender bias in post-myocardial infarction pharmacological management in the older (>60) population of a primary care practice? Di Cecco R, Patel U, Upshur REG. BMC Family Practice 2002, 3:8 (3 May 2002) Background Differences in the management of coronary artery disease between men and women have been reported in the literature. There are few studies of potential inequalities of treatment that arise from a primary care context. This study investigated the existence of such inequalities in the medical management of post myocardial infarction in older patients. Methods A comprehensive chart audit was conducted of 142 men and 81 women in an academic primary care practice. Variables were extracted on demographic variables, cardiovascular risk factors, medical and non-medical management of myocardial infarction. Results Women were older than men. The groups were comparable in terms of cardiac risk factors. A statistically significant difference (14.6%: 95% CI 0.04828.7 p = 0.047) was found between men and women for the prescription of lipid lowering medications. 25.3% (p = 0.0005, CI 11.45, 39.65) more men than women had undergone angiography, and 14.4 % (p = 0.029, CI 2.2, 26.6) more men than women had undergone coronary artery bypass graft surgery. Conclusion Women are less likely than men to receive lipid-lowering medication which may indicate less aggressive secondary prevention in the primary care setting.
6.2 Type I and II errors
In many studies, you are interested in choosing between two competing hypothesis. Ideally, you specify the two competing hypotheses prior to any data collection. You should also specify a decision rule prior to collecting your data. The decision rule uses information from your sample of data to select one or the other of the two competing hypotheses.
The first hypothesis, often called the null hypothesis or denoted by the symbol H0, is traditionally a hypothesis that represents the status quo. The null hypothesis is usually reserved for claims of no effect, no association, or no relationship. If you are comparing a new drug to a standard drug, the null hypothesis might be that the average effect of the two drugs are equal.
The second hypothesis, often called the alternative hypothesis and denoted by the symbol H1 or Ha, represents a claim involving some type of effect or some type of association or relationship. For the study evaluating a new drug, the alternative hypothesis might be that the average effects of the new drug is different than the standard drug (maybe better, maybe worse).
In some situations, your alternative hypothesis may only consider a single direction. For example, if you are comparing a new drug to placebo, the hypothesis that the new drug is worse than placebo is rather uninteresting. It would be effectively no different than if you concluded that the new drug was equivalent to placebo. For these situations, the alternative hypothesis would ignore the possibility of being worse and would restrict itself to the possibility that the new drug is better than placebo.
Hypothesis testing has the danger of oversimplifying the research. Why do you have to choose only between two hypotheses, for example. Why not three or four competing hypotheses. Also, the null hypothesis is perhaps a bit unrealistic. No two drugs are going to have exactly the same level of effectiveness. Wouldn't it be more interesting to look at a null hypothesis that stated that the average effect of the two drugs are close enough to each other that you can feel comfortable using either one? Finally, why do we have to chose? Why can't we just state how much the data changes our degree of belief in the two competing hypotheses?
These types of modifications can be incorporated into hypothesis testing, but too often researchers do not seriously consider modifying the hypotheses but just do things the same old way.
When you are using a decision rule to decide between these two hypothesis, you have to allow for the possibility of error. After all, the decision rule uses information from a sample, which even under the best of circumstances is an imperfect representation of a population. There are actually two types of errors that you can make when choosing between a null and alternative hypothesis.
- A Type I error is rejecting the null hypothesis when the null hypothesis is true.
- A Type II error is accepting the null hypothesis when the null hypothesis is false.
Consider a new drug that we will put on the market if we can show that it is better than a placebo.
- A Type I error would be allowing an ineffective drug onto the market.
- A Type II error would be keeping an effective drug off the market.
Both errors are serious, but you should consider the relative importance of each type of error. If your drug is treating a fatal condition, and there is no other effective drug on the market, then a Type II error is very serious because patients without any other hope are being denied an effective treatment. If your drug is treating a less serious condition and is competing against a wide range of drugs already on the market, then from the patient's perspective, a Type II error is less serious. From your company's perspective, a Type II error is still serious because you are being denied the opportunity to compete in a lucrative marketplace.
Statisticians are unique among all of the professions, because we admit freely that we make errors. We hope that the probability of these errors is small, and in most situations, we can actually estimate these probabilities. Alpha is defined as the probability of making a Type I error, and beta is defined as the probability of making a Type II error. The complementary probabilities also have names. The confidence level is defined as 1 - alpha, and the power is defined as 1 - beta.
For a given sample size, there is a tradeoff between alpha and beta, not unlike the tradeoff between sensitivity and specificity of a diagnostic test. Almost every researcher sets up their decision rule so that alpha, the probability of a Type I error is 0.05. Very few researchers make an attempt to justify this level, and this is a major shortcoming. What they should do is to try to balance alpha and beta according to the costs and severity associated with each type of error. If the cost of a Type I error is trivial and the cost of a Type II error is serious, perhaps the researcher should allow the value of alpha to increase to 0.10 or maybe even higher, so as to insure that beta, the probability of a Type II error, remains small.
The best way to insure that both alpha and beta are small is to increase your sample size. A larger sample will typically reduce the probabilities of both types of errors.
Beta (and power) are a bit more difficult to calculate than alpha because you have to specify not only that the null hypothesis is false, but by how much. Typically, you would want to make sure that beta was small for clinically important changes, but you wouldn't worry so much about beta for changes that are clinically trivial. In fact, if your research sample size is so large that beta is miniscule even for clinically trivial changes, then perhaps your sample size is too large. Conversely, if beta is large, even for changes that are clinically important, then you should consider increasing your sample size.
There are extensive formulas and programs that do this sort of calculation. This is often called a power calculation, because the probabilities when the null hypothesis is false are usually stated as power rather than beta.
There are both ethical and economic considerations at work here. A sample size that is too small represents a waste of money, because there is too much of a chance of concluding that the new treatment is equivalent to a placebo, even when it is capable of producing clinically important effects. This ends up wasting time and money, but more importantly, it is an abuse of the goodwill of your research volunteers. People volunteer for a research study for a variety of reasons, but one of the most important is that they want to help out future patients who have the same disease. They are hoping to contribute to the advancement of knowledge, but you have placed them in a research study that has little chance of doing so.
Similarly, a sample size that is too large represents a waste of money and resources, and also raises ethical concerns. There are inconveniences, discomforts, and hazards associated with research, and you should not ask more people to endure these hardships than is needed to demonstrate a clinically important change.
6.3 Confidence interval
We statisticians have a habit of hedging our bets. We always insert qualifiers into our reports, warn about all sorts of assumptions, and never admit to anything more extreme than probable. There's a famous saying: "Statistics means never having to say you're certain."
We qualify our statements, of course, because we are always dealing with imperfect information. In particular, we are often asked to make statements about a population (a large group of subjects) using information from a sample (a small, but carefully selected subset of this population). No matter how carefully this sample is selected to be a fair and unbiased representation of the population, relying on information from a sample will always lead to some level of uncertainty.
A confidence interval is a range of values that tries to quantify this uncertainty. Consider it as a range of plausible values. A narrow confidence interval implies high precision; we can specify plausible values to within a tiny range. A wide interval implies poor precision; we can only specify plausible values to a broad and uninformative range.
Consider a recent study of homoeopathic treatment of pain and swelling after oral surgery (Lokken 1995). When examining swelling 3 days after the operation, they showed that homoeopathy led to 1 mm less swelling on average. The 95% confidence interval, however, ranged from -5.5 to 7.5 mm. From what little I know about oral surgery, this appears to be a very wide interval. This interval implies that neither a large improvement due to homoeopathy nor a large decrement could be ruled out.
Generally when a confidence interval is very wide like this one, it is an indication of an inadequate sample size, an issue that the authors mention in the discussion section of this paper.
When you see a confidence interval in a published medical report, you should look for two things. First, does the interval contain a value that implies no change or no effect? For example, with a confidence interval for a difference look to see whether that interval includes zero. With a confidence interval for a ratio, look to see whether that interval contains one.
Here's an example of a confidence interval that contains the null value. The interval shown below implies no statistically significant change.
Here's an example of a confidence interval that excludes the null value. If we assume that larger implies better, then the interval shown below would imply a statistically significant improvement.
Here's a different example of a confidence interval that excludes the null value. The interval shown below implies a statistically significant decline.
You should also see whether the confidence interval lies partly or entirely within a range of clinical indifference. Clinical indifference represents values of such a trivial size that you would not want to change your current practice. For example, you would not recommend a special diet that showed a one year weight loss of only five pounds. You would not order a diagnostic test that had a predictive value of less than 50%.
Clinical indifference is a medical judgement, and not a statistical judgement. It depends on your knowledge of the range of possible treatments, their costs, and their side effects. As statistician, I can only speculate on what a range of clinical indifference is. I do want to emphasize, however, that if a confidence interval is contained entirely within your range of clinical indifference, then you have clear and convincing evidence to keep doing things the same way (see below).
One the other hand, if part of the confidence interval lies outside the range of clinical indifference, then you should consider the possibility that the sample size is too small (see below).
Some studies have sample sizes that are so large that even trivial differences are declared statistically significant. If your confidence interval excludes the null value but still lies entirely within the range of clinical indifference, then you have a result with statistical significance, but no practical significance (see below).
Finally, if your confidence interval excludes the null value and lies outside the range of clinical indifference, then you have both statistical and practical significance (see below).
Example: In a study of trends in hospital admission for lower respiratory illness (Bjor 2003), the annual rate of increase was 3.8% (95% CI, 1.3 to 6.3) in boys under one year of age and 5.0% (95% CI, 2.4 to 7.6) in girls under one year of age. Since both of these confidence intervals exclude the value of 0%, you can conclude that there is a statistically significant increase in admission rates. If you presume that a shift of 0.5% in either direction is clinically trivial, then these confidence are both well within the range of clinical importance.
A retrospective population based trend analysis on hospital admissions for lower respiratory illness among Swedish children from 1987 to 2000. Bjφr O, Brεbδck L. BMC Public Health 2003, 3:22 (11 July 2003) Background Data relating to hospital admissions of very young children for wheezing illness have been conflicting. Our primary aim was to assess whether a previous increase in hospital admissions for lower respiratory illness had continued in young Swedish children. We have included re-admissions in our analyses in order to evaluate the burden of lower respiratory illness in very young children. We have also assessed whether changes in the labelling of symptoms have affected the time trend. Methods A retrospective, population based study was conducted to assess the time trend in admissions and re-admissions for lower respiratory illness. Data were obtained from the Swedish Hospital Discharge Register for all children with a first hospital admission before nine years of age, a total of 109,176 children. The register covers more than 98% of all hospital admissions in Sweden. The coding of diagnoses was based on ICD-9 from 1987 to 1996 and ICD-10 from 1997. Results The first admission rates declined significantly in children with a first admission after two years of age. However, an increasing admission trend was observed in children aged less than one year and 35% of first admissions occurred in this age group. The annual increase was 3.8% (95% CI 1.36.3) in boys and 5.0% (95% CI 2.47.6) in girls. A diagnostic shift appeared to occur when ICD-10 was introduced in 1997. The asthma and pneumonia admission rate in children aged less than one year levelled off, whereas the increase in admissions for bronchitis continued. The re-admission rates for asthma decreased and the probability of re-admission was higher in boys. National drug statistics demonstrated a substantial increase in the delivery of inhaled steroids to all age groups but most prescriptions occurred to children aged one year or more. Conclusion Hospital admissions for lower respiratory illness are still increasing in children aged <1 year. Our findings are in line with other recent studies suggesting a change in the responsiveness to viral infections in very young children, but changes in admission criteria cannot be excluded. An increased use of inhaled steroids may have contributed to decreasing re-admission rates.
Example: In a systematic overview of isoflavones or soy phyto-estrogens on serum lipid levels (Yeung 2003), the isoflavones had an insignificant effect on serum total cholesterol showing only a 0.01 mmol/L decline (95% CI, -0.17 to 0.18). The results were equally disappointing for low density lipoproten (0.00 mmol/L decline, 95% CI, -0.14 to 0.15), high density lipoprotein (0.01 mmol/L decline, 95% CI, -0.05 to 0.06), and triglycerides (0.03 mmol/L decline, 95% CI -0.06 to 0.12). Since all of these confidence intervals include zero, there is no statistically significant change in these levels. Furthermore, these intervals are so narrow that they would easily be included in any reasonable range of clinical indifference. That makes these findings a definitive negative result.
Effects of isoflavones (soy phyto-estrogens) on serum lipids: a meta-analysis of randomized controlled trials. Yeung J, Yu T. Nutrition Journal 2003, 2:15 (19 November 2003) Objectives To determine the effects of isoflavones (soy phyto-estrogens) on serum total cholesterol (TC), low density lipoprotein cholesterol (LDL), high density lipoprotein cholesterol (HDL) and triglyceride (TG). Methods We searched electronic databases and included randomized trials with isoflavones interventions in the forms of tablets, isolated soy protein or soy diets. Review Manager 4.2 was used to calculate the pooled risk differences with fixed effects model. Results Seventeen studies (21 comparisons) with 853 subjects were included in this meta-analysis. Isoflavones tablets had insignificant effects on serum TC, 0.01 mmol/L (95% CI: -0.17 to 0.18, heterogeneity p = 1.0); LDL, 0.00 mmol/L (95% CI: -0.14 to 0.15, heterogeneity p = 0.9); HDL, 0.01 mmol/L (95% CI: -0.05 to 0.06, heterogeneity p = 1.0); and triglyceride, 0.03 mmol/L (95% CI: -0.06 to 0.12, heterogeneity p = 0.9). Isoflavones interventions in the forms of isolated soy protein (ISP), soy diets or soy protein capsule were heterogeneous to combine. Conclusions Isoflavones tablets, isolated or mixtures with up to 150 mg per day, seemed to have no overall statistical and clinical benefits on serum lipids. Isoflavones interventions in the forms of soy proteins may need further investigations to resolve whether synergistic effects are necessary with other soy components.
6.4 P-value
A p-value is a measure of evidence. A small p-value indicates lots of evidence against the null hypothesis. How small is small? Sometimes, researchers will use a stricter cut-off (e.g., 0.01) or a more liberal cut-off (e.g., 0.10). Unfortunately, most researchers give little thought to the cut-off and reflexively use the traditional 0.05 level. As mentioned above, you should set the cut-off depending on how serious a Type I error is compared to a Type II error.
A small p-value by itself only tells you half the story because it gives you no information about the magnitude of the change seen. Is there a clinically important difference, or is it trivial? A confidence interval complements the p-value well (and some argue that it should even replace the p-value) because it provides information about whether the difference seen in this research is clinically important or clinically trivial.
A large p-value by itself also only tells half the story. There is little or no evidence against the null hypothesis, but that does not always translate into lots of evidence in favor of the alternative hypothesis. Perhaps your sample size is so small that you have no much evidence for any particular hypothesis. Again, a confidence interval is more helpful, because a narrow interval (one that fits entirely inside the range of clinical indifference) is strong evidence that nothing important is going on here.
Example: In a study of reviewers of abstracts for a primary care research conference (Montgomery 2002), reviewers rated the abstract on seven categories, with a rating of 1 representing a poor level and 4 representing an excellent level. So the total score ranged from 4 to 28 points. The accepted abstracts had an average rating of 17.4 and the rejected abstracts had a rating of 14.6. The p-value for comparing the average rating between the two groups was 0.0003, which is very small. This indicates that you should reject the null hypothesis that the average rating is the same for both groups. The p-value, by itself, does not quantify the magnitude of the change, so the authors also included a confidence interval. The 95% confidence interval for the difference in average ratings was 1.3 to 4.1. You can conclude based on the confidence interval that the difference in average scores is greater than 1 unit, even after allowing for sampling error.
15. Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. Montgomery AA, Graham A, Evans PH, Fahey T BMC Health Services Research 2002, 2:8 (26 March 2002) Background Checklists for peer review aim to guide referees when assessing the quality of papers, but little evidence exists on the extent to which referees agree when evaluating the same paper. The aim of this study was to investigate agreement on dimensions of a checklist between two referees when evaluating abstracts submitted for a primary care conference. Methods Anonymised abstracts were scored using a structured assessment comprising seven categories. Between one (poor) and four (excellent) marks were awarded for each category, giving a maximum possible score of 28 marks. Every abstract was assessed independently by two referees and agreement measured using intraclass correlation coefficients. Mean total scores of abstracts accepted and rejected for the meeting were compared using an unpaired t test. Results Of 52 abstracts, agreement between reviewers was greater for three components relating to study design (adjusted intraclass correlation coefficients 0.40 to 0.45) compared to four components relating to more subjective elements such as the importance of the study and likelihood of provoking discussion (0.01 to 0.25). Mean score for accepted abstracts was significantly greater than those that were rejected (17.4 versus 14.6, 95% CI for difference 1.3 to 4.1, p = 0.0003). Conclusions The findings suggest that inclusion of subjective components in a review checklist may result in greater disagreement between reviewers. However in terms of overall quality scores, abstracts accepted for the meeting were rated significantly higher than those that were rejected.
6.5 Odds ratio and relative risk
Both the odds ratio and the relative risk compare the likelihood of an event between two groups. Consider the following data on survival of passengers on the Titanic. There were 462 female passengers: 308 survived and 154 died. There were 851 male passengers: 142 survived and 709 died (see table below).
Alive Dead Total Female 308 154 462 Male 142 709 851 Total 450 863 1,313
If you saw the movie, Leonardo DiCaprio was one of the 709 male fatalities, and Kate Winslet was one of the 308 female survivors.
Clearly, a male passenger on the Titanic was more likely to die than a female passenger. But how much more likely? You can compute the odds ratio or the relative risk to answer this question.
The odds ratio compares the relative odds of death in each group. For females, the odds were exactly 2 to 1 against dying (154/308=0.5). For males, the odds were almost 5 to 1 in favor of death (709/142=4.993). The odds ratio is 9.986 (4.993/0.5). There is a ten fold greater odds of death for males than for females.
The relative risk (sometimes called the risk ratio) compares the probability of death in each group rather than the odds. For females, the probability of death is 33% (154/462=0.3333). For males, the probability is 83% (709/851=0.8331). The relative risk of death is 2.5 (0.8331/0.3333). There is a 2.5 greater probability of death for males than for females.
There is quite a difference. Both measurements show that men were more likely to die. But the odds ratio implies that men are much worse off than the relative risk. Which number is a fairer comparison?
The relative risk measures events in a way that is interpretable and consistent with the way people really think. The odds ratio is a bit trickier, since the only people who seem to understand odds well are people who bet on horse races. The big advantage of the odds ratio is its flexibility. For certain research designs, such as a case-control design, you can compute and interpret an odds ratio easily, but a relative risk would be meaningless. You can also easily adjust an odds ratio for covariates.
Both the odds ratio and the relative risk are measures of relative change. Many researchers believe that measures of relative change paint an incomplete picture of risk. For example, cigarette smoking has large effect on lung cancer. The figures vary a bit depending on the time frame and how you define smoking, but a reasonable estimate is that patients who smoke are ten times more likely to die from lung cancer than patients who do not smoke. Smoking also has an effect on cardiovascular disease. Patients who smoke are twice as likely to die from a heart attack than patients who do not smoke. This seems to imply that heart attacks are less of a problem than lung cancer, but when you actually tally the number of smokers who die from heart attacks, it ends up being greater than the number who die from lung cancer. That's because lung cancer is a relatively uncommon event among non-smokers, while heart attacks are more frequent. So a doubling of a common risk has more of a public health impact than a ten fold change in a rarer risk.
In contrast to measures of relative change, which involve computing ratios, researchers are now encouraging the use of measures of absolute change, such as risk difference or the number needed to treat. Absolute change involves the computation of a difference rather than a ratio.
The number needed to treat represents the number of patients you would typically have to treat with a new therapy in order to see one additional success compared to the traditional therapy. A low number, like 3, tells you that you will see a lot of extra successes in a short amount of time if you adopt the new therapy. A high number, like 200, means that you will have to treat a lot of patients with the new therapy before you will even see a handful of extra successes.
You can also compute this quantity for adverse effects, such as side effects. In this case, the quantity is usually called the number needed to harm (NNH). A large number is good, because it means that if you give the new therapy to large number of patients, you will only encounter a few more extra side effects. A small number, of course, means that you will be a lot of extra side effects if you adopt the new therapy.
To compute the NNT or NNH, you need to subtract the rate in the treatment group from the rate in the control group and then invert it (divide the difference into 1).
A recently published article on the flu vaccine showed that among the children who received a placebo, 17.9% later had culture confirmed influenza. In the vaccine group, the rate was only 1.3%. This is a 16.6% absolute difference. When you invert this percentage, you get NNT = 6. This means that for every six kids who get the vaccine, you will see one less case of flu on average.
The study also looked at the rate of side effects. In the vaccine group, 1.9% developed a fever. Only 0.8% of the controls developed a fever. This is an absolute difference of 1.1%. When you invert this percentage, you get NNH = 90. This means that for every 90 kids who get the vaccine, you will see one additional fever on average.
Sometimes the ratio between NNT and NNH can prove informative. For this study,
NNH / NNT = 90 / 6 = 15.
This tells you that you should expect to see one additional fever for every fifteen cases of flu prevented.
Although I am not a medical expert, the vaccine looks very promising because you can prevent a lot of flu events and only have to put up with a few additional fevers. In general, it takes medical judgment to assess the trade-offs between the benefits of a treatment and its side effects. The NNT and NNH calculations allow you to assess these trade-offs.
6.6 Correlation
A correlation is a measure of the degree of association between two variables.
The correlation coefficient is always between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Here is how I tend to interpret correlations.
-1.0 to -0.7 strong negative association.
-0.7 to -0.3 weak negative association.
-0.3 to +0.3 little or no association.
+0.3 to +0.7 weak positive association.
+0.7 to +1.0 strong positive association.
It's not a perfect rule, and I might stretch the limits a bit depending on the particular problem at hand.
Here's an example. A data set included in the Data and Story Library (http://lib.stat.cmu.edu/DASL) measures the 1960 crime rates for 47 states along with a variety of demographic factors. The causes of crime are complex and you cannot draw any valid inferences based on the few graphs presented below. Nevertheless, these graphs illustrate the concept of strong and weak correlation. For example, there is a strong relationship between police budgets and crime levels. States with more crime have to spend more on police protection.
There is a weak relationship between education level and crime. States with higher average levels of education do tend to have more crime, but the relationship is more uncertain here.
Finally, there is little or no relationship between unemployment rate and crime.
You should always be cautious about correlations because a large correlation between two variables does not mean that the first variable is the cause of the second. Perhaps it is the second variable that causes the first instead. Someone looking at the first graph might conclude that spending less money on police protection would lead to a lower crime rate. That's similar to the story of the statistician who was reviewing records of a fire department and noticed that the more fire engines you sent to the site of a fire, the more damage they caused.
Another problem with a correlation is that it does not take into account additional factors that might represent the underlying cause of the relationship. For example, a study of life expectancies in 40 different countries (Rossman 1994) noted a strong relationship between life expenctancy and the number of television sets per capita. The surprising relationship was that more television sets were associated with longer lives. It turns out that both availability of consumer goods like televisions and a country's life expectancy were related to a third variable, the wealth of that country. Countries that could afford to buy lots of televisions could also afford to buy adequate health care for their people.
Another example of a misleading correlation appears in a study of patients with Parkinson's disease (Cosentino 2005). The researchers noted a positive association between a particular medication, levodopa, and the number of times that the patients visited their doctor over the past year. This association, they noted, could be explained by the fact that patients using alternate drugs or using levodopa in combination with alternate drugs or using alternate drugs alone tended to be much younger.
6.7 Survival curves
Survival data models provide interpretation of data representing the time until an event occurs. In many situations, the event is death, but it can also represent the time to other bad events such as cancer relapse or failure of a medical device. It can also be used to denote time to positive events such as pregnancy.
Survival data models also incorporate one of the complexities of "time to event" data, the fact that not all patients experience the event during the time frame of the study. So if we are doing a five year mortality study, we have the problem of those stubborn patients who refuse to die during the study period. Other patients may move out of town halfway through the study and are lost to follow-up. In a study of medical devices, sometimes the device continues to work up to a certain time, but then has to be removed, not because the device failed, but because the patient got healthier and no longer needed the device.
These observations are called censored observations. With censored observations, the actual time of the event is unknown but we do know that it would not be any earlier than the time that the last evaluation or follow-up visit was done. These censored observations provide partial information. They influence our estimates of survival probability up to the last evaluation or follow-up, but do not provide any information about survival probabilities beyond that point. To disregard this information is dangerous and could seriously bias your results.
The following data represents survival time for a group of fruit flies and is a subset of a larger data set found on the Chance web site. There are 25 flies in the sample, so the survival probability decreases by 4% (1/25) every time a fly dies.
You have to make some common sense adjustments for ties in the data (when four flies all die on the 47th day, the survival probability declines by 16% not 4%) but otherwise the probabilities are quite easy to compute. Here's a graph of these probabilities over time.
By tradition and for some rather technical reasons, you should use a stair step pattern rather than a diagonal line to connect adjacent survival probabilities, But this does not seriously change the pattern shown.
Now let's alter the experiment. Suppose that totally by accident, a technician leaves the screen cover open on day 70 and all the flies escape. This includes the poor fly who was going to die on the afternoon of the 70th day anyway. You might be tempted to scrap the whole experiment, but really what you have is pretty complete information on survival of the fruit flies up to their 70th day of life. Here's how you would present the data and estimate the survival probabilities.
We clearly have enough data to make several important statements about survival probability. For example, the median survival time is 62 days because roughly half of the flies had died before this day.
Here is a graph of the survival probabilities of the second experiment. The plus sign on the graph at day 70 is an indication of censored data by the software that drew this graph (SPSS version 13). This graph is identical to the graph in the first experiment up to day 70 after which you can no longer estimate survival probabilities.
By the way, you might be tempted to ignore the ten flies who escaped. But that would seriously bias your results. All of these flies were survivors who lived well beyond the median day of death. If you pretended that they didn't exist, you would be seriously underestimate the survival probabilities. The median survival time, for example, of the 15 flies who did not escape, for example, is only 54 days which is much smaller than the actual median.
Let's look at a third experiment, where the screen cover is left open and all but four of the remaining flies escape. It turns out that those four remaining flies who didn't bug out will allow us to still get reasonable estimates of survival probabilities beyond 70 days. Here is the data and the survival probabilities.
What you need to do is to allocate the remaining 40% survival probability evenly among the four remaining flies. These flies become more important, as each death accounts for a 10% decline in survival probability rather than just a 4% decline at earlier dates.
Another way of looking at this is that the six flies who escaped influence the denominator of the survival probabilities up to day 70 and then totally drop out of the calculations for any further survival probabilities. Because the denominator has been reduced, the jumps at each remaining death are much larger.
Here is a graph of the survival probability estimates from the third experiment.
If you look at the survival probability estimates in the third experiment, they differ only slightly from the survival probabilities in the original experiment. This works out because the mechanism that caused us to lose information on six of the fruit flies was independent of their ultimate survival.
If the censoring mechanism were somehow related to survival prognosis, then you would have the possibility of serious bias in your estimates. Suppose for example, that only the toughest of flies (those with the most days left in their short lives) would have been able to escape. Then these censored values would not be randomly interspersed among the remaining survival times, but would constitute some of the larger values. But since these larger values would remain unobserved, you would underestimate survival probabilities beyond the 70th day.
This is known as informative censoring, and it happens more often that you might expect. Suppose someone drops out of a cancer mortality study because they are abandoning the drugs being studied in favor of laetrile treatments down in Mexico. Usually, this is a sign that the current drugs are not working well, so a censored observation here might represent a patient with a poorer prognosis. Excluding these patients would lead to an overestimate of survival probabilities.
When you see a survival curve in a research paper, there are two ways to interpret it. First, you can get an estimate of the median (or other percentiles) by projecting horizontally until you intersect with the survival curve and then head down to get your estimate. In the survival curve we have just looked at, you would estimate the median survival as slightly more than 60 days.
You can also estimate probabilities for survival at any given time by projecting up from the time and then moving to the left to estimate the probability. In the example below, you can see that the 80 day survival probability is a little bit less than 25 percent.
6.8 Prevalence and incidence
Prevalence and incidence are two measures of the how commonly certain diseases are found in a population. They measure two very different dimensions of the disease process, but the distinction can sometimes be quite subtle.
Incident cases of disease represent all cases of the diseases that appear during a specific time interval. An example of an incidence would be the number of breast cancer patients newly diagnosed during the past year. Prevalent cases represent the number of cases alive in the population at a specific time point. An example of a prevalence would be all breast cancer patients who are alive during the first day of the current year.
Incidence involves units of time, such as patient-months. For example, in Smeeth et al 2004, the incidence of autism is reported as increasing from 0.40/10,000 person-years (95% CI 0.30 to 0.54) in 1991 to 2.98/10,000 (95% CI 2.56 to 3.47) in 2001. By contrast, prevalence is simply a count and is usually expressed as a percentage or as the number of cases per 10,000 (The crude prevalence rates per 1000 of neurological sequelae in twins and singletons after assisted conception and in naturally conceived twins were 8.8, 8.2, and 9.6, and of cerebral palsy 3.2, 2.5, and 4.0, respectively. Pinborg et al 2004) and (Rheumatoid arthritis (RA) / juvenile rheumatoid arthritis (JRA) was the most frequent diagnosis given. The prevalence rate for JRA in the Oklahoma City Area was estimated as 53 per 100,000 individuals at risk, while in the Billings Area, the estimated prevalence was nearly twice that, at 115 per 100,000. Mauldin et al 2004).
These can lead to very different answers, because the probability of finding a case in a given time frame is related to mortality risk. Those patients who have a mild form of disease and survive for a relatively long time have a good chance of being around on the date that you go looking for them. Those patients who die quickly are unlikely to be around on the date that you go looking for them.
Let's consider an example with simulated data.
The lines on this graph represent the duration of disease with the left endpoint representing the date that the disease was first diagnosed and the right endpoint representing the date that the patient died. The line segments are ordered from the time of initial diagnosis with patients diagnosed in 1999 and 2000 at the bottom of the graph and patients diagnosed in 2003 and 2004 at the top of the graph.
This graph represents a selection of prevalent cases, and the green lines represent those patients who were alive on January 1, 2002.
This graph represents incident cases, and the green lines represent those patients newly diagnosed with the disease between January 1, 2001 and December 31, 2003.
The prevalent cases include very few patients with short survival time, compared to the incident cases. This becomes more apparent when you reorder the patients by survival time.
In this graph, the patients with the shortest survival times appear at the bottom of the graph and the patients with the longest survival times appear at the top. Notice how rarely the patients with short survival times appear among the prevalent cases.
This graph shows the incident cases with the patients again sorted by survival time. Notice that the incident cases include a fair number of patients with short survival times.
On your own
1. Review the following abstracts. Specify what the sample is and define what you think is a reasonable population that this research is trying to generalize to.
2. Interpret the confidence intervals reported in the same set of abstracts. Specify a range of clinical indifference as best you can and interpret these intervals with respect to that range.
3. Elevated white cell count in acute coronary syndromes: relationship to variants in inflammatory and thrombotic genes. Byrne CE, Fitzgerald A, Cannon CP, Fitzgerald DJ, Shields DC. BMC Medical Genetics 2004, 5:13 (1 June 2004) Background Elevated white blood cell counts (WBC) in acute coronary syndromes (ACS) increase the risk of recurrent events, but it is not known if this is exacerbated by pro-inflammatory factors. We sought to identify whether pro-inflammatory genetic variants contributed to alterations in WBC and C-reactive protein (CRP) in an ACS population. Methods WBC and genotype of interleukin 6 (IL-6 G-174C) and of interleukin-1 receptor antagonist (IL1RN intronic repeat polymorphism) were investigated in 732 Caucasian patients with ACS in the OPUS-TIMI-16 trial. Samples for measurement of WBC and inflammatory factors were taken at baseline, i.e. Within 72 hours of an acute myocardial infarction or an unstable angina event. Results An increased white blood cell count (WBC) was associated with an increased C-reactive protein (r = 0.23, p < 0.001) and there was also a positive correlation between levels of β-fibrinogen and C-reactive protein (r = 0.42, p < 0.0001). IL1RN and IL6 genotypes had no significant impact upon WBC. The difference in median WBC between the two homozygote IL6 genotypes was 0.21/mm3 (95% CI = -0.41, 0.77), and -0.03/mm3 (95% CI = -0.55, 0.86) for IL1RN. Moreover, the composite endpoint was not significantly affected by an interaction between WBC and the IL1 (p = 0.61) or IL6 (p = 0.48) genotype. Conclusions Cytokine pro-inflammatory genetic variants do not influence the increased inflammatory profile of ACS patients.
4. Effect of paper quality on the response rate to a postal survey: A randomised controlled trial. [ISRCTN32032031]. Clark TJ, Khan KS, Gupta JK. BMC Medical Research Methodology 2001, 1:12 (17 December 2001) Background Response rates to surveys are declining and this threatens the validity and generalisability of their findings. We wanted to determine whether paper quality influences the response rate to postal surveys Methods A postal questionnaire was sent to all members of the British Society of Gynaecological Endoscopy (BSGE). Recipients were randomised to receiving the questionnaire printed on standard quality paper or high quality paper. Results The response rate for the recipients of high quality paper was 43/195 (22%) and 57/194 (29%) for standard quality paper (relative rate of response 0.75, 95% CI 0.331.05, p = 0.1 Conclusion The use of high quality paper did not increase response rates to a questionnaire survey of gynaecologists affiliated to an endoscopic society.
6. Effect of paracetamol (acetaminophen) and ibuprofen on body temperature in acute ischemic stroke PISA, a phase II double-blind, randomized, placebo-controlled trial [ISRCTN98608690]. Dippel DWJ, van Breda EJ, van der Worp HB, van Gemert HMA, Meijer RJ, Kappelle LJ, Koudstaal PJ, the PISA-investigators. BMC Cardiovascular Disorders 2003, 3:2 (6 February 2003) Background Body temperature is a strong predictor of outcome in acute stroke. In a previous randomized trial we observed that treatment with high-dose acetaminophen (paracetamol) led to a reduction of body temperature in patients with acute ischemic stroke, even when they had no fever. The purpose of the present trial was to study whether this effect of acetaminophen could be reproduced, and whether ibuprofen would have a similar, or even stronger effect. Methods Seventy-five patients with acute ischemic stroke confined to the anterior circulation were randomized to treatment with either 1000 mg acetaminophen, 400 mg ibuprofen, or placebo, given 6 times daily during 5 days. Treatment was started within 24 hours from the onset of symptoms. Body temperatures were measured at 2-hour intervals during the first 24 hours, and at 6-hour intervals thereafter. Results No difference in body temperature at 24 hours was observed between the three treatment groups. However, treatment with high-dose acetaminophen resulted in a 0.3°C larger reduction in body temperature from baseline than placebo treatment (95% CI: 0.0 to 0.6 °C). Acetaminophen had no significant effect on body temperature during the subsequent four days compared to placebo, and ibuprofen had no statistically significant effect on body temperature during the entire study period. Conclusions Treatment with a daily dose of 6000 mg acetaminophen results in a small, but potentially worthwhile decrease in body temperature after acute ischemic stroke, even in normothermic and subfebrile patients. Further large randomized clinical trials are needed to study whether early reduction of body temperature leads to improved outcome.
7. Effects of carrying a pregnancy and of method of delivery on urinary incontinence: a prospective cohort study. Eason E, Labrecque M, Marcoux S, Mondor M. BMC Pregnancy and Childbirth 2004, 4:4 (19 February 2004) Background This study was carried out to identify risk factors associated with urinary incontinence in women three months after giving birth. Methods Urinary incontinence before and during pregnancy was assessed at study enrolment early in the third trimester. Incontinence was re-assessed three months postpartum. Logistic regression analysis was used to assess the role of maternal and obstetric factors in causing postpartum urinary incontinence. This prospective cohort study in 949 pregnant women in Quebec, Canada was nested within a randomised controlled trial of prenatal perineal massage. Results Postpartum urinary incontinence was increased with prepregnancy incontinence (adjusted odds ratio [adj0R] 6.44, 95% CI 4.15, 9.98), incontinence beginning during pregnancy (adjOR 1.93, 95% CI 1.32, 2.83), and higher prepregnancy body mass index (adjOR 1.07/unit of BMI, 95% CI 1.03,1.11). Caesarean section was highly protective (adjOR 0.27, 95% CI 0.14, 0.50). While there was a trend towards increasing incontinence with forceps delivery (adjOR 1.73, 95% CI 0.96, 3.13) this was not statistically significant. The weight of the baby, episiotomy, the length of the second stage of labour, and epidural analgesia were not predictive of urinary incontinence. Nor was prenatal perineal massage, the randomised controlled trial intervention. When the analysis was limited to women having their first vaginal birth, the same risk factors were important, with similar adjusted odds ratios. Conclusions Urinary incontinence during pregnancy is extremely common, affecting over half of pregnant women. Urinary incontinence beginning during pregnancy roughly doubles the likelihood of urinary incontinence at 3 months postpartum, regardless whether delivery is vaginal or by Caesarean section.
8. Breastfeeding practices in a cohort of inner-city women: the role of contraindications. England L, Brenner R, Bhaskar B, Simons-Morton B, Das A, Revenis M, Mehta N, Clemens J. BMC Public Health 2003, 3:28 (20 August 2003) Background Little is known about the role of breastfeeding contraindications in breastfeeding practices. Our objectives were to 1) identify predictors of breastfeeding initiation and duration among a cohort of predominately low-income, inner-city women, and 2) evaluate the contribution of breastfeeding contraindications to breastfeeding practices. Methods Mother-infant dyads were systematically selected from 3 District of Columbia hospitals between 1995 and 1996. Breastfeeding contraindications and potential predictors of breastfeeding practices were identified through medical record reviews and interviews conducted after delivery (baseline). Interviews were conducted at 37 months postpartum and again at 712 months postpartum to determine breastfeeding initiation rates and duration. Multivariable logistic regression analysis was used to identify baseline factors associated with initiation of breastfeeding. Cox proportional hazards models were generated to identify baseline factors associated with duration of breastfeeding. Results Of 393 study participants, 201 (51%) initiated breastfeeding. A total of 61 women (16%) had at lease one documented contraindication to breastfeeding; 94% of these had a history of HIV infection and/or cocaine use. Of the 332 women with no documented contraindications, 58% initiated breastfeeding, vs. 13% of women with a contraindication. In adjusted analysis, factors most strongly associated with breastfeeding initiation were presence of a contraindication (adjusted odds ratio [AOR], 0.19; 95% confidence interval [CI], 0.080.47), and mother foreign-born (AOR, 4.90; 95% CI, 2.3810.10). Twenty-five percent of study participants who did not initiate breastfeeding cited concern about passing dangerous things to their infants through breast milk. Factors associated with discontinuation of breastfeeding (all protective) included mother foreign-born (hazard ratio [HR], 0.55; 95% CI 0.390.77) increasing maternal age (HR for 5-year increments, 0.80; 95% CI, 0.690.92), and infant birth weight ≥ 2500 grams (HR, 0.45; 95% CI, 0.260.80). Conclusions Breastfeeding initiation rates and duration were suboptimal in this inner-city population. Many women who did not breastfeed had contraindications and/or were concerned about passing dangerous things to their infants through breast milk. It is important to consider the prevalence of contraindications to breastfeeding when evaluating breastfeeding practices in high-risk communities.
9. Randomised controlled trial of a theoretically grounded tailored intervention to diffuse evidence-based public health practice [ISRCTN23257060]. Forsetlund L, Bradley P, Forsen L, Nordheim L, Jamtvedt G, Bjψrndal A. BMC Medical Education 2003, 3:2 (13 March 2003) Background Previous studies have shown that Norwegian public health physicians do not systematically and explicitly use scientific evidence in their practice. They work in an environment that does not encourage the integration of this information in decision-making. In this study we investigate whether a theoretically grounded tailored intervention to diffuse evidence-based public health practice increases the physicians' use of research information. Methods 148 self-selected public health physicians were randomised to an intervention group (n = 73) and a control group (n = 75). The intervention group received a multifaceted intervention while the control group received a letter declaring that they had access to library services. Baseline assessments before the intervention and post-testing immediately at the end of a 1.5-year intervention period were conducted. The intervention was theoretically based and consisted of a workshop in evidence-based public health, a newsletter, access to a specially designed information service, to relevant databases, and to an electronic discussion list. The main outcome measure was behaviour as measured by the use of research in different documents. Results The intervention did not demonstrate any evidence of effects on the objective behaviour outcomes. We found, however, a statistical significant difference between the two groups for both knowledge scores: Mean difference of 0.4 (95% CI: 0.20.6) in the score for knowledge about EBM-resources and mean difference of 0.2 (95% CI: 0.00.3) in the score for conceptual knowledge of importance for critical appraisal. There were no statistical significant differences in attitude-, self-efficacy-, decision-to-adopt- or job-satisfaction scales. There were no significant differences in Cochrane library searching after controlling for baseline values and characteristics. Conclusion Though demonstrating effect on knowledge the study failed to provide support for the hypothesis that a theory-based multifaceted intervention targeted at identified barriers will change professional behaviour.
10. Family structure and risk factors for schizophrenia: case-sibling study. Haukka JK, Suvisaari J, Lonnqvist J. BMC Psychiatry 2004, 4:41 (27 November 2004). Background Several family structure-related factors, such as birth order, family size, parental age, and age differences to siblings, have been suggested as risk factors for schizophrenia. We examined how family-structure-related variables modified the risk of schizophrenia in Finnish families with at least one child with schizophrenia born from 1950 to 1976. Methods We used case-sibling design, a variant of the matched case-control design in the analysis. Patients hospitalized for schizophrenia between 1969 and 1996 were identified from the Finnish Hospital Discharge Register, and their families from the Population Register Center. Only families with at least two children (7914 sibships and 21059 individuals) were included in the analysis. Conditional logistic regression with sex, birth cohort, maternal schizophrenia status, and several family-related variables as explanatory variables was used in the case-sibling design. The effect of variables with the same value in each sibship was analyzed using ordinary logistic regression. Results Having a sibling who was less than five years older (OR 1.46, 95% CI 1.29-1.66), or being the firstborn (first born vs. second born 1.62, 1.87-1.4) predicted an elevated risk, but having siblings who were more than ten years older predicted a lower risk (0.66, 0.56-0.79). Conclusions Several family-structure-related variables were identified as risk factors for schizophrenia. The underlying causative mechanisms are likely to be variable.
11. Overweight, obesity, and colorectal cancer screening: Disparity between men and women. Heo M, Allison DB, Fontaine KR. BMC Public Health 2004, 4:53 (8 November 2004) Background To estimate the association between body-mass index (BMI: kg/m2) and colorectal cancer (CRC) screening among US adults aged ≥ 50 years. Methods Population-based data from the 2001 Behavioral Risk Factor Surveillance Survey. Adults (N = 84,284) aged ≥ 50 years were classified by BMI as normal weight (18.5<25), overweight (25<30), obesity class I (30<35), obesity class II (35<40), and obesity class III (≥ 40). Interval since most recent screening fecal occult blood test (FOBT): (0 = >1 year since last screening vs. 1 = screened within the past year), and screening sigmoidoscopy (SIG): (0 = > 5 years since last screening vs. 1 = within the past 5 years) were the outcomes. Results Results differed between men and women. After adjusting for age, health insurance, race, and smoking, we found that, compared to normal weight men, men in the overweight (odds ratio [OR] 1.25, 95% CI = 1.051.51) and obesity class I (OR = 1.21, 95% CI = 1.031.75) categories were more likely to have obtained a screening SIG within the previous 5 years, while women in the obesity class I (OR = 0.86, 95%CI = 0.780.94) and II (OR = 0.88, 95%CI = 0.790.99) categories were less likely to have obtained a screening SIG compared to normal weight women. BMI was not associated with FOBT. Conclusion Weight may be a correlate of CRC screening behavior but in a different way between men and women.
12. A national survey on the patterns of treatment of inflammatory bowel disease in Canada. Hilsden RJ, Verhoef MJ, Best A, Pocobelli G. BMC Gastroenterology 2003, 3:10 (5 June 2003) Background There is a general lack of information on the care of inflammatory bowel disease (IBD) in a broad, geographically diverse, non-clinic population. The purposes of this study were (1) to compare a sample drawn from the membership of a national Crohn's and Colitis Foundation to published clinic-based and population-based IBD samples, (2) to describe current patterns of health care use, and (3) to determine if unexpected variations exist in how and by whom IBD is treated. Methods Mailed survey of 4453 members of the Crohn's and Colitis Foundation of Canada. The questionnaire, in members stated language of preference, included items on demographic and disease characteristics, general health behaviors and current and past IBD treatment. Each member received an initial and one reminder mailing. Results Questionnaires were returned by 1787, 913, and 128 people with Crohn's disease, ulcerative colitis and indeterminate colitis, respectively. At least one operation had been performed on 1159 Crohn's disease patients, with risk increasing with duration of disease. Regional variation in surgical rates in ulcerative colitis patients was identified. 6-Mercaptopurine/Azathioprine was used by 24% of patients with Crohn's disease and 12% of patients with ulcerative colitis (95% CI for the difference: 8.9% 15%). In patients with Crohn's disease, use was not associated with gender, income or region of residence but was associated with age and markers of disease activity. Infliximab was used by 112 respondents (4%), the majority of whom had Crohn's disease. Variations in infliximab use based on region of residence and income were not seen. Sixty-eight percent of respondents indicated that they depended most on a gastroenterologist for their IBD care. There was significant regional variation in this. However, satisfaction with primary physician did not depend on physician type (for example, gastroenterologist versus general practitioner). Conclusion This study achieved the goal of obtaining a large, geographically diverse sample that is more representative of the general IBD population than a clinic sample would have been. We could find no evidence of significant regional variation in medical treatments due to gender, region of residence or income level. Differences were noted between different age groups, which deserves further attention.
13. Do English and Chinese EQ-5D versions demonstrate measurement equivalence? an exploratory study. Luo N, Chew LH, Fong KY, Koh DR, Ng SC, Yoon KH, Vasoo S, Li SC, Thumboo J. Health and Quality of Life Outcomes 2003, 1:7 (17 April 2003) Background Although multiple language versions of health-related quality of life instruments are often used interchangeably in clinical research, the measurement equivalence of these versions (especially using alphabet vs pictogram-based languages) has rarely been assessed. We therefore investigated the measurement equivalence of English and Chinese versions of the EQ-5D, a widely used utility-based outcome instrument. Methods In a cross-sectional study, either EQ-5D version was administered to consecutive outpatients with rheumatic diseases. Measurement equivalence of EQ-5D item responses and utility and visual analog scale (EQ-VAS) scores between these versions was assessed using multiple regression models (with and without adjusting for potential confounding variables), by comparing the 95% confidence interval (95%CI) of score differences between these versions with pre-defined equivalence margins. An equivalence margin defined a magnitude of score differences (10% and 5% of entire score ranges for item responses and utility/EQ-VAS scores, respectively) which was felt to be clinically unimportant. Results Sixty-six subjects completed the English and 48 subjects the Chinese EQ-5D. The 95%CI of the score differences between these versions overlapped with but did not fall completely within pre-defined equivalence margins for 4 EQ-5D items, utility and EQ-VAS scores. For example, the 95%CI of the adjusted score difference between these EQ-5D versions was -0.14 to +0.03 points for utility scores and -11.6 to +3.3 points for EQ-VAS scores (equivalence margins of -0.05 to +0.05 and -5.0 to +5.0 respectively). Conclusion These data provide promising evidence for the measurement equivalence of English and Chinese EQ-5D versions.
14. Long term benzodiazepine use for insomnia in patients over the age of 60: discordance of patient and physician perceptions. Mah L, Upshur REG. BMC Family Practice 2002, 3:9 (8 May 2002) Background The aim of this study was to determine and compare patients' and physicians' perceptions of benefits and risks of long term benzodiazepine use for insomnia in the elderly. Methods A cross-sectional study (written survey) was conducted in an academic primary care group practice in Toronto, Canada. The participants were 93 patients over 60 years of age using a benzodiazepine for insomnia and 25 physicians comprising sleep specialists, family physicians, and family medicine residents. The main outcome measure was perception of benefit and risk scores calculated from the mean of responses (on a Likert scale of 1 to 5) to various items on the survey. Results The mean perception of benefit score was significantly higher in patients than physicians (3.85 vs. 2.84, p < 0.001, 95% CI 0.69, 1.32). The mean perception of risk score was significantly lower in patients than physicians (2.21 vs. 3.63, p < 0.001, 95% CI 1.07, 1.77). Conclusions There is a significant discordance between older patients and their physicians regarding the perceptions of benefits and risks of using benzodiazepines for insomnia on a long term basis. The challenge is to openly discuss these perceptions in the context of the available evidence to make collaborative and informed decisions.
16. Effect of prize draw incentive on the response rate to a postal survey of obstetricians and gynaecologists: A randomised controlled trial. [ISRCTN32823119] Moses SH, Clark TJ. BMC Health Services Research 2004, 4:14 (28 June 2004) Background Response rates to postal questionnaires are falling and this threatens the external validity of survey findings. We wanted to establish whether the incentive of being entered into a prize draw to win a personal digital assistant (PDA) would increase the response rate for a national survey of consultant obstetricians and gynaecologists. Methods A randomised controlled trial was conducted. This involved sending a postal questionnaire to all Consultant Obstetricians and Gynaecologists in the United Kingdom. Recipients were randomised to receiving a questionnaire offering a prize draw incentive (on response) or no such incentive. Results The response rate for recipients offered the prize incentive was 64% (461/716) and 62% (429/694) in the no incentive group (relative rate of response 1.04, 95% CI 0.96 1.13) Conclusion The offer of a prize draw incentive to win a PDA did not significantly increase response rates to a national questionnaire survey of consultant obstetricians and gynaecologists.
17. Predicting gender differences as latent variables: summed scores, and individual item responses: a methods case study. Pietrobon R, Taylor M, Guller U, Higgins LD, Jacobs DO, Carey T. Health and Quality of Life Outcomes 2004, 2:59 (25 October 2004) Background Modeling latent variables such as physical disability is challenging since its measurement is performed through proxies. This poses significant methodological challenges. The objective of this article is to present three different methods to predict latent variables based on classical summed scores, individual item responses, and latent variable models. Methods This is a review of the literature and data analysis using "layers of information". Data was collected from the North Carolina Back Pain Project, using a modified version of the Roland Questionnaire. Results The three models are compared in relation to their goals and underlying concepts, previous clinical applications, data requirements, statistical theory, and practical applications. Initial linear regression models demonstrated a difference in disability between genders of 1.32 points (95% CI 0.65, 2.00) on a scale from 023. Subsequent item analysis found contradictory results across items, with no clear pattern. Finally, IRT models demonstrated three items were demonstrated to present differential item functioning. After these items were removed, the difference between genders was reduced to 0.78 points (95% CI, -0.99, 1.23). These results were shown to be robust with re-sampling methods. Conclusions Purported differences in the levels of a latent variable should be tested using different models to verify whether these differences are real or simply distorted by model assumptions.
19. The outcome of extubation failure in a community hospital intensive care unit: a cohort study. Seymour CW, Martinez A, Christie JD, Fuchs BD. Critical Care 2004, 8:R322-R327 (20 July 2004) Introduction Extubation failure has been associated with poor intensive care unit (ICU) and hospital outcomes in tertiary care medical centers. Given the large proportion of critical care delivered in the community setting, our purpose was to determine the impact of extubation failure on patient outcomes in a community hospital ICU. Methods A retrospective cohort study was performed using data gathered in a 16-bed medical/surgical ICU in a community hospital. During 30 months, all patients with acute respiratory failure admitted to the ICU were included in the source population if they were mechanically ventilated by endotracheal tube for more than 12 hours. Extubation failure was defined as reinstitution of mechanical ventilation within 72 hours (n = 60), and the control cohort included patients who were successfully extubated at 72 hours (n = 93). Results The primary outcome was total ICU length of stay after the initial extubation. Secondary outcomes were total hospital length of stay after the initial extubation, ICU mortality, hospital mortality, and total hospital cost. Patient groups were similar in terms of age, sex, and severity of illness, as assessed using admission Acute Physiology and Chronic Health Evaluation II score (P > 0.05). Both ICU (1.0 versus 10 days; P < 0.01) and hospital length of stay (6.0 versus 17 days; P < 0.01) after initial extubation were significantly longer in reintubated patients. ICU mortality was significantly higher in patients who failed extubation (odds ratio = 12.2, 95% confidence interval [CI] = 1.5101; P < 0.05), but there was no significant difference in hospital mortality (odds ratio = 2.1, 95% CI = 0.85.4; P < 0.15). Total hospital costs (estimated from direct and indirect charges) were significantly increased by a mean of US$33,926 (95% CI = US$22,57345,280; P < 0.01). Conclusion Extubation failure in a community hospital is univariately associated with prolonged inpatient care and significantly increased cost. Corroborating data from tertiary care centers, these adverse outcomes highlight the importance of accurate predictors of extubation outcome.
This webpage was written by Steve Simon on (unknown date), edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence