Children's Mercy Hospital
Find a Doctor | Press Room | Careers | Directions & Locations

About Us | Contact Us | Giving to Children's Mercy
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Stats #13: Computing an Appropriate Sample Size

One of your most critical choices in designing a research study is selecting an appropriate sample size. A sample size that is either too small or too large will be wasteful of resources and will raise ethical concerns.

This class will provide hands-on computer experience using a Microsoft Excel spreadsheet for sample size calculations. Please bring a copy of a research paper with you to class.

In this class, you will learn how to:

  • identify the information you need to produce a power calculation;
  • justify an appropriate sample size for your research; and
  • examine the sensitivity of the sample size to changes in your research design.

This class does not qualify for IRB Education Credits (IRBECs).

Contents

  • Overview of the STATS web pages
  • Consulting services that I provide
  • Type II error
  • Quick sample size calculations
  • Three things you need for a power calculation
  • Sample size calculations for a binary outcome
  • Confidence intervals
  • Negative results
  • Please fill out an evaluation form

Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

  • setting up your research hypothesis,
  • selecting and justifying your sample size,
  • writing the statistical methods section for your grant,
  • preparing randomization tables for your study,
  • reviewing your surveys for content and quality,
  • developing a system for entering your data,
  • choosing an appropriate statistical model for your data,
  • establishing validity and/or reliability for your measurement scales,
  • checking for violations of statistical assumptions in your data,
  • producing graphs and tables for your research publication, and
  • providing references for new and unusual statistical methods.

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

  • Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It’s best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building’s entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ½ block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Type II error.

Dear Professor Mean, A journal reviewer criticized the small sample size in my research study and suggested that I mention a Type II error as a possible explanation for my results. I've never heard this term before. What is a Type II error?

In your research, you specified a null hypothesis and an alternative hypothesis. Typically, the null hypothesis corresponds to no change.

When you are using Statistics to decide between these two hypothesis, you have to allow for the possibility of error. Actually, if you are using any other procedure, you should still allow for the possibility of error, but we statisticians are the only ones honest enough to admit this. Here are the two types of errors:

  • A type I error is rejecting the null hypothesis when the null hypothesis is true.
  • A type II error is accepting the null hypothesis when the null hypothesis is false.

The null hypothesis traditionally represents a negative finding (i.e., there is no difference between the treatment and control). You should always remember that it is impossible to prove a negative. Some statisticians will emphasize this fact by using the phrase "fail to reject the null hypothesis" in place of "accept the null hypothesis." The former phrase always strikes me as semantic overkill.

Example

Consider a new drug that we will put on the market if we can show that it is better than a placebo.

  • A type I error would be allowing an ineffective drug onto the market.
  • A type II error would be keeping an effective drug off the market.

Suppose we are comparing two groups of patients, one with a possibly dangerous exposure (e.g., non-ionizing radiation), and the other unexposed.

  • A type I error would condemning an exposure that actually is safe.
  • A type II error would be absolving an exposure that actually does harm.

Many studies have small sample sizes that make it difficult to reject the null hypothesis even when there is a big change in the data. In these situations, a Type II error might be a possible explanation for the negative study results.

This webpage was written by Steve Simon on 1999-09-03, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Ask Professor Mean, Category: Hypothesis testing


Quick sample size calculations (October 11, 2001) Category: Ask Professor Mean, Category: Sample size justification

Dear Professor Mean, I'm reading a research paper. I suspect that the sample size is way too small. I don't like the findings of the study anyway, so I'm hoping that you will help me discredit this study. Is there a quick sample size calculation that I can use? -- Cynical Chris

Dear Cynical,

That's not a paper that I wrote, is it? I was very tired, and the dog ate my homework, and the Year 2000 bug just struck (my computer procrastinates worse than I do).

The simplest way to see if your sample size is too small is to look at the confidence intervals. Look for a confidence interval so wide that you can drive a truck through it. Look for an interval that contains both clinically large and clinically insignificant values.

What? The paper doesn't have any confidence intervals? That's terrible. Many journal editors require the use of confidence intervals instead of p-values in their publications. Write a letter to the editor and complain.

A second way to assess the sample size is to do a quick power calculation. If your research paper compares two groups (e.g., a treatment group and a control group), then you can use the rule of 50 or the rule 16. The first rule applies for categorical outcomes; the second applies for continuous outcomes.

Assumptions

These rules presume that you are comparing two groups. Some examples would be comparing a control group to a treatment group, a standard therapy to a new therapy, or an exposed group to an unexposed group.

They also presume that you are performing

  • a two-sided test,
  • the alpha level is .05, and
  • the power is 80%.

Finally, keep in mind that both rules are approximations, which means that you need to consult with a statistician for an official sample size. Still, these quick rules help you get a feel for whether you need dozens versus hundreds versus thousands of research subjects. These quick rules are also helpful when you are reading someone else's research and you want to get a rough idea about what an appropriate sample size might be.

By the way, I have to give credit where credit is due. Both of these rules came from a website with the title "STRUTS: Statistical Rules of Thumb" which is no longer on the web. There is a book out with the same title (see further reading) as well as a new website. Unfortunately, the details about the rule of 50 seem to have disappeared in the update.

The rule of 50

The rule of 50 applies when your outcome measure is a discrete event such as morbidity or mortality. The rule works well if that event is relatively rare. If you want enough subjects to be able to detect a halving of risk from your control group, be sure to collect enough data so that you will have at least 50 events in your control group. Then sample the same number of subjects in your treatment group.

For example, patients using a control medication will have a risk over five years for a heart attack of roughly 8%. You want to try a new drug to see if it can reduce the risk to 4%. You would need a large enough sample in each group to ensure that at least 50 patients in the control group will have a heart attack. It seems a bit morbid to plan for such a thing. Still, it makes sense. If hardly anyone in either group has a heart attack, you will have a hard time deciding whether one medication is better than another.

A control group of 625 subjects would suffice (8% of 625 is 50). With the same number of treated subjects, you would have a total of 1,250 patients in your study. This is just an approximation. The sample size that provides 80% power for detecting a halving of risk is actually 553 per group, not 625.

The rule of 16

The rule of 16 applies when your outcome measure is continuous, such as birth weight. For the continuous outcome, you need to define how much of a difference you would consider to be clinically significant, and then compute the ratio of this clinically significant difference to the standard deviation of the outcome measure. This ratio is called the effect size or the standardized effect size.Your sample size per group is 16 divided by the square of the effect size.

For example, you are measuring the duration of breast feeding in a sample of newborn infants. Let's presume that any intervention that can increase the average duration of breast feeding by at least two weeks is considered clinically significant. Furthermore, in this population, you expect the standard deviation of breast feeding duration to be 10 weeks. The effect size is 0.2, and the required sample size per group is 400 (=16/0.04).

Again, this is just an approximation. The sample size that provides 80% for detecting a two week shift in breast feeding duration is actually 394, not 400.

Disclaimer: I am not a medical expert, so I cannot say with any authority whether two weeks of additional breast feeding is clinically significant. All my knowledge of breast feeding comes from experiences more than forty years ago.

Example of a hypothetical research study (using the rule of 50)

An article by Schwartz et al proposes an interesting scenario for a research study (N Engl J Med. 1998;338:1709-1714). These authors noticed an association between prolonged QT interval and Sudden Infant Death Syndrome. In the discussion of these findings, the authors raise the possibility of screening all newborn infants using electrocardiography and the placing those infants with prolonged QT intervals on a beta blocker. The authors discuss the complexity of the cost benefit issues, which is beyond the scope of this web page. It is interesting, however, to speculate on how to test whether beta blockers would be effective as a therapy to prevent SIDS in those infants with long QT intervals.

The paper provides much interesting data to help you calculate an appropriate sample size for this study. The risk of SIDS in infants with prolonged QT intervals is 1.5%. Suppose that a beta blocker could cut this risk in half (to 0.75%). What sample size would you have to collect in order to have adequate power?

The rule of 50 tells us that we would need 50 SIDS events in the placebo arm of the trial. At a rate of 1.5% that translates into recruiting 3,333 infants with prolonged QT interval for the placebo arm. You would recruit a similar number of infants for the beta-blocker arm of the study.

Not every infant, however, will have a prolonged QT interval. The cutoff used in this paper for a prolonged interval represented the 97.5 percentile. So only 2.5% of the infants screened could qualify to be in the study. In order to recruit 6,666 infants who qualify for the study, you would have to screen 266,640 normal infants.

Disclaimer: I am not a medical expert, so I cannot comment intelligently on cost benefit issues, the amount of improvement that a beta blocker might have, and other related issues. This example should only serve as an illustration of how difficult it would be to prospectively examine a therapy in a group of children where both the adverse event and the qualifying condition for therapy are both rare.

Assessing the sample size in an existing publication (using the rule of 16)

An article by Adkinson et al presents a negative finding on the use of immunotherapy for asthma in allergic children (N Engl J Med. 1997;336:324-331). You may want to examine whether this negative finding is due to an inadequate sample size.

One of the outcome measures is change in medication score, which has a standard deviation of roughly 2.0. Suppose you felt that a difference of one unit on the medication score represented a clinically significant effect. This represents a standardized effect size of 0.5 +1.0/2.0). Using the rule of 16, you would want 64 (=16/0.25) subjects in each group to have adequate power.

Another outcome measure is change in symptom scores, which has a standard deviation of roughly 0.4. If you believed that a 0.25 unit change in the symptom score, this would represent a standardized effect size of 0.625. Using the rule of 16, you would need 41 patients in each group.

The actual study had 60 and 61 patients, so the sample size appears adequate (presuming that the differences of 1.0 and 0.25 are reasonable values).

Disclaimer: I am not a medical expert, so I can only speculate on what a clinically significant difference would be. Also, there are some tricky issues in the paper involving compliance and the quality of care received. My discussion of this example may oversimplify some of the issues, but I hope you still find the example interesting and informative.

Mathematical details

The rule of 50 starts with the a standard formula for sample size.

If the event in question is rare, then

Suppose we want to detect a 50% decline from group 1 to group 2. We want the same number of subjects in both groups, we want a two-sided test with an alpha level of 0.05 and a beta level of 0.20. This implies that

When we evaluate the sample size formula, it simplifies to

which we can rewrite as

which we round up to 50. The number of events in the second group, of course, would be 25.

Summary

Cynical Chris believes that the results of a research study are invalid because the sample size was too small. Professor Mean presents two quick calculations that can help show if this is true. The rule of fifty says that if you are looking at a binary outcome, and you hope that the treatment will cut the chances of the adverse event in half, then you need enough subjects so you will have 50 adverse events in the control group.

The rule of sixteen says that if you want to detect a change in a continuous outcome, you first compute the effect size (the minimum clinically relevant difference divided by the standard deviation). Divide 16 by the effect size squared to get an approximate sample size.

Further reading

  1. Statistical Rules of Thumb. van Belle G (2002) New York, NY: John Wiley & Sons. ISBN: 0471402273. [BookFinder4U link]
  2. Statistical Rules of Thumb. Van Belle G. The original STRUTS web site was either at www.nrcse.washington.edu/research/eo-1.html or www.nrcse.washington.edu/nrcse/struts/struts.html but these links no longer work. Accessed on 2005-03-29. www.vanbelle.org

This page was written by Steve Simon and was last modified on 07/14/2008.


Three things you need for a power calculation (November 8, 2001) Category: Ask Professor Mean, Category: Sample size justification

Dear Professor Mean, I want to do research. Is forty subjects enough, or do I need more? Didn't I hear you mention something about three things you need for a power calculation? -- Eager Edward

Dear Eager,

That reminds me of a cute joke. How many research subjects does it take to screw in a light bulb? At least 300 if you want the bulb to have adequate power.

Sorry, I was digressing. Is forty subjects an adequate sample size? That depends on a lot of factors. The basic idea, though, is to select a sample size which ensures that your study has adequate power. Power is the probability that your research study will successfully detect a difference, assuming that the treatment or exposure you are examining actually can cause an important difference. If you don't care whether your experiment is successful or not, then you can use just about any sample size.

Short answer

Power is to a research design like sensitivity is to a diagnostic test. A diagnostic test with good sensitivity is normally able to detect a disease when the disease is present. A research study with good power is normally able to detect a change when your treatment is indeed effective.

The actual calculation of power requires three pieces of information:

  1. your research hypothesis,
  2. the variability of your outcome measure, and
  3. your estimate of the clinically relevant difference.

Calculating power is sometimes difficult and it may require you to go to the time and expense of running a pilot study. But you should NEVER start a research project without knowing what your power is. That would be like using a diagnostic test with unknown sensitivity.

Research hypothesis

A research hypothesis will provide specific information that will determine what type of analysis is needed. A common structure for a research hypothesis is specification of the subject group you are testing, the treatment or exposure that this group will receive, the outcome measure, and the comparison or control group.

Some exploratory studies may not have a research hypothesis, of course, and for those studies you determine an appropriate sample size in a different way (for example, by insuring that the estimates from this exploratory study have adequate precision).

Variability of your outcome measure

You also need to have an estimate of the variability of your outcome measure. I'm assuming here that your outcome measure is continuous variable like birth weight or cholesterol level. If you are using a categorical outcome measure like mortality or cancer remission, then you need some estimate of the rate of mortality or remission in your control group.

Your literature review (you did do a literature review before you started this research, I hope), will usually provide you with an estimate of variability. Select a study that is reasonably similar to what you plan to do, and find out what that study reported for the standard deviation for your outcome measure.

Although I prefer a standard deviation, other estimates of variability are also acceptable. If the paper reports a variance, a standard error, a confidence interval, or a coefficient of variation, then there are simple formulas for converting these into standard deviations. If the study priveds a range, then you can divide the range by four to get a good approximation for the standard deviation.

Many of the people I see have a difficult time providing any estimate of variability. This area hasn't been studied before, so no one knows what the variability will be. But don't give up too easily.

First keep in mind that you only need a crude estimate of variability. Power calculations are capable of determining if you are "in the right ball park." They are good at specifying your sample size down to an order of magnitude perhaps but not much more than that. In other words, might tell you whether you need hundreds of subjects dozens of subjects instead of hundreds of subjects, or possibly if you need thousands of subjects.

Second, although most research is innovative and therefore unique, this innovation is often in the treatment and not in the outcome measure. So look for studies that used the same outcome measure, even if the treatment is quite different than yours.

Third, try to characterize variability in your control group and we can try to extrapolate what the variability will be in the treatment group. A retrospective chart review, for example, will provide a rough estimate of variability of your outcome measure under the current standard of care.

Third, you may have to use a clearly flawed estimate, but a flawed estimate of variability may still be better than no estimate at all. An estimate of variability in adults, for example, may not be an ideal estimate for a pediatric study, but at least it tells you if your study will have adequate power assuming that the variation in a pediatric population is comparable to variation in an adult population. That's still better than having no idea whether your study has adequate power.

If you've tried and you still can't come up with an estimate of variability, then don't despair. A pilot study can provide you with an estimate of variability when all else fails. Usually 20 to 30 subjects produce a reasonably stable estimate of variability. A pilot study is also helpful for finding out how quickly you can recruit subjects. Furthermore, a pilot study will also identify any weaknesses in the logistics of your research. Finally, if the protocol remains substantially unchanged after the pilot study, you can usually include those pilot subjects in the final analysis.

Clinically relevant difference

Wow, that was exhausting! You're not done, though, until you can tell me what a clinically relevant difference would be for your outcome measure. This is a difference that is large enough to be considered important by a practicing clinician.

For just about every type of study, some differences are so small as to be clinically meaningless. From a theoretical viewpoint, perhaps, changes of any size might be interesting. But theory and practice are very different. If a six month diet program produces an average weight loss of three pounds, a fever medicine reduces average temperature by half a degree Fahrenheit, or a smoking cessation program helps an additional two percent to quit, who cares what the theoretical implicaitons might be.

It's not easy but this is something that you have to do for yourself. The clinically relevant difference is determined by medical experts and not by statisticians. Hey, I'm still trying to understand the difference between good and bad cholesterol; I wouldn't even be able to start thinking about how much of a change in cholesterol is considered clinically relevant. You might start by asking yourself "How much of an improvement would I have to see before I would adopt a new treatment?" Also, try talking with some of your colleagues. And look at the size of improvements for other successful treatments.

Still, there are some general guidelines that might help. Try looking at the resolution of your measuring device, thinking in terms of relative changes, or specifying changes with respect to your standard deviation.

Average changes that are smaller than the resolution of your measuring instrument are probably not clinically relevant. For example, Apgar scores can take on any whole number between 0 and 10. Gestational age can only be measured accurately to within a week In these contexts, it is clear that average changes should probably be greater than one unit in order to achieve relevance.

Still this is not a perfect rule. We can measure weights to within a gram, but changes in birth weight would have to be in the hundreds of grams or more to be meaningful. And while no family can have a fractional number of children, decreasing the average family size by 0.2 children can have a profound effect on society.

It also may help to think in terms of relative changes. If you can change something by 25 percent or 50 percent, that is considered relevant in most contexts. It becomes harder to argue clinical relevance for changes of less than 10 percent. Again, this is not a perfect rule.

Finally, you might find it easier to specify changes with respect to your standard deviation. This type of change is called an effect size. A common classification is that 0.2 standard deviations is considered a small effect size, 0.5 standard deviations is considered a medium effect size, and 0.8 standard deviations is considered a large effect size.

An effect size of 0.2 is small enough that there is no obvious visible separation between the two groups. The difference in average heights between 15 and 16 year old girls is 0.2 standard deviations. An effect size of 0.8 is clearly visible. The difference in average heights between 14 and 18 year old girls is 0.8 standard deviations.

It may be unrealistic to look for changes much smaller than 0.2 standard deviations because the sample sizes become prohibitively large. It may also be unrealistc to expect to see changes much larger than 0.8 standard deviations since this size change does not seem to occur too often in the published literature.

Like the other two rules, this rule is also not perfect. In some animal experiments, for example, the similarity in the gene pool can often reduce variation to such an extent that changes of more than a full standard deviation are quite realistic. If you are trying to specify a clinically relevant difference, there is no substitute for a good understanding of the context of your research.

But I can't do it.

A lot of people tell me that they can't do this. They can't provide an estimate of variability or they can't determine what a clinically relevant difference is, even after I explain all of the above suggestions.

But you have to do it.

The CONSORT Guidelines require you to have an a priori justification of sample size for publication. If you don't do this now, you won't be able to publish the data in any journal that uses these guidelines. What's the point of doing the research if you can't publish it?

If your research requires an ethical review (e.g., through an IRB), they will require the same a priori justification. If the research involves animals, the appropriate animal care and use committee will require this justification.

The bottom line is that if you know so little about this avenue of research that you can't even come up with a preliminary estimate of the variability of your outcome variable, then you shouldn't be doing the research. You need instead to:

  • do a more thorough literature review,
  • collect some pilot data, or
  • switch to an outcome measure whose variability is known to some extent.

But do something, because your ability to perform the research and to publish your research depends on your justification of the sample size.

Example

In a study of two different skin barriers for burn patients, we are interested in three outcome measures: pain, healing time, and cost. We will randomly assign half of the patients to one skin barrier and half to the other.

For pediatric patients we usually measure pain with the Oucher, a five point scale that has been validated for children. A review of previous studies using the Oucher have shown that it has a standard deviation of about 1.5 units. We would be interested in seeing how large a sample size is needed to show a change of 1 unit, the smallest individual change attainable on the Oucher. We want to have a power of .80, or equivalently, the probability of a Type II error of .20.

The formulas for sample size vary from problem to problem. The sample size needed for a comparison of two independent groups is

wpe26.gif (1536 bytes)

We use the letter "z" to represent a standard normal distribution. Alpha represents the probability of a Type I error (usually .05). Beta represents the probability of a Type II error (we usually want this to somewhere between .05 and .20). Sigma represents the standard deviation, and this formula allows for the possibility of different standard deviations in group 1 and group 2. Don't forget that the formula requires you to square these standard deviations. Finally, D is the clinically relevant difference. In our example,

wpe23.gif (2183 bytes)

We round up. So in order to achieve 80% power for detecting a one unit difference in the Oucher score, which has a reported standard deviation of 1.5, we would need to sample 36 patients in each group.

Healing time is a more difficult endpoint to assess. Medical textbooks cite that the healing time for second degree burns has a range of 4 days (minimum 10, maximum 14). A study of healing times for a glove made from one of the skin barriers showed a healing time range of 6 (minimum 2 and maximum 8 days).

A rule of thumb is that the standard deviation is about one fourth to one sixth the size of the range. So we could have a standard deviation as small as 0.67 or as large as 1.5. An average change of one day in healing time would be considered clinically relevant.

If we use the largest possible estimate of standard deviation, we would get (coincidentally) the exact same sample size of 36 per group. If we used the smallest estimate of the standard deviation, we would need only 7 subjects per group.

Ffor one type of skin barrier, a study of costs showed a range of $4.00 ($5.50 to $9.50). We would like to be able to detect a difference as small as $0.50 in costs.

Using the same rule of thumb, we get an estimate of the standard deviation of either 0.67 or 1.0. Using the smaller estimate of standard deviation, we would need 29 subjects per group using the smaller estimate of standard deviation. We would need 63 subjects per group, using the larger estimate.

A sample size of 63 is untenable, so we decide that we can live with a study that could only detect a $1.00 change in costs. For this size difference, we would need 16 subjects per group using the larger standard deviation.

In summary, to achieve adequate power for all three endpoints, we would need 36 patients per group,. This is larger than we need for the healing time endpoint. It is also larger than what we need for the cost endpoint, unless we wanted to detect a $0.50 change in costs. To detect such a small difference, we need a sample size of 63 subjects per group.

Summary

Eager Edgar wants to know if forty subjects is enough to conduct a research study. Professor Mean explains that it is impossible to determine whether forty is an appropriate sample size without having these three things:

  1. a research hypothesis,
  2. a standard deviation for your outcome measure, and
  3. an estimate of the clinically relevant difference for this outcome measure.

Further reading

Jacob Cohen has an excellent discussion of effect sizes in Chapter 2 of his book and the examples of girls heights comes directly from this book. Bernard Rosner incorporates a discussion of power and sample size issues into every section on statistical testing. Russ Lenth's PiFace software will provide more accurate power calculations than those presented here (or in Rosner's book), which is especially important when you are estimating power for small sample sizes. The range method for estimating staindard deviations gives a more precise rule for converting a range into a standard deviation.

  1. Power and sample size page.
    Russell V. Lenth (Accessed on January 1, 2002).
    http://www.stat.uiowa.edu/~rlenth/Power/
  2. Range method for estimating standard deviation.
    (Accessed on October 2, 2000)
    http://www.uop.edu/cop/psychology/Statistics/range_method.html
  3. Statistical Power Analysis for the Behavioral Sciences, Revised Edition.
    Cohen J.
    New York NY: Academic Press (1977).
    ISBN: 0-12-179060-6.
  4. Fundamentals of Biostatistics, Third Edition.
    Rosner B.
    Belmont CA: Duxbury Press (1990).
    ISBN: 0-534-91973-1.

This page was written by Steve Simon and was last modified on 07/14/2008.


Binary outcome sample size calculations (August 23, 2000) Category: Ask Professor Mean,

Dear Professor Mean, I have to calculate a sample size for a binary outcome variable. The research study is on breast feeding failures within 7 to 10 days of birth for mothers who intended to breast feed. The rate of failure overall is expected to be about 12%. What sample size do I need? -- Baffled Bob

Dear Baffled,

Breast feeding failure is an example of a binary outcome measure. There are only two possible values: the mother is successfully breast feeding at 7 to 10 days, or the mother is not successfully breast feeding at 7 to 10 days. Other examples of binary outcomes would be:

  • the patient died (survived) within the first year of study,
  • the patient experienced (did not experience) a specific side effect,
  • the patient had a positive (negative) result on a diagnostic test.

The sample size you need when you outcome is binary is different than when your outcome is continuous. For a continuous outcome, you need to specify the variability of your outcome measure and how much of a change you would consider clinically relevant. For a binary outcome, you still need to specify the clinically relevant change. But you don't need a measure of variability. What you need instead is an estimate in your control group of the probability for one level of your binary outcome. You might also need to specify the distribution of your explanatory (independent) variable.

Example

One of the factors that might influence breast feeding failure is whether the delivery was a vaginal birth or a C-section. Let's assume that roughly 20% of the mothers in the sample had a C-section. Expressing it in a different way, the ratio of vaginal births to C-sections is 4 to 1.

Let's also assume that the rate of breast feeding failure is 15% in the C-section group and 30% in the vaginal birth group. You hypothesize that C-section babies fare better, because the mother stays in the hospital longer. The extra time in the hospital allows greater interaction with lactation consultants.

You wish to use a two sided test at an alpha level of 0.05. You also want the power to be at least 0.80. Under these conditions, you would need a sample size of 435 mothers.

[Show some of the formulas and calculations.]

Summary

Baffled Bob wants to know how to calculate a sample size when his outcome variable is binary (has only two possible values). Professor Mean explains that you need to specify the probability of an outcome at two different values of your predictor or independent variable.

Further reading

  1. Binomial Program to Calculate Power or Sample Size. Brent Hostetler, Southwest Oncology Group Statistical Center. Accessed on 2003-05-08. "Two Arm Binomial is a program to calculate either estimates of sample size or power for differences in proportions. The program allows for unequal sample size allocation between the two groups." www.swogstat.org/Stat/Public/binomial/binomial.htm
  2. One sample binomial. Southwest Oncology Group Statistical Center. Accessed on 2003-05-08. "One Arm Binomial program calculates either estimates of sample size or power for one sample binomial problem. The first button calculates approximate power or sample size and critical values (reject if >= critical value). The second button calculates "exact" power and alpha for the given null and alternative proportions and sample size. Note, sample size and null and alternative proportions can be changed before using the second button." www.swogstat.org/Stat/Public/one_binomial.htm
  3. Bayesian sample size determination for estimating binomial parameters from data subject to misclassification. Elham Rahme, Lawrence Joseph, Theresa W. Gyorkos. Accessed on 2003-05-08. "We investigate the sample size problem when a binomial parameter is to be estimated, but some degree of misclassification is possible. The problem is especially challenging when the degree to which misclassification occurs is not exactly known." Published November 29, 1999. www.med.mcgill.ca/epidemiology/Joseph/diagsmp.pdf

This webpage was written by Steve Simon on 2000-08-23, edited by Steve Simon, and was last modified on 2008-07-14. This page needs minor revisions. Category: Ask Professor Mean, Category: Sample size justification


Confidence Intervals.

Dear Professor Mean:  Can you give me a simple explanation of what a confidence interval is?

We statisticians have a habit of hedging our bets. We always insert qualifiers into our reports, warn about all sorts of assumptions, and never admit to anything more extreme than probable. There's a famous saying: "Statistics means never having to say you're certain."

We qualify our statements, of course, because we are always dealing with imperfect information. In particular, we are often asked to make statements about a population (a large group of subjects) using information from a sample (a small, but carefully selected subset of this population). No matter how carefully this sample is selected to be a fair and unbiased representation of the population, relying on information from a sample will always lead to some level of uncertainty.

Short Explanation

A confidence interval is a range of values that tries to quantify this uncertainty. Consider it as a range of plausible values. A narrow confidence interval implies high precision; we can specify plausible values to within a tiny range. A wide interval implies poor precision; we can only specify plausible values to a broad and uninformative range.

Consider a recent study of homoeopathic treatment of pain and swelling after oral surgery (Lokken 1995). When examining swelling 3 days after the operation, they showed that homoeopathy led to 1 mm less swelling on average. The 95% confidence interval, however, ranged from -5.5 to 7.5 mm. From what little I know about oral surgery, this appears to be a very wide interval. This interval implies that neither a large improvement due to homoeopathy nor a large decrement could be ruled out.

Generally when a confidence interval is very wide like this one, it is an indication of an inadequate sample size, an issue that the authors mention in the discussion section of this paper.

How to Interpret a Confidence Interval

When you see a confidence interval in a published medical report, you should look for two things. First, does the interval contain a value that implies no change or no effect? For example, with a confidence interval for a difference look to see whether that interval includes zero. With a confidence interval for a ratio, look to see whether that interval contains one.

Here's an example of a confidence interval that contains the null value. The interval shown below implies no statistically significant change.

Figure 2.1

Here's an example of a confidence interval that excludes the null value. If we assume that larger implies better, then the interval shown below would imply a statistically significant improvement.

Figure 2.2 (1222 bytes)

Here's a different example of a confidence interval that excludes the null value. The interval shown below implies a statistically significant decline.

Figure 2.3 (1214 bytes)

Practical Significance

You should also see whether the confidence interval lies partly or entirely within a range of clinical indifference. Clinical indifference represents values of such a trivial size that you would not want to change your current practice. For example, you would not recommend a special diet that showed a one year weight loss of only five pounds. You would not order a diagnostic test that had a predictive value of less than 50%.

Clinical indifference is a medical judgement, and not a statistical judgement. It depends on your knowledge of the range of possible treatments, their costs, and their side effects. As statistician, I can only speculate on what a range of clinical indifference is. I do want to emphasize, however, that if a confidence interval is contained entirely within your range of clinical indifference, then you have clear and convincing evidence to keep doing things the same way (see below).

Figure 2.4 (1558 bytes)

One the other hand, if part of the confidence interval lies outside the range of clinical indifference, then you should consider the possibility that the sample size is too small (see below).

Figure 2.5 (1553 bytes)

Some studies have sample sizes that are so large that even trivial differences are declared statistically significant. If your confidence interval excludes the null value but still lies entirely within the range of clinical indifference, then you have a result with statistical significance, but no practical significance (see below).

Figure 2.6 (1548 bytes)

Finally, if your confidence interval excludes the null value and lies outside the range of clinical indifference, then you have both statistical and practical significance (see below).

Figure 2.7 (1550 bytes)

The Standard Error

In many situations, the width of a confidence interval is proportional to the standard error. The standard error is defined the variability for a statistical estimate. You can compute a crude confidence interval by taking the estimate plus or minus twice the standard error.

Confidence Interval for a Simple Average

There are lots of different formulas for the confidence interval and the standard error, depending on the context of the problem. The simplest formula appears when you estimate an average from a single sample. In this situation, the standard error would be

Sigma/Sqrt(n) (972 bytes)

where sigma represents the variability of the original data and n represents the size of the sample. The crude confidence interval would be the sample mean plus or minus two standard errors.

The width of your confidence interval goes down as the sample size goes up, since you are placing a larger value in the denominator. This is a classic and intuitive relationship in statistics: larger sample sizes provide greater precision (that is, narrower confidence intervals).

One way of planning a sample size for your study is to try to make sure your confidence interval has an adequate amount of precision. Although larger sample sizes mean narrower confidence intervals, there is usually a point of diminishing returns. This occurs when further shrinking of the interval is not worth the cost of additional subjects.

An often overlooked strategy for gaining precision is by finding a way to shrink sigma, the variability in your original data set. For example, use of calibration and quality control checks in a laboratory can often provide substantially smaller values for sigma.

Confidence Interval for a Difference Between Two Averages

If we were interested in estimating the difference in averages between two independent samples of data, the standard error of the estimated difference would be

Sqrt(sigma1^2/n1+sigma2^2/n2) (1232 bytes)

where the subscripts 1 and 2 indicate whether the values come from the first or the second group. Notice that the standard error and hence the width of the confidence interval goes down as either or both sample sizes go up.

When you are planning a research study comparing two groups, it is often helpful to consider different allocations of samples to the two groups. For example, if your first group is much more variable than the second group, you might be better off trying for a larger sample size in that group, rather than trying to get equal numbers in each group.

Confidence Interval for a Proportion

If we compute a proportion, p, from a sample, the standard error of that proportion would be

sqrt(p*(1-p)/n) (1210 bytes)

Just like the previous examples, larger sample sizes lead to smaller standard errors and narrower confidence intervals.

Did you notice in this formula that the width of the confidence interval is related to the estimate itself. A bit of work with calculus will show you that, assuming the sample size stays the same, the widest confidence interval occurs when p=0.5. Both rarer and more frequent events than 50% will produce narrower intervals.

Confidence Interval for an Odds Ratio

The final example involves computing an odds ratio. We often use the odds ratio to summarize data in a two by two table. The rows of the table might represent disease status (healthy/diseased) and the columns might represent exposure status (exposed/unexposed). In this case, the odds ratio would represent the relative change in the odds of disease between exposed and unexposed patients.

Or possibly the rows might represent treatment status (active drug/placebo) and the columns might represent health outcome (improvement/no improvement). Here, the odds ratio represents the relative change in the odds of improvement between drug and placebo.

If we let the letters a, b, c, and d represent the frequency counts in a two by two table (see below)

Two by two matrix (1013 bytes)

then the odds ratio would be ad/bc. The odds ratio is skewed, so we cannot easily compute a standard error for the odds ratio itself. We can, however, find a standard error for the natural logarithm of the odds ratio. It is simply

sqrt(1/a+1/b+1/c+1/d) (1280 bytes)

We see that as any or all of the counts in the two by two table increase, the confidence interval for the log odds ratio shrinks. Also, it turns out that the smallest count in the two by two table plays the largest role in determining the size of the standard error.

Example of a Confidence Interval For a Mean

In a study of immunotherapy in children with asthma, 61 patients showed an average improvement of 2.5% peak expiratory flow rate with a standard deviation of 11%. We divide the standard deviation by the square root of 61 to get a standard error of 1.4. A crude confidence interval would be 2.5% plus or minus 2.8% which equals 0.3% to 4.8%. I'm not an expert of asthma, but if we defined a range of clinical indifference to be an improvement of less than 5%, then this confidence interval is entirely within the range of clinical indifference.

Example of a Confidence Interval for An Odds Ratio

In the same study, the author noted that 15 out of 53 immunotherapy patients showed partial remission on their need for medication. This sample size is smaller because of a small number of dropouts. In the placebo group, 12 out of 57 showed partial remission. The two by two table for these data looks like

wpeB9.gif (1899 bytes)

The odds ratio is 1.5, which shows that the immunotherapy treatment increases the odds of partial remission. The natural log of the odds ratio is 0.6. For this calculation, be sure that you use a natural logarithm and not a base 10 logarithm.

The standard error of the log odds ratio is

wpeBA.gif (1493 bytes)

So a crude confidence interval for the log odds ratio is 0.6 plus or minus 0.9 which equals -0.5 to 1.3. We can exponentiate (use the exp button on your scientific calculator) to convert back to the original measurement scale. This gives us a confidence interval of 0.6 to 3.6 for the odds ratio itself. Even though this interval contains 1, we still have to allow for the possibility that the improvement might be as large as two-fold or three-fold.

Summary

A confidence interval is a range of plausible values that accounts for uncertainty in a statistical estimate.. A narrow confidence interval implies high precision; a wide interval implies poor precision.

When you see a confidence interval in a published medical report, you should look for two things.

  1. Does the interval contain a value that implies no change or no effect?
  2. Does the confidence interval lie partly or entirely within a range of clinical indifference?

This webpage was written by Steve Simon on (unknown date), edited by Steve Simon and Linda Foland, and was last modified on 2008-07-14. Category: Confidence intervals, Category: Statistical evidence


Documenting negative results in a research paper (October 11, 2001) Category: Ask Professor Mean, Category: Confidence intervals, Category: Sample size justification

Dear Professor Mean, I have just finished a well-designed research study and my results are negative. I'm worried about publication bias; most journals will only accept papers that show positive results. How do I document the negative findings in a research paper in a way that will convince a journal to accept my paper? -- Apprehensive Arturo

Dear Apprehensive,

Don't worry about publication bias. While it is true that a study with positive results is more likely to get published, it's only a tendency. Besides we really don't know if publication bias is caused by referees and journal editors. Some people suspect that the researchers themselves hold back on publishing negative results: a type of self-censorship.

Also please keep in mind that terms like "negative results" are simplistic, subjective, and ambiguous. There is good evidence, for example, that two people reading the same paper can often come up with different opinions about whether that study is positive or negative.

Still your question is a good one. How do you document that the results from your negative study have credibility? It's a question that should be asked from the other side, as well. If you read a negative study, what should you look for to decide whether the study has credibility?

Short answer

If you have a well-designed negative study and you want to get it published, be sure to stress the aspects of your study that are well designed. Most of these aspects are the same whether the study is positive or negative. For a negative study, though, you should emphasize two things:

  1. confidence intervals for all of your key outcome measures,
  2. justification of your sample size, ideally done prior to data collection.

More details

If you show that your sample size was adequate, either by showing that your study had adequate power, or by showing that your confidence intervals have good precision, then your negative findings will have a lot of credibility.

Power/sample size calculation

You say your study was well-designed. Good! That means that you have a power or sample size calculation. Your power or sample size calculation is best done a priori (prior to the collection of data). If you only calculate power post hoc (after the data are collected), make sure that the effect size used in that calculation is based on what is considered a clinically relevant difference, and is not based on the difference that was observed in your study.

Post hoc power calculations that use the differences observed in the study are useless, because they tell you nothing more than what your p-value already told you. If you have a large p-value, then the post hoc power at the observed difference is always very low. If you have a small p-value, then it is always very high.

When papers specify power or sample size calculations, their results are sometimes ambiguous. A 1998 paper on the use of various doses of an anti-emetic drug notes that

"Five hundred twenty patients were needed to detect a significant difference with 90% probability"

but they don't define what a significant difference is. When you write your paper, be sure to specify the following:

  1. the outcome variable you are basing the power calculation on,
  2. how much of a change you consider clinically relevant,
  3. any estimated values (standard deviations, baseline rates) that you used in the power calculation,
  4. where you got these estimates, and
  5. the power or the probability that your research design would detect that size difference.

It's also nice to provide a reference for the formula and/or the software you used for your power calculation.

Confidence intervals

A second way to enhance the credibility of negative findings is to present your results using confidence intervals. The width of a confidence interval provides especially valuable information for a negative study. If the interval is so narrow that it excludes any clinically relevant difference, then you have demonstrated a clear lack of effect.

If instead the confidence interval is wide enough to drive a truck through, then you have a lot of uncertainty and ambiguity. The wide confidence intervals show that maybe the negative findings could be real or maybe they could caused by an inadequate sample size. This is a very unhappy situation, because it means that we will never know for sure why the study was negative.

Since your study was well-designed, all of your confidence intervals will be narrow enough to make definitive statements about the effect or lack of effect of your treatment. Well maybe not all of your confidence intervals, but those for your primary outcome measures will be narrow.

Example

A paper in the New England Journal of Medicine describes

"a double-blind placebo controlled trial of multiple-allergen immunotherapy in 121 allergic children with moderate-to-severe perennial (year-round) asthma."

The authors conclude that

"immunotherapy with injections of allergens for over two years was of no discernible benefit in allergic children with perennial asthma who were receiving appropriate medical treatment."

Let's examine how they justify this negative finding.

Power calculation

In the methods section, the authors state that

"on the basis of response rates from the most comparable previous study, we estimated a priori that a sample of 60 subjects per group would be required for an alpha level of 0.05 and a beta level of 0.8."

I suspect that the authors meant to say either that the power was 0.8 or that the beta level was 0.2. It is clear from previous context, that the authors are referring to the primary outcome variable, a 10 point medication score. They did not, however, define how much of a change in the medication score they considered relevant.

On the positive side, the authors did conduct this power calculation a priori (prior to data collection).

Confidence intervals

The other thing that the authors did which was very helpful was to present confidence intervals for all of their outcome measures. Since they made measurements at baseline and at the conclusion of therapy, the relevant confidence intervals involve change scores (differences between conclusion and baseline).

The immunotherapy group showed a decline of 1.4 units in the medication score, and a decline of 1.2 units in the placebo group. The immunotherapy group showed a 0.2 unit larger decline than the placebo group. This is a bit confusing, because we are looking at a change in the change score. What this means is that although the immunotherapy group showed a decline it was only slightly better than what a placebo would provide..

The 95% confidence interval for the change in change scores is -0.48 to 0.92. This tells us that even after accounting for sampling error, the largest improvement due to immunotherapy relative to placebo is still less than one unit.

I'm not an expert on asthma, but this seems like a clinically insignificant change. Generally, it is hard to get excited by changes less than one unit on these 5 or 10 point scales. In surveys about people's attitudes, for example, a one unit change usually implies a subtle change in adjectives (slightly disagree versus moderately disagree).

The medication score is a bit trickier to interpret, however. The authors describe "a score of 0 indicated no medication, 2, two to four doses of albuterol; 6 inhaled beclomethasone or alternate-day methylprednisone; and 10, a high dose of methylprednisone (>1 mg per kilogram per day)." So it appears that a one unit decrease at best implies either a lowering in dosage or moving to a less serious type of medication.

You can see another indication that this is a small change, by looking at the relative size of the change. The baseline medication scores were around 5 for both groups, so the 0.2 unit change tells us that the immunotherapy group got 4% closer (.2/5) to the goal of no medication than the placebo group did. Even a full unit change seems unimpressive since it represents only getting 20% closer.

Finally, it is useful, sometimes to compute the effect size, the magnitude of the change divided by the standard deviation of the change. Both the groups had a standard deviation of around 2 units, so a 0.2 unit change represents 0.1 standard deviations.

Jacob Cohen defines a small effect size as 0.2 standard deviations, and this is a difference so small that it would be imperceptible. For example, the difference in heights between 14 and 15 year old girls is roughly 0.2 standard deviations. So clearly, the difference that we did see, is very small.

A full unit change in the medication score represents a 0.5 standard deviation difference, which Jacob Cohen describes as a medium effect size. So we see that this sample is large enough that we can clearly rule out the possibility of a medium effect size. Furthermore, the effect size that we did see is smaller than small.

You can apply this type of logic to any of the secondary endpoints as well. The researchers looked at how often the asthma patients needed to make medical contacts. For example, while both the placebo and the control group had declines in the number of telephone calls in the past 60 days, the decline was larger in the immunotherapy group (0.22 versus 0.05, or a difference of 0.17). The 95% confidence interval is -0.49 to 0.16, showing that at best, immunotherapy led to an average of half a phone call less than placebo. Even the busiest physician would not notice an extra phone call every 120 days. The other measures of medical contacts (office visits, emergency room visits, and hospitalization) also had confidence intervals that are so narrow as to exclude any clinically relevant change.

NNT and NNH calculations

Another secondary endpoint is the proportion of patients showing partial remission (a medication score of 2 or less) and complete remission (a medication score of 0). Although these authors do not present an NNT (Number Needed to Treat) calculation, we can do so ourselves in order to get a feel for whether the differences seen between the immunotherapy and placebo group are clinically relevant.

The authors report that 28% of the patients in the immunotherapy group showed partial remission compared to 21% of the placebo group. This is an absolute change of 7%. This change did not achieve statistical significance. We should still ask ourselves the question, even if this difference did achieve significance, is it enough of a difference to be clinically relevant.

You can compute the NNT by inverting the absolute change. In this case, NNT=14 (=1/.07). This tells you that you have to treat 14 patients with immunotherapy in order to see one additional partial remission on average, compared to placebo. This number is large, telling us that we have to treat a lot of patients to see a small number of successes.

The picture for complete remission is even more pessimistic. The rates of complete remission are 7.5% and 5.8% respectively, showing that the immunotherapy led to an absolute decrease of 1.3%. This difference also fails to achieve statistical significance. The number needed to treat is 77 (=1/.013). You would have to treat 77 patients on average to see one additional complete remission in medication usage.

It's interesting to compare this to the side effects seen with the immunotherapy. Systemic reactions occurred in 34% of the immunotherapy patients and only 7% of the placebo patients. This is a 27% difference, which did achieve statistical significance. Inverting this difference provides us with an estimate of NNH (Number Needed to Harm). Think of the number needed to harm as a measure of how often you will see additional side effects. For this endpoint, NNH=3.7 which tells you that you will see one additional systemic reaction on average for every four patients that we treat with immunotherapy instead of placebo.

The ratios between NNT and NNH calculations are sometimes instructive as well. You will see 3.8 (=14/3.7) systemic reactions for every partial remission and 21 (=77/3.7) systemic reactions for every complete remission.

As someone without much medical expertise, I hesitate to try to judge the tradeoffs between improvements in partial and complete remission compared to additional risk for systemic reactions. A trained physician, however, can make useful judgements when the data are presented this way.

I should also caution that these NNT and NNH calculations should ideally be accompanied by confidence limits. The calculation of confidence limits, however, is a lot more complex than the calculation of the NNT and NNH values themselves.

Summary

Apprehensive Arturo has just finished a research study that is negative and worries that he won't be able to publish his results. Professor Mean assures him that he can publish his paper as long as he documents that the research was good. This documentation should include a power calculation conducted prior to data collection and/or confidence intervals to summarize the magnitude of his effects.

Further reading

The Lang and Secic book give some good general advice about how to write up research results.

How to Report Statistics in Medicine.
Lang TA, Secic M.
Phildelphia PA: American College of Physicians (1997).
ISBN: 0-943126-44-4.

This page was written by Steve Simon.  It was last modified on 09/22/2007.


Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.