Children's Mercy Hospital
Find a Doctor | Press Room | Careers | Directions & Locations

About Us | Contact Us | Giving to Children's Mercy
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Stats #05: Using SPSS to Develop a Survival Data Model

Content:  This three hour training class will give you a general introduction in how to use SPSS software to compute survival data models. These models compare the amount of time until a certain event (such as death or relapse) occurs. This class is useful for anyone who encounters survival times as part of their research. This class will provide hands-on computer experience in the CMH computer lab using SPSS software.

Objectives:  In this class, you will learn how to:

  • define censoring;
  • apply the Kaplan-Meier estimate of the survival function;
  • compare survival times using the log-rank test; and
  • evaluate effects on survival using the proportional hazards model.

Teaching strategies:  Didactic lectures and individual computer exercises.

IRB Education Credits:  This class does not qualify for IRB Education Credits (IRBECs).

Outline:

  • Seating in the computer lab
  • Overview of the STATS web pages
  • Stats: Consulting services that I provide
  • Installing SPSS terminal server (draft)
  • Data management for survival data
  • Kaplan Meier
  • Guidelines for survival data models
  • Please fill out an evaluation form

Welcome to this SPSS computer training class! Please be seated in front of any computer which has a monitor turned on. If the monitor is turned off, that means the computer is not working properly today.


Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

  • setting up your research hypothesis,
  • selecting and justifying your sample size,
  • writing the statistical methods section for your grant,
  • preparing randomization tables for your study,
  • reviewing your surveys for content and quality,
  • developing a system for entering your data,
  • choosing an appropriate statistical model for your data,
  • establishing validity and/or reliability for your measurement scales,
  • checking for violations of statistical assumptions in your data,
  • producing graphs and tables for your research publication, and
  • providing references for new and unusual statistical methods.

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

  • Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It’s best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building’s entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ½ block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Terminal Server (February 3, 2003)

Terminal server is a new and improved approach to using SPSS and SigmaPlot and other programs. You log on to a dedicated computer and run your programs on that computer rather than running SPSS or SigmaPlot through the network.

Terminal server offers several advantages:

  • Because the code runs on a dedicated computer, SPSS and SigmaPlot will load faster and run faster.
  • You will no longer have to worry about upgrading to new versions of SPSS and SigmaPlot. The upgrades will be handled for you.
  • If you have an older computer with compatibility problems, you will encounter fewer difficulties with terminal server.

Listed below are instructions on how to load terminal server on your computer. It is very easy, even for someone who is not a computer nerd. If you prefer to have someone else load terminal server for you, please ask your contact person in Information Systems for help.

We have done some work with the test version of terminal services. You don't need to use the test version, except for special and unusual situations.

Contents


Downloading and installing terminal server.

The software to load terminal server is located on an internal web site. Open Internet Explorer and type

http://10.1.20.59/ts/install.exe

in the address bar. You will get a FILE DOWNLOAD dialog box (see below) that will ask you what to do with the file.

It might look slightly different, depending on the version of Internet Explorer that you are using. Click on the OPEN button. If you don't see an OPEN button, click on the RUN button.

If you see a SECURITY dialog box and/or WINZIP dialog box, click on the YES button to continue.

Once the installation is complete, Click on the START button and select Programs | Terminal Service Client | Client Connection Manager. Then right-click on the CMH TERMINAL icon. This brings up a pop-up menu (see below). Select PROPERTIES from the popup menu.

This will open up the PROPERTIES dialog box. Select the CONNECTION OPTIONS tab and click on the FULL SCREEN option button. This will ensure that terminal server will use your full screen rather than just part of your screen.

Click on the OK button to close this dialog box.

For a second time, right click on the CMH terminal icon to bring up the popup menu. Select CREATE SHORTCUT ON THE DESKTOP from the popup menu. If you do not see an option for "Create Shortcut on the Desktop", you can select "Send To" and then "Desktop (create shortcut)".


Can I load terminal server on my laptop? Your laptop needs to be connected to the network using a high speed internet connection or it needs to be attached directly to the hospital network. Follow the same steps described above. This will allow you to use SPSS on your laptop as long as you have a direct network connection or a connection via high speed internet access.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Logging on to terminal server.

Click on the TRMSERV icon and a TRMSERV - Terminal Services Client window will appear. In the Log On to Windows dialog box (see below), type the same user name and password that you use when you turn on your computer in the morning.

You will now see the desktop of terminal server (see below). This looks very similar to your own desktop, except that it has a different background color and it has the SPSS 11.5 for Windows icon.

You are now connected to terminal server. Double click to open the SPSS folder and then click on the SPSS icon to run SPSS.

How do I exit from terminal server?

At the bottom of the terminal screen is a start button that looks just like the START button on your regular computer. Click on START and select Shut Down from the menu. You will see either a DISCONNECT or LOG OFF option chosen (see below).

Either one works the same. Click on the OK button. Close the Connect to Terminal Server window.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Using files with terminal server.

You cannot use your floppy disk drive or your local hard drive directly from terminal server. Instead, you must save the file on a network drive. The best location is probably your user folder.

You have to tell terminal server the network name of your user folder. Do this once and your computer will remember from that moment forward.

Connect to terminal server and click on the MY COMPUTER icon. This will bring up a folder labeled My Computer (see below).

From the menu, select Tools | Map Network Drive. This will bring up the Map Network Drive dialog box (see below).

You need to assign a drive letter to the location of your network files. It would be best to set the drive letter to V:, but you can use a different letter if you like. Then type in the name of your folder. For me, it would be \\cmhsan08\users\ssimon.

After you have saved to the network, you can copy the file to a floppy disk or a local hard drive.

How do I open files in SPSS terminal server?

You can only open files located on the network. Before you connect to terminal server, copy the file from your floppy disk to a location on the network.

How do I use the training example data sets?

Training example data sets appear on a folder on the desktop as the SPSS Examples folder. Double click on this folder to open it. You can also find this folder on the D drive at D:\SPSS Examples.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Printing from terminal server.

You have to tell terminal server the network name of the printer that you normally use. Do this once and your computer will remember from that moment forward.

Open Internet Explorer and select File | Print from the menu.

Click on the ADD PRINTER icon and follow the instructions. You will get a series of dialog boxes labeled Add Printer Wizard. The instructions are mostly straightforward. After an introductory screen (not shown), you will get a following dialog box asking if you are adding a local printer or a network printer. You must choose the Network printer option (see below), since terminal server does not work with local printers.

It helps if you know the exact name of your printer (one of the printers I use is named \\hpprint02\Medrsrch2).

If you know the exact name of your printer, you can type it in the above dialog box. If you are not 100% certain about the name of your printer, check the option anyway and leave the name blank. You will get a list of printers and print servers to browse through (see below). Do not use the Find a printer in the Directory option button, as that does not work well (at least not for me).

Once you have selected your printer, you should decide if this is the default printer, the one that SPSS terminal server will try to use as its first choice.

When you click on the Next button, you will get a dialog box summarizing your choices. If these choices appear reasonable, click on the Finish button. If something appears to be wrong, use the Back button to fix things.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Removing terminal server from your computer.

Click on Start | Settings | Control Panel | Add/Remove Programs. Find Terminal Services Client on the list of programs and click on the Change/Remove button.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Terminal server--What if I get an error message?

First of all, don't panic. Some of these error messages occur because efforts to protect against viruses, trojan horses, and other malicious software also interfere with the normal operations of SPSS and Terminal Server. Here are some of the messages that I have encountered already with a brief explanation of what causes the message and how to work around it.

If you encounter an error message other than the ones described here, please contact me.

Administrator access message. I have never seen this message on my computer, but your computer might pop up a dialog box when you are trying to install terminal server that says something along the lines of you don't have sufficient access or permission or administrator privileges to run SPSS. When IS set up your computer, they added a security layer that protects you against malicious viruses and other computer threats but which also disables your ability to install software on your own. You need to call the help desk and they will temporarily grant you "king/queen for a day" privileges that will enable you to install your own programs for a limited time.

"The client software could not initialize with SPSS Server at ." When loading SPSS, I would get a dialog box that says: "The client software could not initialize with SPSS Server at ." The folks at SPSS told me the solution. "This is the result of either a missing or corrupted file named 'registry.txt' in the SPSS program folder. This problem can be fixed by either reinstalling SPSS or obtaining a new copy of that file from our FTP site and replace it with the one in your SPSS directory. That file is located at ftp://ftp.spss.com/pub/spss/windows. Please locate the one that's specific to your SPSS version." -- SPSS Web Support, personal communication, September 25, 2002.

"This action has been cancelled due to restrictions put on this computer." You should not be getting this error message anymore, but I am keeping it here just in case. This message is actually a paper tiger. What happened a while back is that someone was running terminal server from home and thought it would be fun to download some games to run on terminal server. You know what happened next, of course. Virus attack on terminal server! So our IS folks decided that they had to add some major security restrictions to terminal server. The restrictions interfere with some of the minor bookkeeping activities with SPSS as it starts up. Apparently when SPSS checks for a proper license, it touches a part of terminal server that raises a security flag. But whatever happens on terminal server stays on terminal server. If you click OK on the dialog box, everything in SPSS works just fine.

"Windows cannot access the specified device, path, or file. You may not have the appropriate permissions to access." At CMH, we have created an SPSS group for security reasons. If you are not part of the SPSS group, you cannot access SPSS and you will get an error message along the lines of the above. Call the help desk (5-3454) and ask to be added to the SPSS group. You may need to reboot your computer afterwards.

"You do not have sufficient access to your machine to connect to the selected printer."  This message appears when you are trying to get Terminal Server to recognize and print to your networked printer. This occurs when IS has not installed the appropriate printer drivers on terminal server. Tell me the brand name of your printer, and we can fix it from our end.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Stats >> Software >> Terminal server

This webpage was written by Steve Simon, edited by Linda Foland and Steve Simon, and was last modified on 07/08/08 . Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Data management for survival data (August 27, 2002)

Every project is different, of course, but here are some general concepts that may help you manage data.

Survival data will involve calculating the time between the various dates and noting when certain dates are present or absent.

In a study of bone marrow transplants for childhood cancer, we have up to four dates:

  1. Date of bone marrow transplant (always known)
  2. Date of last follow-up (always known)
  3. Date of relapse (sometimes censored)
  4. Date of death (sometimes censored)

The dates of relapse and death are censored because either they did not occur, or they occurred after the date of last follow-up.

The data shown above represents an example of this data. Notice that the relapse and death dates are missing (censored) for most of the subjects. In this particular context, the values are missing, for the most part, because the subject has not yet relapsed or died. It could represent, however, a subject for whom we have lost touch, so that we don't know anything about the relapse date or death date except that if it exists, it has to be later than the date of the last follow-up.

Also notice that at least one subjects has relapsed without dying and at least one subject died without having a relapse first.

Notice the formula here:

  • dth_days = (max(dth_date,fu_date)-op_date)/(24*60*60)

We want to account for the possibility that a patient died after their last follow-up, so we take the maximum value. The maximum value function will also handle missing values intelligently: if we have a follow-up date and the death date is missing, then only the follow-up date will be used in the calculation. SPSS stores date/time values as the number of seconds since October 14, 1582, so we adjust the value by dividing by 24 hours/day * 60 minutes/hour * 60 seconds/minute.

The new variable, dth_days, represents the number of days that the patient lived, when the death date is known. When the death date is missing, this variable represents the number of days that we followed-up on the patient, which is a lower bound for the number of days the patient lived.

To distinguish between the two situations, we create a variable, dth_code, which equals 1 when the patients is not missing a death date and 0 when the patient is missing a death date. We could have used a different code (e.g., 1 if the patient is missing a death date, and 2 if the patient is not missing a death date).

This is what the data looks like after the transformations.

We need to create similar variables for analysis of progression free survival. In this analysis, we note the time until relapse. If the relapse time is missing, we note the time until the last follow-up instead.

For this analysis, the one key difference, is that it is possible to have a follow-up after relapse. So the formula for rel_days should be

  • rel_days = (min(rel_date,fu_date)-op_date)/(24*60*60)

The formula for rel_code will be

  • rel_code = ~ missing(rel_date)

These are not the only calculations possible. The time of the operation might be replaced by the time that the tumor was diagnosed or the time when therapy ended. You might also calculate a composite event, such as relapse and/or death within 100 days of the operation.

This webpage was written by Steve Simon on 2008-xx-xx, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Survival analysis


Kaplan Meier (June 27, 2000)

Dear Professor Mean, When I read my medical journals, I keep on coming across terms like "Kaplan-Meier Product Limit estimate" or "Kaplan-Meier survival curve." What do these terms mean and when are they used?

Often we want to measure how long it takes for something to occur. The most common (and the most morbid) example is how long it takes for someone to die. For this outcome, we want to estimate the fraction of patients who survive for at least one month, at least three months, etc. This estimate is known as a survival curve.

The term survival is sometimes misleading, because we can use it for other less severe outcomes like how long until a cancer relapse, or how long until an infection occurs. Sometimes it can even be used for a positive outcome, like how long it takes for a couple to conceive. But for the rest of this example, we'll keep things simple by assuming that the outcome is time until death.

Estimating a survival curve is often complicated by the uncooperative way in which research subjects sometimes behave. For example, some subjects decide to leave a study part of the way through. Others refuse to die before the study ends. We label these uncooperative subjects as censored observations. They survived for at least three months, but then we lost touch with them. Or they survived at least three years, but then we had to terminate the study.

Short explanation

The Kaplan-Meier estimate is a simple way of computing the survival curve in spite of all these troublesome research subjects. It involves computing the number of people who died at a certain time point, divided by the number of people who were still in the study at that time. We multiply these probabilities by any earlier computed probabilities, which is one reason this is called a "product limit estimate."

The Kaplan-Meier survival curve is often illustrated graphically. It looks like a poorly designed staircase, with vertical steps downward at the time of death of each individual subject.

Often we will compare curves for two different groups of subjects. For example, what is the survival pattern for subjects on a standard therapy compared to a newer therapy. We can look for gaps in these curves in a horizontal or vertical direction. A vertical gap means that at a specific time point, one group had a greater fraction of subjects surviving. A horizontal gap means that it took longer for one group to experience a certain fraction of deaths.

More details

To compute a survival curve, you need to note the time of occurrence of events (e.g., failures, deaths)

wpe48.gif (1798 bytes)

It is possible for two or more events to occur at the same time, in which case the number of distinct times is less than the number of deaths or failures. You need to place the t's in order from smallest to largest. That is,

wpe49.gif (1048 bytes)

You also need to define the starting point of the study,

wpe4A.gif (950 bytes)

The basic computations for the Kaplan-Meier survival curve rely on the computation of conditional survival probabilities. In particular, the probability

wpe4B.gif (1200 bytes)

which can be interpreted as the probability of your surviving to a specific time, given that you survived to the previous time. This probability is easy to calculate if you know the number of deaths or failures at a specific time and if you know the number of patients at risk at that same time.

A more difficult (but more important) probability is the unconditional probability of survival,

wpe4C.gif (1052 bytes)

which represents the simple probability of survival to a specific time. You can use a relationship between this unconditional probability and the conditional probability:

wpe4D.gif (1666 bytes)

At first glance, this does not seem to help, because the right hand side of the equation still includes an unconditional probability. But we can apply this approach again to get

wpe4E.gif (2010 bytes)

and we can continue along these lines to get

wpe4F.gif (2366 bytes)

This last probability represents the probability of surviving at the start of the study. Unless we intentionally recruit dead subjects, this probability has to be 1. Therefore, the unconditional probability is equal to the cumulative product of conditional probabilities.

At each time point, you should count

wpe50.gif (1613 bytes)

You should also count

wpe51.gif (2124 bytes)

Armed with this information, you can now compute a Kaplan-Meier survival curve. First you need to calculate the number of patients at risk,

wpe52.gif (1120 bytes)

In other words, the number at risk at any specific time point is just the number at risk at the previous time point, minus the number of deaths/failures and the number of censored observations. For convenience, we define

wpe53.gif (2872 bytes)

Next you compute the conditional probability of survival:

wpe56.gif (1402 bytes)

Finally, the unconditional probability of survival is simply the cumulative product of the conditional probabilities:

wpe57.gif (1542 bytes)

Example

The following example is from Chadha et al (2000). The authors studied a sample of 36 pediatric patients undergoing acute peritoneal dialysis through Cook Catheters. They wished to examine how long these catheters performed properly. They noted the date of complication (either occlusion, leakage, exit-site infection, or peritonitis).

Half of the subjects had no complications before the catheter was removed. Reasons for removal of the catheter in this group of patients were that the patient recovered (n=4), the patient died (n=9), or the catheter was changed to a different type electively (n=5). If the catheter was removed prior to complications, that represented a censored observation, because they knew that the catheter stayed complication free at least until the time of removal.

wpe2D.gif (2277 bytes)

Figure 3.1 Failures and censored observations for catheter study.

The table above lists the days at which failures and/or censored observations occurred.

wpe2F.gif (2849 bytes)

Figure 3.2 Computation of number of patients at risk

To compute a Kaplan-Meier survival curve, you first need to compute the number of catheters at risk on each day. This is just the number of catheters that were not previously censored or failures. These calculations appear in the table shown above.

wpe34.gif (3485 bytes)

Figure 3.3 Compuation of conditional probability of survival.

Next you need to compute the conditional probability of survival. This is the probabilty that a catheter will survive at a specific time point, given that it survived (and was not censored) at any previous time point. These calculations appear in the table shown above.

wpe37.gif (3771 bytes)

Figure 3.4 Computation of unconditional survival probabilities.

Finally, you need to compute the cumulative product: the product of each conditional probability with all previous conditional probabilities. This provides the estimates of survival probability used in the Kaplan-Meier curve. These calculations appear in the table shown above.

wpe3D.gif (2518 bytes)

Figure 3.5 Graph of unconditional survival probabilities (Kaplan-Meier curve).

The graph you see above is the Kaplan-Meier curve as computed by SPSS. Select ANALYZE | SURVIVAL | KAPLAN-MEIER from the menu to get this graph.

Figure 3.6 SPSS dialog box for Kaplan-Meier procedure. [Image is already full size]

The figure above shows the SPSS dialog box. The date of the event (either failure or censoring) goes in the TIME field. In the STATUS field, you should place the variable which indicates whether the event was a failure or a censored observation. Click on the DEFINE EVENT button to tell SPSS what codes you used.

Figure 3.7 SPSS dialog box for defining events. [Image is already full size]

The figure shown above is the SPSS dialog box where you distinguish between failures and censoring. In this data set, a value of 1 indicates a failure and 0 represents censoring.

wpe3A.gif (10940 bytes)

Figure 3.8 SPSS dialog box for Kaplan-Meier options. [Image is already full size]

Also be sure to click on the OPTIONS button in the main dialog box. The figure above shows you the dialog box you see when you click on the OPTIONS button. Be sure that the SURVIVAL PLOTS option is checked.

Reference

Tenckhoff Catheters Prove Superior to Cook Catheters in Pediatric Acute Peritoneal Dialysis.
Chada V, Warady BA, Blowey DL, Simckes AM, Alon US.
American Journal of Kidney Diseases (2000), 35(6):1111-1116.

Further reading

There are many beginning level books on biostatistics that discuss the Kaplan-Meier curve, such as Woolson's book. You can find a more advanced and detailed approach in Collett's book.

  1. Modelling Survival Data in Medical Research.
    Collett D.
    London England: Chapman and Hall (1994).
    ISBN: 0-412-44890-4.
  2. Statistical Methods for the Analysis of Biomedical Data
    Woolson RF.
    New York NY: John Wiley & Sons, Inc. (1987).
    ISBN: 0-471-80615-3.

Here is some extra material that I need to integrate into the above description.

Survival probabilities involve the estimation of the time to some event. Usually, the event involves death or failure of some sort. Some of the patients may not experience the event, because the study ends before they die, or we lose touch with them partway through the study. For these patients we have partial information, we know that the event occurred (or will occur) sometime after the date of last follow-up. We refer to these patients as censored observations. We don't want to ignore these patients, because they provide some information about survival, but we need to handle them differently.

The first step in a survival data analysis is to estimate survival probabilities for each group. When we know the exact date of death (or failure) for each patient, this computation is trivial. In most situations, however, we will have partial information on some of the patients. We will know that they survived beyond a certain point, but because the study ended before all the patients died, or because we lost touch with some of the patients, or because they withdrew from the study, we do not know the exact date of death. These patients represent censored observations, observations that you have to account for differently than others.

A simple example of censored data involves failure of a device, and not the death of a person. In a study of catheters for peritoneal dialysis, these catheters can fail due to occlusion, leakage, or infection. Some catheters are removed prior to failure, usually either because the patient completed dialysis or the patient died. If the catheter is removed prior to failure, that is considered a censored observation.

Day Catheters removed
prior to
failure
Catheters failed
1 8 2
2 2 2
3 1 1
4 1 1
5 5 3
6   2
7   1
10   2
12   2
13   1

If you wanted to estimate the probability that a catheter will survive its first day, that's easy. There were 34 catheters, 2 did not survive the first day, 15 failed on days 2-13. For 17 of the catheters, we did not know when they would have failed, but we do know that they all survived at least one day.

So the probability of surviving the first day is 32/34 = 94%.

But how would we estimate the probability of surviving two days? four days? ten days?

This is tricky, because the censored observations provide information up to the day of censoring, but cannot tell us anything more about surviving beyond that day. What we need to do is compute the number of catheters at risk on each day. This is the number of catheters that would be at risk for failure on that day. It would exclude any catheters that failed on previous days and it would exclude any catheters that were censored on previous days.

Day Catheters removed
prior to
failure
Catheters failed Catheters
at risk
1 8 2 34
2 2 2 34-8-2=24
3 1 1 24-2-2=20
4 1 1 20-1-1=18
5 5 3 18-1-1=16
6   2 16-5-3=8
7   1 8-2=6
10   2 6-1=5
12   2 5-2=3
13   1 3-2=1

We then need to compute the conditional probability of surviving at each time point given that the catheter survived the previous time point. This conditional probability would be

(number at risk - number of failures)/(number at risk)

Day Catheters removed
prior to
failure
Catheters failed Catheters
at risk
Conditional
probability
1 8 2 34 32/34
=0.94
2 2 2 34-8-2=24 22/24
=0.92
3 1 1 24-2-2=20 19/20
=0.95
4 1 1 20-1-1=18 17/18
=0.94
5 5 3 18-1-1=16 13/16
=0.81
6   2 16-5-3=8 6/8
=0.75
7   1 8-2=6 5/6
=0.83
10   2 6-1=5 3/5
=0.60
12   2 5-2=3 1/3
=0.33
13   1 3-2=1 0/1
=0.00

Then we compute the cumulative product of these probabilities. This represents the Kaplan-Meier estimate of the survival probability.

Day Catheters removed
prior to
failure
Catheters failed Catheters
at risk
Conditional
probability
Cumulative
product
1 8 2 34 32/34
=0.94
0.94
2 2 2 34-8-2=24 22/24
=0.92
0.94*0.92
=0.86
3 1 1 24-2-2=20 19/20
=0.95
0.86*0.95
=0.82
4 1 1 20-1-1=18 17/18
=0.94
0.82*0.94
=0.77
5 5 3 18-1-1=16 13/16
=0.81
0.77*0.81
=0.62
6   2 16-5-3=8 6/8
=0.75
0.62*0.75
=0.46
7   1 8-2=6 5/6
=0.83
0.46*0.83
=0.38
10   2 6-1=5 3/5
=0.60
0.38*0.60
=0.23
12   2 5-2=3 1/3
=0.33
0.23*0.33
=0.08
13   1 3-2=1 0/1
=0.00
0.08*0.00
=0.00

Here is a graph of these survival probabilities. 

The plot has a "stair step" pattern, because we don't know the survival probability at fractional days (such as 2.5 days) and at some integer days (such as 9 days). By convention, we estimate the survival probability for these values as equaling the survival probability of the closest value that is still smaller (the 2 day survival probability for 2.5 days, and the 7 day survival probability at 9 days).

Notice that the estimated median survival time (the time at which 50% of the catheters survived) is six days.

Tenckhoff Catheters Prove Superior to Cook Catheters in Pediatric Acute Peritoneal Dialysis. Chadha V. American Journal of Kidney Diseases 2000:35(6);1111-1116.

This webpage was written by Steve Simon on 200-06-27 and was last modified on 07/14/2008. Category: Ask Professor Mean, Category: Survival analysis


Steps in a typical survival data analysis (October 11, 2002)

There are three steps in a typical survival analysis.

  1. Know how much data you have
  2. Graph the survival function
  3. Compare the survival times

Know how much data you have

How much data do you have. There are several ways of measuring this. The simplest is to note the number of patients that you have studied.

SPSS supplies a data set, AML Survival, that has data on 23 patients, 11 who received chemotherapy and 12 who did not.

You should also note how long these patients were studied. From the table above, you can see that we had a total of 678 patients weeks of observation, with an average of 32 weeks per patient. There was a greater amount of time observed in the chemotherapy group, 423 patients, or an average of 38 weeks per patient.

Finally, you should note the number of events that occurred. In this data set, relapse is the event, and we had a total of 18 relapses.

For the most part, it is the number of events, rather than the number of patients or the amount of time followed that determines the precision of your survival data model. You want to see roughly 25 to 50 events per group in order to have a good level of precision. By that standard, this data set is small.

Graph the survival function

The Kaplan Meier survival curve gives you a good estimate of the survival probabilities for each group you are studying.

In this graph, the relapse rate appears worse in the no chemotherapy group. Keep in mind, though, that there is a lot of variability in these curves, because the sample size is so small.

Notice that in the group without chemotherapy, the standard error is 0.14 at 12 weeks. This means that an approximate 95% confidence interval for the 12 week relapse rate would be 0.30 to 0.86.

Compare the survival times

A quick and simple comparison of the survival curves would use the mean, median, and/or quartiles. You could also estimate the survival at a certain time point.

The table above shows the mean, median, and quartiles for the no chemotherapy group. The median relapse time is 23 weeks, but the confidence interval extends all the way from 0.6 to 45.4 weeks.

The median survival time for the group with chemotherapy is larger, 31 weeks, but again there is a lot of variability in this estimate.

Sometimes you cannot estimate the median survival time, particularly if the number of events in the group is much less than half of the patients studied.

Suppose we are interested in the probability of relapse within half a year (26 weeks). There is no data at 26 weeks, so you round down to the nearest value. For the group with chemotherapy, the table directly above shows the estimated relapse rate at 23 weeks to be 0.61 with a standard error of 0.15. For the group without chemotherapy the estimated relapse rate at 0.49 with a standard error of 0.15. So the group with chemotherapy appears to have better values, but this difference is dwarfed by the uncertainty in the data.

The simplest formal test that compares two survival curves is the log rank test. In this example, the p-value is borderline, indicating a possible trend, but not quite achieving statistical significance.

The Cox regression model is more complex, but it can look at relationships with both continuous and categorical predictors. This model estimates a hazard ratio of 2.5, with confidence limits going from 0.9 to 6.7. The hazard ratio could be interpreted as a relative risk. The risk of relapse is 2.5 times greater in the group without chemotherapy. Although this ratio is large, it does not quite achieve statistical significance due to the small sample size.

This webpage was written by Steve Simon on 2002-10-11, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Survival analysis


Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.