Survival Analysis in R: A Practical Tutorial

Using Scottish Metastatic Breast Cancer Data

Author

Edinburgh Cancer Informatics Group

Published

December 5, 2025

1 Introduction

This tutorial provides a practical introduction to survival analysis using R, illustrated with a synthetic cohort of patients with metastatic breast cancer from Scottish NHS Health Boards. By the end of this tutorial, you will be able to:

  • Create and interpret survival objects
  • Fit and visualise Kaplan-Meier survival curves
  • Conduct log-rank tests to compare survival between groups
  • Fit Cox proportional hazards regression models
About the Data

The dataset used in this tutorial is entirely synthetic and was generated for educational purposes. It does not contain any real patient data but has been designed to reflect realistic patterns seen in metastatic breast cancer populations.

2 Why Survival Analysis?

Before diving into the methods, it’s worth understanding why we need specialised techniques for time-to-event data.

2.1 The Problem with Standard Methods

Imagine you’re analysing a clinical trial where patients were followed for up to 5 years. At the end of the study, some patients have died, but others are still alive. If you simply calculate the proportion who died, you’re ignoring valuable information — patients who were still alive contributed follow-up time showing they survived at least that long.

This is the problem of censoring. A patient is censored when we know they survived up to a certain point, but we don’t know what happened after that. Common reasons for censoring include: - The study ended while the patient was still alive - The patient was lost to follow-up - The patient withdrew from the study

2.2 What Happens If We Ignore Censoring?

Consider a simple example with 10 patients. Five died during follow-up, three were still alive when the study ended (censored), and two were lost to follow-up (also censored).

If we naively calculate survival as “proportion who didn’t die”, we might include the censored patients in the denominator as if they were followed for the entire study period. This overestimates survival because we’re treating patients with incomplete follow-up as if they definitely survived.

Conversely, if we calculate median survival time only among those who died, we underestimate it because we’re excluding patients who survived longer but were censored.

Survival analysis methods — particularly the Kaplan-Meier estimator — correctly account for censored observations by including each patient’s follow-up time in the analysis up until the point they were censored.

2.3 Key Concepts

Survival function S(t): The probability of surviving beyond time t. It starts at 1 (everyone alive at time 0) and decreases over time.

Hazard: The instantaneous risk of the event occurring at time t, given that the individual has survived up to that point. Think of it as the “danger” at each moment.

Censoring: When we have incomplete information about a patient’s outcome. Most commonly “right-censoring” — we know the patient survived at least until a certain time, but not what happened afterwards.

Proportional hazards: The assumption (used in Cox regression) that the ratio of hazards between groups remains constant over time. This doesn’t mean hazards are constant — they can both increase or decrease — but their ratio stays the same.

2.4 The Non-Informative Censoring Assumption

Most survival methods assume non-informative censoring — that a patient’s censoring time is unrelated to their underlying survival time. If this assumption is violated, survival estimates can be biased.

Examples of potentially informative censoring:

  • Sicker patients drop out of a study because they feel too unwell to attend appointments → survival estimates biased upward (look better than reality)
  • Patients who are doing well stop attending follow-up because they feel they don’t need it → survival estimates biased downward
  • Patients with severe side effects withdraw from a treatment trial → treatment effect estimates biased

When designing or analysing a study, carefully consider whether reasons for censoring might be related to prognosis. If informative censoring is suspected, sensitivity analyses or specialised methods may be needed.

3 Loading the Data

We begin by loading our synthetic Scottish metastatic breast cancer cohort from the Excel file.

Code
# Load the dataset
mbc_data <- read_xlsx("scottish_mbc_cohort.xlsx")

# Preview the data structure
glimpse(mbc_data)
Rows: 850
Columns: 21
$ patient_id           <chr> "MBC0001", "MBC0002", "MBC0003", "MBC0004", "MBC0…
$ health_board         <chr> "NHS Highland", "NHS Lothian", "NHS Tayside", "NH…
$ age_at_diagnosis     <dbl> 58, 80, 45, 72, 79, 87, 79, 74, 51, 74, 74, 42, 5…
$ sex                  <chr> "Female", "Female", "Female", "Female", "Female",…
$ simd_quintile        <dbl> 1, 4, 2, 4, 2, 1, 3, 1, 3, 4, 1, 1, 5, 1, 1, 4, 4…
$ ecog_ps              <dbl> 0, 1, 0, 2, 0, 0, 3, 3, 1, 0, 2, 1, 0, 1, 2, 1, 0…
$ er_status            <chr> "Negative", "Negative", "Negative", "Negative", "…
$ pr_status            <chr> "Negative", "Negative", "Negative", "Negative", "…
$ her2_status          <chr> "Negative", "Negative", "Negative", "Negative", "…
$ n_metastatic_sites   <dbl> 1, 1, 2, 1, 1, 2, 2, 3, 2, 1, 1, 4, 5, 2, 4, 4, 2…
$ bone_mets            <chr> "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "Ye…
$ liver_mets           <chr> "Yes", "No", "No", "No", "No", "No", "No", "Yes",…
$ lung_mets            <chr> "No", "No", "No", "Yes", "Yes", "No", "No", "Yes"…
$ brain_mets           <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", …
$ presentation         <chr> "De novo metastatic", "Recurrent", "Recurrent", "…
$ diagnosis_date       <dttm> 2020-03-01, 2022-11-11, 2021-07-25, 2022-05-21, …
$ first_line_treatment <chr> "Chemotherapy", "Chemotherapy", "Chemotherapy", "…
$ observed_time        <dbl> 40.8, 13.6, 9.2, 0.5, 4.8, 37.3, 13.7, 17.6, 11.1…
$ status               <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1…
$ age_group            <chr> "50-64", "75+", "<50", "65-74", "75+", "75+", "75…
$ hormone_receptor     <chr> "HR-negative", "HR-negative", "HR-negative", "HR-…

The dataset contains 850 patients diagnosed between 2018-01-05 and 2022-12-28 across all 14 Scottish NHS Health Boards.

3.1 Calculating Survival Times from Dates

In practice, clinical data often comes with dates rather than pre-calculated survival times. Here’s how to calculate follow-up time from diagnosis and last contact dates using the lubridate package.

Code
# Example: Creating survival times from dates
date_example <- tibble(
  patient_id = c("P001", "P002", "P003"),
  diagnosis_date = as.Date(c("2020-03-15", "2019-08-22", "2021-01-10")),
  last_contact_date = as.Date(c("2023-06-20", "2022-04-15", "2023-12-31")),
  status = c(1, 1, 0)  # 1 = died, 0 = alive/censored
)

# Calculate follow-up time in months
date_example <- date_example |>
  mutate(
    follow_up_months = as.numeric(
      difftime(last_contact_date, diagnosis_date, units = "days")
    ) / 30.44  # average days per month
  )

date_example |>
  kable(digits = 1)
patient_id diagnosis_date last_contact_date status follow_up_months
P001 2020-03-15 2023-06-20 1 39.2
P002 2019-08-22 2022-04-15 1 31.8
P003 2021-01-10 2023-12-31 0 35.6
Time Units

Choose time units appropriate for your disease. For aggressive cancers like metastatic breast cancer, months are sensible. For slow-growing conditions, years might be better. Whatever you choose, be consistent throughout your analysis.

4 Patient Characteristics

Before conducting survival analysis, it is essential to understand the composition of our cohort. We present descriptive statistics stratified by hormone receptor status.

4.1 Summary Table

Code
# Create summary table
mbc_data |>
  mutate(
    simd_quintile = factor(simd_quintile),
    ecog_ps = factor(ecog_ps),
    status = factor(status, labels = c("Censored", "Died"))
  ) |>
  select(
    hormone_receptor, age_at_diagnosis, age_group, sex, simd_quintile, ecog_ps,
    er_status, pr_status, her2_status,
    n_metastatic_sites, bone_mets, liver_mets, lung_mets, brain_mets,
    presentation, first_line_treatment, observed_time, status
  ) |>
  tbl_summary(
    by = hormone_receptor,
    statistic = list(
      all_continuous() ~ "{median} ({p25}, {p75})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(all_continuous() ~ 1),
    missing = "ifany"
  ) |>
  add_overall() |>
  bold_labels()
Table 1: Patient characteristics by hormone receptor status
Characteristic Overall
N = 8501
HR-negative
N = 1921
HR-positive
N = 6581
age_at_diagnosis 61.0 (54.0, 70.0) 61.0 (55.0, 69.0) 61.0 (54.0, 70.0)
age_group


    <50 123 (14%) 24 (13%) 99 (15%)
    50-64 396 (47%) 95 (49%) 301 (46%)
    65-74 210 (25%) 46 (24%) 164 (25%)
    75+ 121 (14%) 27 (14%) 94 (14%)
sex


    Female 842 (99%) 187 (97%) 655 (100%)
    Male 8 (0.9%) 5 (2.6%) 3 (0.5%)
simd_quintile


    1 204 (24%) 45 (23%) 159 (24%)
    2 208 (24%) 44 (23%) 164 (25%)
    3 155 (18%) 38 (20%) 117 (18%)
    4 153 (18%) 33 (17%) 120 (18%)
    5 130 (15%) 32 (17%) 98 (15%)
ecog_ps


    0 248 (29%) 49 (26%) 199 (30%)
    1 348 (41%) 87 (45%) 261 (40%)
    2 192 (23%) 38 (20%) 154 (23%)
    3 62 (7.3%) 18 (9.4%) 44 (6.7%)
er_status


    Negative 225 (26%) 192 (100%) 33 (5.0%)
    Positive 625 (74%) 0 (0%) 625 (95%)
pr_status


    Negative 275 (32%) 192 (100%) 83 (13%)
    Positive 575 (68%) 0 (0%) 575 (87%)
her2_status


    Negative 699 (82%) 164 (85%) 535 (81%)
    Positive 151 (18%) 28 (15%) 123 (19%)
n_metastatic_sites


    1 307 (36%) 66 (34%) 241 (37%)
    2 242 (28%) 64 (33%) 178 (27%)
    3 161 (19%) 37 (19%) 124 (19%)
    4 106 (12%) 18 (9.4%) 88 (13%)
    5 34 (4.0%) 7 (3.6%) 27 (4.1%)
bone_mets 624 (73%) 149 (78%) 475 (72%)
liver_mets 307 (36%) 69 (36%) 238 (36%)
lung_mets 260 (31%) 53 (28%) 207 (31%)
brain_mets 90 (11%) 21 (11%) 69 (10%)
presentation


    De novo metastatic 250 (29%) 62 (32%) 188 (29%)
    Recurrent 600 (71%) 130 (68%) 470 (71%)
first_line_treatment


    Chemo + Targeted 105 (12%) 26 (14%) 79 (12%)
    Chemotherapy 284 (33%) 166 (86%) 118 (18%)
    Endocrine + Targeted 339 (40%) 0 (0%) 339 (52%)
    Endocrine only 122 (14%) 0 (0%) 122 (19%)
observed_time 13.2 (5.6, 23.4) 8.6 (4.4, 17.3) 14.6 (6.5, 25.4)
status


    Censored 147 (17%) 16 (8.3%) 131 (20%)
    Died 703 (83%) 176 (92%) 527 (80%)
1 Median (Q1, Q3); n (%)

4.2 Distribution by Health Board

Code
mbc_data |>
  count(health_board) |>
  mutate(health_board = fct_reorder(health_board, n)) |>
  ggplot(aes(x = n, y = health_board, fill = n)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = n), hjust = -0.2, size = 3.5) +
  scale_fill_viridis_c(option = "plasma", end = 0.85) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    x = "Number of patients",
    y = NULL,
    title = "Patient distribution by NHS Health Board"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank()
  )
Figure 1: Patient distribution across Scottish NHS Health Boards

5 Creating a Survival Object

The foundation of survival analysis in R is the Surv() function from the survival package. This function creates a special object that encodes both the time-to-event and the censoring indicator.

5.1 Understanding Censoring

In our dataset:

  • observed_time: Time from diagnosis to death or last follow-up (months)
  • status: Event indicator (1 = death occurred, 0 = censored/alive)

Censoring occurs when we do not observe the event of interest. In this study, patients are censored if they were still alive at the end of the study period.

5.2 Creating the Survival Object

Code
# Create the survival object
surv_obj <- Surv(time = mbc_data$observed_time, 
                 event = mbc_data$status)

# Examine the first 20 observations
head(surv_obj, 20)
 [1] 40.8  13.6+  9.2   0.5   4.8  37.3  13.7+ 17.6  11.1  13.5   9.5   2.3 
[13] 30.8  15.6+ 18.2  20.6   2.6   7.0  16.8  11.9 
Interpreting the Output

In the survival object:

  • Numbers without a + indicate observed events (deaths)
  • Numbers with a + indicate censored observations (patient was alive at that time)

6 Kaplan-Meier Survival Estimation

The Kaplan-Meier (KM) estimator is a non-parametric method for estimating the survival function from observed survival times.

6.1 Fitting the KM Curve

Code
# Fit Kaplan-Meier estimator for the entire cohort
km_fit <- survfit(Surv(observed_time, status) ~ 1, data = mbc_data)

# Summary of the KM fit
summary(km_fit, times = c(6, 12, 24, 36, 48, 60), extend = TRUE)
Call: survfit(formula = Surv(observed_time, status) ~ 1, data = mbc_data)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    6    626     226   0.7341  0.0152       0.7050       0.7644
   12    463     163   0.5424  0.0171       0.5099       0.5769
   24    207     184   0.3031  0.0164       0.2725       0.3371
   36     94      76   0.1790  0.0147       0.1523       0.2104
   48     27      44   0.0824  0.0124       0.0614       0.1106
   60      9       8   0.0515  0.0117       0.0330       0.0804

The output shows survival probabilities at specific time points with 95% confidence intervals.

6.2 Plotting the Overall Survival Curve

Code
ggsurvplot(
  km_fit,
  data = mbc_data,
  conf.int = FALSE,
  risk.table = TRUE,
  risk.table.col = "strata",
  risk.table.height = 0.25,
  xlab = "Time since diagnosis (months)",
  ylab = "Survival probability",
  title = "Overall Survival in Metastatic Breast Cancer",
  ggtheme = theme_minimal(),
  palette = "#2E86AB",
  break.time.by = 12,
  surv.median.line = "hv"
)
Figure 2: Overall survival in the Scottish metastatic breast cancer cohort

6.3 Median Survival Time

Code
# Extract median survival with confidence interval
surv_median <- surv_median(km_fit)

surv_median |>
  kable(
    col.names = c("Strata", "Median", "Lower 95% CI", "Upper 95% CI"),
    digits = 1,
    caption = "Median overall survival (months)"
  )
Median overall survival (months)
Strata Median Lower 95% CI Upper 95% CI
All 13.3 12.3 15.1

The median survival time is the time at which 50% of patients have experienced the event. It’s the preferred measure of “average” survival because survival times are typically skewed — a few long survivors can dramatically inflate the mean.

6.4 Estimating Survival at Specific Time Points

Often we want to report survival probability at clinically meaningful time points, such as 1-year or 2-year survival.

Code
# Get survival estimates at specific times
time_points <- c(12, 24, 36, 48, 60)  # months

surv_summary <- summary(km_fit, times = time_points, extend = TRUE)

# Create a nice table
tibble(
  `Time (months)` = surv_summary$time,
  `N at risk` = surv_summary$n.risk,
  `Survival probability` = surv_summary$surv,
  `95% CI lower` = surv_summary$lower,
  `95% CI upper` = surv_summary$upper
) |>
  mutate(
    `Survival (95% CI)` = sprintf("%.1f%% (%.1f%%-%.1f%%)", 
                                   `Survival probability` * 100,
                                   `95% CI lower` * 100,
                                   `95% CI upper` * 100)
  ) |>
  select(`Time (months)`, `N at risk`, `Survival (95% CI)`) |>
  kable()
Time (months) N at risk Survival (95% CI)
12 463 54.2% (51.0%-57.7%)
24 207 30.3% (27.3%-33.7%)
36 94 17.9% (15.2%-21.0%)
48 27 8.2% (6.1%-11.1%)
60 9 5.2% (3.3%-8.0%)
Don’t Ignore Censoring!

A common mistake is to calculate survival as simply “number who didn’t die / total number”. This ignores censoring and will give you biased estimates. Always use proper survival analysis methods like Kaplan-Meier.

The extend Argument

When using summary() on a survfit object with specific time points, you may encounter an error if a requested time exceeds the maximum follow-up in your data. Setting extend = TRUE allows the function to return the last known survival estimate for times beyond the data — essentially assuming no further events occurred.

7 Survival Curves by Subgroups

One of the most common uses of KM analysis is comparing survival between different patient groups.

7.1 By Hormone Receptor Status

Code
km_hr <- survfit(Surv(observed_time, status) ~ hormone_receptor, 
                 data = mbc_data)

ggsurvplot(
  km_hr,
  data = mbc_data,
  conf.int = FALSE,
  risk.table = TRUE,
  risk.table.height = 0.30,
  xlab = "Time since diagnosis (months)",
  ylab = "Survival probability",
  title = "Overall Survival by Hormone Receptor Status",
  legend.title = "HR Status",
  legend.labs = c("HR-negative", "HR-positive"),
  ggtheme = theme_minimal(),
  palette = c("#E74C3C", "#3498DB"),
  break.time.by = 12
)
Figure 3: Survival by hormone receptor status

7.2 By ECOG Performance Status

Code
km_ecog <- survfit(Surv(observed_time, status) ~ ecog_ps, 
                   data = mbc_data)

ggsurvplot(
  km_ecog,
  data = mbc_data,
  conf.int = FALSE,
  risk.table = TRUE,
  risk.table.height = 0.35,
  xlab = "Time since diagnosis (months)",
  ylab = "Survival probability",
  title = "Overall Survival by ECOG Performance Status",
  legend.title = "ECOG PS",
  ggtheme = theme_minimal(),
  palette = "jco",
  break.time.by = 12
)
Figure 4: Survival by ECOG performance status

7.3 By Number of Metastatic Sites

Code
# Create grouped metastatic sites variable
mbc_data <- mbc_data |>
  mutate(
    mets_group = case_when(
      n_metastatic_sites == 1 ~ "1 site",
      n_metastatic_sites == 2 ~ "2 sites",
      n_metastatic_sites >= 3 ~ "3+ sites"
    ) |> factor(levels = c("1 site", "2 sites", "3+ sites"))
  )

km_mets <- survfit(Surv(observed_time, status) ~ mets_group, 
                   data = mbc_data)

ggsurvplot(
  km_mets,
  data = mbc_data,
  conf.int = FALSE,
  risk.table = TRUE,
  risk.table.height = 0.30,
  xlab = "Time since diagnosis (months)",
  ylab = "Survival probability",
  title = "Overall Survival by Number of Metastatic Sites",
  legend.title = "Metastatic Sites",
  ggtheme = theme_minimal(),
  palette = c("#27AE60", "#F39C12", "#C0392B"),
  break.time.by = 12
)
Figure 5: Survival by number of metastatic sites

8 Hazard Functions

While survival curves show the probability of surviving beyond time t, hazard functions show the instantaneous rate of the event occurring at time t, given survival up to that point.

8.1 Cumulative Hazard Function

The cumulative hazard function, \(H(t)\), represents the accumulated risk up to time t.

Code
ggsurvplot(
  km_hr,
  data = mbc_data,
  fun = "cumhaz",
  conf.int = FALSE,
  xlab = "Time since diagnosis (months)",
  ylab = "Cumulative hazard",
  title = "Cumulative Hazard by Hormone Receptor Status",
  legend.title = "HR Status",
  legend.labs = c("HR-negative", "HR-positive"),
  ggtheme = theme_minimal(),
  palette = c("#E74C3C", "#3498DB"),
  break.time.by = 12
)
Figure 6: Cumulative hazard function by hormone receptor status

8.2 Smoothed Hazard Function

To visualise the instantaneous hazard rate, we can use kernel smoothing methods.

Code
# Using muhaz package for smoothed hazard estimation
library(muhaz)

# Calculate smoothed hazard for each HR group
hr_pos_data <- mbc_data |> filter(hormone_receptor == "HR-positive")
hr_neg_data <- mbc_data |> filter(hormone_receptor == "HR-negative")

haz_pos <- muhaz(hr_pos_data$observed_time, hr_pos_data$status, 
                 min.time = 1, max.time = 60)
haz_neg <- muhaz(hr_neg_data$observed_time, hr_neg_data$status, 
                 min.time = 1, max.time = 60)

# Combine into a data frame for plotting
hazard_df <- bind_rows(
  tibble(
    time = haz_pos$est.grid,
    hazard = haz_pos$haz.est,
    group = "HR-positive"
  ),
  tibble(
    time = haz_neg$est.grid,
    hazard = haz_neg$haz.est,
    group = "HR-negative"
  )
)

ggplot(hazard_df, aes(x = time, y = hazard, colour = group)) +
  geom_line(linewidth = 1.2) +
  scale_colour_manual(
    values = c("HR-negative" = "#E74C3C", "HR-positive" = "#3498DB"),
    name = "HR Status"
  ) +
  labs(
    x = "Time since diagnosis (months)",
    y = "Hazard rate",
    title = "Smoothed Hazard Function by Hormone Receptor Status"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
Figure 7: Smoothed hazard function estimates
Interpreting the Hazard Function

The hazard rate represents the instantaneous risk of death at any given time point. A higher hazard indicates greater risk.

9 Log-Rank Test

The log-rank test is the most widely used method to formally compare survival distributions between two or more groups. It tests the null hypothesis that there is no difference in survival between groups.

9.1 How the Log-Rank Test Works

The log-rank test is essentially a chi-squared test applied to survival data. At each time point where an event occurs, the test compares the observed number of events in each group to the expected number of events if there were truly no difference between groups.

The expected number of events in each group is calculated based on the proportion of patients still at risk in that group. If Group A has 60% of the remaining patients at a given time point, we would expect 60% of the events at that time to occur in Group A.

The test statistic is:

\[\chi^2 = \frac{(O - E)^2}{Var(O-E)}\]

Where O is the total observed events and E is the total expected events in a group. Under the null hypothesis of no difference, this follows a chi-squared distribution with degrees of freedom equal to the number of groups minus one.

Key Properties of the Log-Rank Test
  • It is a non-parametric test — it makes no assumptions about the shape of the survival curves
  • It compares the entire survival experience, not just survival at a single time point
  • It gives equal weight to all time points (early and late events contribute equally)
  • It is most powerful when the hazard ratio is constant over time (proportional hazards)
  • It does not provide an estimate of the size of the difference — only whether a difference exists

9.2 Comparing Hormone Receptor Groups

Code
# Log-rank test for hormone receptor status
logrank_hr <- survdiff(Surv(observed_time, status) ~ hormone_receptor, 
                       data = mbc_data)

logrank_hr
Call:
survdiff(formula = Surv(observed_time, status) ~ hormone_receptor, 
    data = mbc_data)

                               N Observed Expected (O-E)^2/E (O-E)^2/V
hormone_receptor=HR-negative 192      176      119     27.33      33.5
hormone_receptor=HR-positive 658      527      584      5.57      33.5

 Chisq= 33.5  on 1 degrees of freedom, p= 7e-09 
Code
# Extract p-value
p_val <- 1 - pchisq(logrank_hr$chisq, length(logrank_hr$n) - 1)

cat("Log-rank test Chi-squared:", round(logrank_hr$chisq, 2), "\n")
Log-rank test Chi-squared: 33.51 
Code
cat("Degrees of freedom:", length(logrank_hr$n) - 1, "\n")
Degrees of freedom: 1 
Code
cat("P-value:", format.pval(p_val, digits = 3), "\n")
P-value: 7.09e-09 
Reading the Output

The survdiff() output shows:

  • N: Number of patients in each group
  • Observed: Total events (deaths) observed in each group
  • Expected: Events expected if there were no difference between groups
  • (O-E)^2/E and (O-E)^2/V: Components of the chi-squared statistic

If the observed and expected values are similar, the groups have similar survival. Large differences between observed and expected suggest the groups differ. A group with fewer observed than expected events has better survival than average; a group with more observed than expected has worse survival.

9.3 Comparing ECOG Performance Status Groups

When comparing more than two groups, the log-rank test provides an overall (omnibus) test of whether any differences exist. A significant result tells you that at least one group differs from the others, but not which specific groups differ.

Code
# Log-rank test for ECOG PS
logrank_ecog <- survdiff(Surv(observed_time, status) ~ ecog_ps, 
                         data = mbc_data)

logrank_ecog
Call:
survdiff(formula = Surv(observed_time, status) ~ ecog_ps, data = mbc_data)

            N Observed Expected (O-E)^2/E (O-E)^2/V
ecog_ps=0 248      190    255.6     16.85     27.08
ecog_ps=1 348      278    303.6      2.16      3.85
ecog_ps=2 192      176    113.6     34.26     42.04
ecog_ps=3  62       59     30.2     27.58     29.20

 Chisq= 84  on 3 degrees of freedom, p= <2e-16 

9.4 Stratified Log-Rank Test

Sometimes we want to compare groups while adjusting for a potential confounding variable. The stratified log-rank test performs the comparison within strata of the adjusting variable, then combines the results. This is analogous to the Cochran-Mantel-Haenszel test for contingency tables.

Code
# Compare HR status, stratified by ECOG PS
logrank_strat <- survdiff(
  Surv(observed_time, status) ~ hormone_receptor + strata(ecog_ps), 
  data = mbc_data
)

logrank_strat
Call:
survdiff(formula = Surv(observed_time, status) ~ hormone_receptor + 
    strata(ecog_ps), data = mbc_data)

                               N Observed Expected (O-E)^2/E (O-E)^2/V
hormone_receptor=HR-negative 192      176      114     33.14        41
hormone_receptor=HR-positive 658      527      589      6.44        41

 Chisq= 41  on 1 degrees of freedom, p= 2e-10 

This tests whether hormone receptor status affects survival after accounting for differences in ECOG performance status across the groups.

10 Cox Proportional Hazards Regression

The Cox proportional hazards model is the cornerstone of survival analysis, allowing us to examine the relationship between multiple covariates and survival time simultaneously. It was introduced by Sir David Cox in 1972 and remains the most widely used regression method for time-to-event data.

10.1 Why Use Cox Regression?

The log-rank test tells us whether groups differ, but not by how much. It also cannot adjust for multiple confounders simultaneously. Cox regression addresses both limitations:

  • It provides hazard ratios that quantify the size of effects
  • It can include multiple covariates (both categorical and continuous)
  • It can adjust for confounders in a single model
  • It handles censored data appropriately

10.2 The Cox Model

The Cox model specifies the hazard function as:

\[h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p)\]

Where:

  • \(h(t|X)\) is the hazard at time \(t\) for a patient with covariates \(X\)
  • \(h_0(t)\) is the baseline hazard — the hazard when all covariates equal zero
  • \(\beta_1, \beta_2, ..., \beta_p\) are the regression coefficients
  • \(\exp(\beta)\) gives the hazard ratio for a one-unit increase in the covariate
Why “Semi-Parametric”?

The Cox model is called semi-parametric because:

  • The covariate effects (\(\beta\) coefficients) are estimated parametrically
  • The baseline hazard \(h_0(t)\) is left completely unspecified — no distributional assumption is made

This flexibility is a major advantage: the model works regardless of whether the underlying survival distribution is exponential, Weibull, or any other shape. The trade-off is that we cannot directly estimate survival probabilities without additional assumptions (though we can still compare groups via hazard ratios).

10.3 The Proportional Hazards Assumption

The model assumes that hazard ratios are constant over time. If patient A has twice the hazard of patient B at 1 month, they should still have twice the hazard at 12 months, 24 months, and so on.

Mathematically, for two patients with covariate values \(X_A\) and \(X_B\):

\[\frac{h(t|X_A)}{h(t|X_B)} = \frac{h_0(t) \exp(\beta X_A)}{h_0(t) \exp(\beta X_B)} = \exp(\beta(X_A - X_B))\]

The baseline hazard \(h_0(t)\) cancels out, leaving a ratio that does not depend on time.

10.4 Univariable Cox Models

First, let’s examine each potential predictor individually. Univariable analysis helps identify candidate variables for a multivariable model and provides unadjusted effect estimates.

Code
# Prepare data for modelling
mbc_model <- mbc_data |>
  mutate(
    ecog_ps = factor(ecog_ps),
    simd_quintile = factor(simd_quintile),
    age_group = factor(age_group, levels = c("<50", "50-64", "65-74", "75+"))
  )

# Create univariable Cox regression table
mbc_model |>
  select(
    observed_time, status,
    age_at_diagnosis, age_group, ecog_ps, hormone_receptor,
    her2_status, n_metastatic_sites, mets_group,
    liver_mets, brain_mets, presentation, simd_quintile
  ) |>
  tbl_uvregression(
    method = coxph,
    y = Surv(observed_time, status),
    exponentiate = TRUE,
    hide_n = TRUE
  ) |>
  bold_p() |>
  bold_labels()
Table 2: Univariable Cox regression analysis
Characteristic HR 95% CI p-value
age_at_diagnosis 1.01 1.01, 1.02 <0.001
age_group


    <50
    50-64 1.24 0.99, 1.56 0.066
    65-74 1.63 1.27, 2.10 <0.001
    75+ 1.67 1.26, 2.22 <0.001
ecog_ps


    0
    1 1.24 1.03, 1.49 0.022
    2 2.16 1.75, 2.66 <0.001
    3 2.73 2.04, 3.67 <0.001
hormone_receptor


    HR-negative
    HR-positive 0.60 0.51, 0.72 <0.001
her2_status


    Negative
    Positive 0.83 0.68, 1.01 0.062
n_metastatic_sites 1.20 1.13, 1.28 <0.001
mets_group


    1 site
    2 sites 1.14 0.94, 1.38 0.2
    3+ sites 1.49 1.25, 1.78 <0.001
liver_mets


    No
    Yes 1.08 0.93, 1.26 0.3
brain_mets


    No
    Yes 1.12 0.88, 1.42 0.4
presentation


    De novo metastatic
    Recurrent 1.07 0.91, 1.26 0.4
simd_quintile


    1
    2 1.02 0.82, 1.27 0.8
    3 1.11 0.88, 1.40 0.4
    4 1.20 0.95, 1.51 0.12
    5 1.06 0.83, 1.35 0.7
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

10.5 Multivariable Cox Model

Now we fit a multivariable model including clinically relevant predictors.

Code
# Fit multivariable Cox model
cox_multi <- coxph(
  Surv(observed_time, status) ~ 
    age_at_diagnosis + 
    ecog_ps + 
    hormone_receptor + 
    her2_status + 
    n_metastatic_sites + 
    liver_mets + 
    brain_mets,
  data = mbc_model
)

summary(cox_multi)
Call:
coxph(formula = Surv(observed_time, status) ~ age_at_diagnosis + 
    ecog_ps + hormone_receptor + her2_status + n_metastatic_sites + 
    liver_mets + brain_mets, data = mbc_model)

  n= 850, number of events= 703 

                                 coef exp(coef)  se(coef)      z Pr(>|z|)    
age_at_diagnosis             0.015183  1.015299  0.003205  4.737 2.17e-06 ***
ecog_ps1                     0.281729  1.325420  0.096426  2.922  0.00348 ** 
ecog_ps2                     0.926369  2.525322  0.109050  8.495  < 2e-16 ***
ecog_ps3                     1.039383  2.827472  0.152049  6.836 8.15e-12 ***
hormone_receptorHR-positive -0.584592  0.557333  0.089281 -6.548 5.84e-11 ***
her2_statusPositive         -0.125108  0.882401  0.100482 -1.245  0.21310    
n_metastatic_sites           0.222397  1.249067  0.034330  6.478 9.29e-11 ***
liver_metsYes                0.104616  1.110285  0.078715  1.329  0.18383    
brain_metsYes                0.139167  1.149316  0.123112  1.130  0.25830    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                            exp(coef) exp(-coef) lower .95 upper .95
age_at_diagnosis               1.0153     0.9849    1.0089    1.0217
ecog_ps1                       1.3254     0.7545    1.0972    1.6011
ecog_ps2                       2.5253     0.3960    2.0394    3.1271
ecog_ps3                       2.8275     0.3537    2.0988    3.8091
hormone_receptorHR-positive    0.5573     1.7943    0.4679    0.6639
her2_statusPositive            0.8824     1.1333    0.7247    1.0745
n_metastatic_sites             1.2491     0.8006    1.1678    1.3360
liver_metsYes                  1.1103     0.9007    0.9516    1.2955
brain_metsYes                  1.1493     0.8701    0.9029    1.4630

Concordance= 0.644  (se = 0.011 )
Likelihood ratio test= 177.8  on 9 df,   p=<2e-16
Wald test            = 184  on 9 df,   p=<2e-16
Score (logrank) test = 190.1  on 9 df,   p=<2e-16
Reading the Cox Model Output

The summary() output provides:

  • coef: The log hazard ratio (\(\beta\)) — positive values indicate increased hazard
  • exp(coef): The hazard ratio — the multiplicative effect on hazard
  • se(coef): Standard error of the coefficient
  • z: Wald test statistic (coef/se)
  • Pr(>|z|): P-value for testing whether the coefficient differs from zero
  • lower .95 / upper .95: 95% confidence interval for the hazard ratio
  • Concordance: A measure of model discrimination (0.5 = no better than chance, 1.0 = perfect)
  • Likelihood ratio test: Overall test of whether the model is better than a null model

10.6 Understanding Hazard Ratios

The hazard ratio (HR) is the key output from Cox regression. It represents the relative rate at which events occur in one group compared to another, at any given point in time.

Interpreting HR values:

  • HR = 1: No difference between groups
  • HR > 1: Increased hazard (worse survival) compared to reference
  • HR < 1: Decreased hazard (better survival) compared to reference

For categorical variables: The HR compares each category to a reference category. For example, if HR-positive vs HR-negative has HR = 0.6, then HR-positive patients have 40% lower hazard (die at 0.6 times the rate) compared to HR-negative patients.

For continuous variables: The HR represents the change in hazard for each one-unit increase. For example, if age has HR = 1.02, then each additional year of age increases the hazard by 2%.

HR is Not a Risk or Probability

The hazard ratio is often misinterpreted as a relative risk. While they can be similar in some situations, they measure different things:

  • Relative risk compares cumulative probabilities (e.g., “twice as likely to die by 5 years”)
  • Hazard ratio compares instantaneous rates (e.g., “dying at twice the rate at any moment”)

A HR of 2 does not mean twice the probability of the event. Be precise in your interpretation, and avoid phrases like “twice as likely” when reporting hazard ratios.

10.7 Formatted Multivariable Results

Code
# Create formatted Cox regression table
cox_multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      age_at_diagnosis ~ "Age at diagnosis (years)",
      ecog_ps ~ "ECOG Performance Status",
      hormone_receptor ~ "Hormone receptor status",
      her2_status ~ "HER2 status",
      n_metastatic_sites ~ "Number of metastatic sites",
      liver_mets ~ "Liver metastases",
      brain_mets ~ "Brain metastases"
    )
  ) |>
  bold_p() |>
  bold_labels()
Table 3: Multivariable Cox regression model
Characteristic HR 95% CI p-value
Age at diagnosis (years) 1.02 1.01, 1.02 <0.001
ECOG Performance Status


    0
    1 1.33 1.10, 1.60 0.003
    2 2.53 2.04, 3.13 <0.001
    3 2.83 2.10, 3.81 <0.001
Hormone receptor status


    HR-negative
    HR-positive 0.56 0.47, 0.66 <0.001
HER2 status


    Negative
    Positive 0.88 0.72, 1.07 0.2
Number of metastatic sites 1.25 1.17, 1.34 <0.001
Liver metastases


    No
    Yes 1.11 0.95, 1.30 0.2
Brain metastases


    No
    Yes 1.15 0.90, 1.46 0.3
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

10.8 Forest Plot of Hazard Ratios

Code
# Create forest plot data from model
forest_data <- tidy(cox_multi, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(
    # Clean up term names for display
    term_label = case_when(
      term == "age_at_diagnosis" ~ "Age (per year)",
      term == "ecog_ps1" ~ "ECOG PS 1 vs 0",
      term == "ecog_ps2" ~ "ECOG PS 2 vs 0",
      term == "ecog_ps3" ~ "ECOG PS 3 vs 0",
      term == "hormone_receptorHR-positive" ~ "HR-positive vs HR-negative",
      term == "her2_statusPositive" ~ "HER2-positive vs negative",
      term == "n_metastatic_sites" ~ "Metastatic sites (per site)",
      term == "liver_metsYes" ~ "Liver metastases",
      term == "brain_metsYes" ~ "Brain metastases",
      TRUE ~ term
    ),
    # Create HR label
    hr_label = sprintf("%.2f (%.2f-%.2f)", estimate, conf.low, conf.high),
    # Significance indicator
    significant = p.value < 0.05
  ) |>
  # Reverse order for plotting (top to bottom)
  mutate(term_label = fct_inorder(term_label) |> fct_rev())

# Create the forest plot
ggplot(forest_data, aes(x = estimate, y = term_label)) +
  # Reference line at HR = 1

geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  # Confidence intervals
  geom_errorbarh(
    aes(xmin = conf.low, xmax = conf.high),
    height = 0.2,
    colour = "grey30"
  ) +
  # Point estimates
  geom_point(
    aes(colour = significant),
    size = 3
  ) +
  # HR labels on the right
  geom_text(
    aes(x = max(conf.high) * 1.3, label = hr_label),
    hjust = 0,
    size = 3
  ) +
  # Colour scale
  scale_colour_manual(
    values = c("TRUE" = "#E74C3C", "FALSE" = "#3498DB"),
    labels = c("TRUE" = "p < 0.05", "FALSE" = "p ≥ 0.05"),
    name = "Significance"
  ) +
  # Log scale for x-axis
  scale_x_log10(
    breaks = c(0.25, 0.5, 1, 2, 4),
    limits = c(0.2, max(forest_data$conf.high) * 2)
  ) +
  labs(
    x = "Hazard Ratio (log scale)",
    y = NULL,
    title = "Multivariable Cox Regression: Hazard Ratios",
    subtitle = "Adjusted for all variables shown"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )
Figure 8: Forest plot of hazard ratios from the multivariable Cox model

10.9 Model Fit Statistics

We can assess how well the model fits the data using several metrics.

Code
# Model fit statistics
glance(cox_multi) |>
  select(n, nevent, concordance, std.error.concordance, logLik, AIC, BIC) |>
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") |>
  mutate(
    Statistic = case_when(
      Statistic == "n" ~ "Number of observations",
      Statistic == "nevent" ~ "Number of events",
      Statistic == "concordance" ~ "Concordance (C-statistic)",
      Statistic == "std.error.concordance" ~ "SE of concordance",
      Statistic == "logLik" ~ "Log-likelihood",
      Statistic == "AIC" ~ "AIC",
      Statistic == "BIC" ~ "BIC"
    ),
    Value = round(Value, 3)
  ) |>
  kable(caption = "Model fit statistics")
Model fit statistics
Statistic Value
Number of observations 850.000
Number of events 703.000
Concordance (C-statistic) 0.644
SE of concordance 0.011
Log-likelihood -4080.457
AIC 8178.915
BIC 8219.913
Interpreting Model Fit Statistics
  • Concordance (C-statistic): Measures discriminative ability — the probability that, for a random pair of patients where one dies first, the model correctly predicts which one. Values range from 0.5 (no better than chance) to 1.0 (perfect discrimination). In clinical models, values of 0.7–0.8 are typically considered acceptable.

  • AIC/BIC: Information criteria for comparing models. Lower values indicate better fit, penalised for model complexity. Useful for comparing alternative models fitted to the same data.

  • Log-likelihood: The basis for likelihood ratio tests comparing nested models.

10.10 Predicting Survival for Specific Patients

One practical application of Cox models is predicting survival curves for patients with specific characteristics. Although the Cox model does not directly estimate the baseline hazard, the survfit() function can estimate it from the data and combine it with covariate effects to produce predicted survival curves.

Code
# Create profiles for two hypothetical patients
new_patients <- tibble(
  patient_type = c("Lower risk", "Higher risk"),
  age_at_diagnosis = c(55, 75),
  ecog_ps = factor(c(0, 2), levels = levels(mbc_model$ecog_ps)),
  hormone_receptor = c("HR-positive", "HR-negative"),
  her2_status = c("Negative", "Negative"),
  n_metastatic_sites = c(1, 3),
  liver_mets = c("No", "Yes"),
  brain_mets = c("No", "No")
)

new_patients |>
  kable(caption = "Hypothetical patient profiles")
Hypothetical patient profiles
patient_type age_at_diagnosis ecog_ps hormone_receptor her2_status n_metastatic_sites liver_mets brain_mets
Lower risk 55 0 HR-positive Negative 1 No No
Higher risk 75 2 HR-negative Negative 3 Yes No
Code
# Generate predicted survival curves
pred_surv <- survfit(cox_multi, newdata = new_patients)

# Plot using ggsurvplot
ggsurvplot(
  pred_surv,
  data = new_patients,
  conf.int = TRUE,
  censor = FALSE,
  xlab = "Time since diagnosis (months)",
  ylab = "Survival probability",
  legend.labs = c("Lower risk patient", "Higher risk patient"),
  legend.title = "Patient profile",
  palette = c("#27AE60", "#E74C3C"),
  ggtheme = theme_minimal()
)
Figure 9: Predicted survival curves for two hypothetical patient profiles
How Predictions Work

The Cox model doesn’t directly estimate the survival function — it estimates hazard ratios. To predict survival, R first estimates a baseline survival function (for a patient with all covariates at reference levels), then adjusts this based on the covariate values using the estimated coefficients.

11 Practice Exercises

The following exercises use datasets built into R’s survival package. Try these to consolidate your learning.

11.1 Exercise 1: The Lung Cancer Dataset

The lung dataset contains survival data from patients with advanced lung cancer. Load it and explore:

Code
# Load the lung dataset
data(lung)

# Examine the structure
glimpse(lung)
Rows: 228
Columns: 10
$ inst      <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…
$ time      <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…
$ status    <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ age       <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …
$ sex       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …
$ ph.ecog   <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…
$ ph.karno  <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…
$ meal.cal  <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…
$ wt.loss   <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …

Key variables:

  • time: Survival time in days
  • status: 1 = censored, 2 = dead (note: non-standard coding!)
  • sex: 1 = Male, 2 = Female
  • ph.ecog: ECOG performance score (0 = good, 5 = dead)

11.1.1 Task 1a: Recode the status variable

The status variable uses non-standard coding. Recode it so that 0 = censored and 1 = event (death).

Code
lung <- lung |>
  mutate(status = status - 1)  # Convert 1,2 to 0,1

# Check the recoding
table(lung$status)

  0   1 
 63 165 

11.1.2 Task 1b: Create a Kaplan-Meier curve by sex

Fit a survival curve comparing males and females, and create a plot.

Code
# Fit the model
lung_sex <- survfit(Surv(time, status) ~ sex, data = lung)

# Plot
ggsurvplot(
  lung_sex,
  data = lung,
  risk.table = TRUE,
  xlab = "Time (days)",
  ylab = "Survival probability",
  legend.labs = c("Male", "Female"),
  title = "Overall Survival by Sex"
)

Survival by sex in lung cancer patients

11.1.3 Task 1c: Test for differences and fit a Cox model

Conduct a log-rank test and fit a Cox model for sex.

Code
# Log-rank test
survdiff(Surv(time, status) ~ sex, data = lung)
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

        N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138      112     91.6      4.55      10.3
sex=2  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.001 
Code
# Cox model
cox_lung <- coxph(Surv(time, status) ~ sex, data = lung)
summary(cox_lung)
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

  n= 228, number of events= 165 

       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.021 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001
Wald test            = 10.09  on 1 df,   p=0.001
Score (logrank) test = 10.33  on 1 df,   p=0.001

Interpretation: Females (sex = 2) have significantly better survival than males. The hazard ratio of approximately 0.59 means females die at about 59% the rate of males, or equivalently, males have about 1.7 times higher hazard of death.

11.2 Exercise 2: The Colon Cancer Dataset

The colon dataset contains data from a trial of adjuvant chemotherapy for colon cancer. It has two rows per patient: one for recurrence and one for death.

Code
data(colon)

# Filter to death events only
colon_death <- colon |>
  filter(etype == 2)  # etype 2 = death

glimpse(colon_death)
Rows: 929
Columns: 16
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ study    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ rx       <fct> Lev+5FU, Lev+5FU, Obs, Lev+5FU, Obs, Lev+5FU, Lev, Obs, Lev, …
$ sex      <dbl> 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1…
$ age      <dbl> 43, 63, 71, 66, 69, 57, 77, 54, 46, 68, 47, 52, 64, 68, 46, 6…
$ obstruct <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1…
$ perfor   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ adhere   <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0…
$ nodes    <dbl> 5, 1, 7, 6, 22, 9, 5, 1, 2, 1, 1, 2, 1, 3, 4, 1, 6, 1, 1, 1, …
$ status   <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1…
$ differ   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ extent   <dbl> 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3…
$ surg     <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
$ node4    <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ time     <dbl> 1521, 3087, 963, 293, 659, 1767, 420, 3192, 3173, 3308, 2908,…
$ etype    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…

Key variables:

  • time: Days until event or censoring
  • status: 0 = censored, 1 = event
  • rx: Treatment (Obs = observation, Lev = levamisole, Lev+5FU = levamisole + 5-fluorouracil)
  • nodes: Number of positive lymph nodes

11.2.1 Task 2a: Compare survival by treatment

Create KM curves comparing the three treatment groups.

Code
colon_rx <- survfit(Surv(time, status) ~ rx, data = colon_death)

ggsurvplot(
  colon_rx,
  data = colon_death,
  risk.table = TRUE,
  xlab = "Time (days)",
  title = "Overall Survival by Treatment"
)

Survival by treatment in colon cancer
Code
# Log-rank test
survdiff(Surv(time, status) ~ rx, data = colon_death)
Call:
survdiff(formula = Surv(time, status) ~ rx, data = colon_death)

             N Observed Expected (O-E)^2/E (O-E)^2/V
rx=Obs     315      168      148      2.58      3.85
rx=Lev     310      161      146      1.52      2.25
rx=Lev+5FU 304      123      157      7.55     11.62

 Chisq= 11.7  on 2 degrees of freedom, p= 0.003 

11.2.2 Task 2b: Fit a multivariable Cox model

Fit a Cox model including treatment (rx), age, sex, and number of positive nodes.

Code
cox_colon <- coxph(
  Surv(time, status) ~ rx + age + sex + nodes,
  data = colon_death
)

summary(cox_colon)
Call:
coxph(formula = Surv(time, status) ~ rx + age + sex + nodes, 
    data = colon_death)

  n= 911, number of events= 441 
   (18 observations deleted due to missingness)

               coef exp(coef)  se(coef)      z Pr(>|z|)    
rxLev     -0.080072  0.923049  0.111613 -0.717  0.47312    
rxLev+5FU -0.402527  0.668628  0.120539 -3.339  0.00084 ***
age        0.005333  1.005347  0.004045  1.318  0.18739    
sex       -0.028257  0.972138  0.095728 -0.295  0.76786    
nodes      0.092755  1.097193  0.008871 10.456  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
rxLev        0.9230     1.0834    0.7417    1.1488
rxLev+5FU    0.6686     1.4956    0.5279    0.8468
age          1.0053     0.9947    0.9974    1.0134
sex          0.9721     1.0287    0.8058    1.1728
nodes        1.0972     0.9114    1.0783    1.1164

Concordance= 0.638  (se = 0.014 )
Likelihood ratio test= 87.79  on 5 df,   p=<2e-16
Wald test            = 123  on 5 df,   p=<2e-16
Score (logrank) test = 125  on 5 df,   p=<2e-16

Interpretation:

  • Levamisole alone shows no significant benefit over observation
  • Levamisole + 5FU significantly improves survival (HR ≈ 0.64)
  • Each additional positive lymph node increases hazard by about 9%
  • Age and sex are not significantly associated with survival after adjusting for treatment and nodes

11.3 Exercise 3: The AML Dataset

The aml dataset is a small dataset comparing maintenance chemotherapy for acute myelogenous leukaemia.

Code
data(aml)
glimpse(aml)
Rows: 23
Columns: 3
$ time   <dbl> 9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161, 5, 5, 8, 8, 12, 16,…
$ status <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
$ x      <fct> Maintained, Maintained, Maintained, Maintained, Maintained, Mai…

Variables:

  • time: Survival time in weeks
  • status: 0 = censored, 1 = relapsed
  • x: Treatment group (Maintained or Nonmaintained)

11.3.1 Task 3a: Calculate median survival by group

Estimate median survival time for each treatment group.

Code
aml_fit <- survfit(Surv(time, status) ~ x, data = aml)

# Median survival
print(aml_fit)
Call: survfit(formula = Surv(time, status) ~ x, data = aml)

                 n events median 0.95LCL 0.95UCL
x=Maintained    11      7     31      18      NA
x=Nonmaintained 12     11     23       8      NA

11.3.2 Task 3b: Estimate survival probability at 30 weeks

What proportion of patients in each group survived beyond 30 weeks?

Code
# Check the range of times in the data
range(aml$time)
[1]   5 161
Code
# Get survival estimates at 30 weeks
# Use extend = TRUE to handle times beyond observed data
summary(aml_fit, times = 30, extend = TRUE)
Call: survfit(formula = Surv(time, status) ~ x, data = aml)

                x=Maintained 
        time       n.risk      n.event     survival      std.err lower 95% CI 
      30.000        5.000        4.000        0.614        0.153        0.377 
upper 95% CI 
       0.999 

                x=Nonmaintained 
        time       n.risk      n.event     survival      std.err lower 95% CI 
      30.000        4.000        8.000        0.292        0.139        0.115 
upper 95% CI 
       0.741 

Note: The extend = TRUE argument allows estimation at time points beyond the last observed event for a group by carrying forward the last survival estimate.

11.4 Exercise 4: Cox Modelling with the Veteran Dataset

The veteran dataset describes survival of patients with lung cancer in a Veterans Administration trial.

Code
data(veteran)
glimpse(veteran)
Rows: 137
Columns: 8
$ trt      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ celltype <fct> squamous, squamous, squamous, squamous, squamous, squamous, s…
$ time     <dbl> 72, 411, 228, 126, 118, 10, 82, 110, 314, 100, 42, 8, 144, 25…
$ status   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0…
$ karno    <dbl> 60, 70, 60, 60, 70, 20, 40, 80, 50, 70, 60, 40, 30, 80, 70, 6…
$ diagtime <dbl> 7, 5, 3, 9, 11, 5, 10, 29, 18, 6, 4, 58, 4, 9, 11, 3, 9, 2, 4…
$ age      <dbl> 69, 64, 38, 63, 65, 49, 69, 68, 43, 70, 81, 63, 63, 52, 48, 6…
$ prior    <dbl> 0, 10, 0, 10, 10, 0, 10, 0, 0, 0, 0, 10, 0, 10, 10, 0, 0, 0, …

Key variables:

  • time: Survival time in days
  • status: 0 = censored, 1 = dead
  • trt: Treatment (1 = standard, 2 = test)
  • karno: Karnofsky performance score (0-100, higher is better)
  • age: Age in years

11.4.1 Task 4a: Fit a Cox model

Fit a Cox model with treatment, Karnofsky score, and age as predictors.

Code
cox_veteran <- coxph(Surv(time, status) ~ trt + karno + age, data = veteran)
summary(cox_veteran)
Call:
coxph(formula = Surv(time, status) ~ trt + karno + age, data = veteran)

  n= 137, number of events= 128 

           coef exp(coef)  se(coef)      z Pr(>|z|)    
trt    0.189546  1.208701  0.185531  1.022    0.307    
karno -0.034444  0.966143  0.005232 -6.583 4.62e-11 ***
age   -0.003864  0.996143  0.009187 -0.421    0.674    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      exp(coef) exp(-coef) lower .95 upper .95
trt      1.2087     0.8273    0.8402    1.7388
karno    0.9661     1.0350    0.9563    0.9761
age      0.9961     1.0039    0.9784    1.0142

Concordance= 0.712  (se = 0.022 )
Likelihood ratio test= 43.14  on 3 df,   p=2e-09
Wald test            = 44.52  on 3 df,   p=1e-09
Score (logrank) test = 46.84  on 3 df,   p=4e-10

Interpretation:

  • Treatment (test vs standard) shows no significant effect on survival
  • Each 1-point increase in Karnofsky score reduces hazard by about 3% (HR ≈ 0.97), and this is highly significant
  • Age shows no significant association with survival after adjusting for treatment and Karnofsky score

11.5 Exercise 5: Interpretation Questions

These questions test your understanding of survival analysis concepts. Think about each one before revealing the answer.

11.5.1 Question 5a

A Cox model gives a hazard ratio of 2.5 for a binary exposure variable. What does this mean?

The exposed group experiences the event at 2.5 times the rate of the unexposed group at any given time point. This indicates substantially worse survival in the exposed group.

11.5.2 Question 5b

Why can’t we simply calculate “proportion surviving” to estimate 2-year survival?

A naive calculation would either:

  1. Treat censored patients as if they survived the full period (overestimating survival), or
  2. Exclude them entirely (potentially biasing in either direction)

The Kaplan-Meier method correctly accounts for censoring by including each patient’s contribution up to their censoring time, then removing them from the risk set.

12 Summary

This tutorial has covered the fundamental techniques of survival analysis in R:

  1. Why survival analysis: Understanding censoring, non-informative censoring assumptions, and why standard methods don’t work for time-to-event data
  2. Survival objects: Created using Surv() to encode time-to-event data with censoring
  3. Kaplan-Meier estimation: Non-parametric estimation of survival curves using survfit()
  4. Survival estimates: Calculating median survival and survival probabilities at specific time points
  5. Hazard functions: Both cumulative and smoothed instantaneous hazard rates
  6. Log-rank tests: Formal comparison of survival between groups using survdiff(), including weighted alternatives
  7. Cox regression: Semi-parametric modelling of hazard ratios using coxph(), including interpretation of HRs
  8. Survival predictions: Generating predicted survival curves for specific patient profiles
Further Reading

Introductory tutorials:

Key papers:

  • Clark TG et al. (2003) Survival analysis part I-IV. British Journal of Cancer — four-part series covering basic to advanced concepts

Books:

  • Therneau TM & Grambsch PM (2000) Modeling Survival Data: Extending the Cox Model. Springer.
  • Kleinbaum DG & Klein M (2012) Survival Analysis: A Self-Learning Text. 3rd ed. Springer.

R package documentation:

  • vignette("survival") — comprehensive introduction to the survival package
  • vignette("timedep") — time-dependent covariates and coefficients

13 Session Information

Code
sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] muhaz_1.2.6.4    flextable_0.9.10 knitr_1.50       readxl_1.4.5    
 [5] writexl_1.5.4    broom_1.0.8      gtsummary_2.4.0  survminer_0.5.1 
 [9] ggpubr_0.6.1     survival_3.8-3   lubridate_1.9.4  forcats_1.0.0   
[13] stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4      readr_2.1.5     
[17] tidyr_1.3.1      tibble_3.3.0     ggplot2_4.0.0    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        viridisLite_0.4.2       farver_2.1.2           
 [4] S7_0.2.0                fastmap_1.2.0           fontquiver_0.2.1       
 [7] broom.helpers_1.22.0    labelled_2.15.0         digest_0.6.37          
[10] timechange_0.3.0        lifecycle_1.0.4         magrittr_2.0.3         
[13] compiler_4.5.1          sass_0.4.10             rlang_1.1.6            
[16] tools_4.5.1             gt_1.0.0                yaml_2.3.10            
[19] data.table_1.17.8       ggsignif_0.6.4          labeling_0.4.3         
[22] askpass_1.2.1           htmlwidgets_1.6.4       xml2_1.4.1             
[25] RColorBrewer_1.1-3      abind_1.4-8             withr_3.0.2            
[28] grid_4.5.1              gdtools_0.4.3           xtable_1.8-4           
[31] scales_1.4.0            cli_3.6.5               rmarkdown_2.29         
[34] ragg_1.4.0              generics_0.1.4          rstudioapi_0.17.1      
[37] km.ci_0.5-6             tzdb_0.5.0              commonmark_2.0.0       
[40] splines_4.5.1           cellranger_1.1.0        base64enc_0.1-3        
[43] survMisc_0.5.6          vctrs_0.6.5             Matrix_1.7-3           
[46] jsonlite_2.0.0          fontBitstreamVera_0.1.1 carData_3.0-5          
[49] litedown_0.7            car_3.1-3               hms_1.1.3              
[52] rstatix_0.7.2           Formula_1.2-5           systemfonts_1.2.3      
[55] glue_1.8.0              ggtext_0.1.2            stringi_1.8.7          
[58] gtable_0.3.6            pillar_1.10.2           htmltools_0.5.8.1      
[61] openssl_2.3.3           R6_2.6.1                KMsurv_0.1-6           
[64] textshaping_1.0.1       evaluate_1.0.4          lattice_0.22-7         
[67] haven_2.5.5             markdown_2.0            backports_1.5.0        
[70] cards_0.7.0             gridtext_0.1.5          ggsci_4.0.0            
[73] fontLiberation_0.1.0    cardx_0.3.0             Rcpp_1.1.0             
[76] zip_2.3.3               uuid_1.2-1              gridExtra_2.3          
[79] officer_0.7.0           xfun_0.52               zoo_1.8-14             
[82] pkgconfig_2.0.3