Overview

The goal of this workshop is to introduce core statistical concepts using simple base R syntax and a real clinical dataset from the OncoDataSets package.

Topics covered:

Tip: pressing F1 after typing a function automatically opens the Help page for that fucntion.

Step 0: Install and Load Required Packages

Task: Install and load the required packages.

Suggested commands:

#install.packages("OncoDataSets")
#install.packages("survival")

library(OncoDataSets)
library(survival)

Exercise 1: Load and Explore the Breast Cancer Dataset

We will use the BreastCancer_df dataset, which contains demographic, tumour, and survival information for patients with breast cancer. Paste these commands into your script to rename the dataset and generate an (artificial) tumour size variable, which would not otherwise be present in the dataset.

Task: Load the dataset into R.

Suggested command:

data("WBreastCancer_tbl_df")
BreastCancer_df <- WBreastCancer_tbl_df

BreastCancer_df$tumour_size <- abs(round(15 + rnorm(length(BreastCancer_df$age),25,10) 
                               - 0.25*BreastCancer_df$age + BreastCancer_df$histgrad
                               *-(rgamma(length(BreastCancer_df$histgrad),3)),0))

Questions

  1. How many patients are in the dataset?
  2. What variables are available, and which are numeric vs categorical?
  3. What is the median age of patients?

Suggested commands:

head() str() summary()


Exercise 2: Descriptive Statistics and Visualization

Age Distribution

Task: Create a histogram of patient age.

Suggested commands:

hist()

Questions:

  • Is the distribution symmetric or skewed?

Estrogen Receptor (ER) Status

Task: Create a bar chart showing ER-positive vs ER-negative patients.

Suggested commands:

table() barplot()

Questions:

  • Which ER group is more common?

Tumour Size

Task: Visualize tumour size.

Suggested commands:

hist()

Questions:

  • Is tumour size normally distributed?

Exercise 3: Two-Sample t-Test

Clinical Question

Is patient age different between ER-positive and ER-negative breast cancer?

Tasks:

  1. Count how many patients are in each ER group.
  2. Perform a two-sample t-test.

Suggested commands:

table() t.test()

Questions:

  • What is the null hypothesis?
  • What does the p-value represent?

Exercise 4: Chi-Square Test

Clinical Question

Is ER status associated with nodal involvement?

Tasks:

  1. Create a contingency table.
  2. Perform a chi-square test.

Suggested commands:

table() chisq.test()

Questions:

  • Obtain counts and cross-counts for both variables
  • Is the result statistically significant?

Exercise 5: Linear Regression (OLS)

Question

Does age predict tumour size?

Tasks:

  1. Fit a linear regression model.
  2. Examine the model summary.
  3. Plot the relationship and regression line.

Suggested commands:

lm() summary() plot() abline()

Questions:

  • What does the slope represent clinically?
  • How strong is the relationship?

Exercise 6: Kaplan–Meier Survival Analysis

Question

Does ER status affect overall survival?

Tasks:

  1. Create a survival object.
  2. Fit Kaplan–Meier curves by ER status.
  3. Plot the survival curves.

Suggested commands:

Surv() survfit() plot() legend()

Questions:

  • Which group appears to have better survival?
  • Can you estimate median survival from the plot?

Exercise 7: Log-Rank Test

Question

Is the difference in survival between ER groups statistically significant?

Task: Perform a log-rank test.

Suggested command:

survdiff()

Questions:

  • What hypothesis is being tested?
  • Comment on the prognostic value of ER status.