Descriptive Statistics

FRCR Medical Statistics Module

Author

Scottish Oncology Course

Published

December 2, 2025

1 Introduction

1.1 What is Statistics?

“Statistics…is the core science of evidence-based practice” — Martin Bland

Statistics is the science of:

  • Collecting data
  • Summarising data
  • Presenting data
  • Interpreting data

1.2 Learning Objectives

By the end of this tutorial, you will be able to:

  • Present and summarise individual variables
  • Recognise categorical data (nominal, ordinal)
  • Recognise discrete and continuous numerical data
  • Recognise symmetric and skewed distributions
  • Describe the normal distribution
  • Interpret bar charts and histograms
  • Define and apply measures of central tendency and spread

1.3 Descriptive Statistics

Descriptive statistics involves summarising and presenting data. This is essential because it:

  • Allows us to get “a feel” for the data
  • Must be done before any inferential analysis
  • Helps form subjective impressions of answers to research questions

2 Loading the Data

Throughout this tutorial, we will use a dataset of 102 patients with non-small cell lung cancer (NSCLC) treated with stereotactic ablative radiotherapy (SABR).

Show code
# Load patient data from Excel file
patients <- read_excel("data/nsclc_patient_data.xlsx")

# View the first few rows
patients |>
  head(10) |>
  gt() |>
  tab_header(
    title = "NSCLC Patient Dataset",
    subtitle = "First 10 observations"
  )
NSCLC Patient Dataset
First 10 observations
patient_id age sex stage tumour_location performance_status tumour_size_cm suv_max height_m weight_kg neutrophil_count lymphocyte_count brain_metastases status_2yr bmi nlr age_group
1 81 Male IIB Right Lower Lobe 1 2.7 10.5 1.70 89.5 5.2 1.7 1 Dead 31.0 3.06 80+
2 76 Female IB Right Lower Lobe 1 3.7 12.6 1.65 62.9 6.0 1.5 1 Alive 23.1 4.00 70-79
3 71 Female IA Left Upper Lobe 2 3.0 16.3 1.55 65.3 5.6 0.6 1 Dead 27.2 9.33 70-79
4 70 Female IIB Right Upper Lobe 0 4.4 9.2 1.68 93.2 6.5 1.6 3 Alive 33.0 4.06 70-79
5 82 Female IA Left Upper Lobe 1 3.3 13.2 1.58 75.5 7.3 1.8 2 Alive 30.2 4.06 80+
6 84 Male IA Right Upper Lobe 1 1.2 12.8 1.93 74.1 5.3 1.7 1 Alive 19.9 3.12 80+
7 77 Male IB Right Upper Lobe 0 0.9 12.8 1.75 80.6 4.1 1.1 3 Alive 26.3 3.73 70-79
8 71 Male IB Right Lower Lobe 2 1.8 9.2 1.76 91.5 4.4 3.0 1 Alive 29.5 1.47 70-79
9 61 Female IB Right Lower Lobe 1 6.3 3.0 1.60 59.7 1.7 3.0 3 Dead 23.3 0.57 60-69
10 62 Female IA Right Upper Lobe 3 2.3 4.5 1.47 67.6 6.9 1.6 3 Alive 31.3 4.31 60-69

2.1 Dataset Structure

A collection of data is called a dataset. It contains information on subjects we are interested in. For computer analysis, data must have a clear structure where:

  • Each row represents an individual observation (patient)
  • Each column represents a variable (characteristic)
Show code
# Examine the structure
glimpse(patients)
Rows: 102
Columns: 17
$ patient_id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ age                <dbl> 81, 76, 71, 70, 82, 84, 77, 71, 61, 62, 57, 76, 79,…
$ sex                <chr> "Male", "Female", "Female", "Female", "Female", "Ma…
$ stage              <chr> "IIB", "IB", "IA", "IIB", "IA", "IA", "IB", "IB", "…
$ tumour_location    <chr> "Right Lower Lobe", "Right Lower Lobe", "Left Upper…
$ performance_status <dbl> 1, 1, 2, 0, 1, 1, 0, 2, 1, 3, 1, 0, 1, 3, 0, 0, 1, …
$ tumour_size_cm     <dbl> 2.7, 3.7, 3.0, 4.4, 3.3, 1.2, 0.9, 1.8, 6.3, 2.3, 2…
$ suv_max            <dbl> 10.5, 12.6, 16.3, 9.2, 13.2, 12.8, 12.8, 9.2, 3.0, …
$ height_m           <dbl> 1.70, 1.65, 1.55, 1.68, 1.58, 1.93, 1.75, 1.76, 1.6…
$ weight_kg          <dbl> 89.5, 62.9, 65.3, 93.2, 75.5, 74.1, 80.6, 91.5, 59.…
$ neutrophil_count   <dbl> 5.2, 6.0, 5.6, 6.5, 7.3, 5.3, 4.1, 4.4, 1.7, 6.9, 7…
$ lymphocyte_count   <dbl> 1.7, 1.5, 0.6, 1.6, 1.8, 1.7, 1.1, 3.0, 3.0, 1.6, 1…
$ brain_metastases   <dbl> 1, 1, 1, 3, 2, 1, 3, 1, 3, 3, 1, 1, 2, 3, 1, 3, 3, …
$ status_2yr         <chr> "Dead", "Alive", "Dead", "Alive", "Alive", "Alive",…
$ bmi                <dbl> 31.0, 23.1, 27.2, 33.0, 30.2, 19.9, 26.3, 29.5, 23.…
$ nlr                <dbl> 3.06, 4.00, 9.33, 4.06, 4.06, 3.12, 3.73, 1.47, 0.5…
$ age_group          <chr> "80+", "70-79", "70-79", "70-79", "80+", "80+", "70…

3 Types of Data

To produce descriptive statistics appropriately requires knowledge of different data types. Broadly, data are either numerical or categorical.

3.1 Categorical Data

Categorical (qualitative) data tells us which category an individual belongs to.

3.1.1 Nominal Data

Categories with no natural ordering.

Show code
patients |>
  count(tumour_location) |>
  gt() |>
  tab_header(title = "Tumour Location (Nominal Variable)")
Tumour Location (Nominal Variable)
tumour_location n
Left Lower Lobe 11
Left Upper Lobe 30
Right Lower Lobe 26
Right Middle Lobe 1
Right Upper Lobe 34

3.1.2 Ordinal Data

Categories with a natural ordering.

Show code
patients |>
  count(stage) |>
  arrange(stage) |>
  gt() |>
  tab_header(title = "Cancer Stage (Ordinal Variable)")
Cancer Stage (Ordinal Variable)
stage n
IA 25
IB 32
IIA 17
IIB 19
IIIA 9

3.1.3 Binary (Dichotomous) Data

A categorical variable with only two categories (e.g., alive or dead). Sometimes coded as 0 and 1.

Show code
patients |>
  count(status_2yr) |>
  mutate(proportion = n / sum(n)) |>
  gt() |>
  tab_header(title = "Two-Year Survival Status (Binary Variable)") |>
  fmt_percent(proportion, decimals = 1)
Two-Year Survival Status (Binary Variable)
status_2yr n proportion
Alive 74 72.5%
Dead 28 27.5%

3.2 Numerical Data

3.2.1 Discrete Data

Can only take specific values (usually whole numbers).

Show code
patients |>
  count(brain_metastases) |>
  mutate(
    brain_metastases = case_when(
      brain_metastases == 1 ~ "1",
      brain_metastases == 2 ~ "2",
      brain_metastases == 3 ~ "3 or more"
    )
  ) |>
  gt() |>
  tab_header(title = "Number of Brain Metastases (Discrete Variable)")
Number of Brain Metastases (Discrete Variable)
brain_metastases n
1 62
2 18
3 or more 22

3.2.2 Continuous Data

Can take any value within a range (infinitely divisible).

Show code
patients |>
  select(patient_id, age, height_m, weight_kg, tumour_size_cm, suv_max) |>
  head(8) |>
  gt() |>
  tab_header(title = "Examples of Continuous Variables")
Examples of Continuous Variables
patient_id age height_m weight_kg tumour_size_cm suv_max
1 81 1.70 89.5 2.7 10.5
2 76 1.65 62.9 3.7 12.6
3 71 1.55 65.3 3.0 16.3
4 70 1.68 93.2 4.4 9.2
5 82 1.58 75.5 3.3 13.2
6 84 1.93 74.1 1.2 12.8
7 77 1.75 80.6 0.9 12.8
8 71 1.76 91.5 1.8 9.2

3.3 Summary: Data Types

Data Type Description Examples
Nominal Categories without order Tumour location, Sex
Ordinal Categories with order Stage, Performance status
Discrete Countable numbers Number of metastases
Continuous Measurable values Age, Height, SUVmax

4 Graphics for Data Visualisation

4.1 Visualising Categorical Data

4.1.1 Frequency Tables

A frequency distribution table lists data values and how often each occurs.

Show code
patients |>
  count(stage, name = "n_patients") |>
  mutate(percent = round(100 * n_patients / sum(n_patients), 1)) |>
  arrange(stage) |>
  adorn_totals("row") |>
  gt() |>
  tab_header(title = "Frequency Distribution of Cancer Stage")
Frequency Distribution of Cancer Stage
stage n_patients percent
IA 25 24.5
IB 32 31.4
IIA 17 16.7
IIB 19 18.6
IIIA 9 8.8
Total 102 100.0

The most frequent category is called the mode. Percentages can also be expressed as proportions (e.g., 0.35 instead of 35%).

4.1.2 Bar Charts

Bar charts display the frequency of categorical data with gaps between bars.

Show code
patients |>
  count(stage) |>
  ggplot(aes(x = stage, y = n, fill = stage)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(
    title = "Stage Distribution",
    subtitle = "93 patients with lung cancer treated with radical radiotherapy",
    x = "Stage",
    y = "Frequency (n)"
  ) +
  scale_fill_brewer(palette = "Blues") +
  ylim(0, 45)

4.1.3 Pie Charts

Pie charts show proportions of a whole (best used for nominal data with few categories).

Show code
patients |>
  count(tumour_location) |>
  mutate(
    percent = round(100 * n / sum(n), 0),
    label = paste0(tumour_location, "\n(", percent, "%)")
  ) |>
  ggplot(aes(x = "", y = n, fill = tumour_location)) +
  geom_col(width = 1, colour = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(percent, "%")),
            position = position_stack(vjust = 0.5),
            colour = "white", fontface = "bold") +
  labs(
    title = "Tumour Location Distribution",
    subtitle = "Location of tumour among lung cancer patients",
    fill = "Location"
  ) +
  theme_void() +
  scale_fill_brewer(palette = "Set2")

4.1.4 Contingency Tables

Contingency tables show the relationship between two categorical variables.

Show code
# Create contingency table
cont_table <- patients |>
  count(sex, status_2yr) |>
  pivot_wider(names_from = sex, values_from = n) |>
  mutate(Total = Female + Male)

# Add row percentages
cont_table_pct <- patients |>
  group_by(sex) |>
  count(status_2yr) |>
  mutate(
    pct = round(100 * n / sum(n), 0),
    cell = paste0(n, " (", pct, "%)")
  ) |>
  select(-n, -pct) |>
  pivot_wider(names_from = sex, values_from = cell)

cont_table_pct |>
  gt() |>
  tab_header(
    title = "Two-Year Survival by Sex",
    subtitle = "Column percentages shown"
  )
Two-Year Survival by Sex
Column percentages shown
status_2yr Female Male
Alive 48 (74%) 26 (70%)
Dead 17 (26%) 11 (30%)

4.1.5 Clustered Bar Charts

Compare two categorical variables side by side.

Show code
patients |>
  count(sex, status_2yr) |>
  ggplot(aes(x = sex, y = n, fill = status_2yr)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = n), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5) +
  labs(
    title = "Two-Year Survival Status by Sex",
    subtitle = "102 patients with NSCLC treated with SABR",
    x = "Sex",
    y = "Frequency (n)",
    fill = "Status"
  ) +
  scale_fill_brewer(palette = "Set1") +
  ylim(0, 55)

4.1.6 Stacked Bar Charts

Show code
patients |>
  count(sex, status_2yr) |>
  group_by(sex) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = sex, y = pct, fill = status_2yr)) +
  geom_col() +
  geom_text(aes(label = scales::percent(pct, accuracy = 1)), 
            position = position_stack(vjust = 0.5),
            colour = "white", fontface = "bold") +
  labs(
    title = "Two-Year Survival Status by Sex (Percentage)",
    subtitle = "102 patients with NSCLC treated with SABR",
    x = "Sex",
    y = "Proportion",
    fill = "Status"
  ) +
  scale_y_continuous(labels = percent) +
  scale_fill_brewer(palette = "Set1")

4.2 Visualising Numerical Data

4.2.1 Histograms

Histograms show the distribution of continuous data. Unlike bar charts, bars are adjacent (no gaps).

Show code
ggplot(patients, aes(x = height_m)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", colour = "white") +
  labs(
    title = "Distribution of Patient Height",
    subtitle = "102 patients with NSCLC",
    x = "Height (m)",
    y = "Frequency"
  )

4.2.2 Histograms by Group

Show code
ggplot(patients, aes(x = height_m, fill = sex)) +
  geom_histogram(binwidth = 0.05, colour = "white", alpha = 0.7) +
  facet_wrap(~sex) +
  labs(
    title = "Distribution of Height by Sex",
    x = "Height (m)",
    y = "Frequency"
  ) +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "none")

4.2.3 Dot Plots

Useful for small datasets or comparing before/after measurements.

Show code
patients |>
  slice_head(n = 30) |>
  ggplot(aes(x = suv_max)) +
  geom_dotplot(binwidth = 1, fill = "steelblue") +
  labs(
    title = "SUVmax Distribution",
    subtitle = "PET scan measurements (subset of patients)",
    x = "SUVmax",
    y = NULL
  ) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

4.2.4 Box Plots

Box plots (box-and-whisker plots) summarise the distribution showing:

  • Median (middle line)
  • Interquartile range (IQR) (box = Q1 to Q3)
  • Whiskers (extend to 1.5 × IQR)
  • Outliers (points beyond whiskers)
Show code
# Create annotated boxplot
bp_data <- tibble(
  group = "A",
  value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15)
)

ggplot(bp_data, aes(x = group, y = value)) +
  geom_boxplot(fill = "lightblue", width = 0.3) +
  annotate("text", x = 1.25, y = median(bp_data$value), 
           label = "Median", hjust = 0) +
  annotate("text", x = 1.25, y = quantile(bp_data$value, 0.75), 
           label = "Q3 (75th percentile)", hjust = 0) +
  annotate("text", x = 1.25, y = quantile(bp_data$value, 0.25), 
           label = "Q1 (25th percentile)", hjust = 0) +
  annotate("text", x = 1.25, y = 15, 
           label = "Outlier", hjust = 0) +
  labs(
    title = "Anatomy of a Box Plot",
    y = "Value"
  ) +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank())

Show code
ggplot(patients, aes(x = sex, y = nlr, fill = sex)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Neutrophil-Lymphocyte Ratio by Sex",
    subtitle = "102 patients with NSCLC prior to treatment with SABR",
    x = NULL,
    y = "Neutrophil-Lymphocyte Ratio"
  ) +
  scale_fill_brewer(palette = "Set2")

4.2.5 Scatter Plots

Show the relationship between two continuous variables.

Show code
ggplot(patients, aes(x = neutrophil_count, y = lymphocyte_count)) +
  geom_point(colour = "steelblue", alpha = 0.7, size = 2) +
  labs(
    title = "Blood Counts",
    subtitle = "102 patients with NSCLC prior to treatment with SABR",
    x = "Neutrophil Count (×10⁹/L)",
    y = "Lymphocyte Count (×10⁹/L)"
  )

4.3 Summary: Graphics for Different Data Types

Data Type Suitable Graphics
Continuous Histograms, box plots, scatter plots, dot plots
Categorical (Nominal) Bar charts, pie charts, frequency tables
Categorical (Ordinal) Bar charts, frequency tables
Two categorical variables Contingency tables, clustered/stacked bar charts
Two continuous variables Scatter plots

5 Measures of Central Tendency

Measures of central tendency (or “location”) describe the centre or “average” of the data.

5.1 Mean

The arithmetic mean is the sum of all values divided by the number of observations.

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

Show code
# Example with 10 SUVmax values
suv_sample <- patients |>
  slice_head(n = 10) |>
  pull(suv_max)

# Display the values
cat("SUVmax values:", paste(suv_sample, collapse = ", "), "\n")
SUVmax values: 10.5, 12.6, 16.3, 9.2, 13.2, 12.8, 12.8, 9.2, 3, 4.5 
Show code
cat("Sum:", sum(suv_sample), "\n")
Sum: 104.1 
Show code
cat("n:", length(suv_sample), "\n")
n: 10 
Show code
cat("Mean:", round(mean(suv_sample), 2))
Mean: 10.41

5.2 Median

The median is the middle value when data are ordered from smallest to largest.

  • For odd n: the middle value
  • For even n: the average of the two middle values
Show code
# Order the values
suv_ordered <- sort(suv_sample)
cat("Ordered values:", paste(suv_ordered, collapse = ", "), "\n")
Ordered values: 3, 4.5, 9.2, 9.2, 10.5, 12.6, 12.8, 12.8, 13.2, 16.3 
Show code
cat("Median:", median(suv_sample))
Median: 11.55

The median is robust to outliers—extreme values have little effect on it.

5.3 Mode

The mode is the most frequently occurring value. It is particularly useful for categorical data.

Show code
# Mode of stage
mode_stage <- patients |>
  count(stage, sort = TRUE) |>
  slice_head(n = 1) |>
  pull(stage)

cat("Mode of stage:", mode_stage)
Mode of stage: IB

5.4 Effect of Outliers

Show code
# Create data with and without outlier
normal_data <- c(1.8, 2.7, 5.4, 5.8, 6.6, 8.9, 9.4, 13.1, 16.0, 17.9)
outlier_data <- c(1.8, 2.7, 5.4, 5.8, 6.6, 8.9, 9.4, 13.1, 16.0, 1000.0)

comparison <- tibble(
  Measure = c("Mean", "Median"),
  `Original Data` = c(mean(normal_data), median(normal_data)),
  `With Outlier (1000)` = c(mean(outlier_data), median(outlier_data))
)

comparison |>
  gt() |>
  tab_header(
    title = "Effect of Outliers on Mean and Median"
  ) |>
  fmt_number(columns = -Measure, decimals = 1)
Effect of Outliers on Mean and Median
Measure Original Data With Outlier (1000)
Mean 8.8 107.0
Median 7.8 7.8

Key insight: The median remains stable (7.75) while the mean jumps dramatically (106.5) when an outlier is present.

6 Measures of Dispersion (Spread)

Measures of dispersion describe how “concentrated” or “spread out” the data are.

6.1 Range

The range is the difference between the maximum and minimum values.

Show code
suv_ordered <- sort(suv_sample)
cat("Ordered values:", paste(suv_ordered, collapse = ", "), "\n")
Ordered values: 3, 4.5, 9.2, 9.2, 10.5, 12.6, 12.8, 12.8, 13.2, 16.3 
Show code
cat("Minimum:", min(suv_sample), "\n")
Minimum: 3 
Show code
cat("Maximum:", max(suv_sample), "\n")
Maximum: 16.3 
Show code
cat("Range:", max(suv_sample) - min(suv_sample))
Range: 13.3

6.2 Interquartile Range (IQR)

The IQR is the range between the 25th percentile (Q1) and 75th percentile (Q3).

  • Contains the middle 50% of the data
  • Robust to outliers
Show code
cat("Q1 (25th percentile):", quantile(suv_sample, 0.25), "\n")
Q1 (25th percentile): 9.2 
Show code
cat("Q3 (75th percentile):", quantile(suv_sample, 0.75), "\n")
Q3 (75th percentile): 12.8 
Show code
cat("IQR:", IQR(suv_sample))
IQR: 3.6

6.3 Standard Deviation

The standard deviation (SD) measures how far observations are from the mean, on average.

\[SD = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}\]

6.3.1 Calculation Step by Step

Show code
# Step-by-step SD calculation
sd_table <- tibble(
  SUVmax = suv_sample,
  `SUVmax - mean` = SUVmax - mean(SUVmax),
  `(SUVmax - mean)²` = (SUVmax - mean(SUVmax))^2
)

# Display with totals
sd_table |>
  adorn_totals("row") |>
  gt() |>
  tab_header(title = "Standard Deviation Calculation") |>
  fmt_number(columns = -SUVmax, decimals = 2) |>
  tab_footnote(
    footnote = paste0("Mean = ", round(mean(suv_sample), 2), 
                      ", SD = ", round(sd(suv_sample), 2)),
    locations = cells_title()
  )
Standard Deviation Calculation1
SUVmax SUVmax - mean (SUVmax - mean)²
10.5 0.09 0.01
12.6 2.19 4.80
16.3 5.89 34.69
9.2 −1.21 1.46
13.2 2.79 7.78
12.8 2.39 5.71
12.8 2.39 5.71
9.2 −1.21 1.46
3 −7.41 54.91
4.5 −5.91 34.93
Total 0.00 151.47
1 Mean = 10.41, SD = 4.1
Show code
cat("Sum of squared deviations:", round(sum((suv_sample - mean(suv_sample))^2), 2), "\n")
Sum of squared deviations: 151.47 
Show code
cat("n - 1:", length(suv_sample) - 1, "\n")
n - 1: 9 
Show code
cat("Variance:", round(var(suv_sample), 2), "\n")
Variance: 16.83 
Show code
cat("Standard Deviation:", round(sd(suv_sample), 2))
Standard Deviation: 4.1

6.4 Reporting Summary Statistics

Important Convention
  • Report means with standard deviations: mean (SD)
  • Report medians with interquartile ranges: median (IQR or Q1-Q3)
  • Never mix medians with SD or means with IQR!
Show code
patients |>
  summarise(
    n = n(),
    `Mean (SD)` = paste0(round(mean(suv_max), 1), " (", round(sd(suv_max), 1), ")"),
    `Median (IQR)` = paste0(round(median(suv_max), 1), " (", 
                            round(quantile(suv_max, 0.25), 1), "-",
                            round(quantile(suv_max, 0.75), 1), ")")
  ) |>
  gt() |>
  tab_header(title = "SUVmax Summary Statistics")
SUVmax Summary Statistics
n Mean (SD) Median (IQR)
102 9.3 (4.9) 8.8 (5.6-13)

6.5 Which Measure to Use?

Data Type Central Tendency Spread
Nominal categorical Mode
Ordinal categorical Median IQR
Numerical, symmetric Mean Standard deviation
Numerical, skewed Median IQR

7 Distribution Shapes

7.1 Symmetric vs Skewed Distributions

Show code
# Generate example distributions
set.seed(123)
symmetric <- rnorm(1000, mean = 50, sd = 10)
right_skew <- rgamma(1000, shape = 2, rate = 0.1)
left_skew <- 100 - rgamma(1000, shape = 2, rate = 0.1)

p1 <- ggplot(tibble(x = symmetric), aes(x)) +
  geom_histogram(bins = 30, fill = "steelblue", colour = "white") +
  labs(title = "Symmetric", subtitle = "Mean ≈ Median") +
  theme(axis.title = element_blank())

p2 <- ggplot(tibble(x = right_skew), aes(x)) +
  geom_histogram(bins = 30, fill = "coral", colour = "white") +
  labs(title = "Right (Positive) Skew", subtitle = "Mean > Median") +
  theme(axis.title = element_blank())

p3 <- ggplot(tibble(x = left_skew), aes(x)) +
  geom_histogram(bins = 30, fill = "forestgreen", colour = "white") +
  labs(title = "Left (Negative) Skew", subtitle = "Mean < Median") +
  theme(axis.title = element_blank())

p1 + p2 + p3

8 The Normal Distribution

8.1 Properties

The normal (Gaussian) distribution is a symmetric, bell-shaped distribution completely specified by two parameters:

  • μ (mu): the population mean (centre)
  • σ (sigma): the population standard deviation (spread)
Show code
# Generate normal distribution curve
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)

normal_plot <- ggplot(tibble(x, y), aes(x, y)) +
  geom_line(linewidth = 1, colour = "steelblue") +
  geom_area(alpha = 0.3, fill = "steelblue") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  annotate("text", x = 0, y = 0.42, label = "μ (mean)", fontface = "bold") +
  labs(
    title = "The Normal Distribution",
    x = "Standard Deviations from Mean",
    y = "Probability Density"
  )

normal_plot

8.2 The 68-95-99.7 Rule (Reference Ranges)

In a normal distribution:

  • 68% of values lie within ±1 SD of the mean
  • 95% of values lie within ±2 SD of the mean
  • 99.7% of values lie within ±3 SD of the mean
Show code
# Create three panels showing the rule
plot_sd_range <- function(sd_range, fill_col, pct_text) {
  ggplot(tibble(x, y), aes(x, y)) +
    geom_area(data = tibble(x, y) |> filter(x >= -sd_range & x <= sd_range),
              fill = fill_col, alpha = 0.5) +
    geom_line(linewidth = 1) +
    geom_vline(xintercept = c(-sd_range, sd_range), linetype = "dashed", colour = "red") +
    annotate("text", x = 0, y = 0.15, label = pct_text, size = 6, fontface = "bold") +
    scale_x_continuous(breaks = -3:3, labels = c("μ-3σ", "μ-2σ", "μ-σ", "μ", "μ+σ", "μ+2σ", "μ+3σ")) +
    labs(title = paste0("±", sd_range, " SD"), y = NULL, x = NULL) +
    theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
}

p1 <- plot_sd_range(1, "coral", "68%")
p2 <- plot_sd_range(2, "steelblue", "95%")
p3 <- plot_sd_range(3, "forestgreen", "99.7%")

p1 + p2 + p3

8.3 Example: Height Distribution

Show code
# Female heights (approximately normal)
female_heights <- patients |> 
  filter(sex == "Female") |> 
  pull(height_m)

mean_h <- mean(female_heights)
sd_h <- sd(female_heights)

ggplot(patients |> filter(sex == "Female"), aes(x = height_m)) +
  geom_histogram(aes(y = after_stat(density)), bins = 15, 
                 fill = "steelblue", colour = "white", alpha = 0.7) +
  stat_function(fun = dnorm, args = list(mean = mean_h, sd = sd_h),
                colour = "red", linewidth = 1) +
  geom_vline(xintercept = mean_h, colour = "darkblue", linetype = "dashed") +
  annotate("text", x = mean_h + 0.02, y = 4, 
           label = paste0("Mean = ", round(mean_h, 2), "m"), hjust = 0) +
  labs(
    title = "Distribution of Height (Female Patients)",
    subtitle = paste0("Mean = ", round(mean_h, 2), "m, SD = ", round(sd_h, 2), "m"),
    x = "Height (m)",
    y = "Density"
  )

9 Practical Exercises

9.1 Exercise 1: Identify Data Types

For each variable in our dataset, identify whether it is:

  1. Nominal, Ordinal, Discrete, or Continuous
  2. Categorical or Numerical
Show code
# Check variables
tibble(
  Variable = c("patient_id", "age", "sex", "stage", "performance_status", 
               "tumour_size_cm", "brain_metastases", "status_2yr"),
  `Your Answer (Type)` = c("", "", "", "", "", "", "", ""),
  `Your Answer (Category)` = c("", "", "", "", "", "", "", "")
) |>
  gt()
Variable Your Answer (Type) Your Answer (Category)
patient_id
age
sex
stage
performance_status
tumour_size_cm
brain_metastases
status_2yr
Variable Type Category
patient_id Discrete (ID) Numerical
age Continuous Numerical
sex Nominal Categorical
stage Ordinal Categorical
performance_status Ordinal Categorical
tumour_size_cm Continuous Numerical
brain_metastases Discrete Numerical
status_2yr Binary/Nominal Categorical

9.2 Exercise 2: Calculate Summary Statistics

Calculate the mean, median, SD, and IQR for tumour size.

Show code
# Calculate summary statistics for tumour_size_cm
patients |>
  summarise(
    Mean = mean(tumour_size_cm),
    Median = median(tumour_size_cm),
    SD = sd(tumour_size_cm),
    Q1 = quantile(tumour_size_cm, 0.25),
    Q3 = quantile(tumour_size_cm, 0.75),
    IQR = IQR(tumour_size_cm)
  ) |>
  gt() |>
  fmt_number(everything(), decimals = 2)
Mean Median SD Q1 Q3 IQR
2.84 2.80 1.21 2.02 3.70 1.68

9.3 Exercise 3: Create Appropriate Visualisations

Show code
# Create appropriate visualisations for different variable types

# Histogram for continuous variable
p1 <- ggplot(patients, aes(x = tumour_size_cm)) +
  geom_histogram(bins = 15, fill = "steelblue", colour = "white") +
  labs(title = "Tumour Size Distribution", x = "Size (cm)", y = "Count")

# Bar chart for categorical variable
p2 <- ggplot(patients, aes(x = factor(performance_status))) +
  geom_bar(fill = "coral") +
  labs(title = "Performance Status", x = "PS Score", y = "Count")

# Box plot comparing groups
p3 <- ggplot(patients, aes(x = status_2yr, y = age, fill = status_2yr)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Age by Survival Status", x = NULL, y = "Age (years)") +
  scale_fill_brewer(palette = "Set2")

p1 + p2 + p3

10 Practice Questions (FRCR Style)

10.1 Question 1

The standard error of the mean provides a measure of the:

  1. spread of the data
  2. centre of the data
  3. Normality of the data
  4. precision of the sample mean
  5. bias of the sample mean

d) precision of the sample mean

The standard error measures how precisely we have estimated the population mean from our sample.

10.2 Question 2

In a normal distribution, 95% of values lie within:

  1. the range
  2. the interquartile range
  3. ±1 standard deviation from the mean
  4. ±1.5 standard deviations from the mean
  5. ±2 standard deviations from the mean

e) ±2 standard deviations from the mean

The 68-95-99.7 rule states that approximately 95% of values in a normal distribution fall within 1.96 (≈2) standard deviations of the mean.

10.3 Question 3

In a normal distribution it is expected that:

  1. the median and mean will be the same
  2. the median will be greater than the mean
  3. the median will be smaller than the mean
  4. the median cannot be calculated
  5. the mean and median will not be the same

a) the median and mean will be the same

In a perfectly symmetric normal distribution, the mean, median, and mode are all equal.

11 Summary

11.1 Key Points

  1. Data types: Categorical (nominal, ordinal) vs Numerical (discrete, continuous)

  2. Visualisation: Match your graphic to your data type

  3. Central tendency: Mean (symmetric data), Median (skewed/ordinal), Mode (categorical)

  4. Spread: Standard deviation (symmetric), IQR (skewed)

  5. Normal distribution: Symmetric, bell-shaped, described by μ and σ

  6. Reporting convention: Mean (SD) or Median (IQR) — never mix!

11.2 Comprehensive Summary Table

Show code
patients |>
  select(age, height_m, weight_kg, bmi, tumour_size_cm, suv_max, nlr) |>
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    label = list(
      age ~ "Age (years)",
      height_m ~ "Height (m)",
      weight_kg ~ "Weight (kg)",
      bmi ~ "BMI (kg/m²)",
      tumour_size_cm ~ "Tumour size (cm)",
      suv_max ~ "SUVmax",
      nlr ~ "Neutrophil-Lymphocyte Ratio"
    )
  ) |>
  modify_header(label = "**Variable**") |>
  as_gt()
Variable N = 1021
Age (years) 72 (9)
Height (m) 1.67 (0.10)
Weight (kg) 74 (16)
BMI (kg/m²) 26.6 (5.7)
Tumour size (cm) 2.84 (1.21)
SUVmax 9.3 (4.9)
Neutrophil-Lymphocyte Ratio 3.37 (2.78)
1 Mean (SD)

12 Further Reading

  • Bland M. An Introduction to Medical Statistics. 4th ed. Oxford University Press.
  • Kirkwood BR, Sterne JAC. Essential Medical Statistics. 2nd ed. Blackwell Publishing.
  • The Royal College of Radiologists. Clinical Oncology Curriculum 2021: Medical Statistics Module.

Tutorial created for the FRCR Medical Statistics Module