5 Presenting Data

5.1 Introduction

The use of graphics and descriptive statistics is an important step to identifying the main features of data, for detecting outliers, and identifying data which has been recorded incorrectly.

Outliers are extreme observations which are not consistent with the rest of the data. The presence of outliers can distort statistical techniques.

5.2 Frequency Distribution Tables

A count of the number of times something occurs is called a frequency count.

Within a data set we may list the data values and count how many times each value occurs. A table with the set of data values and frequency of each value is called a frequency distribution table.

code

# Create the brain metastases data
brain_mets <- tibble(
  `Number of brain metastases` = c("1", "2", "3 or more", "Total"),
  `Number of patients (n)` = c(19, 4, 9, 32),
  `Percent of patients (%)` = c(59.4, 12.5, 28.1, 100)
)

brain_mets |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 5.1: Number of brain metastases among renal cancer patients treated with cranial radiotherapy

Number of brain metastases	Number of patients (n)	Percent of patients (%)
1	19	59.4
2	4	12.5
3 or more	9	28.1
Total	32	100.0

The most frequent category is called the mode. In the above table the mode would be 1 metastasis, which occurred in a total of 19 renal cancer patients.

The percentages in a frequency distribution table are sometimes expressed as relative frequencies or proportions. In the above table, the proportions would be 0.594, 0.125, 0.281.

Grouped Frequency Tables

When the set of potential data values is large the data are organised in groups (examples – age groups 20-29, 30-39, 40-49, 50-59, 60-69 etc). A count of the number of data values in each group (group frequency) is made. The frequency distribution table will then include the data groupings and the frequency of each group.

5.3 Graphs for Numeric Data

Histograms

A histogram consists of a set of rectangles which present the frequency distribution of a variable. It allows a visualisation of the shape of the data.

code

# Create simple histogram anatomy diagram
set.seed(100)
sample_data <- tibble(x = c(2.5, 3.5, 4.5, 5.5, 6.5))
sample_freq <- c(3, 7, 12, 8, 4)

ggplot() +
  geom_col(aes(x = sample_data$x, y = sample_freq),
           width = 0.9, fill = "#3498db", colour = "#2980b9", alpha = 0.7) +
  # Y-axis label annotation
  annotate("segment", x = 1.5, xend = 1.5, y = 0, yend = 12,
           arrow = arrow(length = unit(0.3, "cm"), ends = "both"), colour = "#e74c3c", linewidth = 1) +
  annotate("text", x = 0.8, y = 6, label = "Frequency\n(y-axis)", colour = "#e74c3c", size = 3.5, fontface = "bold") +
  # X-axis label annotation
  annotate("segment", x = 2, xend = 7, y = -1.5, yend = -1.5,
           arrow = arrow(length = unit(0.3, "cm"), ends = "both"), colour = "#27ae60", linewidth = 1) +
  annotate("text", x = 4.5, y = -2.5, label = "Variable values (x-axis)", colour = "#27ae60", size = 3.5, fontface = "bold") +
  # Bar width annotation
  annotate("segment", x = 4.05, xend = 4.95, y = 13, yend = 13,
           arrow = arrow(length = unit(0.2, "cm"), ends = "both"), colour = "#8e44ad", linewidth = 0.8) +
  annotate("text", x = 4.5, y = 13.8, label = "Group width", colour = "#8e44ad", size = 3, fontface = "bold") +
  # Midpoint annotation
  annotate("point", x = 4.5, y = 0, colour = "#e67e22", size = 3) +
  annotate("segment", x = 4.5, xend = 4.5, y = 0, yend = -0.8,
           colour = "#e67e22", linewidth = 0.8, linetype = "dashed") +
  annotate("text", x = 5.8, y = 0.5, label = "Midpoint", colour = "#e67e22", size = 3, fontface = "bold") +
  annotate("segment", x = 5.3, xend = 4.6, y = 0.5, yend = 0.1,
           arrow = arrow(length = unit(0.15, "cm")), colour = "#e67e22", linewidth = 0.6) +
  labs(x = "", y = "") +
  scale_y_continuous(limits = c(-3, 15), breaks = seq(0, 12, 3)) +
  scale_x_continuous(limits = c(0.5, 8), breaks = sample_data$x) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())

Figure 5.1: Anatomy of a histogram showing key components

code

# Generate synthetic height data (women, approximately normal)
set.seed(123)
heights <- tibble(
  height = rnorm(5682, mean = 1.61, sd = 0.07)
)

ggplot(heights, aes(x = height)) +
  geom_histogram(binwidth = 0.05, fill = "#3498db", colour = "white", alpha = 0.8) +
  labs(x = "Height (m)", y = "Frequency (n)") +
  scale_x_continuous(breaks = seq(1.2, 2.0, 0.2)) +
  theme_minimal(base_size = 14)

Figure 5.2: Histogram of the heights of 5,682 women aged 25-64

The histogram above shows that the most frequent grouping of heights occurs at 1.6-1.65 m. There is a tendency for the bars of the histogram to cluster around this central modal group in a symmetric fashion. This is a classic example of a Normal distribution which displays a symmetric bell shape.

Not all histograms are symmetric; skewness can exist in the distribution of values.

code

set.seed(42)

# Symmetric
symmetric <- tibble(x = rnorm(1000, 50, 10), type = "Symmetric (bell shaped)")

# Positively skewed (long right tail)
positive_skew <- tibble(x = rgamma(1000, shape = 2, scale = 10), type = "Positively skewed\n(long tail to the right)")

# Negatively skewed (long left tail)
negative_skew <- tibble(x = 100 - rgamma(1000, shape = 2, scale = 10), type = "Negatively skewed\n(long tail to the left)")

all_data <- bind_rows(symmetric, positive_skew, negative_skew)
all_data$type <- factor(all_data$type, levels = c("Symmetric (bell shaped)", 
                                                   "Positively skewed\n(long tail to the right)",
                                                   "Negatively skewed\n(long tail to the left)"))

ggplot(all_data, aes(x = x)) +
  geom_histogram(bins = 25, fill = "#3498db", colour = "white", alpha = 0.8) +
  facet_wrap(~type, scales = "free") +
  labs(x = "Variable", y = "Frequency") +
  theme_minimal(base_size = 12)

Figure 5.3: Examples of symmetric, positively skewed, and negatively skewed distributions

Dot Plots

Dot plots are a simple method of conveying as much information as possible by showing all of the data. It retains individual subject values and clearly demonstrates differences between groups. An additional advantage is that outliers will be detected.

code

set.seed(456)
psa_data <- tibble(
  patient = rep(1:41, 2),
  timepoint = rep(c("Initial", "Week 18"), each = 41),
  psa = c(
    rlnorm(41, meanlog = 2, sdlog = 0.8),  # Initial - higher values
    rlnorm(41, meanlog = 0.5, sdlog = 0.6)  # Week 18 - lower values
  )
)

psa_data$timepoint <- factor(psa_data$timepoint, levels = c("Initial", "Week 18"))

ggplot(psa_data, aes(x = timepoint, y = psa)) +
  geom_jitter(width = 0.1, alpha = 0.7, colour = "#2c3e50", size = 2) +
  labs(x = "", y = "PSA (ng/mL)") +
  theme_minimal(base_size = 14)

Figure 5.4: PSA levels before treatment and at week 18 among men who received SABR for prostate cancer

Box Plots

A boxplot is used for discrete and continuous data. It indicates 5 important statistics which describe the distribution of observations:

Minimum value
Lower quartile (Q1) - 25% of data values below this line
Median (Q2) - 50% of data values below this line
Upper quartile (Q3) - 75% of data values below this line
Maximum value

The median is the middle observation in a set of data. 50% of the sample lie below, and 50% lie above the median. It can be seen from the figure that 50% of the data sample lie between the lower quartile (Q1) and the upper quartile (Q3).

code

set.seed(789)
nlr_data <- tibble(
  sex = rep(c("Women", "Men"), c(50, 60)),
  nlr = c(
    rlnorm(50, meanlog = 1.0, sdlog = 0.5),
    rlnorm(60, meanlog = 1.1, sdlog = 0.6)
  )
)

ggplot(nlr_data, aes(x = sex, y = nlr)) +
  geom_boxplot(fill = "#3498db", alpha = 0.6, outlier.colour = "#e74c3c") +
  labs(x = "", y = "Neutrophil-lymphocyte ratio") +
  theme_minimal(base_size = 14)

Figure 5.5: Pre-treatment neutrophil-lymphocyte ratio in men and women who received SABR for lung cancer

5.4 Graphs for Bivariate Continuous Data

Scatterplots

Scatterplots show the relationship between two numeric variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point in the plot.

Two variables are positively associated when higher values of one variable tend to accompany higher values of the other, and lower values tend to occur together.

Two variables are negatively associated when higher values of one variable tend to accompany lower values of the other.

code

set.seed(321)
n <- 80
blood_data <- tibble(
  platelets = runif(n, 100, 450),
  wbc = 5 + 0.015 * platelets + rnorm(n, 0, 2.5)
)

ggplot(blood_data, aes(x = platelets, y = wbc)) +
  geom_point(colour = "#2c3e50", alpha = 0.7, size = 2) +
  labs(x = "Platelets", y = "White blood cells") +
  theme_minimal(base_size = 14)

Figure 5.6: Scatterplot showing relationship between platelets and white blood cell counts

The scatterplot above shows a positive association between platelets and white blood cell counts. Here’s another example showing the relationship between neutrophils and white blood cells:

code

set.seed(987)
n <- 80
neutrophil_data <- tibble(
  neutrophils = runif(n, 2, 8),
  wbc = 3 + 0.8 * neutrophils + rnorm(n, 0, 1.5)
)

ggplot(neutrophil_data, aes(x = neutrophils, y = wbc)) +
  geom_point(colour = "#e74c3c", alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, colour = "#2c3e50", linetype = "dashed", linewidth = 0.8) +
  labs(x = "Neutrophils (10⁹/L)", y = "White blood cells (10⁹/L)") +
  theme_minimal(base_size = 14)

Figure 5.7: Scatterplot showing relationship between neutrophils and white blood cell counts

This scatterplot demonstrates a strong positive association - as neutrophil counts increase, white blood cell counts tend to increase as well. The dashed line shows the linear trend in the data.

5.5 Graphs for Categorical Data

Bar Charts

Bar charts are typically used for categorical data (but can be employed for discrete numerical data). Data can be nominal or ordinal.

code

# Create bar chart anatomy diagram
category_data <- tibble(
  category = factor(c("A", "B", "C", "D"), levels = c("A", "B", "C", "D")),
  frequency = c(5, 12, 8, 15)
)

ggplot(category_data, aes(x = category, y = frequency)) +
  geom_col(fill = "#3498db", colour = "#2980b9", alpha = 0.7, width = 0.7) +
  # Y-axis annotation
  annotate("segment", x = 0.3, xend = 0.3, y = 0, yend = 15,
           arrow = arrow(length = unit(0.3, "cm"), ends = "both"), colour = "#e74c3c", linewidth = 1) +
  annotate("text", x = 0.1, y = 7.5, label = "Frequency\n(y-axis)", colour = "#e74c3c", size = 3.5, fontface = "bold", hjust = 0.5) +
  # X-axis annotation
  annotate("segment", x = 0.5, xend = 4.5, y = -2, yend = -2,
           arrow = arrow(length = unit(0.3, "cm"), ends = "both"), colour = "#27ae60", linewidth = 1) +
  annotate("text", x = 2.5, y = -3.2, label = "Categories (x-axis)", colour = "#27ae60", size = 3.5, fontface = "bold") +
  # Gap annotation
  annotate("segment", x = 1.35, xend = 1.65, y = 16.5, yend = 16.5,
           arrow = arrow(length = unit(0.2, "cm"), ends = "both"), colour = "#8e44ad", linewidth = 0.8) +
  annotate("text", x = 1.5, y = 17.5, label = "Gap between bars\n(categorical data)", colour = "#8e44ad", size = 3, fontface = "bold") +
  # Bar label annotation
  annotate("text", x = 4, y = 15/2, label = "Bar height =\nfrequency", colour = "#e67e22", size = 3, fontface = "bold") +
  annotate("segment", x = 3.5, xend = 4, y = 15/2, yend = 15,
           arrow = arrow(length = unit(0.15, "cm")), colour = "#e67e22", linewidth = 0.6) +
  annotate("segment", x = 3.5, xend = 4, y = 15/2, yend = 0,
           arrow = arrow(length = unit(0.15, "cm")), colour = "#e67e22", linewidth = 0.6) +
  labs(x = "", y = "") +
  scale_y_continuous(limits = c(-4, 19), breaks = seq(0, 15, 5)) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())

Figure 5.8: Anatomy of a bar chart showing key components

code

stage_data <- tibble(
  stage = factor(c("Ia", "Ib", "IIa", "IIb", "IIIa", "IIIb"), 
                 levels = c("Ia", "Ib", "IIa", "IIb", "IIIa", "IIIb")),
  frequency = c(7, 4, 11, 5, 46, 20)
)

ggplot(stage_data, aes(x = stage, y = frequency)) +
  geom_col(fill = "#3498db", alpha = 0.8) +
  labs(x = "Clinical stage", y = "Frequency") +
  theme_minimal(base_size = 14)

Figure 5.9: Clinical stage among 93 radical radiotherapy NSCLC patients

Pie Charts

Pie charts illustrate nominal data. Each slice of the pie represents a category. The size of the slice is proportional to the relative frequency of that category. It is best utilised with 3-5 categories.

code

location_data <- tibble(
  location = c("Left upper lobe", "Left lower lobe", "Right upper lobe", 
               "Right lower lobe", "Right middle lobe"),
  percent = c(38, 11, 30, 20, 1)
)

ggplot(location_data, aes(x = "", y = percent, fill = location)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  scale_fill_brewer(palette = "Set2") +
  labs(fill = "Tumour location") +
  theme_void(base_size = 12) +
  theme(legend.position = "right")

Figure 5.10: Tumour location among 93 patients who received radical radiotherapy for lung cancer

5.6 Tables and Graphs for Bivariate Categorical Data

Contingency Tables

When the interest is in the relationship between two qualitative variables a contingency table (a cross-tabulation) is created. This has the categories for one variable as rows and the categories of the other variable as columns. Each cell of the table has a count of the number of individuals in both categories.

The following example concerns the two-year follow-up of 42 men and 60 women following stereotactic radiotherapy for lung cancer. In total 25 patients had died and 77 were still alive at the two-year follow-up.

code

contingency <- tibble(
  `Status at two years` = c("Alive", "Died", "Total"),
  `Women (n)` = c(48, 12, 60),
  `Men (n)` = c(29, 13, 42),
  `Total (n)` = c(77, 25, 102)
)

contingency |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 5.2: Cross-tabulation of status at two years by sex

Status at two years	Women (n)	Men (n)	Total (n)
Alive	48	29	77
Died	12	13	25
Total	60	42	102

Grouped Bar Charts

Grouped bar charts are useful for displaying contingency table data visually.

code

grouped_data <- tibble(
  sex = rep(c("Women", "Men"), each = 2),
  status = rep(c("Alive", "Died"), 2),
  count = c(48, 12, 29, 13)
)

ggplot(grouped_data, aes(x = sex, y = count, fill = status)) +
  geom_col(position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("Alive" = "#27ae60", "Died" = "#e74c3c")) +
  labs(x = "", y = "Number of patients", fill = "Status at 2 years") +
  theme_minimal(base_size = 14)

Figure 5.11: Status at two years by sex following stereotactic radiotherapy for lung cancer

5.7 Summary

Graph Type	Data Type	Purpose
Histogram	Continuous numeric	Show distribution shape
Dot plot	Numeric	Show all individual values
Box plot	Numeric	Show median, quartiles, range
Scatter plot	Two numeric variables	Show relationship between variables
Bar chart	Categorical	Show frequencies of categories
Pie chart	Nominal (3-5 categories)	Show proportions
Contingency table	Two categorical variables	Show joint frequencies