10 Odds, Odds Ratios and Logistic Regression

10.1 Odds vs Risk

Odds are simply another way of describing probability. Odds are calculated by dividing the number of times an event happens by the number of times it does not happen.

If one in every 100 patients suffers a side-effect from a treatment, the odds are:

\[\text{Odds} = 1:99 = \frac{1}{99} = 0.0101\]

Risk, on the other hand, indicates the probability that an event will happen. It is calculated by dividing the number of events by the number of people at risk. In the example above the risk would be:

\[\text{Risk} = \frac{1}{100} = 0.01\]

Key Difference

Odds = Number of events / Number of non-events
Risk = Number of events / Total number at risk

While similar for rare events, odds and risk diverge as events become more common.

10.2 Odds Ratios

Odds ratios are calculated by dividing the odds in one group of patients (e.g. cases) with the odds in a comparison group of patients (e.g. controls).

An odds ratio of 1 indicates no difference between the groups, i.e. the odds in each group are the same.

Odds Ratio	Interpretation
= 1	No difference in odds between groups
> 1	Increased odds of exposure in cases
< 1	Reduced odds of exposure in cases

Odds ratios are frequently given with 95% confidence intervals – if the confidence interval for an odds ratio does not include 1 (no difference in odds), it is statistically significant.

The 2 × 2 Table for Odds Ratios

In a case-control study, patients are selected on the basis of their disease status. We compare the odds of exposure between cases (those with disease) and controls (those without disease).

code

or_table <- tibble(
  ` ` = c("Exposed", "Unexposed", "**Total**", "Odds of exposure"),
  `Case (Disease)` = c("a", "c", "a + c", "a / c"),
  `Control (No Disease)` = c("b", "d", "b + d", "b / d"),
  `Total` = c("a + b", "c + d", "n", "")
)

or_table |>
  kable(escape = FALSE) |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  add_header_above(c(" " = 1, "Disease Status" = 2, " " = 1))

Table 10.1: Structure of a 2 × 2 table for calculating odds ratios

	Disease Status
	Case (Disease)	Control (No Disease)	Total
Exposed	a	b	a + b
Unexposed	c	d	c + d
Total	a + c	b + d	n
Odds of exposure	a / c	b / d

The odds ratio (OR) compares the odds of exposure in cases to the odds of exposure in controls:

\[OR = \frac{\text{Odds of exposure in cases}}{\text{Odds of exposure in controls}} = \frac{a/c}{b/d} = \frac{ad}{bc}\]

Worked Example: HPV and Oropharyngeal Cancer

A case-control study investigated whether human papillomavirus (HPV) infection was associated with oropharyngeal squamous cell carcinoma. Researchers recruited 250 patients with newly diagnosed oropharyngeal cancer (cases) and 250 age- and sex-matched patients without cancer (controls). HPV status was determined by serology testing.

code

hpv_data <- tibble(
  `HPV Status` = c("HPV positive", "HPV negative", "**Total**", "Odds of exposure"),
  `Case (Cancer)` = c("175 (a)", "75 (c)", "**250**", "175 / 75"),
  `Control (No Cancer)` = c("50 (b)", "200 (d)", "**250**", "50 / 200"),
  `Total` = c("225", "275", "**500**", "")
)

hpv_data |>
  kable(escape = FALSE) |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  add_header_above(c(" " = 1, "Disease Status" = 2, " " = 1))

Table 10.2: Case-control study of HPV infection and oropharyngeal cancer

	Disease Status
HPV Status	Case (Cancer)	Control (No Cancer)	Total
HPV positive	175 (a)	50 (b)	225
HPV negative	75 (c)	200 (d)	275
Total	250	250	500
Odds of exposure	175 / 75	50 / 200

Calculating the odds ratio:

code

# Values from the 2×2 table
a <- 175  # Exposed cases (HPV positive with cancer)
b <- 50   # Exposed controls (HPV positive without cancer)
c <- 75   # Unexposed cases (HPV negative with cancer)
d <- 200  # Unexposed controls (HPV negative without cancer)

# Odds of HPV exposure in cases
odds_cases <- a / c

# Odds of HPV exposure in controls
odds_controls <- b / d

# Odds ratio
or <- odds_cases / odds_controls
# Equivalently: or <- (a * d) / (b * c)

Step 1: Calculate the odds of HPV exposure in cases (patients with cancer):

\[\text{Odds in cases} = \frac{a}{c} = \frac{175}{75} = 2.33\]

Step 2: Calculate the odds of HPV exposure in controls (patients without cancer):

\[\text{Odds in controls} = \frac{b}{d} = \frac{50}{200} = 0.25\]

Step 3: Calculate the odds ratio:

\[OR = \frac{a/c}{b/d} = \frac{2.33}{0.25} = 9.3\]

Or equivalently:

\[OR = \frac{ad}{bc} = \frac{175 \times 200}{50 \times 75} = \frac{35000}{3750} = 9.3\]

Interpretation: The odds of HPV exposure are 9.3 times higher in patients with oropharyngeal cancer compared to controls. This strong positive association suggests HPV infection is an important risk factor for this malignancy.

Clinical Significance

An odds ratio of 9.3 indicates a very strong association between HPV infection and oropharyngeal cancer. If the 95% confidence interval excludes 1.0, the association is statistically significant. This finding is consistent with published literature showing that HPV-positive oropharyngeal cancers have distinct biology and generally improved prognosis compared to HPV-negative tumours.

10.3 Logistic Regression

Logistic regression is similar to linear regression but is used when the outcome variable is binary (e.g. having a disease or not) as opposed to continuous.

The coefficients in a logistic regression are interpreted as odds ratios. The coefficients indicate the percent change in the odds of the event when a unit change in the explanatory variable occurs.

Practical Application

Logistic regression is commonly used in medical research to:

Predict disease risk based on multiple factors
Identify risk factors for binary outcomes (e.g., death vs survival)
Adjust for confounding variables when examining associations

Worked Example: Logistic Regression for Treatment Response

A study investigated factors predicting complete response to chemotherapy in 150 cancer patients. The outcome was binary (complete response: yes/no) and predictors included age, tumour size, and performance status.

code

# Create example logistic regression output
logistic_results <- tibble(
  Predictor = c("Intercept", "Age (per year)", "Tumour size (per cm)", "Performance status (1 vs 0)"),
  `Coefficient (log OR)` = c(2.45, -0.03, -0.42, -1.15),
  `Odds Ratio` = c(11.59, 0.97, 0.66, 0.32),
  `95% CI Lower` = c(3.21, 0.95, 0.51, 0.15),
  `95% CI Upper` = c(41.85, 0.99, 0.85, 0.67),
  `P-value` = c("<0.001", "0.041", "0.002", "0.003")
)

logistic_results |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 10.3: Logistic regression output for predicting complete response to chemotherapy

Predictor	Coefficient (log OR)	Odds Ratio	95% CI Lower	95% CI Upper	P-value
Intercept	2.45	11.59	3.21	41.85	<0.001
Age (per year)	-0.03	0.97	0.95	0.99	0.041
Tumour size (per cm)	-0.42	0.66	0.51	0.85	0.002
Performance status (1 vs 0)	-1.15	0.32	0.15	0.67	0.003

Interpretation:

Age: OR = 0.97 (95% CI: 0.95-0.99, p = 0.041)
- For each additional year of age, the odds of complete response decrease by 3% (1 - 0.97 = 0.03)
- Older patients have slightly lower odds of complete response
Tumour size: OR = 0.66 (95% CI: 0.51-0.85, p = 0.002)
- For each additional cm of tumour size, the odds of complete response decrease by 34% (1 - 0.66 = 0.34)
- Larger tumours have significantly lower odds of complete response
Performance status: OR = 0.32 (95% CI: 0.15-0.67, p = 0.003)
- Patients with performance status 1 have 68% lower odds (1 - 0.32 = 0.68) of complete response compared to those with performance status 0
- Poor performance status is a strong negative predictor

Statistical Significance

All three predictors are statistically significant (p < 0.05), and none of the 95% confidence intervals include 1.0. This indicates that age, tumour size, and performance status are all independently associated with complete response when adjusting for the other variables.

10.4 Summary

Concept	Formula	Use
Odds	Events / Non-events	Alternative to probability
Risk	Events / Total at risk	Probability of event
Odds Ratio	(a/c) / (b/d) = ad/bc	Compare odds of exposure between cases and controls
Logistic Regression	Binary outcome model	Predict and explain binary outcomes