8  Statistical Inference

8.1 The Logic of Statistical Inference

Until the end of the 17th Century Europeans assumed that all swans were white. The hypothesis that “All swans are white” was assumed to be true but was rejected by the sighting of a black swan by Willem de Vlamingh in 1697. The black swan resulted in a rejection of the original null hypothesis (H0): “All swans are white” in favour of the alternative hypothesis (H1): “All swans are not white”.

Statistical inference follows a similar logical process. Having come up with a research question, the procedure is to:

  1. Collect data from a sample of individuals
  2. Formulate an appropriate null hypothesis
  3. Assume it to be true
  4. Seek evidence to refute it

8.2 The Null Hypothesis

The null hypothesis, H0, is a statement of ‘no difference’ or ‘no effect’ which is assumed to be true.

For example, in a clinical trial of a new drug for hypertension, the null hypothesis might be that the new drug has a similar average effect on blood pressure as another drug in current use – i.e. that there is no difference between the drugs.

H0: there is no difference in the effect on blood pressure between the two drugs

8.3 The Alternative Hypothesis

The alternative hypothesis (H1 or HA) is the negation of the null hypothesis. It holds if the null hypothesis is not true. The alternative hypothesis relates more directly to the theory we are interested in.

In the anti-hypertensive example, we might have:

H1: the effects of the two anti-hypertensive drugs are not equal

8.4 The Test Statistic and P-values

Having set up the null hypothesis, the probability that the observed data (or more extreme data) would be obtained if the null hypothesis were true is evaluated. This is done by calculating a numerical summary called a test statistic (calculated from the sample data) which is known to have a specific probability distribution.

The test statistic is used to test the null hypothesis. The value of the test statistic is related to the specific probability distribution to obtain a P-value. The smaller the P-value, the greater the evidence against the null hypothesis.

8.5 Using the P-value

A P-value less than 0.05 (p<0.05), is conventionally considered enough evidence to reject the null hypothesis. P<0.05 suggests only a small chance that the observed results (or more extreme results) would have occurred if the null hypothesis were true.

The null hypothesis is then rejected in favour of the alternative hypothesis and the results described as statistically significant at the 5% level.

In contrast, a P-value equal to or greater than 0.05, suggests insufficient evidence to reject the null hypothesis. The null hypothesis is not rejected, and the results are described as not statistically significant at the 5% level.

Key Point

This does not mean that the null hypothesis is true - just that it cannot be rejected.

8.6 One or Two-tailed Test?

In the above example the alternative hypothesis did not specify the direction for the difference in the effects of the two anti-hypertensive medications, i.e. it did not state whether the new drug provides better blood pressure control than the current drug or vice versa.

This is known as a two-tailed test because it allows for either eventuality.

In some circumstances, a one-tailed test in which the direction of the difference is specified in H1 may be carried out. In general, one-tailed tests are discouraged as it is unlikely that we can know beforehand which direction will occur.

8.7 Making a Decision: Type I and Type II Errors

A Type I error leads to the conclusion that an effect or relationship exists when in fact it does not. Type I errors result from the incorrect rejection of a true null hypothesis (a “false positive”).

A Type II error is a failure to detect an effect that is present. It results from incorrectly retaining a false null hypothesis (a “false negative”).

code
errors_table <- tibble(
  ` ` = c("H₀ true", "H₀ false"),
  `Reject H₀` = c("Type I error", "No error (correct)"),
  `Do not reject H₀` = c("No error (correct)", "Type II error")
)

errors_table |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  add_header_above(c(" " = 1, "Decision" = 2))
Table 8.1: Type I and Type II errors in hypothesis testing
Decision
Reject H₀ Do not reject H₀
H₀ true Type I error No error (correct)
H₀ false No error (correct) Type II error

In the previous example of a clinical trial of a new drug for hypertension:

H0: there is no difference in the effect on blood pressure between the two drugs

H1: the effects of the two drugs on blood pressure are not equal

A Type I error would occur if we concluded that the two drugs produced different effects when in fact there was no difference between them.

A Type II error would occur if we failed to reject the null hypothesis when there was a real difference in effect of the two drugs.

8.8 Alpha, Beta, and Power

The probability of making a Type I error (denoted by α (alpha)) is simply the chosen significance level (conventionally 5%).

\[\text{Probability (Type I error)} = \text{Probability (reject null when true)} = \alpha\]

The chance of making a Type II error is denoted by β (beta):

\[\text{Probability (Type II error)} = \text{Probability (fail to reject null when false)} = \beta\]

The complement of β is (1-β). This is the probability of not making a Type II error:

\[\text{Probability (not Type II error)} = \text{Probability (reject null when false)} = 1 - \beta\]

(1 – β) is called the power of the test. The power, therefore, is the probability of rejecting the null hypothesis when it is false; i.e. it is the chance (usually expressed as a percentage) of detecting, as statistically significant, a real treatment effect.

8.9 P-values or Confidence Intervals?

We saw earlier that a confidence interval is a range of values that the parameter of interest is likely to lie in the population. The parameter might be the population mean or median, or the mean difference between two groups, or a proportion.

Presenting study findings directly as confidence intervals provides information on the imprecision due to sampling variability and has advantages over just giving P-values which dichotomise results into significant or non-significant.

With a confidence interval, we can determine whether a parameter is or is not likely to be different from something:

  • If the confidence interval contains a specific number (i.e. the number is between the lower and upper values of the interval), then there is no evidence that the parameter is different from that number
  • If the number is not within the interval, then there is evidence that the parameter is different from that number

8.10 Parametric Tests for Numeric Data

If data are numeric and come from a Normal distribution we can use parametric tests to test whether the population mean equals a specific value, or whether the means from two samples are equal.

Parametric tests need the assumption that the data derive from a Normal distribution. If this assumption cannot be met (even after transformation) then non-parametric tests must be used.

code
parametric_tests <- tibble(
  Situation = c("1 sample", "2 independent samples", "2 paired samples"),
  Test = c("Student's t-test", "Student's two-sample t-test", "Student's paired t-test")
)

parametric_tests |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 8.2: Parametric tests for Normally distributed data
Situation Test
1 sample Student's t-test
2 independent samples Student's two-sample t-test
2 paired samples Student's paired t-test

8.11 Non-parametric Tests for Numeric Data

Non-parametric tests compare the median to a specific value, or test the medians between samples to see if they would be equal in the wider population.

Non-parametric tests are not as powerful (i.e. the probability of rejecting the null hypothesis when it is false will be smaller) as parametric tests.

code
nonparametric_tests <- tibble(
  Situation = c("1 sample", "2 independent samples", "2 paired samples"),
  Test = c("Wilcoxon signed rank test", "Wilcoxon Mann-Whitney test", "Wilcoxon signed rank test, or paired sign test")
)

nonparametric_tests |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 8.3: Non-parametric tests for non-Normally distributed data
Situation Test
1 sample Wilcoxon signed rank test
2 independent samples Wilcoxon Mann-Whitney test
2 paired samples Wilcoxon signed rank test, or paired sign test

8.12 Tests for Categorical Data

Tests for categorical data are concerned with the comparisons of proportions in each category of a variable. Just as for numeric data, a special analysis is required if paired data are involved.

code
categorical_tests <- tibble(
  Situation = c("Unpaired, large sample (expected counts >5)", 
                "Unpaired, small sample (empty cells)",
                "Paired data"),
  Test = c("Pearson χ² test", "Fisher's exact test", "McNemar's test")
)

categorical_tests |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 8.4: Tests for categorical data
Situation Test
Unpaired, large sample (expected counts >5) Pearson χ² test
Unpaired, small sample (empty cells) Fisher's exact test
Paired data McNemar's test

8.13 Sample Size and Power Considerations

How many subjects should be included in a study is a common consideration. If a study has too few people, the power to detect a statistically significant effect will be low. On the other hand, obtaining a sample size that is large or larger than required can be difficult to achieve and expensive.

Recruiting patients to a study which will be too small to detect the minimum effect we are looking for or recruiting more patients than necessary (over-powered) can be considered unethical.

Factors for Sample Size Calculation

To establish the sample size needed for a study the following factors should be considered:

  1. The minimum size of the effect to be detected
  2. The variability (standard deviation)
  3. The power required
  4. The significance level

For a chosen significance level, power, minimum size of effect to be detected and standard deviation, the sample size needed can be calculated.

General Principles

Power is the probability of rejecting the null hypothesis when it is false. In general:

  • As the sample size increases, the power increases
  • As the variability (standard deviation) increases, the power decreases
  • As the minimum size of effect to be detected increases, the power increases (i.e. small effects are more difficult to detect)

To increase the power of a study:

  • The sample size can be increased
  • The minimum size of the effect you are trying to detect can be increased

The significance level is not affected by choice of power or sample size. It is the decision rule that you employ in the study.

8.14 Summary

Concept Definition
Null hypothesis (H0) Statement of ‘no difference’ or ‘no effect’
Alternative hypothesis (H1) Negation of the null hypothesis
P-value Probability of observed results if H0 is true
Statistically significant P < 0.05 (conventionally)
Type I error (α) Rejecting true H0 (“false positive”)
Type II error (β) Failing to reject false H0 (“false negative”)
Power (1-β) Probability of detecting a real effect
Confidence interval Range likely to contain population parameter