Frequentist statistics and p-values

Learn about p-values and how to avoid common interpretation pitfalls

Author

Stefano Coretta

Published

September 27, 2024

Prerequisites

Frequentist and Bayesian statistics.

1 Introduction

The frequentist approach to probabilities has dominated the field of applied statistics for about a century. Researchers are usually taught frequentist statistics and the majority of studies employs frequentist statistics. This is the framework that generates p-values.

Alas, it is now known that there is a lot of misunderstanding around the real meaning of p-values and around what frequentist statistics can do for researchers (Cassidy et al. 2019; Gigerenzer 2004; John, Loewenstein, and Prelec 2012; Flake and Fried 2020; Bochynska et al. 2023). This post tries to shed light on these issues and prepare you so that you can read the literature that uses frequentist statistics without been led astray by the common misinterpretations.

2 What is a p-value?

Let’s imagine we are interested in knowing if being a bilingual speaker has a facilitating effect in a memory task. We run a study with monolingual and bilingual participants in which they have to perform a memory task and we note their accuracy in recall. 0% accuracy means they got all the trials wrong and 100% accuracy means they got all trials right.

We want to compare the average accuracy of the monolingual and bilingual group of participants. Specifically, our (quite vague) hypothesis is that the accuracy of the monolingual group should be lower than that of the bilingual group. We decide to take a frequentist approach to assess this hypothesis.

Warning

Get ready for a wild ride! Frequentist inference is just very counter-intuitive and our brains try to make it intuitive, by distorting the interpretation.

First we assume that the following hypothesis, called the null hypothesis, is true.

Null hypothesis \(H_0\)

The average accuracy of monolingual participants is the same as that of bilingual participants.

Let’s say we found that the average accuracy of monolinguals is 80% and that of bilinguals is 85%. The difference in accuracy between the two groups is 5 percent points.

Now we need to calculate the probability of finding a difference that is 5 percent points (i.e. the difference we found) or larger, given that the null hypothesis is true (i.e. given that in the real world there is no difference in accuracy between the two groups).

To calculate that probability, we run a statistical significance test and that gives us a p-value of 0.025.

This means there is a 2.5% probability that, even if in the real world there is no difference in accuracy between the two groups, we could find a difference of 5 percentage points or bigger.

So in the world in which there is no difference between groups, there is a quite low probability (2.5%) of finding a difference that is that big or bigger. Usually, we set a threshold below which we decide to take the risk of rejecting the null hypothesis: in social sciences, this is usually 0.05 (i.e. 5%). p-values that are below the threshold are said to be statistically significant. Those who are not below the threshold are said to be non-significant (please, don’t say “insignificant”).

Since the p-value we got, 0.025, is below the threshold of 0.05, we take the risk of rejecting the null hypothesis and we declare that the results are statistically significant.

Note that rejecting the null hypothesis does not mean we can accept the alternative hypothesis that there is a difference between the two groups. We did find a difference between the two groups, that’s 5 percentage points. But the only thing that frequentist statistics can tell us is whether we can take the risk or not of rejecting the null hypothesis.

(Classical) frequentist statistics, the Null Hypothesis Significance Testing approach, cannot corroborate the null hypothesis nor the alternative hypothesis. It only tells us if we can risk rejecting the null or not. Even if a difference between two groups is significant, that does not mean we have evidence for a difference. The only evidence we can get from frequentist statistics is for rejecting the null hypothesis (not even accepting it).

3 Statistical significance is not theoretically significant

But why does a p-value not tell us if the results we got are true? That’s because of the type of probability a p-value is: p-values are conditional probabilities.

p-value

A p-value is the probability of finding a difference as extreme or more extreme than the one found, given that there is no difference.

We could write this in mathematical notation as \(P(d|h)\), the probability of the data \(d\) (the difference we got), given \(h\) (the hypothesis that the difference is 0).

More often than not, we are actually intestered in a different probability, \(P(h|d)\): the probability of our hypothesis \(h\) (this does not have to be null) given the data we obtained (our results).

With non-conditional probabilities, like a coin toss, if the probability of head is \(p = 0.8\), then you know that the probability of tail is \(q = 1 - 0.8 = 0.2\). Alas, while \(q = 1 - p\), \(P(h|d)\) is not equal to \(1 - P(d|h)\) (nor \(P(h)\) for that mattter), that’s why a p-value does not tell you the probability of the hypothesis given the data nor the probability of the hypothesis.

Frequently, researchers interpret a significant result as evidence for the existence of a difference between groups but that is a mistake. You should be very careful when reading discussion sections in papers: you will very commonly find misinterpretations like this one.

Statistical significance is not theoretically meaningful because of the real nature of p-values. Moreover, statistical significance is binary: a result either is significant or not. There are no “almost significant” results or results “approaching significance”. Yet, these and similar wordings are very common in the literature.

Quiz 1

True or false?

The p-value is the probability that the null hypothesis is true.
The smaller the p-value the more probable the alternative hypothesis is.
The p-value is the probability that the result is due to chance alone.
The p-value is the probability of wrongly rejecting the null hypothesis.
The smaller the p-value the stronger the effect.
The p-value is the probability of finding an effect equal to or greater than the one found.

Explanation

The probability that the null hypothesis \(h\) is true would be \(P(h)\), but we have seen that a p-value is \(P(d|h)\).
The probability of the alternative hypothesis \(q\) is independent from the p-value \(P(d|h)\).
The probability that the result is due to chance is another way of saying the probability of the null hypothesis \(P(h)\), but we have seen that that is not what the p-value is. In other words, a p-value is not the probability of the null hypothesis.
The probability of wrongly rejecting the null hypothesis is the threshold we set for significance. It is also known as the alpha level (usually 0.05 in social sciences).
The magnitude of the difference (aka effect size) and the p-value are independent. You can get bigger differences with large p-values and smaller difference with smaller p-values.
That is almost correct. What is missing is the conditional part of the probability: given that the null hypothesis is true.

4 References

Bochynska, Agata, Liam Keeble, Caitlin Halfacre, Joseph V. Casillas, Irys-Amélie Champagne, Kaidi Chen, Melanie Röthlisberger, Erin M. Buchanan, and Timo B. Roettger. 2023. “Reproducible Research Practices and Transparency Across Linguistics.” Glossa Psycholinguistics 2 (1). https://doi.org/10.5070/g6011239.

Cassidy, Scott A., Ralitza Dimova, Benjamin Giguère, Jeffrey R. Spence, and David J. Stanley. 2019. “Failing Grade: 89 Per-Cent of Introduction to Psychology Textbooks That Define/Explain Statistical Significance Do so Incorrectly.” Advances in Methods and Practices in Psychological Science. https://doi.org/10.1177/2515245919858072.

Flake, Jessica Kay, and Eiko I. Fried. 2020. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” Advances in Methods and Practices in Psychological Science 3 (4): 456–65. https://doi.org/10.1177/2515245920952393.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587–606. https://doi.org/10.1016/j.socec.2004.09.033.

John, Leslie K., George Loewenstein, and Drazen Prelec. 2012. “Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling.” Psychological Science 23 (5): 524–32. https://doi.org/10.1177/0956797611430953.