XII. The Neyman-Pearson Approach

Reductio Ad Absurdum

We have seen the Bayesian schema inform us about how to update our beliefs, and the likelihood analysis schema disregard the priors and instead tell us about relative evidential strength. While it is possible to set up decision criteria thresholds, such as “Bayes factor above 4” or “likelihood ratio above 32”, an argument could be made that these approaches do not support behavioral, binary decisions very well – such as confirming or rejecting a hypothesis – because, conceptually, wise decision-making should not necessarily be based on beliefs or relative evidence. Rather, the top priority could be to control how often, on average, we will make the wrong decision. This is the logical basis of the methodologically more complex yet by far most widespread statistical procedure, known as the Neyman-Pearson approach, or “Null hypothesis significance testing”. It was institutionalized in the 1950s and often described as a backbone of statistics, when, in reality, it is just one instrument of many in the toolbox that is statistics.

Central to this logic is the notion of the testing procedure itself being part of an infinite population, called a “reference class”. For example, the toss of a coin could be considered as a sample from a collective of all possible outcomes. If, in this reference class, “head” occupies 40%, then this will be reflected in a 40% long-run frequency of heads. This long-run frequency is what defines probability, such that a single event cannot meaningfully be said to have one – only the reference class has a probability attached to it. This idea is often described as meaning that probability is “objective”, but how the reference class is conceived is dependent on our subjective information. If we knew the aerodynamics surrounding the coin, the reference class would be narrowed down to the indefinite number of realizable tosses with those particular wind conditions.

More importantly, this conceptualization implies that, because a hypothesis has no reference class (it is not part of a well-defined space of possible hypotheses), it does not have an objective probability. It is either true or false. Our test’s reference class meanwhile, has one, so our decision as to whether a hypothesis is true or false will have a long-run error attached to it. In essence, therefore, we calculate “If we were to conduct the exact same test and procedure again, with the same amount of subjective ignorance and information, an infinite number of times, how often would the true/false test give the wrong answer?”.

The "reference class" represents a category of potential replications of the experiment, in which everything except the influence of chance is identical.

The true/false test in question is performed not on the hypothesis we are curious about, called the “alternative hypothesis”, but on the “null hypothesis”. This is occasionally defined as the one that is most costly to reject falsely – our default explanation that we don’t abandon unless we have to, because it is simple and approximates well. In practice, it is often, but far from always, the hypothesis of no difference (a nil hypothesis such as “all the samples from different conditions come from the same population”). 

As such, the test is somewhat equivalent to a mathematical type of proof known as reduction ad absurdum, in which a statement is assumed to be true, and if it is legally manipulated so as to result in self-contradiction, this means that the original statement must have been false. Except, instead of self-contradiction, the rejection-criterion is now based on extreme improbability. In other words, if the outcome is very improbable by happenstance, i.e. to have occurred as a result of chancy fluctuations, then this suggests that the experimental manipulation (or some other, unknown nonrandom influence) must be held accountable for it. We may say that an observed difference is significant, meaning that we reject the null, without directly quantifying the probability of alternative hypothesis, since – given that only reference classes have probabilities – this would be meaningless.

We thus actively seek disconfirming evidence, and place the burden on the alternative explanation, which costs more in terms of parsimony compared to the default (“it’s all due to chance” is more parsimonious), something that we intuitively are very poor at. Rejecting the default hypothesis on the basis of low probability means that we again are concerned with a conditional probability.However, whereas in Bayesian and likelihood approaches, we calculated the likelihoods P(observed difference| hypothesis is true), holding the observed difference constant and letting the hypothesis vary, we are now concerned with the conditional probability P(getting as extreme or more extreme data |  null is true), in which the hypothesis is fixed as null so that we don’t consider every single parameter value, and the “possible data” is allowed to vary.

How we allocate the alpha-level's "extreme percentage constituting improbability" is up to us - either in only on or both directions.

Moreover, we are no longer interested in the heights, but in areas, since the probability of the obtained data for a continuous density function is infinitesimally small, so we are interested in a low-probability range. Somewhat arbitrarily, this “rejection region” has by convention been set to 5%. This refers either to the two extreme 2.5% in both directions, in which case the test is two-tailed, or the most extreme 5% in one direction, in which case the test is one-tailed, and any effect in the other direction won’t be picked up. This makes the calculated probability dependent on unobserved things (choice of test), so that identical data can result in different conclusions. In effect, these are two different ways of calculating the conditional probability P(getting as extreme or more extreme data | null is true).

Decision Thresholds

We – given the population distribution – calculate the probability of obtaining our sample or a more extreme one, and if probability, called the p-value, is less than 0.05, then if we reject the null now when actually the null is true, we will do this mistake only 5% in any future replications of this test. The 5% level, called alpha, therefore is an a decision criterion set by ourselves in advance that gives us the number of false alarms we would get if we were to perform this test again and again if the null were true. It is an objective probability. The long-term behavior, alpha, is everything we know.  Note that it is not correct to define alpha as “the probability of false alarm error”, since this could be interpreted to mean the probability of false alarm after rejecting the null. If you obtain p < 0.011, it is a mistake to say that “My false alarm rate is less than 1.1%” or that “98.9% of all replications would get a significant result” or that “There is a higher probability that a replication will be significant”, since a single experiment cannot have a false alarm rate. Alpha is a property of the procedure, not of one particular experiment.

The cost-benefit analysis of setting alpha and beta-levels may be compared to deciding about the mesh-width of a fishing net, where you wish to catch any small fish. The bigger mesh (lower beta), the greater chance of catching small fish, and not mistakenly conclude that from no caught fish that no small fish exist (false negative). However, it also increases risk of bigger fish being caught (false positive), which you could reduce by having smaller width (lower alpha). If small and big fish differ more in size (larger effect size), choosing levels becomes easier, because small width would catch small fish, with less risk of false negatives.

Confusing the p-value with the probability of a fluke is called the “base rate fallacy”. It ignores the fact that the probability of a significant result being a fluke depends on the prior distribution of real effects. If you perform 100 tests at alpha=0.05, then you would expect 5 false positives. However, in total you obtained 15 significant results, so the fraction of them that are truly false (a number known as the “false discovery rate”) would be 5/15=33%.  The lower the base rate (the fewer cases in which the null is true), the more opportunities for false positives. Therefore, in data-heavy research areas like genomics or early drug trials, because the base rate is so low, the vast majority of significant results are flukes.

 The alpha and beta levels don't tell us how many of our findings are likely to be false. To know this, we must have an estimate of the base rate (Bayesian prior) proportion of true effects versus true null effects.

The alpha and beta levels don't tell us how many of our findings are likely to be false. To know this, we must have an estimate of the base rate (Bayesian prior) proportion of true effects versus true null effects.

There is a similar risk of failing to detect real effects, called beta and expressed as P(accepting the null | null is false). Like alpha, beta is an objective probability decided upon in advance of the experiment. Given the beta, the expected effect size, and the expected amount of noise, you can calculate the sample size required to keep the beta at this predetermined level. Because large sample sizes normally are expensive, the relationship between alpha and beta is usually a tradeoff: we can reduce the probability of false alarms by requiring a p-value below 1%, but only by increasing the probability of missing a true effect. Though in practice they often are, alpha and beta are not meant to be picked as a mindless ritual, but carefully chosen based on an experiment-specific cost-benefit analysis. False alarms are typically presumed to be more costly, but this is not always the case: in quality control, failure to detect malfunctioning is often a higher priority.

Two ways of visualizing power: as a function of effect size, or as the relative area of the values in the alternative hypothesis' probability distribution that would qualify as significant.

The complement of beta (1 – beta) is the probability to pick up an effect of the expected size. It is called “power” and can be thought of as the procedure’s sensitivity. Usually, a power of 0.8 or higher is considered sufficient, but in practice it is seldom achieved or even calculated. We mentioned previously that the hypotheses tested in the most common statistical tests tend to be low in content, because they only predict “there will be no difference” without specifying the effect size. The effect size is therefore not only needed for power-calculations, but also makes the theory more falsifiable. In neuroscience, the median study has a power of only 0.2, with the hope that meta-studies aggregating the results will compensate for it. A non-significant result in an underpowered study is meaningless, because the study never stood a chance of finding what it was looking for.  Underpowered studies also run the risk of “truth inflation” – true effect sizes vary randomly, and if the power is low, only effects that by chance are very large will be statistically significant. By reporting the effect size, you are over-estimating its future magnitude, when it regresses towards the mean.

"Truth inflation" occurs when underpowered experiments report their obtained effect size as the mean effect size.

Detailing the Reference Class

The essence of null-hypothesis testing thus is to control long-term error, and to do this, the test procedure needs to be specified in finest detail, so that the reference class – the infinite number of replications that never happened – is well-defined.  This includes the number of tests for different hypotheses we perform as part of the whole testing procedure – the “family” of test. Each test does not stand on its own. The reference class is then comprised of an infinite number of replications of this whole family, so if we still wish to have false alarms in only 5% of our replications, then we need to provide for the fact that several tests increase the chance of a misleading significant result being found somewhere in the procedure. This fact, known as the “family-wise error rate”, can be understood by how, if for each test in the family, the probability of not making an error is (1-alpha) and the probability of making at least 1 error in k tests is the complement of (1-alpha)k, i.e.  1-(1-alpha)k, which increases with k. Thus, the overall long-term false alarm rate is no longer controlled. Intuitively, by exposing ourselves to more chance events for each trial, there is more opportunity for chance to yield a significant result, and to curb this, we enforce much more conservative criteria. The most common and least sophisticated way to control this error rate is to let each test’s alpha be 0.05/k, something known as Bonferroni correction.

The presence of multiple comparisons can be subtle. Consider, for example, how in neuroscientific studies, you test for increased activity in every 3D-pixel of the brain – the required multiple comparison correction is massive. There is, furthermore, an intuitively disturbing aspect to the concept of a test family. Bonferroni correction reduces statistical power dramatically. If the researcher had planned a priori to only perform one comparison (and pre-registered her statistical protocol), then the reference class can be re-defined as “Collect the data, perform the planned comparison at the set alpha and then perform the other tests for curiosity’s sake, Bonferroni-corrected”. Also, if the same comparisons had been part of different experiments, there would be no need for correctives. Again, the reason for this eerie feature is that the reference class needs to be well-defined.

Another reference class specification is that of when to stop testing, “stopping rules”. Suppose we recruit subjects gradually and perform tests as we go along – a procedure known as “sequential analysis”. On the one hand, it is obvious that testing until you have a significant effect is obviously not a good stopping rule, since you are guaranteed to get a significant effect if you wait long enough, but in medical research this is often ethically required, and certain correctives are used to account for multiple testing. Still, we cannot tell whether we stopped due to luck or due to an actual effect, so the effect is likely to be inflated. Notably, whereas in the Bayesian and likelihood approaches, we could continue to collect data for as long as we wanted, now, if we get a non-significant result for our 40 subjects and decide to test another 10 subjects, the p-value would refer to “Run 40 subjects, test, if not significant, run another 10”. This p-value is doomed to be above 0.05, since at n=40, there was a 5% chance of false alarm, and at n=50 there was another opportunity for false positives. Meanwhile, if a colleague decides to test 50 patients in advance, there is no need for adjustments, because his test is member of another reference class. Different stopping rules that coincidentally agree to stop at the same size can thus lead to different conclusions.

Interpreting the Results

The p-value is the probability for obtaining such data, the probability of the t-statistic. Therefore, if very small, this does not indicate a larger effect size. A small difference can be significant while a large difference is insignificant. Nor is the p-value a measure of how important it is. Another frequent mistake is to interpret a p-value as a Bayesian posterior (as a probability of the hypothesis being true), or as an indicator of evidential strength. Both the virtue and weakness with the Neyman-Pearson approach lies in just how mercilessly binary it is. If p=0.051, accept null. If p=0.049, reject null. If p=0.00.., reject null too. There is no such thing as “different degrees of significance”, and you are not justified to switch alpha to 0.001 if you get a p-value below that value, as this would undermine its purpose. For the Neyman-Pearson framework, the p-value contains no information other than whether it is past the critical threshold. Because it depends on things other than strength, such as stopping rules and whether the test is one-tailed or two-tailed, it cannot be used as an index for evidential strength. If you decide to test more after a non-significant result, the p-value would increase, indicating less evidential strength, even though intuitively evidential strength also would also increase.

Another important aspect is that two tests, say A vs. C and B vs. C, cannot be interpreted as “A was significantly better than B, while B was not significantly better than C, thus A is better than B” without directly comparing A and B. This is because of the arbitrariness of alpha-level (one could be slightly below, the other slightly above) and the statistical power is limited.

Confidence intervals is a highly informative way of summarising the results, representing the range that 95% of the time will include the true population value.

There is an alternative, more informative and straightforward reporting strategy known as “confidence intervals”. By calculating the set of mean values that are non-significantly different from your sample mean at your chosen alpha, mean +- 1.96 S.E., you will obtain an interval that, 95% of times that you replicate the procedure, will include the true population mean. In essence, it gives you the range of answers consistent with your data. If it includes both the null-predicted mean and the alternative-predicted mean, the result is non-significant, but it also tells you directly that the sensitivity is insufficient.  Moreover, the width of the interval indicates your precision. If it is narrow and includes zero (null), the effect is likely to be small, while if it is wide the procedure may be too imprecise to make any inference. Widths, like data, vary randomly, but at the planning stage of the experiment, it is possible to calculate required sample size so that the interval will be of a desired width 95% of the time (a number known as “assurance”). Confidence intervals are particularly useful for comparing differently sized groups, since the certainty of a large sample size will be reflected in that interval’s precision. For these reasons, confidence intervals are preferred, but even in a journal like Nature, only 10% of the articles report confidence intervals, perhaps because the finding may seem less exciting in light of its width, or because of pressures to conform.

Criticisms

A number of criticisms have been raised against the Neyman-Pearson approach. One is that an internally incoherent hybrid-version is often taught, in which p-values are reported as p=0.0001, even if the alpha a priori was set at 0.05, which is the only thing that should matter. Another is that the hypotheses it tests are typically low in content. A third is that it invites cheating, since the Bonferroni penalties of collecting more data after a non-significant result means a lot of expensive data may be wasted. Also, because it is poorly implemented without prior power-calculations, and the “truth inflation” phenomenon means that reported effect sizes that are exaggerations of real effects may be unfairly dismissed, because the replication will have a power based on the initial effect size, not the degraded one, and consequently be underpowered.

Researchers' and journals' preference for significant results may make theories seem more promising than they are.

A fifth is that the binary nature of the Neyman-Pearson approach also invites “publication bias”, in which both scientists and journals prefer unexpected and significant results, as a kind of confirmation bias writ large. If an alpha=0.05 experiment is replicated 20 times, 1 of them will be significant due to chance. If 20 independent research teams performed the same experiment, one would get a significant result, simply by chance, and that team certainly would not feel very lucky. If journals choose only to publish significant results, neglecting the 19 insignificant replications, it fosters something like a collective illusion. Moreover, it is not unusual to tweak, shoehorn and wiggle results – be selective about what data to collect in the first place – to make it past that 5% threshold, “torturing data until it confesses”.

Statistical methods for detecting publishing bias and "p-hacking".

How can we unravel this illusions quicker? How can we attain maximum falsifiability, maximum evolvability, to accelerate progress?

  • Meta-research: There are statistical tools for detecting data-wiggling, and a lot of initiatives aiming to weed out dubious results.
  • Improving peer-review: Studies in which deliberately error-ridden articles have been sent out for peer-review indicate that it may not be a very reliable way to detect errors. Data is rarely re-analyzed from scratch.
  • Triple-blindedness: Let those who perform the statistical tests be unaware of the hypothesis.
  • Restructuring incentives: Replications are extremely rare because they are so thankless in an industry biased towards pioneering work. Science isn’t nearly as self-corrective as it wishes it were. We must reward scientists for high-quality and successfully replicated research, and not the quantity of published studies.
  • Transparency: To avoid hidden multiple comparisons, encourage code- and data-sharing policies.
  • Patience: Always remain critical of significant results until they have been robustly and extensively validated. Always try not to get too swept away by the hype.
  • Flexibility: Consider other statistical approaches than Neyman-Pearson.

But ultimately, we have to accept that there is no mindless, algorithmic slot-machine path to approximate truth. Science writer Jonas Lehrer put it as follows: “When the experiments are done, we still have to choose what to believe.”