Technical articles

New trends in medical journals: limited use of p-values in favour of estimated indicators and confidence intervals

4/02/2020

Since the recommendations of the American Statistical Association (Wassertein, 2016), p-values have been out of favour. Following in the footsteps of other prestigious journals (Nature, JAMA, etc.), the New England Journal of Medicine (NEJM) has taken the plunge. The Journal has recently published new guidelines, including this requirement: replace the traditional p-value as a decision criterion by the estimated value of the effects and their confidence interval, when neither the study protocol nor the statistical analysis plan describe adjustment methods for multiple comparisons (Harrington, 2019).

Why is the p-value, once so widely used, now being called into question?

Let’s take the example of a clinical trial in which the investigator seeks to demonstrate that a treatment increases the chances of cure. To do this, he can use an indicator called the odds ratio (OR), which is greater than 1 when the treatment is effective, and less than or equal to 1 otherwise. We assume here that, according to the data, the treatment tested has a 1.25 times greater chance of being cured, i.e. OR= 1.25 and therefore log(OR) = 0.22. In this case, the appropriate statistical test is a comparison of the following hypotheses:

H0 : log⁡(OR) = 0      (no effect of the intervention)
H1 : log⁡(OR) ≠0                   (effect of the intervention)

The p-value corresponds to the probability of the observed event occurring when the null hypothesis (H0) is true. If the p-value of the previous test is equal to 0.03, this means that if the hypothesis is true (i.e., that the intervention has no effect), the probability of observing an effect of the intervention at 1.25 is 3%. A p-value of < 5% generally leads to the rejection. We consider that it is too unlikely (less than one chance in 20) to be under the null hypothesis, and that it is preferable to accept the alternative hypothesis. In our example, we accept the alternative hypothesis and consider that there is an effect of the intervention on recovery, with a risk of error of 5%.

However, this statistical test assumes that there are no factors that “confound” the relationship between the intervention and recovery. In this single case, the estimated effect of the intervention can then be interpreted as a causal effect, without bias. Randomisation of patients, double-blinding, intention to treat… these various principles applied to randomised controlled trials are designed precisely to avoid any bias linked to the presence of other factors. In this context, the p-value has always been considered a reliable indicator of the reality of the effect, and a valuable decision-making aid.

The NEJM nevertheless identifies several limitations to the use of the p-value, including in clinical trials:

  • The p-value is often misinterpreted as the probability of being under the null hypothesis. A p-value of less than 5% does not mean that there is less than a 5% chance of being under the null hypothesis.
  • The p-value does not provide any indication of the magnitude of the effect or its variability.
  • There are often several criteria of interest in a study, which leads to multiple comparisons and increases the risk of error of the first kind (alpha risk). The risk of erroneous rejection increases very rapidly at a rate of 1-(1-a)K, where K represents the number of criteria evaluated. The NEJM points out that when 10 statistical association tests are conducted on 10 different criteria, the probability of wrongly concluding that the intervention is effective on at least one of the 10 tests is 40%.
  • Finally, reducing the notion of the efficacy of an intervention to reaching a threshold of 5% may be seen as a reductionist vision of medicine that does not reflect reality.
What alternative solutions are recommended?

The NEJM recommends the use of statistical thresholds to conclude on an effect of the intervention only in the case where the statistical analysis plan specifies methods for adjusting the alpha risk. It also recommends presenting the estimate of the effects and their confidence interval for the evaluation of the benefits and risks of the intervention.

The journal Nature goes further by assuming that we must stop basing conclusions on the threshold of significance (Amrhein, 2019):

  • For example, concluding that there is no difference or effect just because the p-value is greater than 5%, or, equivalently, when a confidence interval contains the value 0 (or 1 depending on the indicator considered).
  • Or conclude that two studies contradict each other, because one shows significant results and the other does not.

Nature journal advocates abandoning the significance threshold without banishing the p-value. Instead, it recommends looking at the observed effect (the estimated point) and its confidence interval. It also comments on the practical implication of the fact that the observed effect lies on a scale of possible values. The confidence interval should be seen as a “compatibility interval”, i.e., a set of possible values for the effect, compatible with the data, under the hypothesis being tested. A value outside the interval would not be ‘incompatible’, but simply less compatible. The journal is also open to the consideration of confidence levels other than 95% for the calculation of confidence intervals.

Recently, a new article from the American Statistical Association has identified several alternative approaches to the consideration of estimated points and their confidence intervals (Wassertstein, 2019). A multitude of “second-generation p-values” are currently emerging, such as the one derived from a statistical test where the alternative hypothesis corresponds to a minimum observed effect. Other indicators, such as the Bayes factor or the calculation of plausibility intervals for several hypotheses under consideration, are also worth considering. In general, the trend is to abandon any dichotomous criteria in a field where uncertainty is omnipresent, and where the design, quality of the data and methodological biases often have more importance on the conclusions of a study than the significance threshold.

Need to know more? Contact our teams at onedt@efor-group.com

References

Amrhein et al. (2019). Scientists rise up against statistical significance. Nature, 305.

Harrington et al. (2019). New Guidelines for Statistical Reporting in the Journal. New England Journal of Medicine, 285-286.

Wasserstein et al. (2016). The ASA’s statement on p-values: context, process, and purpose. The American statistician 70.2, 129-133.

Wasserstein et al. (2019). Moving to a world beyond p < 0.05. The American statistician, 1-19.