Vance Berger proposed to consider the problem of post hoc testing via a deck of 52 cards game. If H0 is true, all 52 cards have the same chance to be selected (1/52) and then, if H1 is that the ace has a probability of 1 to show up, then observing a particular ace would lead to a p value of 1/52. Based on a simple case yet, the problem of “inherent multiplicity” occurs.

- Not postulating h1 would always lead to a low p value (1/52) which violate the rule of an uniform distribution of the p value under h0.
- “The PSG always wins on Saturday when Ibra is playing”. What is the probability of this event? How much other outcomes should we consider to judge the significance of such an event? We need both the H0’s of all these events in the extreme region and their ranking based on the extent to which they contradict the null.
- If I planned an analysis with an alpha level of .05 two-sided, my critical value is 1.96. If now I planned 5 analysis using the Pocock sequential boundaries (in case an earlier trial leads to unexpected promising results making the experiment ethically unsustainable to pursue), my critical value is now around 2.5. If the 5th experiment reveals a value of 2.01, I can feel disappointed of having missing a significant opportunity if I would have not apply such a penalty. However, we stipulated a priori that an early important difference should matter more than a small later one. As Berger mentioned, “lamenting this regret is tantamount to requesting a refund on a losing lottery ticket”.

Berger acknowledged that questions such as “where does alpha come from?” or “how much alpha should be applied?” is a difficult one but it seems clearly easier to specify an alpha when one already generated a prespecified (Berger, 2004). The problem is that sometimes confounder might show up during an analysis or some assumptions of the study are violated. In the context of analysis of variance, you can use Fisher’s LSD procedure to compare 2 specific means as long as the ANOVA omnibus is significant. This procedure has more power than Tukey because the alpha is not corrected for multiples comparisons, then increasing the Type I error (i.e., finding a difference when there is not) The Fisher-Hayter procedure or the Scheffé method are plausible alternatives.

Post-hoc issues also emerge when we don’t know which outcomes measures will be the most appropriate. Berger take the exemple of an intervention to reduce childhood smoking, drug use and crimes. Sometimes, we are not sure, in advance, which one of this outcome will show a significant decrease. Some alternative testing such as the IPCE (Information-preserving composite end point) or the Smirnov test allo you to avoid having to pre-specify the particular sub-end point to analyze, letting the data select the outcome measure and comparing it not to its own null sampling distribution but to the null distribution of the others outcome.

When you preregistered a study, even though you cannot necessary predict such a thing as the primary outcome measures, you can still precise the type of analysis you will pursue and indicate it in a exploratory part (still, it is definitely better to pre-specified it when possible). Theoretical considerations rather than statistics should be considered when building your plan (bio-psychological differences between 2 concepts rather than different exploratory data trends for example). Finally, comparisons considered as post-hoc or exploratory for one study should lead to additional study when this time your previous post-hoc statements will be turned to a prespecified hypothesis.