If you open a Borenstein class on YouTube, you may end up on comments like this :
“A meta-analysis is NOT science. It is meaningless garbage for people too lazy and too stupid to do their own randomized clinical study.”
Don’t get me wrong, this is more interesting that it seems at first glance. One of the goal of empirical science is actually to estimate effect size and so the first sentence does not make any sense of course. The second part of the second sentence comes from a hater that probably spent countless hours doing interesting and costly clinical studies that failed to reach significance (welcome to Science dude). But the first part of the second sentence is interesting as it contains the adjective “meaningless” and it is true that, recently, medical and social sciences are received severe critics from top experts (including the famous « Why Most Published Research Findings Are False ») concluding that results of many findings and thus of meta-analysis may be really difficult to interpret if not uninterpretable.
With the overwhelming evidence of publication bias in psychology (and, probably, to a larger extent social science), Robbie van Aert and his collegues recently gave recommandations on how to apply p-curve and p-uniform to detect publication bias in meta-analyses based on p value. Those two tools present great promises to evaluate publication bias but van Aert et al. (2016) showed that those tools are highly sensitive to p-hacking (data peeking, selectively reported DVs or excluding outliers) and moderate-to-large heterogeneity. They offer guidance for p-curve, p-uniform and drawing inferences from meta-data along with R-code and web ressources.
Effect sizes can be largely overestimated because of publications bias. Indeed, journals tend to publish more significant results that none-significant results despite tremendous efforts of some researchers and editors. If H0 is true (no effect), the p-distribution must be uniform but right-skewed if there is an effect (see here for mor details).
Based on that, Simonsohn, Nelson & Simmons (2014), following our master (in statistics, biology, genetics, design experiment, etc., etc.) Sir Ronald A. Fisher (1925), then created p-curve. They stated that “because only true effects are expected to generate right-skewed p-curves – containing more low (.01s) than high (.04s) significant p-values – only right-skewed p-curves are diagnostic of evidential value.” Since p-curve does not estimate effect sizes or CI, Van Assen et al. (2015) created p-uniform to deal with those additional features.
Van aert et al. (2016) proposed different recommandations for those tools along with more general criteria for meta-analysis (Table 1).
© 2016 SAGE
As an example, let’s take Rabelo et al. (2015) who conducted a meta-analysis investigating the effect of weight (e.g., holding a heavy object) on judgments of importance (e.g., importance of fairness in decision making). They reunited 25 studies and found that both fixed- and random-effect outlined the same results: 0.571 (95% CI [0.468; 0.673], p < .001 (for those who lost track on fixed vs. random effect, click here for more details). This medium-to-large effect means that the heavier the object you’re carrying during your decision process, the more importance to attribute to certain feature (such as fairness for instance). At first sight, this seems fair as the CI is pretty restrained, the effect size decent and the p-value really low, but when running p-curve and p-uniform, we end up on a really different story (see Table 2).
© 2016 SAGE
Results of p-Uniform revealed the present of publication bias (z=5.058, p < .001, first column) with a NEGATIVE effect size estimate (opposite conclusion) with a CI including 0 and, at best, a small positive ES (1.60, upper limit). This a largely due to the fact that the meta-analysis contained studies with really small sample sizes and p-value around .05. We should then followed Recommandation 3 of the Table 1 and use p-uniform and p-curve instead of fixed- or random-effect models to draw inferences here. Since the average p-value > .025, the authors recommend to set the effect size estimate of p-uniform and p-curve equal to 0 (Recommandation 4) and due to absence of heterogeneity (Q(24)=4.55, p=1, I^2=0, p-curve and p-uniform may be accurate predictor of the true effect (Recommendation 5a). Although be careful cause here with the p-value of the Q-test close to 1, it indicates excessive homogeneity, which is unlikely to occur under regular sampling conditions and may indicate publication bias or p-hacking (Ioannidis, Trikalinos, & Zintzaras, 2006). As a result, it is difficult to conclude that such an effect of physical experience of weight on importance actually exist for now or that, at least, the data are too inconclusive to make such a claim.
On the other hand, if the meta-analysis would have revealed high heterogeneity between studies, relying on p-curve and p-uniform is unlikely to guide you properly and, as Recommandation 5b proclaimed, it can be more appropriate to first create subgroup of homogeneous studies based on theoretical and methodological considerations before running a second time p-curve and p-uniform. It is in general difficul to interpret meta-analytical result under high heterogeneity. Van Aert et al. (2016) run a simulation of 5,000 studies based on a true effect of .397 confirming that, with high heterogeneity, most of the tools available are biased by a large extent (overestimating the true effect by 2 to 3-fold factor, see Table 3).
© 2016 SAGE
Two other pitfalls seem to slow down the use of p-curve and p-uniform. First, the authors note that those tools are extremely sensitive to p values close to .05 leading, for example, p-uniform to behave erratically. Second of all, p-hacking technics such as data peeking (adding observation until you get a p value < .05), selectively reporting DVs or selectively excluding outliers modify their estimations. P-curve, in particular, underestimate the effect size under data peeking and report of only the first significant DV.
We thus need to develop more those tools in order to make them reliable when for treating publication bias in meta-analysis. If you want to generalize your meta-results to the general population, don’t forget also to compare your random-effect to your fixed-effect model since the former may lead to higher estimate of the effect size than the later because small N studies with overestimated effect sizes get less weight in fixed-effect model.