Publication Bias in Meta-Analysis: how to slay the Dragon?

McShane and his colleagues recently published a paper on publications bias and selection methods. Publications bias represent concerns one can have over the representativeness of a study or set of studies and has been subject of debates for centuries now (McShane even quoted Boyle, the Anglo-Irish chemist who was supposedly one of the first having shed a light on these concerns (I haven’t read the book but I let you check if this is true).

Publication bias not only raise the issues regarding the correct estimate of effect sizes, direction or statistical significance but also to accessibility, languages or familiarity. While it is generally viewed as a specific problem for meta-analysis, it is as much of a problem for a single study. If you generate an hypothesis based on a set of biased studies, you might very well end up not being able to reproduce any major findings and, a fortiori, finding a robust effect for your specific claim. This view generates questions about the nature of the hypothesis generation process, implying to get as much unpublished work possible before even starting to draw inferences and generate testable hypothesis.


More than 30 years ago, Larry Hedges (yes, the guy who created the g’s index) proposed a selection methods to possibly reduce bias in meta-analysis. The model contains two main components: 1) a data model, describing how data are produced where there is no publication biais (based on assumption of normality and common but unknown variance) and 2) a selection model based on the publication process and can be adjusted to estimate the likelihood of non-significant studies to be published (Hedges, 1984; Iyengar & Greenhouse, 1988). For those of you who would like more details, I recommend reading Hedges’ article here (in particular, pp.64-70 which basically described the model). He supposed the data are only reported when the mean difference reach a certain critical value of F(α, n) and can then run simulations showing that when n=15 and d=.50, the biais of g (the equivalent of d, correcting for the biais due to small n) is nearly 100%!

McShane and his colleagues then ran simulations comparing Hedges’ method, Iyengar & Greenhouse’s method (correcting for likelihood of non-significant results to be published), p-curve and p-uniform across 5 metrics:

  1. Bias: difference between the average estimate of d and d.
  2. RMSE: The root mean square error of the estimate d. The RMSD represents the sample standard deviation of the differences between predicted values and observed values.
  3. Log(SE/SD)=logarithm of the estimate standard error of d/standard deviation of the estimate d giving the accuracy of the estimated standard errors (zero is preferable).
  4. Coverage %: the % of the estimated 95% CI that cover d.
  5. Coverage width: the average width of the estimated 95% CI, smaller widths are more desirable.

In the first simulation, McShane and colleagues compare p-curve and p-uniform to the initial settings of Hedges (1984) with (a) only statistically significant studies are published and (b) effect sizes are homogeneous across studies, q=0 and t=0, respectively.


Fig. 1. “Simulation 1 results. The figure plots the bias, root mean square error (RMSE), log(SE/SD), coverage percentage, and coverage width of the three methods for three values of δ as a function of the number of studies.” © 2016 SAGE

As displayed in figure 1 above, the bias is asymptotically zero for the Hedges approach (line “Selection” in cyan on the first 3 cases). But the bias is globally negligible for the 3 methods. For RMSE, Hedges’ method > p-uniform > p-curve in term of reducing the error. Of course, with a medium effect size (d=.5) and 100 studies, the 3 approaches all performed decently but those setting are unlikely to happen in the actual social science context (in psychology, the average effect size is closer to .2 for instance).

# EDIT: As one of my colleague mentioned, I forgot to indicate the metric and specific field for that assertion: the average effect size (ES) for the last 100 year of social psychology is a r of .21, d≈.4 (Richard, Bond & Stokes-Zoota, 2003) but this estimation itself may be sensitive to publication bias and overestimate the real average ES…

Since only the Hedges approach produces SE, it was the only evaluated on the log(SE/SD) ratio estimate. The SE is pretty precised, within 5% of the SD of the estimates of d. Finally, since p-curve does not produce CI, McShane compared p-uniform to Hedges only with the latter performing a bit better on the coverage width.

If we now relax two assumptions to allow non-statistically results to be published, either ten times or 4 times less likely to be published that significant results, q=.10 or .25 respectively and heterogeneity across studies of t=.20, equivalent to I^2=50%, considered as a medium heterogeneity (half the total variability across simulated studies is due to heterogeneity).


Fig.2. ‘Simulation 3 Results. The figure plots the bias, root mean square error (RMSE), log(SE/SD), coverage percentage, and coverage width of the three methods for three values of δ as a function of the number of studies and the relative likelihood of publication q.” © 2016 SAGE

With those barriers broke down, the story-telling, much more realistic, is overly really different. If we look at the bias standard first, we can see that within small effect size ranges, the optimized Hedges approach performed much better than p-curve and p-uniform, same story with the RMSE. And while on the coverage %, p-uniform reasonably tied with the selection methods, the width is much larger for p-uniform, in particular when higher amount of non-significant results came out. Note that, even though this not the main point of McShane, the selection approach also produced a much better estimate of heterogeneity than the traditional meta-analytic approach (fixed- and random-effect models) that tend to underestimate heterogeneity.

In sum, it seems that the selection methods performs better than p-curve and p-uniform on this setting, in particular when we released the publication assumptions and homogeneity across studies (q and t > 0) and other variants of t gave approximately the same results. The correct setting of q is also difficult to evaluate though Franco et al. (up  above) tend to evaluate it around 20%, hopefully this would change in the future with new Journals publishing non-significant results (as can be seen here:

As McShane and colleagues mentioned at the beginning of the article (in table 1 below), it can be interested to use the 3 approaches along with standard meta-analytical tools to compare estimates. Though, with high heterogeneity and looser assumption concerning non-significant publications, Hedges’ approach perform better than the other ones.

table1.png Table 1 © 2016 SAGE


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s