Among the myriad of goals that meta-analysis is prone to fulfill, estimating heterogeneity is a major one. Thanks to the Dutch team (Augusteijn, van Aert & van Assen, see https://osf.io/gv25c), we have here a really good summarize of both classical heterogeneity tests as well as the effect of publication bias upon them. I’ll keep you in suspense for now but they basically showed that the effect of publication bias on the Q-test, for instance, is large, complex and non-linear (which would seems bad news as first glance). But what I like about this team is that they always propose tools trying to adresse the issue and this article was no exception.

A meta-analysis (MA) is trying to estimate the true population effect size (ES) and heterogeneity emerged when variations between study’s ES is greater by chance alone. Thus, the different studies do not estimate the same true effect because of influences by other factors (population, study design, measures, etc.). Let’s take the Q-test that we mentioned for example.

where Yi is the study effect size, M is the summary effect, Si the within-study standard deviation and k the number of studies. We simply compute the deviation of each ES from the mean and divided it by the SD, making the Q a standardized measure (not affected by the metric of the ES, like Cohen’s d for analogy). We postulate H0 (all studies share a common effect size) and then test this hypothesis. Under H0, Q follow a central chi-squared distribution with degree of freedom equal to k-1 (if p<.05, we conclude the study do not share a common ES; p>.05 does not necessarily mean they do share one, “absence of evidence is not evidence of absence”). While Q is not sensitive to the ES’s metric, it is extremely sensitive to the number of studies.

In the figure below (from Borenstein, 2008), while plots A and B have 6 studies, plots C and D have 12 studies. If we compare A and B, when the dispersion increases (when Q increases), the p-value became significant. If we compare A and C, which looks the same (in terme of t-squared, I-squared etc.), adding studies move the p values away from zero (.83 > .70).

Again, Borenstein and colleagues (2008) are clear on that point:

“First, while a significant p-value provides evidence that the true effects vary, the converse is not true. A nonsignificant p-value should not be taken as evidence that the effect sizes are consistent, since the lack of significance may be due to low power. With a small number of studies and/or large within-study variance (small studies), even substantial between-studies dispersion might yield a nonsignificant p-value.

Second, the Q statistic and p-value address only the test of significance and should never be used as surrogates for the amount of true variance. A nonsignificant p value could reflect a trivial amount of observed dispersion, but could also reflect a substantial amount of observed dispersion with imprecise studies. Similarly, a significant p-value could reflect a substantial amount of observed dispersion, but could also reflect a minor amount of observed dispersion with precise studies.”

Knowing how critical is this dispersion and what affect it is an essential job of a meta-analyzer. The Dutch team express Q in function of I-squared (Q=(H^2)*(X^2)(K) where H^2=1(1-I^2) and X^2 the chi-squared distribution). They were able then to run simulations modifying the I^2 ( I-squared is the percentage of total variation across studies that is due to heterogeneity rather than chance) and the number of studies through K.

Using the web-application they developed (Q-sense: https://augusteijn.shinyapps.io/Q-sense/) to real dataset, they examined the impact of publication bias on Q-test through a meta-analysis concerning of breastfeeding on intelligence in children (Der, Batty & Deary, 2006). This MA contained 9 effect sizes with an average d in random-effect model of d=0.138, 95% CI [0.059:0.217], p=0.0006, and Q(8)=21.05, p=.007 (I2=65.76%, τ2=0.0071). With half of studies not being significant, we could say that there is a small true effect size, with moderate to large heterogeneity and probably no publication bias (or really low).

As we can see on the figure 8 from Augustejin and colleagues (2017), if we express the average value of Q in function of the effect sizes of the MA (d=0.138) and heterogeneity (t2=0.0071), Q’s average and CI are barely affected by publication bias which indicate that this meta-analysis are robust to publication bias. Drawing an horizontal line at δ=0.138, we are really closed to the average Q-value being almost constant across different level of publication bias.

If we considered another study by Rabelo et al. in 2015, (that we mentioned in a precedent post here), who conducted a meta-analysis investigating the effect of weight (e.g., holding a heavy object) on judgments of importance (e.g., importance of fairness in decision making). They reunited 25 studies and found that both fixed- and random-effect outlined the same results: 0.571 (95% CI [0.468; 0.673], p < .001 and Q(24)=4.7 (p = .999993; H2=0.196), signaling extreme homogeneity. This medium-to-large effect means that the heavier the object you’re carrying during your decision process, the more importance to attribute to certain feature (such as fairness for instance). However, the Q seem to indicate an extreme homogeneity, signs of publication bias.

In there Q sensitivity software, Augustejin and colleagues set the the d to 0 instead of .57 due as true ES due to this extreme homogeneity. As you can see down below, the Q’s average value and CI are greatly influenced by publication bias: if there is no bias and all studies are published, Q=24.01 (p=.46) and if only 3% of the studies are published, the average Q=47.63 (p=.003).

In that case, it’s more reasonable to conclude this MA is not robust to publication bias and that the true ES may be close to 0. This is interesting as it includes much more studies that the precedent MA and one may naively thinks the latter should be more precised and less heterogeneous. This teamwork from Tilburg University represent one step further to correctly assess the robustness of meta-analytical results…