The Ganzfeld Experiments: Quality -Part 2

The GRADE approach

One begins by reviewing each single study and rating it for quality. Based on that one will decide the overall quality of the collection of studies.

The GRADE approach, adopted by The Cochrane Collaboration, specifies four levels of quality (high, moderate, low and very low) where the highest quality rating is for a body of evidence based on randomized trials. Review authors can downgrade randomized trial evidence depending on the presence of five factors and upgrade the quality of evidence of observational studies depending on three factors.
-Cochrane Handbook Section 12

Randomized trials start out as high quality evidence and observational trials as low quality. Unfortunately, the ganzfeld is not a randomized trial in the medical sense as it lacks a control group. A randomized trial in medicine is one where patients are randomly assigned to a group.
Observational evidence is something you have when that is not possible. You can’t, for example, randomly tell people to smoke or not. You can only observe people who chose to do that or not. The problem is so-called confounding variables. People who smoke might also make other poor health choices, for example.

The typical ganzfeld experiment is neither. It could be, and has been, modified to be either of them.
If you were to randomly tell senders to sneak off, without anyone knowing, so that you had two groups one with sender and one without. That would be a randomized placebo controlled trial and definitely high quality evidence for the influence of a sender.
If, on the other hand, you simply compared trials where the sender just did not happen to show up, to those where he or she did, then you would have observational evidence.
Both of these things have been done but we’re not interested in that, for now.

Experiments like the ganzfeld are not explicitly considered by the Cochrane Collaboration. It seems to me clearly superior to an observational study because the experimenter has full control over what goes on.
On the other hand, the absence of a control group is a serious problem. It means that there is no way of knowing if the experimenters really managed to rule out all biases, which in this case means all conventional means of communication. The design of the experiment may not have any apparent biases but the implementation may be faulty. This is especially problematic because we are at the same time facing evidence suggesting that something was going on after all. We have no means of identifying the cause of the apparent effect.

The average ganzfeld experiment comprises a mere 40 trials. Trial in relation to the ganzfeld means a single attempt at guessing, rather than an actual study with many attempts as in medicine. I hope that’s not too confusing but I can’t change it.
That means that, if something is going wrong, the experimenter has relatively few chances of finding it. We would expect 10 hits just by chance. The claimed effect means that you should expect a mere 2 or 3 hits on top of that. That’s not a lot…


Regardless at which level of evidence the study starts out at, there are factors for which one downgrades or upgrades the quality. A reason to downgrade is indirectness of evidence.

An example of indirectness is if one study compares drug A to placebo and another drug B to placebo. You could use those results to compare drug A and B but that would mean a downgrade of the quality of evidence. That is, even if both studies were High Quality, you would only be dealing with Moderate Evidence for which drug is better than the other.

There are other types of indirectness which don’t need to bother us.

The relevance for the ganzfeld is that we only have indirect evidence that the hit rate should be 25%. This is based on a number of assumptions such as that the random sequence is really random, or that there is no sound (or other sensory) leakage between sender and receiver and so on.

All of these assumptions can be tested and well justified but the fact remains that it is indirect evidence.

The bottom line is that th typical ganzfeld experiment cannot be regarded as the equal of a randomized trial in medicine. We are starting out with Moderate Quality.

Other factors to consider

There are other factors which should result in downgrading the quality of an individual study. These would merit downgrading some individual experiments but I do not think that such factors are prevalent enough to globally downgrade the ganzfeld experiments as a whole.

One clear-cut example would be ‘Selective outcome reporting bias’. Remember that some ganzfeld studies had not just the receiver guessing at the correct target but also independent judges. This would allow, in principle, to report only the better result. So where we do not have the hit-rate of the actual receiver, we should downgrade.

Another thing that is problematic is when the hit-rate is not reported. Some studies had receiver rank the target according to preference. This allows computing the direct hit-rate depending on whether the correct target is ranked as #1. It, however also, allows turning the guess into a binary choice, where a hit is when the correct target is ranked #1 or #2 and a miss if it is #3 or #4. And then one may simply compute whether the average rank of the correct targets was higher than expected by chance.

However, the mere fact that this was done is not reason to downgrade. Only those studies where we do not have access to straight hit-rate must be downgraded. We’re only interested in the reliability of the evidence and the fact that some experimenters made dubious choices in not in itself relevant to that.

On the whole, I think that the early studies, would have to be downgraded further as a body, but not the later ones. This is something we will consider when we get to looking at actual results.

Normally one must not make such summary judgements. One should consider each study separately and give a reason why one downgrades or not. Transparency is vital because there quality assessments always involve a degree of subjectivity.

But for now, let’s just remember that we have this open issue and move on.

Publication Bias

It’s generally the case in all sciences that “successful” experiments are more likely to be published than unsuccessful ones. This is quite understandable. You are more likely to tell people about the unexpected than the expected, about what happened rather than what did not happen.

The ganzfeld experiments should have a hit rate of 25%. What would happen if no one ever published experiments with a hit rate lower than that?
The remaining published hit rates would then have to have a hit rate of way over 25%. Of course, it doesn’t have to be that extreme. Every study that is not available will distort the average of the rest.

Publication bias is notoriously difficult to detect because it results from studies that one doesn’t know anything about, not even if they exist. There are statistical methods to detect it but they are far from perfect. The problem with them is that they rely on certain assumptions about what the unbiased effect is like, and which results are most likely to remain unpublished. I won’t discuss these methods in detail for now. That’s for another time.

In general, review authors and guideline developers should consider rating down for likelihood of publication bias when the evidence consists of a number of small studies. The inclination to rate down for publication bias should increase if most of those small studies are industry sponsored or likely to be industry sponsored (or if the investigators share another conflict of interest).

GRADE guidelines: 5. Rating the quality of evidence—publication bias by Guyatt et al.

On its face, this suggests that we should rate down. The studies are all small and it can be argued that a conflict of interest is present, at least among some labs. I am not sure, though, if it is warranted to assume that whatever interests parapsychologists have, this would show in the form of publication bias.

Let’s look at counter-arguments for now.

Arguments against publication bias

One argument is that the Parapsychological Association Council adopted a policy opposing the selective reporting of positive outcomes in 1975.

Unfortunately, this argument is empiricly not tenable. Research on publication bias in mainstream science suggest that the cause lies with authors not bothering to write up negative studies, rather than journals not publishing them. So, such a policy should not have an effect.
There is evidence that this is indeed the case. A researcher sent a survey to parapsychologists asking whether they knew of any unpublished ganzfeld experiments and what the results were. All in all 19 unpublished studies were reported. (The extent of selective reporting of ESP ganzfeld studies by S. Blackmore, 1980)

This leads to the next argument. The proportion of significant studies among published and unpublished studies was similar (37% vs 58%). This is supposed to indicate that there was no bias to publish only the succesful ones. Obviously this argument is nonsense.
What matters is that the average hit-rate among the unpublished studies is systematically different. If that is the case then the average of the published studies will be biased. The fact that there are more significant studies among those published points in that direction, regardless of whether this fraction is similar.
In short, using that survey to argue that there was no publication bias is just Orwellian.

However, we must bear in mind that statistical significance tells us little about the hit-rate. Whether a study is significant depends both on the hit-rate and the number of trials in that study. If the number of trials per study in the unpublished studies is comparable to that in the published ones, then the hit-rate should be assumed to be lower. If the number of trials per study is much smaller, then the hit-rate may be the same or even higher.
For our purposes, we do not need solid evidence of that the result is biased. We should downgrade whenever there is a high probability of publication bias.

That means that the early studies would definitely have to be rated down.
The later studies are a different matter, though. The results don’t look much like overly influenced by publication bias. That’s something for later.

For now, I won’t downgrade. Others would. It is a borderline case.


If results from different studies are inconsistent with each other, then rating down is a must. One method to diagnose inconsistency is by applying a statistical test for heterogeneity.

Let’s say that the true hit-rate were always 25%. Because that is just a probability, few experiments would actually average an actual hit-rate of 25%. Most experiments would have a hit-rate close to that and few further away from that. In the same way, few people are of average height, but most are around it with few being very tall or very small.

A test for heterogeneity finds out if the hit-rates are distributed as one would expect if it is always the same, no matter what it is. One should still find this pattern even if something causes the true hit-rate to be 33% or whatever, as long as it is the same in all experiments.

The ganzfeld results are heterogeneous. The hit-rates found in the individual experiments are more varied than random chance would lead one to expect.

It is easy to see why inconsistency is a great problem for the ganzfeld in particular. If someone makes a mistake in implementing the standard ganzfeld design this may bias the results. It is unlikely that different people in different locations, all make the same mistake. It’s more likely that only some results will be biased, and to different degrees. And that’s something that would pop up as heterogeneity.

Heterogeneity is a clear warning signal that something may have gone wrong.

Of course, undetected mistakes are not the only possible cause for heterogeneity. There are always variations between different experiments which might also explain differences. A medical example is when patients in some study are sicker, or if some receive higher dose of a drug and so on.
If a robust explanation for the heterogeneity can be found then one need not downgrade. Preferably one would perform a subgroup analysis, which means that one splits the studies into different groups that are themselves not heterogeneous.

Parapsychologists speculate that something analogous is responsible for the heterogeneity in the ganzfeld. Some experiments used different target types while others recruited ‘creative’ participants which are supposed to be especially good at the task.
That does not rise to the level of a robust explanation, though. In medicine, one could invoke genetic differences, differences in food and climate, and any number of such things but that’s just hand-waving.
Future research may or may not uncover the causes of the heterogeneity but right now we simply lack any sort of robust explanation.

The way things are, downgrading for heterogeneity is a must.

Reasons for upgrading

One might upgrade the quality of evidence from an observational trial if the effect is especially large, if there is a dose-response relationship, or if all plausible confounding variables would work against the effect.

A large effect is defined as a relative risk (RR) of over 2, which for us means a hit rate of over 50%: Clearly not the case.

A dose-response relationship is clearly not present. There is nothing equivalent either because that implies knowledge of some underlying mechanism, which we do not have by definition.

The last one regarding confounders may be a bit confusing, so I’ll quote an example.

In another example of this phenomenon, an unpublished systematic review addressed the effect of condom use on HIV infection among men who have sex with men. The pooled effect estimate of RR from the five eligible observational studies was 0.34 [0.21, 0.54] in favor of condom use compared with no condom use. Two of these studies that examined the number of partners in those using condoms and not using condoms found that condom users were more likely to have more partners (but did not adjust for this confounding factor in their analyses). Considering the number of partners would, if anything, strengthen the effect estimate in favor of condom use.

-GRADE guidelines: 9. Rating up the quality of evidence by Guyatt et al. in JCE Vol. 64

That’s it for now. Still a lot of ground to cover…


[sound of sighing]

The May issue of The Psychologist carries an article by Stuart Ritchie, Richard Wiseman and Chris French titled Replication, Replication, Replication plus some reactions to it. The Psychologist is the official monthly publication of The British Psychological Society. And the article is, of course, about the problems the 3 skeptics had in getting their failed replications published.

Yes, replication is important

That the importance of replications receives attention is good, of course. Depositories for failed experiments are important and have the potential to aid the scientific enterprise.

What is sad, however, is that the importance of proper methodology is largely overlooked. Even the 3 skeptics who should know all about the dangers of data-dredging cavalierly dismiss the issue with these words:

While many of these methodological problems are worrying, we don’t think any of them completely undermine what appears to be an impressive dataset.

But replication is still not the answer

I have written about how replication cannot be the whole answer before. In a nutshell, by cunning abuse of statistical methods it is possible to give any mundane and boring result the impression of showing some amazing, unheard of effect. That takes hardly any extra work but experimentally debunking the supposed effect is a huge effort. It takes more searching to be sure that something is not there than to simply find it. For statistical reasons, an experiment needs more subjects to “prove” the absence of an effect with the same confidence as finding it.
But there’s also that there might be some difference between the original experiment and the replication that explain the lack of effect. In this case it was claimed that maybe the 3 failed because they did not believe in the effect. It takes just seconds to make such a claim. Disproving it requires finding a “believer” who will again run an experiment with more subjects that the original.

Quoth the 3 skeptics:

Most obviously, we have only attempted to replicate one of Bem’s nine experiments; much work is yet to be done.

It should be blindingly obvious that science just can’t work like that.

There are a few voices that take a more sensible approach. Daniel Bor writes a little of how neuroimaging which has, or had, extreme problems with useless statistics might improve by foster greater expertise among the practitioners. Neuroimaging seems to have made methodological improvements. What social psychology needs is a drink of the same cup.

The difficulty of publishing and the crying of rivers

On the whole, I find the article by the 3 skeptics to be little more than a whine about how difficult it is to get published, hardly an unusual experience. The first journal refused because they don’t publish replications.
Top journals are supposed to make sure that the results they publish are worthwhile. Showing that people can see into the future is amazing, not being able to show that is not. Back in the day it was simply so that there was only a limited number of pages that could be stuffed into an issue, these days, with online publishing, there’s still the limited attention of readers.
The second journal refused to publish because one of the peer-reviewers, who happened to be Daryl Bem, requested further experiments to be done. That’s a perfectly normal thing and it’s also normal that researchers should be annoyed by what they see as a frivolous request.
In this case, one more experiment should have made sure that the failure to replicate wasn’t due to the beliefs of the experimenters. The original results published by Bem were almost certainly not due to chance. Looking for a reason for the different results is good science.

I’ve given a simple explanation for the obvious reason here. If the 3 skeptics are unwilling or unable to actually give such an explanation they are hardly in a position to complain.

Beware the literature

As a general rule, failed experiments have a harder time to get published than successful ones. That’s something of a problem because it means that information about what doesn’t work is lost to the larger community. When there is an interesting result in the older literature that seems not to have been followed up on then it probably is the case that it didn’t work after all. The original report was a fluke and the “debunking” was never much published. Of course, one can’t be sure if it was not maybe overlooked, which is a problem.
One must be aware that the scientific literature is not a complete record of all available scientific information. Failures will mostly live on in the memory of professors and will still be available to their ‘apprentices’ but it would be much more desirable if the information could be made available to all. With the internet, this possibility now exists and that discussion about such means is probably the most valuable result of the Bem affair so far.