The Ganzfeld Experiments: Quality -Part 3

I have touched on a number of issues, so far, and parapsychologists have not been silent on them either. So, now I’m going to address a few things that seem especially pertinent.

Tressoldi and GRADE

I am not the first to think that the GRADE approach could shed a light on the solidity of parapsychological evidence. Patrizio E. Tressoldi mentions the GRADE approach in Extraordinary claims require extraordinary evidence: the case of non-local perception, a classical and Bayesian review of evidences.
Unfortunately “non-local perception” is not defined in the article. A connection to certain quantum physical phenomena is made but there is no explanation of the relationship. Most importantly, there is no explanation of how any of that relates to the experimental data.
These are the same fatal flaws that regrettably are the norm rather than the exception in parapsychology but that’s not of importance here.

The experimental data consists of data from several previous meta-analyses which are reexamined using different statistical methods. There is no attempt made to apply the GRADE guidelines. The quality of evidence is not evaluated in any way whatsoever.
Tressoldi simply asserts that the evidence should be High Quality by the GRADE standard which has the unfortunate potential to leave a rather misleading impression. A normal reader, not bothering to track down all references, might be lead to believe that the simple statistical exercises performed constitute the GRADE approach.

Such bad scholarship is best, and most politely, ignored and forgotten. Yet, I would have been amiss not to mention this in this context.

How not to assess evidence

One approach to quality in the ganzfeld debate has been to use quality scales. This method is explicitly discouraged.

1. Do not use quality scales

Quality scales and resulting scores are not an appropriate way to appraise clinical trials. They tend to combine assessments of aspects of the quality of reporting with aspects of trial conduct, and to assign weights to different items in ways that are difficult to justify. Both theoretical considerations and empirical evidence11 suggest that associations of different scales with intervention effect estimates are inconsistent and unpredictable

The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials

A quality scale is like a checklist of things that are thought to bias the experiment. For each item that is checked a point value is assigned and the sum then gives the quality of the study in a single value. It’s this part of expressing the quality in a single value that is problematic. We’ll see why in a moment.

Quality scales have been used on multiple occasions within parapsychology and also more than once on ganzfeld studies. The way things are done in parapsychology is that the studies are first rated for quality. Then it is checked whether there is a correlation between quality and effect size. That means one looks if studies that are of lower quality have a higher hit-rate, on average, than studies that are of high quality.

The correlation is typically weak and non-significant which is supposed to show that results are not due to quality problems. The argument is quite curious on its face because one would think that parapsychologists of all people. would understand that absence of evidence does not equal evidence of absence.

Medicine has standardized quality scales and even so, it is found that these scales may give contradictory answers. So when you fail to find a correlation between a particular scale and outcome, that may simply mean that you used the wrong scale. And when you find one, well… Try enough scales and you will find a correlation just by chance.
The problem is especially acute in parapsychology where there are no standard scales. The scales are simply made up on the spot and never reused.

An Example

I’ll use Meta-Analysis of Free-Response Studies, 1992–2008: Assessing the Noise Reduction Model in Parapsychology by Storm, Tressoldi and DiRisio as an example for a closer look at the problem of quality scales.

The first item on the scale is this:

appropriate randomization (using electronic apparatuses or random tables),

Randomization is obviously important. If the target is not randomly selected, then we are certainly not justified in thinking that there is a 1 in 4 of guessing correctly. If the target was selected, just for example, based on how visually appealing it is, then it would not be surprising to find a higher hit-rate.
However, there is long story behind this. We’ll get back to this item.

random target positioning during judgment (i.e., target was randomly placed in the presentation with decoys),

Obviously, if you always place the correct target in the same place, that’s really bad. Even if no one figures out the correct place, it offers a ready explanation for any especially high or low hit-rate.

If you present people with a list of items, and ask them to pick one at random , then they will prefer some simply based on their position in the list. Of course, that’s only true as long as there aren’t some over-riding considerations and it’s only true on the average but the fact is that people aren’t very random.

That’s one the more important findings to come out of psychological science. How so? Think commercial web-sites, super-market shelves, etc…

It’s not actually necessary to randomize the order of the choices every time, though. For example, if you always had the same 4 images in the same order, and simply designated one at random as the target, then that would work as well.

In a way, this is an odd item. If all the experimenters are blind and the target selection is random, then there should be no need for explicitly randomizing the positioning because it would already be random by necessity

The further items are these:

• blind response transcription or impossibility to know the target in advance,
• sensory shielding from sender (agent) and receiver (perceiver),
• target independently checked by a second judge, and
• experimenters blind to target identity.

I won’t pick them apart in detail.

All of these items could potentially bias the hit-rate but -and that’s the problem- we don’t know to what a degree or even if they do at all.
Take sensory shielding: That’s a complete must for any ganzfeld experiment. If any article failed to specifically mention sensory shielding, then this can only be an omission in the report but not necessarily a sign that the shielding was insufficient. On the other hand, if it is mentioned, it is not knowable from the report if was truly sufficient.

For the sake of the argument, imagine that one of the items will always lead to a higher (or lower) hit-rate and the rest do nothing. Then you will have studies that were rated as high-quality that are still and “low-quality” studies that are unbiased.
Now you look for a correlation. Do the high-quality studies have a different hit-rate?
Strictly speaking, you should still expect a slight difference because the biased studies can never, assuming perfect reporting, be top-quality. So, there will a few less biased studies among the high-quality studies but the true extent will be hidden because you are basically mixing apples and oranges.

Basically, when you use a quality scale in this way, you are implicitly assuming that all your potential biases have exactly the same effect and that’s all. The more factors that have no effect that you put into the scale (and the more factors that have one which you leave off), the less likely you are to find any correlation between effect and quality rating.

It would be far more relevant to look for a correlation between any item individually and the hit-rate. This would allow parapsychologists to identify potential biases and make amends. I fear that such undertakings are unlikely to happen. Such a thing is contrary to the culture of parapsychology.
Parapsychology is focused on showing that something cannot possibly have happened in any known way, including by error. In order to study the impact of biases, it would first be necessary to acknowledge that error can never truly be ruled out. Acknowledging that would render the entire parapsychological method moot.

And that leads us back to the first item.

Manual vs. Automatic Randomization

A couple of mainstream scientists (i.e. scientists not part of the usual handful of skeptics and believers) had a look at the database that Storm and his colleagues created in the previously mentioned paper.
In the main, they reanalysed it using Bayesian methods but that’s a whole ‘nother can of worms.

They obtained the full database from Storm et al which contained not only the cumulative quality score but also the individual item ratings. It turned out that the experiments using manual randomization had much better hit-rates than those using automatic randomization.

Here’s the relevant figure from their paper:
RouderFig1
Rouder, J. N., Morey, R. D., & Province, J. M. (2013): A Bayes factor meta-analysis of recent extrasensory perception experiments: Comment on Storm, Tressoldi, and Di Risio (2010). Psychological Bulletin

As you can see, the studies cluster around the expected chance hit-rate but some studies just run off. And that is particularly true for the manually randomized studies.
What this indicates, on its face, is that manual randomization is associated with a considerable risk of bias. The size of the bias is not the same across all studies but that’s just what you’d expect. However, clearly that does not explain all high scores.

In reply, the original authors pointed out that a few studies had been mis-rated (while glossing over the fact that the errors were largely their own – classy!).

They still found that there was a significant difference in effect size between the two groups with a p-value of 1.9%. That means that if you randomly split the database in two groups and then compare the hit-rate, the difference will be larger than that found only about 1 in 50 times.

This finding is rather suggestive but note that this is far from solid evidence. Bear in mind that many things that limit our ability to draw firm conclusions from the ganzfeld studies are also present here.

For one, it’s possible that there are confounding variables. Maybe it is not about the randomization at all, but about something else that people who chose one method also did differently.
And also, it may just be a false positive, a random association. Such a difference may only be found 1 in 50 times, but this 1 time may just have been this time.

There are two things that add credibility to thinking that this points to a bias due to improper randomization. For one, Rouder and colleagues did not go about ‘data-mining’ for some association. They had a limited number of factors that they “inherited” from Storm and colleagues.
These factors, in turn, were certainly not the result of some data-mining either. They came up with a limited number of factors that they thought might indicate the presence of bias and then had the database rated for these factors.
That’s the second thing that adds credibility. It is not some correlation we have simply noticed. We know how improper randomization can improve the hit-rate and that was why this was looked at in the first place.

Still, even so, the finding is of limited usefulness to us because that particular database consisted of both ganzfeld and non-ganzfeld studies, and not even all ganzfeld studies.

Storm and his colleagues offer some rather curious counter-arguments.

For one they point out that the z-scores were not significantly different but that’s just statistical nonsense. The z-score is the p-value expressed in terms of standard deviations, so it’s basically a measure of how frequently one obtains a certain number of hits in a given number of trials.
The z-score depends both on the hit-rate and the number of trials. A 40% hit-rate in a 10-trials study give a low z-score while the same hit-rate in 100-trials study will give a high z-score. That is because such a high hit-rate can happen often, by chance, in a 10-trials study but rarely in a 100-trials study.

So basically, if one group of studies consists of smaller studies then they will have relatively low z-scores even if they have a much higher hit-rate on average. I can’t think of any way in which testing for a difference in z-scores makes sense.

The other counter-arguments is that there is… well, I’ll just quote them:

No argument is presented as to (a) why the use of random number tables is problematic or (b) why [automatic randomization] means both “true” randomization caused by radioactive decay, as in state-of-the-art random number generators (see Stanford, (1977), and pseudorandomization with computer algorithms, but not one or the other.

Wait… what?

One wonders then, why did they use it for their scale? Can it be that Storm et al have already forgotten who came up with this criterion?
That’s not to say that what they point out is wrong. It’s true that there is no reason to think that the randomization in one group is necessarily bad or necessarily good in the other. That is exactly a problem with their quality scale in particular, which they have apparently just disowned.
However, it is exactly that one finds a difference that retrospectively validates this item.

The Ganzfeld Experiments: Quality – Conclusion

Previously, I assessed the quality of evidence provided by the ganzfeld experiments. I found that the typical ganzfeld experiment could only be considered to yield Moderate evidence, and further that it was necessary to downgrade the entire body of studies at least once, for heterogeneity.
That leaves the overall quality of evidence for the ganzfeld trials as Low. There is no way that I could justify any higher grade but one could certainly justify a lower one. One could justify a double downgrade for the heterogeneity, on account of the serious implications. One could also justify downgrading for publication bias. And then, I didn’t look in detail at the individual studies, which could only uncover reasons for downgrading, as I found that there was no reason to upgrade.
When two (or more) factors are borderline, one should downgrade for at least one.

Put like that, calling the evidence of Low Quality is a very favorable assessment.

The best argument for a better grade is claiming that the ganzfeld design as it is should be regarded as High Quality, like a medical RCT. I’ve already laid out why I don’t agree with that. It would simply lead to another borderline case and at some point you can’t ignore all these borderline calls and must downgrade for at least one of them.

But what does that mean?

Quality level Current definition
High We are very confident that the true effect lies close to that of the estimate of the effect
Moderate We are moderately confident in the effect estimate: The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different
Low Our confidence in the effect estimate is limited: The true effect may be substantially different from the estimate of the effect
Very low We have very little confidence in the effect estimate: The true effect is likely to be substantially different from the estimate of effect

Note well what this does not mean. It does not mean that there is no effect. It means that we can have no confidence that there is one. But equally it means that we can have no confidence that there is none.

And that simply means that everyone will retain whatever opinion they had beforehand which leads us to another curious feature of parapsychology in general.
Parapsychologists say that the hit-rate should be 25%. Any conventional cause for a deviation is not of interest and should be regarded as a bias. The basic ganzfeld design has been intensely scrutinized for any such potential bias and modified to rule it out.
This puts parapsychologists into the position to make a solid and credible argument that the hit-rate must be 25% by any conventional expectation. And it is that which lends credence to the argument that any systematic deviation, any effect, must be due to something amazing, that some worthwhile scientific discovery is waiting there.

Unfortunately, the sheer solidity of the theoretical argument means that few mainstream scientists will be swayed by low quality evidence. Curiously, many vocal parapsychologists seem unable to understand this.
They accuse mainstream science of being “dogmatic” and yet the failure to convince the mainstream with low quality evidence is precisely because of the solidity of their arguments that, by all prior evidence, the hit-rate must be 25%.

  1. Parapsychologists work hard and convince people that the hit-rate should be 25%.
  2. Parapsychologists accuse people of being dogmatic for believing it.

It’s one of those things about parapsychology that does not make the slightest bit of sense. Such displays of irrationality are ultimately responsible for parapsychology’s bad reputation. Low quality evidence is normal enough. That’s why there is such a thing as the GRADE approach.
If someone appears irrational, you probably won’t attempt a rational dialogue. And if you try anyway and your open-mindedness is rewarded with accusations of dogmatism and even dishonesty, then your probably give up.
It is that which leads mainstream science to, for the most part, shun parapsychology. Which then leads prominent parapsychologists to double down and declare that there is a “taboo” against dealing with them. But that’s a different matter.

Does GRADE work?

That’s a very good question. I hope it occurred to you.

One thing we would like to know is how reliable the assessment is. How much agreement is there between different raters? And the answer is: Not as much as we’d like. There is human judgement involved in the rating which is one reason that the GRADE approach demands transparency.
I have tried my best to make the reasoning as clear as possible and have already discussed where others might differ in their assessment.

The other thing we would like to know is how solid the conclusion is. Say, you have 1,000 different apparent effects but based on evidence rated Low or Very Low. How many of those effects would really be found to be substantially different?
The answer: No one knows, yet.

In relation to the ganzfeld, however, we can say that the assessment would have been exactly spot on. I’ve talked about a 33% hit-rate because that is often claimed but, in truth, the hit-rate has varied wildly. When some of the earliest experiments were analyzed in 1985, a hit-rate of 37% was obtained; while when studies from between 1987 and 1997 were analyzed a hit-rate of only 27% was obtained.
In the latter case it was, of course, the parapsychologists who were not impressed and argued that this was due to certain specific biases. That’s something for a later post.

Conclusion

So eventually we find that the ganzfeld evidence is of Low Quality but that should not come as a surprise to anyone.
The more important lesson is probably that this is so according to the standards of mainstream medical science. Other sciences may have a lower standard; I’m thinking of psychology in particular. Indeed, it has been asserted by psychologist Richard Wiseman that the ganzfeld is solid by the standards of his field but, as far as I can tell, his colleagues are, on the whole, not particularly impressed which would seem to contradict his assessment.
In any case, accusations of a double standard clearly lack merit.

What I find more worrying are the problems that parapsychology has in interpreting the evidence and drawing supportable conclusions, regardless of quality considerations. Low Quality evidence is not unusual, but the irrationality surrounding the whole issue is.

What If: High Quality Evidence ?

I think many parapsychologists have very unrealistic expectations about that. Remember that all that could be concluded from the ganzfeld experiments is the presence of some unexplained effect causing the chance expectation to be wrong.
High quality ganzfeld evidence would just indicate that there is probably something worth studying there. Some scientists would become interested enough to look into it. Most would simply be too busy with whatever they are already doing.
The interested scientists would then start out by repeating the original, standard ganzfeld experiment to create the effect in their own lab. And then, once they have succeeded in that they would study the effect. If the found themselves unable to recreate the effect, they would still give up. If you can’t create an effect, you can’t study it, even if you are convinced it exists.
And that’s all that would happen.

The situation currently, with low quality evidence, is not fundamentally different! It just means that fewer people are going to think they can recreate the effect or that the effect is due to something worthwhile.
This idea that high quality evidence for psi would lead to some sort of “paradigm shift” because of some single experiment is just nonsense. That kind of thing has never happened before and I don’t see how it could happen even in principle.

While this concludes the GRADE business, this does not conclude the quality series. There are some more issues we need to talk about, such as what parapsychologists had to say about all this.

Examples of mishaps

I want to give some examples of things that actually went wrong in the ganzfeld experiments. I hope it may illustrate how these vague biases may look “on the ground”.
Do not take these examples as a reason to dismiss the experiments. You can take the Low Quality as a reason for that but these examples are just, you know, life. Things don’t go perfect.
Parapsychology doesn’t stand apart in that respect.

These results differ slightly from those reported earlier since an independent check of our database by Ulla Böwadt found an extra hit in study V. Two trials in study IV (a hit and a miss) had also been included although the experimenters apparently were not agreed on this prior to the results. Their exclusion would make however virtually no effect on the final figures.
A review of the ganzfeld work at Gothenburg University by Adrian Parker, 2000

This shows how individual trials may simply fall through the cracks. It would have been completely justifiable not to include those. One has to wonder if the media demonstration, in particular, was conducted with the same diligence as the regular trials.
Somewhat similar problems are known in medicine. In a medical studies, patient may drop out; they quit the experiment. One must suspect that it will usually be those who are disillusioned by the offered treatment, or, perhaps, those who see no need to continue because the feel cured. In either case, this will bias the results. This so-called attrition is considered by GRADE.
Another thing this is similar to is transcription errors. You probably won’t be surprised to learn that people have actually been maimed and killed because of doctors’ illegible hand-writing but it’s also a problem in science. Bias may be introduced into a study simply because of faulty data entry. That published values had to be corrected has happened on occasion in ganzfeld experiments and particularly meta-analyses.

 

An amusing, recent example is an analysis published in 2010: Meta-analysis of free-response studies, 1992-2008: assessing the noise reduction model in parapsychology by Storm L, Tressoldi PE, Di Risio L.
The article is more full of errors than I care to point out but this is just about one of them (one of the less embarassing ones).
As part of their analysis they rated the ganzfeld experiments for quality. What they did wrong there is for a later post. The details of the rating were obtained by a couple of scientists, Jeffrey N. Rouder, Richard D. Morey and Jordan M. Province. They argued that improper randomization could explain a part of the results. More on that later.
One of the counter-arguments by the original authors was that the ratings, obtained from them, contained errors!

 

After about 80% of the sessions were completed, it was becoming clear that our hypothesis concerning the superiority of dynamic targets over static targets was receiving substantial confirmation. Because dynamic targets contain auditory as well as visual information, we conducted a supplementary test to assess the possibility of auditory leakage from the VCR soundtrack to R. With the VCR audio set to normal amplification, no auditory signal could be detected through R’s headphones, with or without white noise. When an external amplifier was added between the VCR and R’s headphones and with the white noise turned completely off, the soundtrack could sometimes be faintly detected.
Psi Communication in the ganzfeld: experiments with an automated testing system and a comparison with a meta-analysis of earlier studies by Honorton et al, 1990

This means that the receiver(R) may have heard the sound of the correct target, which certainly would have allowed him or her to make the correct guess. That’s potentially serious. The counter-argument, however, sounds quite convincing, as well: There was no drop in the hit-rate after the sound system was modified to rule that out.

That’s certainly quite suggestive but mind that it is not high quality evidence. We have a bunch of trials conducted before the sound system was fixed and a bunch afterwards but there is no direct, randomizzed comparison.
And what does the unchanging hit-rate indicate anyway? Maybe they just failed to remove the problem with the modification.

For what it’s worth, when that lab closed the equipment was moved elsewhere where it was used by different experimenters. They were unable recreate the effect.
You could take that as evidence that maybe the sound system played a role but once again: Low quality evidence. There certainly were other documented potential biases in the experiments at that lab which may not have been present at the new location.

The Ganzfeld Experiments: Quality -Part 2

The GRADE approach

One begins by reviewing each single study and rating it for quality. Based on that one will decide the overall quality of the collection of studies.

The GRADE approach, adopted by The Cochrane Collaboration, specifies four levels of quality (high, moderate, low and very low) where the highest quality rating is for a body of evidence based on randomized trials. Review authors can downgrade randomized trial evidence depending on the presence of five factors and upgrade the quality of evidence of observational studies depending on three factors.
-Cochrane Handbook Section 12

Randomized trials start out as high quality evidence and observational trials as low quality. Unfortunately, the ganzfeld is not a randomized trial in the medical sense as it lacks a control group. A randomized trial in medicine is one where patients are randomly assigned to a group.
Observational evidence is something you have when that is not possible. You can’t, for example, randomly tell people to smoke or not. You can only observe people who chose to do that or not. The problem is so-called confounding variables. People who smoke might also make other poor health choices, for example.

The typical ganzfeld experiment is neither. It could be, and has been, modified to be either of them.
If you were to randomly tell senders to sneak off, without anyone knowing, so that you had two groups one with sender and one without. That would be a randomized placebo controlled trial and definitely high quality evidence for the influence of a sender.
If, on the other hand, you simply compared trials where the sender just did not happen to show up, to those where he or she did, then you would have observational evidence.
Both of these things have been done but we’re not interested in that, for now.

Experiments like the ganzfeld are not explicitly considered by the Cochrane Collaboration. It seems to me clearly superior to an observational study because the experimenter has full control over what goes on.
On the other hand, the absence of a control group is a serious problem. It means that there is no way of knowing if the experimenters really managed to rule out all biases, which in this case means all conventional means of communication. The design of the experiment may not have any apparent biases but the implementation may be faulty. This is especially problematic because we are at the same time facing evidence suggesting that something was going on after all. We have no means of identifying the cause of the apparent effect.

The average ganzfeld experiment comprises a mere 40 trials. Trial in relation to the ganzfeld means a single attempt at guessing, rather than an actual study with many attempts as in medicine. I hope that’s not too confusing but I can’t change it.
That means that, if something is going wrong, the experimenter has relatively few chances of finding it. We would expect 10 hits just by chance. The claimed effect means that you should expect a mere 2 or 3 hits on top of that. That’s not a lot…

Indirectness

Regardless at which level of evidence the study starts out at, there are factors for which one downgrades or upgrades the quality. A reason to downgrade is indirectness of evidence.

An example of indirectness is if one study compares drug A to placebo and another drug B to placebo. You could use those results to compare drug A and B but that would mean a downgrade of the quality of evidence. That is, even if both studies were High Quality, you would only be dealing with Moderate Evidence for which drug is better than the other.

There are other types of indirectness which don’t need to bother us.

The relevance for the ganzfeld is that we only have indirect evidence that the hit rate should be 25%. This is based on a number of assumptions such as that the random sequence is really random, or that there is no sound (or other sensory) leakage between sender and receiver and so on.

All of these assumptions can be tested and well justified but the fact remains that it is indirect evidence.

The bottom line is that th typical ganzfeld experiment cannot be regarded as the equal of a randomized trial in medicine. We are starting out with Moderate Quality.

Other factors to consider

There are other factors which should result in downgrading the quality of an individual study. These would merit downgrading some individual experiments but I do not think that such factors are prevalent enough to globally downgrade the ganzfeld experiments as a whole.

One clear-cut example would be ‘Selective outcome reporting bias’. Remember that some ganzfeld studies had not just the receiver guessing at the correct target but also independent judges. This would allow, in principle, to report only the better result. So where we do not have the hit-rate of the actual receiver, we should downgrade.

Another thing that is problematic is when the hit-rate is not reported. Some studies had receiver rank the target according to preference. This allows computing the direct hit-rate depending on whether the correct target is ranked as #1. It, however also, allows turning the guess into a binary choice, where a hit is when the correct target is ranked #1 or #2 and a miss if it is #3 or #4. And then one may simply compute whether the average rank of the correct targets was higher than expected by chance.

However, the mere fact that this was done is not reason to downgrade. Only those studies where we do not have access to straight hit-rate must be downgraded. We’re only interested in the reliability of the evidence and the fact that some experimenters made dubious choices in not in itself relevant to that.

On the whole, I think that the early studies, would have to be downgraded further as a body, but not the later ones. This is something we will consider when we get to looking at actual results.

Normally one must not make such summary judgements. One should consider each study separately and give a reason why one downgrades or not. Transparency is vital because there quality assessments always involve a degree of subjectivity.

But for now, let’s just remember that we have this open issue and move on.

Publication Bias

It’s generally the case in all sciences that “successful” experiments are more likely to be published than unsuccessful ones. This is quite understandable. You are more likely to tell people about the unexpected than the expected, about what happened rather than what did not happen.

The ganzfeld experiments should have a hit rate of 25%. What would happen if no one ever published experiments with a hit rate lower than that?
The remaining published hit rates would then have to have a hit rate of way over 25%. Of course, it doesn’t have to be that extreme. Every study that is not available will distort the average of the rest.

Publication bias is notoriously difficult to detect because it results from studies that one doesn’t know anything about, not even if they exist. There are statistical methods to detect it but they are far from perfect. The problem with them is that they rely on certain assumptions about what the unbiased effect is like, and which results are most likely to remain unpublished. I won’t discuss these methods in detail for now. That’s for another time.

In general, review authors and guideline developers should consider rating down for likelihood of publication bias when the evidence consists of a number of small studies. The inclination to rate down for publication bias should increase if most of those small studies are industry sponsored or likely to be industry sponsored (or if the investigators share another conflict of interest).

GRADE guidelines: 5. Rating the quality of evidence—publication bias by Guyatt et al.

On its face, this suggests that we should rate down. The studies are all small and it can be argued that a conflict of interest is present, at least among some labs. I am not sure, though, if it is warranted to assume that whatever interests parapsychologists have, this would show in the form of publication bias.

Let’s look at counter-arguments for now.

Arguments against publication bias

One argument is that the Parapsychological Association Council adopted a policy opposing the selective reporting of positive outcomes in 1975.

Unfortunately, this argument is empiricly not tenable. Research on publication bias in mainstream science suggest that the cause lies with authors not bothering to write up negative studies, rather than journals not publishing them. So, such a policy should not have an effect.
There is evidence that this is indeed the case. A researcher sent a survey to parapsychologists asking whether they knew of any unpublished ganzfeld experiments and what the results were. All in all 19 unpublished studies were reported. (The extent of selective reporting of ESP ganzfeld studies by S. Blackmore, 1980)

This leads to the next argument. The proportion of significant studies among published and unpublished studies was similar (37% vs 58%). This is supposed to indicate that there was no bias to publish only the succesful ones. Obviously this argument is nonsense.
What matters is that the average hit-rate among the unpublished studies is systematically different. If that is the case then the average of the published studies will be biased. The fact that there are more significant studies among those published points in that direction, regardless of whether this fraction is similar.
In short, using that survey to argue that there was no publication bias is just Orwellian.

However, we must bear in mind that statistical significance tells us little about the hit-rate. Whether a study is significant depends both on the hit-rate and the number of trials in that study. If the number of trials per study in the unpublished studies is comparable to that in the published ones, then the hit-rate should be assumed to be lower. If the number of trials per study is much smaller, then the hit-rate may be the same or even higher.
For our purposes, we do not need solid evidence of that the result is biased. We should downgrade whenever there is a high probability of publication bias.

That means that the early studies would definitely have to be rated down.
The later studies are a different matter, though. The results don’t look much like overly influenced by publication bias. That’s something for later.

For now, I won’t downgrade. Others would. It is a borderline case.

Inconsistency

If results from different studies are inconsistent with each other, then rating down is a must. One method to diagnose inconsistency is by applying a statistical test for heterogeneity.

Let’s say that the true hit-rate were always 25%. Because that is just a probability, few experiments would actually average an actual hit-rate of 25%. Most experiments would have a hit-rate close to that and few further away from that. In the same way, few people are of average height, but most are around it with few being very tall or very small.

A test for heterogeneity finds out if the hit-rates are distributed as one would expect if it is always the same, no matter what it is. One should still find this pattern even if something causes the true hit-rate to be 33% or whatever, as long as it is the same in all experiments.

The ganzfeld results are heterogeneous. The hit-rates found in the individual experiments are more varied than random chance would lead one to expect.

It is easy to see why inconsistency is a great problem for the ganzfeld in particular. If someone makes a mistake in implementing the standard ganzfeld design this may bias the results. It is unlikely that different people in different locations, all make the same mistake. It’s more likely that only some results will be biased, and to different degrees. And that’s something that would pop up as heterogeneity.

Heterogeneity is a clear warning signal that something may have gone wrong.

Of course, undetected mistakes are not the only possible cause for heterogeneity. There are always variations between different experiments which might also explain differences. A medical example is when patients in some study are sicker, or if some receive higher dose of a drug and so on.
If a robust explanation for the heterogeneity can be found then one need not downgrade. Preferably one would perform a subgroup analysis, which means that one splits the studies into different groups that are themselves not heterogeneous.

Parapsychologists speculate that something analogous is responsible for the heterogeneity in the ganzfeld. Some experiments used different target types while others recruited ‘creative’ participants which are supposed to be especially good at the task.
That does not rise to the level of a robust explanation, though. In medicine, one could invoke genetic differences, differences in food and climate, and any number of such things but that’s just hand-waving.
Future research may or may not uncover the causes of the heterogeneity but right now we simply lack any sort of robust explanation.

The way things are, downgrading for heterogeneity is a must.

Reasons for upgrading

One might upgrade the quality of evidence from an observational trial if the effect is especially large, if there is a dose-response relationship, or if all plausible confounding variables would work against the effect.

A large effect is defined as a relative risk (RR) of over 2, which for us means a hit rate of over 50%: Clearly not the case.

A dose-response relationship is clearly not present. There is nothing equivalent either because that implies knowledge of some underlying mechanism, which we do not have by definition.

The last one regarding confounders may be a bit confusing, so I’ll quote an example.

In another example of this phenomenon, an unpublished systematic review addressed the effect of condom use on HIV infection among men who have sex with men. The pooled effect estimate of RR from the five eligible observational studies was 0.34 [0.21, 0.54] in favor of condom use compared with no condom use. Two of these studies that examined the number of partners in those using condoms and not using condoms found that condom users were more likely to have more partners (but did not adjust for this confounding factor in their analyses). Considering the number of partners would, if anything, strengthen the effect estimate in favor of condom use.

-GRADE guidelines: 9. Rating up the quality of evidence by Guyatt et al. in JCE Vol. 64
heterogeneity

That’s it for now. Still a lot of ground to cover…

The Ganzfeld Experiments: Telepathy?

Discussions of parapsychological experiments usually focus on the quality of the alleged evidence. I will break with this tradition and begin by analyzing what conclusions are warranted if the results are as parapsychologists claim. Let’s begin by looking at the typical Ganzfeld experiment.

The typical Ganzfeld experiment

We have two experimental subjects or participants. One is called “sender” and the other “receiver”.
The sender is shown a video clip or a still image (in one case music) which was randomly selected from a collection. He or she is supposed to watch or listen to this for around 20 minutes while the receive” is “in the Ganzfeld”, that is, he or she experiences a mild state of sensory deprivation. During this time the receiver just says whatever goes through their mind and this is recorded.

Afterwards comes the judging procedure. The target is presented along with 3 decoys (though occasionally a different number was used) from the same collection of possible targets. Sometimes the judging is done by the receivers themselves, possibly with the help of an experimenter, and sometimes by an experimenter. Naturally, we are assured that the experimenters involved in the judging do not know the correct answer themselves.

Who is being tested?

That the receiver does not give the answer alone provides an immediate problem in interpreting the result. We cannot know if any correct answer was truly given by the test subject, or by someone else.

One might correctly point out that the only thing that matters is if something paranormal happens. So what if the telepathy is between the sender and the experimenter rather than the designated receiver?

Unfortunately, there are practical implications to this. It is not enough to ensure that the receiver had no way of knowing the correct answer, one must ensure this for the judges, and/or helper as well, which multiplies the problem.

Consider a situation where something like distance-dependence is to be investigated. Not only must sender and receiver be separated by the prescribed distance but also sender and judge and any judging helpers.

Evidence for telepathy?

The conclusion, or rather explanation, offered by parapsychologists is implied in the terminology. The sender sends and the receiver receives in some unknown way simply called telepathy. The catch is that there is absolutely no evidence that the sender actually does any sending.

Yes, if there was some unidentified form of human communication not blocked by whatever separates the two, then that could explain the results. However, the results could also be explained by, just for example, remote viewing.
Some will now say that it doesn’t matter if we are dealing with remote viewing or telepathy, as long as we are dealing with something inexplicable. In the sense that both would be extremely interesting, that is certainly true.

However, it does not change the fact that drawing unwarranted conclusions is simply bad science. Surely, if you have something interesting on your hands you want to do good science on it. Does it really make sense to say: “It doesn’t matter that we are misrepresenting the results, only that they are interesting?”

There are also practical considerations. If it turns out that a sender is not necessary, then you need only half as many test subjects and, on top of that, can schedule the sessions easier.

Finally, I need to be blunt. If someone is unable or unwilling to correctly interpret evidence, then one can’t help but wonder if they are able to correctly implement the experimental protocol. If they tell me that this “proves” telepathy and I see it does not, can I trust them when they tell me that they ruled out all conventional explanations?

A few experiments have been conducted that modified that standard experiment to properly test for the influence of the sender. Discussion of that will need to wait for later.

Consciousness Research?

Many parapsychologists think of themselves as conducting consciousness research. Somehow, in some unknown way, consciousness is thought to be responsible for the Ganzfeld results. Whether we are dealing with telepathy or remote viewing, it must surely be a psychic ability, or so some say.
But again, this does not follow from the data. We have a number of known sensory organs which are clearly responsible for gathering most, if not all, of the information we have about the world. If there was an additional extra-sensory means of perception, there is no reason to assume from the get-go that it would be fundamentally different from the known means. We can only say that it must be much less effective than the known means or we would all be aware of it.

Not only is there no good reason to think that the psyche or consciousness should have anything to do with the results, there is a strong argument against that. The receiver is typically not aware of receiving any information. He or she simply babbles out whatever goes through their mind. Afterwards they will be provided with some record of their babblings and possibly even with help in matching them to the target. In other words, the process is completely unconscious.

Evidence for psi?

Here it gets more tricky, as there is no single definition of psi accepted by all parapsychologists. The Parapsychological Association (PA) says on its web-page:

A general blanket term, proposed by B. P. Wiesner and seconded by R. H. Thouless (1942), and used either as a noun or adjective to identify paranormal processes and paranormal causation; the two main categories of psi are psi-gamma (paranormal cognition; extrasensory perception) and psi-kappa (paranormal action; psychokinesis), although the purpose of the term “psi” is to suggest that they might simply be different aspects of a single process, rather than distinct and essentially different processes.

Obviously, the idea that all alleged psi phenomena are connected cannot, in principle, receive any support from the standard Ganzfeld experiment. And even more obviously, it cannot provide evidence for any other psi phenomenon.
Nothing about the experiment allows any conclusion about the mechanism.
One might say, that, if parapsychologists stumbled across something with one experiment, it would be worthwhile to look at what else they have. That is a reasonable argument, but it is entirely based on social considerations, namely an assessment of how credible parapsychologists are as a group. It does not follow from the experiment.

Another definition, given in some Ganzfeld papers (those by Daryl Bem) goes like so:

The term psi denotes anomalous processes of information or energy transfer, processes such as telepathy or other forms of extrasensory perception that are currently unexplained in terms of known physical or biological mechanisms. The term is purely descriptive: It neither implies that such anomalous phenomena are paranormal nor connotes anything about their underlying mechanisms.

Clearly, this second definition is in direct contradiction to the one by the PA. Not only does the PA explicitly say that psi denotes something paranormal (whatever that means), it also says that it does imply something about the mechanism: Namely that there is a single process for all alleged psi phenomena.

An amusing consequence of the second definition is that psi definitely exists. Whenever someone knows something and you don’t know how, that’s psi.
By that definition, the Ganzfeld experiment can provide evidence for psi but it is not clear why you would even bother. By that definition, the existence of psi is a necessary consequence of us not being all-knowing.

Many psi believers feel that psi is this awesome spiritual thing. If people could be convinced of its existence it would revolutionize all of science, from physics to psychology, and even all of society.  That’s not something a single experiment can ever provide.

What it is not evidence for

There is any number of people who claim psychic powers. Some of them do so on a professional basis, that is they charge money for psychic readings. Some people might say that if something is happening in these Ganzfeld experiments then that supports that there might be something to that. Unfortunately, it’s the other way around.

It’s like saying that, because anyone can leap over a low fence, therefore some people might be able to leap over a tall building. That doesn’t follow, of course. No matter how many people you watch jumping a low fence, it will not change the fact that no one can jump over the average building. Even worse, the longer you go on, the clearer it will become that it is simply not possible.

It is much easier to demonstrate an amazing ability, with big effects on the world, than a tiny one. If someone claims the Ganzfeld experiments as “the best evidence for psi” that is basically a tacit admission that psychic powers as portrayed on TV, or the New Age literature, do no exist.

Evidence for what?

Eventually, the standard Ganzfeld experiment can only provide us with evidence that something unexpected happens in such Ganzfeld experiments and nothing else. By design the experiment does not provide us with any clues as to what is going on.

This is the fundamental problem in the standard Ganzfeld design.

We think that the hit rate in a typical Ganzfeld experiment should be 25% but there is any number of reasons why this might be wrong.

Early on, a few possibilities were raised. For example, when a sender is given a photograph, they might leave fingerprints, kinks, scratches or other such clues on the paper. These so-called handling cues might enable the receiver to tell that photograph from other, fresh photographs supplied as decoys.
Such things are usually referred to as flaws in the protocol. I, personally, don’t think it is constructive to label some explanations as somehow inherently reflecting badly on the experimenters. In my opinion, the single flaw is that the design does not allow any conclusions about the explanation, not that it allows for boring explanations.

Parapsychologists frequently point out a catch with such proposed explanations: There is no evidence that they are true. The experiment, after all, only tells us that there is something to be explained, not how.
Unfortunately, they rarely realize that the same is doubly true for their suggestions. As vague as the term is, there is no evidence that something like telepathy exists, so invoking it to explain the Ganzfeld results is simply building on air. Arguing that the Ganzfeld results themselves are evidence for telepathy is simply circular reasoning. One might equally claim the results as evidence for fraud, error, or gremlins.

Attention! Double-slit!

Recently Dean Radin and others published an article that purports to study the effects of attention on a double slit experiment.

Originally I wanted to do just a rebuttal to that but then found it necessary to also review the entire background. The simple rebuttal spiraled out of control into a 3-part series. My old math teacher was right. Once you add the imaginary things get complex, for reals. And not only for them.

A Word of Caution

People often ask for evidence when they are faced with something they find unlikely. The more skeptical will also ask for evidence for something they consider credible, at least sometimes. For the academically educated evidence means articles published in peer-reviewed, reputable, scientific journals.
For example, all the articles I cite as evidence in the first part, where I look at mainstream quantum physics, are from such journals.
So here comes the warning. Not all journals that call themselves peer-reviewed are reputable. For example, there is a peer-reviewed journal dedicated to creationistic ideas. And I probably don’t need to tell you what scientists on the whole think of creationism.

The journals that published the articles discussed in this series are not reputable. Mainstream science does not take note of them. Physics Essays, where the most recent article appeared may very well be the closest to the mainstream and still it is mostly ignored.
It is largely an outlet for people who believe that Einstein was wrong. We’re not talking about scientists looking for the next big thing, we’re talking about people who are to Einstein’s theory what creationists are to evolution.
This is not meant as an argument against these ideas, I just don’t want to mislead anyone into believing that there is a legitimate scientific debate going on here.

That’s not to say that science ignores fringe ideas. For example, Stanley Jeffers who appears in the second part of this series is a mainstream physicist who decided to follow up on some of those.
He just didn’t find that there was anything there. It was a dead end.
James Alcock has a few words on that in his editorial Give the Null Hypothesis a Chance.

There are many cranks out there. These are people who hold onto some theory in the face of contrary evidence. They will not go away but they will, almost invariably, accuse the mainstream of science to be dogmatic. Eventually, there is nothing to be done but ignore them.

On to the Review

The first part gives a brief overview over the quantum physics background to the experiment. Dean Radin gets this completely wrong. And I fear the misunderstandings he propagates will pop up in many places.

Part 1: A Quantum Understanding

In the next part we will look at the experiment in question. Let’s call it the parapsychological double-slit experiment. We will learn who came up with the idea and what he found out and also what a positive result should look like and what it might mean.

Part 2: A Physicist Investigates

The 3rd and last part, for now, looks at the two articles authored by Dean Radin, presenting seven replications of the original design.

Part 3: Radin for a Rerun

Further studies are being conducted so more parts are likely to follow at some point.

About another skeptical award

A few weeks ago, I blogged about Daryl Bem being awarded a Pigasus by James Randi.

Today, I am going to tell you about another such negative award. This one is called ‘Das goldene Brett’ and is awarded by the Austrian Society for critical thinking (Gesellschaft für kritisches Denken). This society is the Vienna chapter of the GWUP which is the German language equivalent of the CSI.

“Das goldene Brett” means “the golden board. In German saying that someone has ‘a board before his head’ (ein Brett vor’m Kopf) means that he or she is an idiot. Someone who obviously can’t see and is unable to work out why.

Perhaps this recalls the bible Matthew 7:3
And why do you look at the splinter in your brother’s eye, and not notice the beam which is in your own eye?
But enough about that quaint and unwieldy language.

 

Is it a bird? Is it a plane? No! It’s a food buffet!

But why, my dear reader, would I bother you with the local affairs of an obscure mountain province?
The reason is that one the three prize winners 2011 has managed to make the international news. Just the skeptical news but still.

That winner was a P.A. Straubinger who directed the movie In the Beginning There Was Light. That movie promoted Breatharianism which is the belief that eating (or drinking) is not necessary for survival. People can survive on (sun-)light alone. It boggles the mind that people could possibly belief such a thing. Yet, not only do people belief that, a select few of them died trying to do it.

When news of the movie was heard among the local skeptics they immediately saw the danger. The publicity would inspire further copy-cats and among them, deaths.
Precisely that has happened now. A Swiss woman was found dead by starvation. (English report) Her journey into death started with Straubinger’s movie.

This leaves us with many open questions.
How much blame should we assign the propagandists? Or was it just the dead woman’s free choice?
Was she open-minded or gullible? Was she gullible or mentally ill?

What should skeptics learn from such a case? What should be done to protect the vulnerable from dangerous nonsense?

How about counter-arguments? There is some extensive debunking of the supposed “evidence” in the film available on German skeptic blogs. But it seems unlikely that one can reach the vulnerable with information, otherwise they were not vulnerable. Everyone already knows that one can’t survive without nourishment. If someone is willing to dismiss such an everyday fact as merely a ‘materialistic belief’ then any further details must fall on deaf ears.
Even worse, a nuanced reply might even be seen as confirmatory. A scientific person will not ever rule out anything as impossible. Nothing can be known with such certainty. Distinguishing between the practically impossible and the literally impossible is a fine point that is rarely made in daily life. A scientist acknowledging the fundamental, philosophical limits of our knowledge may be heard as endorsing a practical possibility where none exists.

What about ridicule and a clear word? Some warn that that will just push away believers but I wonder if it might not still be the best method. I don’t know what truly motivates people to believe in the clearly untrue but if it is largely driven by emotion then emotional appeals must be made to reach them.
even if many skeptics will disagree with such methods on principle. In truth, it seems dishonest to me to seek to convince others with emotional, rationally invalid, rhetorics. But if there are lives at stake maybe I should swallow my distaste?
It seems plausible that ridicule will not reach the entrenched and only push them away but maybe it is a good method to reach the broad mass of people. A more open approach is surely needed to reach the truly vulnerable, the ‘spiritual’.

Should such nonsense be banned? In Germany Holocaust denial is illegal. And yet science denial is not, even when the danger is clear and present. It seems impossible to get a legislature to ban certain kinds of speech based on objective danger rather than offense taken.
However, I do not see a clear conflict with the principle of free speech. Hardly anyone would seriously say that an add offering money for the death of someone, that is an adds seeking a contract killer should be legal. Such speech is aimed solely at getting someone killed, that is denying someone a right even more important than the right to free speech, the right to live. There is no ethical duty to tolerate speech that will get people killed.

Shut up and ignore? Straubinger has actually thanked skeptic for the attention they paid to the movie and the extra publicity that gave. That raises a worrying specter. May skeptics share part of the responsibility for the breatharian deaths? Many people have a reflexive sort of sympathy for the underdog especially when that underdog is an enemy of an enemy. When skeptics denounce a dangerous fringe idea, does that maybe drive some people into accepting it?