I have touched on a number of issues, so far, and parapsychologists have not been silent on them either. So, now I’m going to address a few things that seem especially pertinent.
Tressoldi and GRADE
I am not the first to think that the GRADE approach could shed a light on the solidity of parapsychological evidence. Patrizio E. Tressoldi mentions the GRADE approach in Extraordinary claims require extraordinary evidence: the case of non-local perception, a classical and Bayesian review of evidences.
Unfortunately “non-local perception” is not defined in the article. A connection to certain quantum physical phenomena is made but there is no explanation of the relationship. Most importantly, there is no explanation of how any of that relates to the experimental data.
These are the same fatal flaws that regrettably are the norm rather than the exception in parapsychology but that’s not of importance here.
The experimental data consists of data from several previous meta-analyses which are reexamined using different statistical methods. There is no attempt made to apply the GRADE guidelines. The quality of evidence is not evaluated in any way whatsoever.
Tressoldi simply asserts that the evidence should be High Quality by the GRADE standard which has the unfortunate potential to leave a rather misleading impression. A normal reader, not bothering to track down all references, might be lead to believe that the simple statistical exercises performed constitute the GRADE approach.
Such bad scholarship is best, and most politely, ignored and forgotten. Yet, I would have been amiss not to mention this in this context.
How not to assess evidence
One approach to quality in the ganzfeld debate has been to use quality scales. This method is explicitly discouraged.
1. Do not use quality scales
Quality scales and resulting scores are not an appropriate way to appraise clinical trials. They tend to combine assessments of aspects of the quality of reporting with aspects of trial conduct, and to assign weights to different items in ways that are difficult to justify. Both theoretical considerations and empirical evidence11 suggest that associations of different scales with intervention effect estimates are inconsistent and unpredictable
A quality scale is like a checklist of things that are thought to bias the experiment. For each item that is checked a point value is assigned and the sum then gives the quality of the study in a single value. It’s this part of expressing the quality in a single value that is problematic. We’ll see why in a moment.
Quality scales have been used on multiple occasions within parapsychology and also more than once on ganzfeld studies. The way things are done in parapsychology is that the studies are first rated for quality. Then it is checked whether there is a correlation between quality and effect size. That means one looks if studies that are of lower quality have a higher hit-rate, on average, than studies that are of high quality.
The correlation is typically weak and non-significant which is supposed to show that results are not due to quality problems. The argument is quite curious on its face because one would think that parapsychologists of all people. would understand that absence of evidence does not equal evidence of absence.
Medicine has standardized quality scales and even so, it is found that these scales may give contradictory answers. So when you fail to find a correlation between a particular scale and outcome, that may simply mean that you used the wrong scale. And when you find one, well… Try enough scales and you will find a correlation just by chance.
The problem is especially acute in parapsychology where there are no standard scales. The scales are simply made up on the spot and never reused.
I’ll use Meta-Analysis of Free-Response Studies, 1992–2008: Assessing the Noise Reduction Model in Parapsychology by Storm, Tressoldi and DiRisio as an example for a closer look at the problem of quality scales.
The first item on the scale is this:
appropriate randomization (using electronic apparatuses or random tables),
Randomization is obviously important. If the target is not randomly selected, then we are certainly not justified in thinking that there is a 1 in 4 of guessing correctly. If the target was selected, just for example, based on how visually appealing it is, then it would not be surprising to find a higher hit-rate.
However, there is long story behind this. We’ll get back to this item.
random target positioning during judgment (i.e., target was randomly placed in the presentation with decoys),
Obviously, if you always place the correct target in the same place, that’s really bad. Even if no one figures out the correct place, it offers a ready explanation for any especially high or low hit-rate.
If you present people with a list of items, and ask them to pick one at random , then they will prefer some simply based on their position in the list. Of course, that’s only true as long as there aren’t some over-riding considerations and it’s only true on the average but the fact is that people aren’t very random.
That’s one the more important findings to come out of psychological science. How so? Think commercial web-sites, super-market shelves, etc…
It’s not actually necessary to randomize the order of the choices every time, though. For example, if you always had the same 4 images in the same order, and simply designated one at random as the target, then that would work as well.
In a way, this is an odd item. If all the experimenters are blind and the target selection is random, then there should be no need for explicitly randomizing the positioning because it would already be random by necessity
The further items are these:
• blind response transcription or impossibility to know the target in advance,
• sensory shielding from sender (agent) and receiver (perceiver),
• target independently checked by a second judge, and
• experimenters blind to target identity.
I won’t pick them apart in detail.
All of these items could potentially bias the hit-rate but -and that’s the problem- we don’t know to what a degree or even if they do at all.
Take sensory shielding: That’s a complete must for any ganzfeld experiment. If any article failed to specifically mention sensory shielding, then this can only be an omission in the report but not necessarily a sign that the shielding was insufficient. On the other hand, if it is mentioned, it is not knowable from the report if was truly sufficient.
For the sake of the argument, imagine that one of the items will always lead to a higher (or lower) hit-rate and the rest do nothing. Then you will have studies that were rated as high-quality that are still and “low-quality” studies that are unbiased.
Now you look for a correlation. Do the high-quality studies have a different hit-rate?
Strictly speaking, you should still expect a slight difference because the biased studies can never, assuming perfect reporting, be top-quality. So, there will a few less biased studies among the high-quality studies but the true extent will be hidden because you are basically mixing apples and oranges.
Basically, when you use a quality scale in this way, you are implicitly assuming that all your potential biases have exactly the same effect and that’s all. The more factors that have no effect that you put into the scale (and the more factors that have one which you leave off), the less likely you are to find any correlation between effect and quality rating.
It would be far more relevant to look for a correlation between any item individually and the hit-rate. This would allow parapsychologists to identify potential biases and make amends. I fear that such undertakings are unlikely to happen. Such a thing is contrary to the culture of parapsychology.
Parapsychology is focused on showing that something cannot possibly have happened in any known way, including by error. In order to study the impact of biases, it would first be necessary to acknowledge that error can never truly be ruled out. Acknowledging that would render the entire parapsychological method moot.
And that leads us back to the first item.
Manual vs. Automatic Randomization
A couple of mainstream scientists (i.e. scientists not part of the usual handful of skeptics and believers) had a look at the database that Storm and his colleagues created in the previously mentioned paper.
In the main, they reanalysed it using Bayesian methods but that’s a whole ‘nother can of worms.
They obtained the full database from Storm et al which contained not only the cumulative quality score but also the individual item ratings. It turned out that the experiments using manual randomization had much better hit-rates than those using automatic randomization.
Here’s the relevant figure from their paper:
Rouder, J. N., Morey, R. D., & Province, J. M. (2013): A Bayes factor meta-analysis of recent extrasensory perception experiments: Comment on Storm, Tressoldi, and Di Risio (2010). Psychological Bulletin
As you can see, the studies cluster around the expected chance hit-rate but some studies just run off. And that is particularly true for the manually randomized studies.
What this indicates, on its face, is that manual randomization is associated with a considerable risk of bias. The size of the bias is not the same across all studies but that’s just what you’d expect. However, clearly that does not explain all high scores.
In reply, the original authors pointed out that a few studies had been mis-rated (while glossing over the fact that the errors were largely their own – classy!).
They still found that there was a significant difference in effect size between the two groups with a p-value of 1.9%. That means that if you randomly split the database in two groups and then compare the hit-rate, the difference will be larger than that found only about 1 in 50 times.
This finding is rather suggestive but note that this is far from solid evidence. Bear in mind that many things that limit our ability to draw firm conclusions from the ganzfeld studies are also present here.
For one, it’s possible that there are confounding variables. Maybe it is not about the randomization at all, but about something else that people who chose one method also did differently.
And also, it may just be a false positive, a random association. Such a difference may only be found 1 in 50 times, but this 1 time may just have been this time.
There are two things that add credibility to thinking that this points to a bias due to improper randomization. For one, Rouder and colleagues did not go about ‘data-mining’ for some association. They had a limited number of factors that they “inherited” from Storm and colleagues.
These factors, in turn, were certainly not the result of some data-mining either. They came up with a limited number of factors that they thought might indicate the presence of bias and then had the database rated for these factors.
That’s the second thing that adds credibility. It is not some correlation we have simply noticed. We know how improper randomization can improve the hit-rate and that was why this was looked at in the first place.
Still, even so, the finding is of limited usefulness to us because that particular database consisted of both ganzfeld and non-ganzfeld studies, and not even all ganzfeld studies.
Storm and his colleagues offer some rather curious counter-arguments.
For one they point out that the z-scores were not significantly different but that’s just statistical nonsense. The z-score is the p-value expressed in terms of standard deviations, so it’s basically a measure of how frequently one obtains a certain number of hits in a given number of trials.
The z-score depends both on the hit-rate and the number of trials. A 40% hit-rate in a 10-trials study give a low z-score while the same hit-rate in 100-trials study will give a high z-score. That is because such a high hit-rate can happen often, by chance, in a 10-trials study but rarely in a 100-trials study.
So basically, if one group of studies consists of smaller studies then they will have relatively low z-scores even if they have a much higher hit-rate on average. I can’t think of any way in which testing for a difference in z-scores makes sense.
The other counter-arguments is that there is… well, I’ll just quote them:
No argument is presented as to (a) why the use of random number tables is problematic or (b) why [automatic randomization] means both “true” randomization caused by radioactive decay, as in state-of-the-art random number generators (see Stanford, (1977), and pseudorandomization with computer algorithms, but not one or the other.
One wonders then, why did they use it for their scale? Can it be that Storm et al have already forgotten who came up with this criterion?
That’s not to say that what they point out is wrong. It’s true that there is no reason to think that the randomization in one group is necessarily bad or necessarily good in the other. That is exactly a problem with their quality scale in particular, which they have apparently just disowned.
However, it is exactly that one finds a difference that retrospectively validates this item.