The Ganzfeld Experiments: Quality -Part 3

I have touched on a number of issues, so far, and parapsychologists have not been silent on them either. So, now I’m going to address a few things that seem especially pertinent.

Tressoldi and GRADE

I am not the first to think that the GRADE approach could shed a light on the solidity of parapsychological evidence. Patrizio E. Tressoldi mentions the GRADE approach in Extraordinary claims require extraordinary evidence: the case of non-local perception, a classical and Bayesian review of evidences.
Unfortunately “non-local perception” is not defined in the article. A connection to certain quantum physical phenomena is made but there is no explanation of the relationship. Most importantly, there is no explanation of how any of that relates to the experimental data.
These are the same fatal flaws that regrettably are the norm rather than the exception in parapsychology but that’s not of importance here.

The experimental data consists of data from several previous meta-analyses which are reexamined using different statistical methods. There is no attempt made to apply the GRADE guidelines. The quality of evidence is not evaluated in any way whatsoever.
Tressoldi simply asserts that the evidence should be High Quality by the GRADE standard which has the unfortunate potential to leave a rather misleading impression. A normal reader, not bothering to track down all references, might be lead to believe that the simple statistical exercises performed constitute the GRADE approach.

Such bad scholarship is best, and most politely, ignored and forgotten. Yet, I would have been amiss not to mention this in this context.

How not to assess evidence

One approach to quality in the ganzfeld debate has been to use quality scales. This method is explicitly discouraged.

1. Do not use quality scales

Quality scales and resulting scores are not an appropriate way to appraise clinical trials. They tend to combine assessments of aspects of the quality of reporting with aspects of trial conduct, and to assign weights to different items in ways that are difficult to justify. Both theoretical considerations and empirical evidence11 suggest that associations of different scales with intervention effect estimates are inconsistent and unpredictable

-The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials

A quality scale is like a checklist of things that are thought to bias the experiment. For each item that is checked a point value is assigned and the sum then gives the quality of the study in a single value. It’s this part of expressing the quality in a single value that is problematic. We’ll see why in a moment.

Quality scales have been used on multiple occasions within parapsychology and also more than once on ganzfeld studies. The way things are done in parapsychology is that the studies are first rated for quality. Then it is checked whether there is a correlation between quality and effect size. That means one looks if studies that are of lower quality have a higher hit-rate, on average, than studies that are of high quality.

The correlation is typically weak and non-significant which is supposed to show that results are not due to quality problems. The argument is quite curious on its face because one would think that parapsychologists of all people. would understand that absence of evidence does not equal evidence of absence.

Medicine has standardized quality scales and even so, it is found that these scales may give contradictory answers. So when you fail to find a correlation between a particular scale and outcome, that may simply mean that you used the wrong scale. And when you find one, well… Try enough scales and you will find a correlation just by chance.
The problem is especially acute in parapsychology where there are no standard scales. The scales are simply made up on the spot and never reused.

An Example

I’ll use Meta-Analysis of Free-Response Studies, 1992–2008: Assessing the Noise Reduction Model in Parapsychology by Storm, Tressoldi and DiRisio as an example for a closer look at the problem of quality scales.

The first item on the scale is this:

appropriate randomization (using electronic apparatuses or random tables),

Randomization is obviously important. If the target is not randomly selected, then we are certainly not justified in thinking that there is a 1 in 4 of guessing correctly. If the target was selected, just for example, based on how visually appealing it is, then it would not be surprising to find a higher hit-rate.
However, there is long story behind this. We’ll get back to this item.

random target positioning during judgment (i.e., target was randomly placed in the presentation with decoys),

Obviously, if you always place the correct target in the same place, that’s really bad. Even if no one figures out the correct place, it offers a ready explanation for any especially high or low hit-rate.

If you present people with a list of items, and ask them to pick one at random , then they will prefer some simply based on their position in the list. Of course, that’s only true as long as there aren’t some over-riding considerations and it’s only true on the average but the fact is that people aren’t very random.

That’s one the more important findings to come out of psychological science. How so? Think commercial web-sites, super-market shelves, etc…

It’s not actually necessary to randomize the order of the choices every time, though. For example, if you always had the same 4 images in the same order, and simply designated one at random as the target, then that would work as well.

In a way, this is an odd item. If all the experimenters are blind and the target selection is random, then there should be no need for explicitly randomizing the positioning because it would already be random by necessity

The further items are these:

• blind response transcription or impossibility to know the target in advance,
• sensory shielding from sender (agent) and receiver (perceiver),
• target independently checked by a second judge, and
• experimenters blind to target identity.

I won’t pick them apart in detail.

All of these items could potentially bias the hit-rate but -and that’s the problem- we don’t know to what a degree or even if they do at all.
Take sensory shielding: That’s a complete must for any ganzfeld experiment. If any article failed to specifically mention sensory shielding, then this can only be an omission in the report but not necessarily a sign that the shielding was insufficient. On the other hand, if it is mentioned, it is not knowable from the report if was truly sufficient.

For the sake of the argument, imagine that one of the items will always lead to a higher (or lower) hit-rate and the rest do nothing. Then you will have studies that were rated as high-quality that are still and “low-quality” studies that are unbiased.
Now you look for a correlation. Do the high-quality studies have a different hit-rate?
Strictly speaking, you should still expect a slight difference because the biased studies can never, assuming perfect reporting, be top-quality. So, there will a few less biased studies among the high-quality studies but the true extent will be hidden because you are basically mixing apples and oranges.

Basically, when you use a quality scale in this way, you are implicitly assuming that all your potential biases have exactly the same effect and that’s all. The more factors that have no effect that you put into the scale (and the more factors that have one which you leave off), the less likely you are to find any correlation between effect and quality rating.

It would be far more relevant to look for a correlation between any item individually and the hit-rate. This would allow parapsychologists to identify potential biases and make amends. I fear that such undertakings are unlikely to happen. Such a thing is contrary to the culture of parapsychology.
Parapsychology is focused on showing that something cannot possibly have happened in any known way, including by error. In order to study the impact of biases, it would first be necessary to acknowledge that error can never truly be ruled out. Acknowledging that would render the entire parapsychological method moot.

And that leads us back to the first item.

Manual vs. Automatic Randomization

A couple of mainstream scientists (i.e. scientists not part of the usual handful of skeptics and believers) had a look at the database that Storm and his colleagues created in the previously mentioned paper.
In the main, they reanalysed it using Bayesian methods but that’s a whole ‘nother can of worms.

They obtained the full database from Storm et al which contained not only the cumulative quality score but also the individual item ratings. It turned out that the experiments using manual randomization had much better hit-rates than those using automatic randomization.

Here’s the relevant figure from their paper:
RouderFig1
Rouder, J. N., Morey, R. D., & Province, J. M. (2013): A Bayes factor meta-analysis of recent extrasensory perception experiments: Comment on Storm, Tressoldi, and Di Risio (2010). Psychological Bulletin

As you can see, the studies cluster around the expected chance hit-rate but some studies just run off. And that is particularly true for the manually randomized studies.
What this indicates, on its face, is that manual randomization is associated with a considerable risk of bias. The size of the bias is not the same across all studies but that’s just what you’d expect. However, clearly that does not explain all high scores.

In reply, the original authors pointed out that a few studies had been mis-rated (while glossing over the fact that the errors were largely their own – classy!).

They still found that there was a significant difference in effect size between the two groups with a p-value of 1.9%. That means that if you randomly split the database in two groups and then compare the hit-rate, the difference will be larger than that found only about 1 in 50 times.

This finding is rather suggestive but note that this is far from solid evidence. Bear in mind that many things that limit our ability to draw firm conclusions from the ganzfeld studies are also present here.

For one, it’s possible that there are confounding variables. Maybe it is not about the randomization at all, but about something else that people who chose one method also did differently.
And also, it may just be a false positive, a random association. Such a difference may only be found 1 in 50 times, but this 1 time may just have been this time.

There are two things that add credibility to thinking that this points to a bias due to improper randomization. For one, Rouder and colleagues did not go about ‘data-mining’ for some association. They had a limited number of factors that they “inherited” from Storm and colleagues.
These factors, in turn, were certainly not the result of some data-mining either. They came up with a limited number of factors that they thought might indicate the presence of bias and then had the database rated for these factors.
That’s the second thing that adds credibility. It is not some correlation we have simply noticed. We know how improper randomization can improve the hit-rate and that was why this was looked at in the first place.

Still, even so, the finding is of limited usefulness to us because that particular database consisted of both ganzfeld and non-ganzfeld studies, and not even all ganzfeld studies.

Storm and his colleagues offer some rather curious counter-arguments.

For one they point out that the z-scores were not significantly different but that’s just statistical nonsense. The z-score is the p-value expressed in terms of standard deviations, so it’s basically a measure of how frequently one obtains a certain number of hits in a given number of trials.
The z-score depends both on the hit-rate and the number of trials. A 40% hit-rate in a 10-trials study give a low z-score while the same hit-rate in 100-trials study will give a high z-score. That is because such a high hit-rate can happen often, by chance, in a 10-trials study but rarely in a 100-trials study.

So basically, if one group of studies consists of smaller studies then they will have relatively low z-scores even if they have a much higher hit-rate on average. I can’t think of any way in which testing for a difference in z-scores makes sense.

The other counter-arguments is that there is… well, I’ll just quote them:

No argument is presented as to (a) why the use of random number tables is problematic or (b) why [automatic randomization] means both “true” randomization caused by radioactive decay, as in state-of-the-art random number generators (see Stanford, (1977), and pseudorandomization with computer algorithms, but not one or the other.

Wait… what?

One wonders then, why did they use it for their scale? Can it be that Storm et al have already forgotten who came up with this criterion?
That’s not to say that what they point out is wrong. It’s true that there is no reason to think that the randomization in one group is necessarily bad or necessarily good in the other. That is exactly a problem with their quality scale in particular, which they have apparently just disowned.
However, it is exactly that one finds a difference that retrospectively validates this item.

The Ganzfeld Experiments: Quality – Conclusion

Previously, I assessed the quality of evidence provided by the ganzfeld experiments. I found that the typical ganzfeld experiment could only be considered to yield Moderate evidence, and further that it was necessary to downgrade the entire body of studies at least once, for heterogeneity.
That leaves the overall quality of evidence for the ganzfeld trials as Low. There is no way that I could justify any higher grade but one could certainly justify a lower one. One could justify a double downgrade for the heterogeneity, on account of the serious implications. One could also justify downgrading for publication bias. And then, I didn’t look in detail at the individual studies, which could only uncover reasons for downgrading, as I found that there was no reason to upgrade.
When two (or more) factors are borderline, one should downgrade for at least one.

Put like that, calling the evidence of Low Quality is a very favorable assessment.

The best argument for a better grade is claiming that the ganzfeld design as it is should be regarded as High Quality, like a medical RCT. I’ve already laid out why I don’t agree with that. It would simply lead to another borderline case and at some point you can’t ignore all these borderline calls and must downgrade for at least one of them.

But what does that mean?

Quality level Current definition
High We are very confident that the true effect lies close to that of the estimate of the effect
Moderate We are moderately confident in the effect estimate: The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different
Low Our confidence in the effect estimate is limited: The true effect may be substantially different from the estimate of the effect
Very low We have very little confidence in the effect estimate: The true effect is likely to be substantially different from the estimate of effect

Note well what this does not mean. It does not mean that there is no effect. It means that we can have no confidence that there is one. But equally it means that we can have no confidence that there is none.

And that simply means that everyone will retain whatever opinion they had beforehand which leads us to another curious feature of parapsychology in general.
Parapsychologists say that the hit-rate should be 25%. Any conventional cause for a deviation is not of interest and should be regarded as a bias. The basic ganzfeld design has been intensely scrutinized for any such potential bias and modified to rule it out.
This puts parapsychologists into the position to make a solid and credible argument that the hit-rate must be 25% by any conventional expectation. And it is that which lends credence to the argument that any systematic deviation, any effect, must be due to something amazing, that some worthwhile scientific discovery is waiting there.

Unfortunately, the sheer solidity of the theoretical argument means that few mainstream scientists will be swayed by low quality evidence. Curiously, many vocal parapsychologists seem unable to understand this.
They accuse mainstream science of being “dogmatic” and yet the failure to convince the mainstream with low quality evidence is precisely because of the solidity of their arguments that, by all prior evidence, the hit-rate must be 25%.

  1. Parapsychologists work hard and convince people that the hit-rate should be 25%.
  2. Parapsychologists accuse people of being dogmatic for believing it.

It’s one of those things about parapsychology that does not make the slightest bit of sense. Such displays of irrationality are ultimately responsible for parapsychology’s bad reputation. Low quality evidence is normal enough. That’s why there is such a thing as the GRADE approach.
If someone appears irrational, you probably won’t attempt a rational dialogue. And if you try anyway and your open-mindedness is rewarded with accusations of dogmatism and even dishonesty, then your probably give up.
It is that which leads mainstream science to, for the most part, shun parapsychology. Which then leads prominent parapsychologists to double down and declare that there is a “taboo” against dealing with them. But that’s a different matter.

Does GRADE work?

That’s a very good question. I hope it occurred to you.

One thing we would like to know is how reliable the assessment is. How much agreement is there between different raters? And the answer is: Not as much as we’d like. There is human judgement involved in the rating which is one reason that the GRADE approach demands transparency.
I have tried my best to make the reasoning as clear as possible and have already discussed where others might differ in their assessment.

The other thing we would like to know is how solid the conclusion is. Say, you have 1,000 different apparent effects but based on evidence rated Low or Very Low. How many of those effects would really be found to be substantially different?
The answer: No one knows, yet.

In relation to the ganzfeld, however, we can say that the assessment would have been exactly spot on. I’ve talked about a 33% hit-rate because that is often claimed but, in truth, the hit-rate has varied wildly. When some of the earliest experiments were analyzed in 1985, a hit-rate of 37% was obtained; while when studies from between 1987 and 1997 were analyzed a hit-rate of only 27% was obtained.
In the latter case it was, of course, the parapsychologists who were not impressed and argued that this was due to certain specific biases. That’s something for a later post.

Conclusion

So eventually we find that the ganzfeld evidence is of Low Quality but that should not come as a surprise to anyone.
The more important lesson is probably that this is so according to the standards of mainstream medical science. Other sciences may have a lower standard; I’m thinking of psychology in particular. Indeed, it has been asserted by psychologist Richard Wiseman that the ganzfeld is solid by the standards of his field but, as far as I can tell, his colleagues are, on the whole, not particularly impressed which would seem to contradict his assessment.
In any case, accusations of a double standard clearly lack merit.

What I find more worrying are the problems that parapsychology has in interpreting the evidence and drawing supportable conclusions, regardless of quality considerations. Low Quality evidence is not unusual, but the irrationality surrounding the whole issue is.

What If: High Quality Evidence ?

I think many parapsychologists have very unrealistic expectations about that. Remember that all that could be concluded from the ganzfeld experiments is the presence of some unexplained effect causing the chance expectation to be wrong.
High quality ganzfeld evidence would just indicate that there is probably something worth studying there. Some scientists would become interested enough to look into it. Most would simply be too busy with whatever they are already doing.
The interested scientists would then start out by repeating the original, standard ganzfeld experiment to create the effect in their own lab. And then, once they have succeeded in that they would study the effect. If the found themselves unable to recreate the effect, they would still give up. If you can’t create an effect, you can’t study it, even if you are convinced it exists.
And that’s all that would happen.

The situation currently, with low quality evidence, is not fundamentally different! It just means that fewer people are going to think they can recreate the effect or that the effect is due to something worthwhile.
This idea that high quality evidence for psi would lead to some sort of “paradigm shift” because of some single experiment is just nonsense. That kind of thing has never happened before and I don’t see how it could happen even in principle.

While this concludes the GRADE business, this does not conclude the quality series. There are some more issues we need to talk about, such as what parapsychologists had to say about all this.

Examples of mishaps

I want to give some examples of things that actually went wrong in the ganzfeld experiments. I hope it may illustrate how these vague biases may look “on the ground”.
Do not take these examples as a reason to dismiss the experiments. You can take the Low Quality as a reason for that but these examples are just, you know, life. Things don’t go perfect.
Parapsychology doesn’t stand apart in that respect.

These results differ slightly from those reported earlier since an independent check of our database by Ulla Böwadt found an extra hit in study V. Two trials in study IV (a hit and a miss) had also been included although the experimenters apparently were not agreed on this prior to the results. Their exclusion would make however virtually no effect on the final figures.
-A review of the ganzfeld work at Gothenburg University by Adrian Parker, 2000

This shows how individual trials may simply fall through the cracks. It would have been completely justifiable not to include those. One has to wonder if the media demonstration, in particular, was conducted with the same diligence as the regular trials.
Somewhat similar problems are known in medicine. In a medical studies, patient may drop out; they quit the experiment. One must suspect that it will usually be those who are disillusioned by the offered treatment, or, perhaps, those who see no need to continue because the feel cured. In either case, this will bias the results. This so-called attrition is considered by GRADE.
Another thing this is similar to is transcription errors. You probably won’t be surprised to learn that people have actually been maimed and killed because of doctors’ illegible hand-writing but it’s also a problem in science. Bias may be introduced into a study simply because of faulty data entry. That published values had to be corrected has happened on occasion in ganzfeld experiments and particularly meta-analyses.

 

An amusing, recent example is an analysis published in 2010: Meta-analysis of free-response studies, 1992-2008: assessing the noise reduction model in parapsychology by Storm L, Tressoldi PE, Di Risio L.
The article is more full of errors than I care to point out but this is just about one of them (one of the less embarassing ones).
As part of their analysis they rated the ganzfeld experiments for quality. What they did wrong there is for a later post. The details of the rating were obtained by a couple of scientists, Jeffrey N. Rouder, Richard D. Morey and Jordan M. Province. They argued that improper randomization could explain a part of the results. More on that later.
One of the counter-arguments by the original authors was that the ratings, obtained from them, contained errors!

 

After about 80% of the sessions were completed, it was becoming clear that our hypothesis concerning the superiority of dynamic targets over static targets was receiving substantial confirmation. Because dynamic targets contain auditory as well as visual information, we conducted a supplementary test to assess the possibility of auditory leakage from the VCR soundtrack to R. With the VCR audio set to normal amplification, no auditory signal could be detected through R’s headphones, with or without white noise. When an external amplifier was added between the VCR and R’s headphones and with the white noise turned completely off, the soundtrack could sometimes be faintly detected.
-Psi Communication in the ganzfeld: experiments with an automated testing system and a comparison with a meta-analysis of earlier studies by Honorton et al, 1990

This means that the receiver(R) may have heard the sound of the correct target, which certainly would have allowed him or her to make the correct guess. That’s potentially serious. The counter-argument, however, sounds quite convincing, as well: There was no drop in the hit-rate after the sound system was modified to rule that out.

That’s certainly quite suggestive but mind that it is not high quality evidence. We have a bunch of trials conducted before the sound system was fixed and a bunch afterwards but there is no direct, randomizzed comparison.
And what does the unchanging hit-rate indicate anyway? Maybe they just failed to remove the problem with the modification.

For what it’s worth, when that lab closed the equipment was moved elsewhere where it was used by different experimenters. They were unable recreate the effect.
You could take that as evidence that maybe the sound system played a role but once again: Low quality evidence. There certainly were other documented potential biases in the experiments at that lab which may not have been present at the new location.

The Ganzfeld Experiments: Quality -Part 2

The GRADE approach

One begins by reviewing each single study and rating it for quality. Based on that one will decide the overall quality of the collection of studies.

The GRADE approach, adopted by The Cochrane Collaboration, specifies four levels of quality (high, moderate, low and very low) where the highest quality rating is for a body of evidence based on randomized trials. Review authors can downgrade randomized trial evidence depending on the presence of five factors and upgrade the quality of evidence of observational studies depending on three factors.
-Cochrane Handbook Section 12

Randomized trials start out as high quality evidence and observational trials as low quality. Unfortunately, the ganzfeld is not a randomized trial in the medical sense as it lacks a control group. A randomized trial in medicine is one where patients are randomly assigned to a group.
Observational evidence is something you have when that is not possible. You can’t, for example, randomly tell people to smoke or not. You can only observe people who chose to do that or not. The problem is so-called confounding variables. People who smoke might also make other poor health choices, for example.

The typical ganzfeld experiment is neither. It could be, and has been, modified to be either of them.
If you were to randomly tell senders to sneak off, without anyone knowing, so that you had two groups one with sender and one without. That would be a randomized placebo controlled trial and definitely high quality evidence for the influence of a sender.
If, on the other hand, you simply compared trials where the sender just did not happen to show up, to those where he or she did, then you would have observational evidence.
Both of these things have been done but we’re not interested in that, for now.

Experiments like the ganzfeld are not explicitly considered by the Cochrane Collaboration. It seems to me clearly superior to an observational study because the experimenter has full control over what goes on.
On the other hand, the absence of a control group is a serious problem. It means that there is no way of knowing if the experimenters really managed to rule out all biases, which in this case means all conventional means of communication. The design of the experiment may not have any apparent biases but the implementation may be faulty. This is especially problematic because we are at the same time facing evidence suggesting that something was going on after all. We have no means of identifying the cause of the apparent effect.

The average ganzfeld experiment comprises a mere 40 trials. Trial in relation to the ganzfeld means a single attempt at guessing, rather than an actual study with many attempts as in medicine. I hope that’s not too confusing but I can’t change it.
That means that, if something is going wrong, the experimenter has relatively few chances of finding it. We would expect 10 hits just by chance. The claimed effect means that you should expect a mere 2 or 3 hits on top of that. That’s not a lot…

Indirectness

Regardless at which level of evidence the study starts out at, there are factors for which one downgrades or upgrades the quality. A reason to downgrade is indirectness of evidence.

An example of indirectness is if one study compares drug A to placebo and another drug B to placebo. You could use those results to compare drug A and B but that would mean a downgrade of the quality of evidence. That is, even if both studies were High Quality, you would only be dealing with Moderate Evidence for which drug is better than the other.

There are other types of indirectness which don’t need to bother us.

The relevance for the ganzfeld is that we only have indirect evidence that the hit rate should be 25%. This is based on a number of assumptions such as that the random sequence is really random, or that there is no sound (or other sensory) leakage between sender and receiver and so on.

All of these assumptions can be tested and well justified but the fact remains that it is indirect evidence.

The bottom line is that th typical ganzfeld experiment cannot be regarded as the equal of a randomized trial in medicine. We are starting out with Moderate Quality.

Other factors to consider

There are other factors which should result in downgrading the quality of an individual study. These would merit downgrading some individual experiments but I do not think that such factors are prevalent enough to globally downgrade the ganzfeld experiments as a whole.

One clear-cut example would be ‘Selective outcome reporting bias’. Remember that some ganzfeld studies had not just the receiver guessing at the correct target but also independent judges. This would allow, in principle, to report only the better result. So where we do not have the hit-rate of the actual receiver, we should downgrade.

Another thing that is problematic is when the hit-rate is not reported. Some studies had receiver rank the target according to preference. This allows computing the direct hit-rate depending on whether the correct target is ranked as #1. It, however also, allows turning the guess into a binary choice, where a hit is when the correct target is ranked #1 or #2 and a miss if it is #3 or #4. And then one may simply compute whether the average rank of the correct targets was higher than expected by chance.

However, the mere fact that this was done is not reason to downgrade. Only those studies where we do not have access to straight hit-rate must be downgraded. We’re only interested in the reliability of the evidence and the fact that some experimenters made dubious choices in not in itself relevant to that.

On the whole, I think that the early studies, would have to be downgraded further as a body, but not the later ones. This is something we will consider when we get to looking at actual results.

Normally one must not make such summary judgements. One should consider each study separately and give a reason why one downgrades or not. Transparency is vital because there quality assessments always involve a degree of subjectivity.

But for now, let’s just remember that we have this open issue and move on.

Publication Bias

It’s generally the case in all sciences that “successful” experiments are more likely to be published than unsuccessful ones. This is quite understandable. You are more likely to tell people about the unexpected than the expected, about what happened rather than what did not happen.

The ganzfeld experiments should have a hit rate of 25%. What would happen if no one ever published experiments with a hit rate lower than that?
The remaining published hit rates would then have to have a hit rate of way over 25%. Of course, it doesn’t have to be that extreme. Every study that is not available will distort the average of the rest.

Publication bias is notoriously difficult to detect because it results from studies that one doesn’t know anything about, not even if they exist. There are statistical methods to detect it but they are far from perfect. The problem with them is that they rely on certain assumptions about what the unbiased effect is like, and which results are most likely to remain unpublished. I won’t discuss these methods in detail for now. That’s for another time.

In general, review authors and guideline developers should consider rating down for likelihood of publication bias when the evidence consists of a number of small studies. The inclination to rate down for publication bias should increase if most of those small studies are industry sponsored or likely to be industry sponsored (or if the investigators share another conflict of interest).

GRADE guidelines: 5. Rating the quality of evidence—publication bias by Guyatt et al.

On its face, this suggests that we should rate down. The studies are all small and it can be argued that a conflict of interest is present, at least among some labs. I am not sure, though, if it is warranted to assume that whatever interests parapsychologists have, this would show in the form of publication bias.

Let’s look at counter-arguments for now.

Arguments against publication bias

One argument is that the Parapsychological Association Council adopted a policy opposing the selective reporting of positive outcomes in 1975.

Unfortunately, this argument is empiricly not tenable. Research on publication bias in mainstream science suggest that the cause lies with authors not bothering to write up negative studies, rather than journals not publishing them. So, such a policy should not have an effect.
There is evidence that this is indeed the case. A researcher sent a survey to parapsychologists asking whether they knew of any unpublished ganzfeld experiments and what the results were. All in all 19 unpublished studies were reported. (The extent of selective reporting of ESP ganzfeld studies by S. Blackmore, 1980)

This leads to the next argument. The proportion of significant studies among published and unpublished studies was similar (37% vs 58%). This is supposed to indicate that there was no bias to publish only the succesful ones. Obviously this argument is nonsense.
What matters is that the average hit-rate among the unpublished studies is systematically different. If that is the case then the average of the published studies will be biased. The fact that there are more significant studies among those published points in that direction, regardless of whether this fraction is similar.
In short, using that survey to argue that there was no publication bias is just Orwellian.

However, we must bear in mind that statistical significance tells us little about the hit-rate. Whether a study is significant depends both on the hit-rate and the number of trials in that study. If the number of trials per study in the unpublished studies is comparable to that in the published ones, then the hit-rate should be assumed to be lower. If the number of trials per study is much smaller, then the hit-rate may be the same or even higher.
For our purposes, we do not need solid evidence of that the result is biased. We should downgrade whenever there is a high probability of publication bias.

That means that the early studies would definitely have to be rated down.
The later studies are a different matter, though. The results don’t look much like overly influenced by publication bias. That’s something for later.

For now, I won’t downgrade. Others would. It is a borderline case.

Inconsistency

If results from different studies are inconsistent with each other, then rating down is a must. One method to diagnose inconsistency is by applying a statistical test for heterogeneity.

Let’s say that the true hit-rate were always 25%. Because that is just a probability, few experiments would actually average an actual hit-rate of 25%. Most experiments would have a hit-rate close to that and few further away from that. In the same way, few people are of average height, but most are around it with few being very tall or very small.

A test for heterogeneity finds out if the hit-rates are distributed as one would expect if it is always the same, no matter what it is. One should still find this pattern even if something causes the true hit-rate to be 33% or whatever, as long as it is the same in all experiments.

The ganzfeld results are heterogeneous. The hit-rates found in the individual experiments are more varied than random chance would lead one to expect.

It is easy to see why inconsistency is a great problem for the ganzfeld in particular. If someone makes a mistake in implementing the standard ganzfeld design this may bias the results. It is unlikely that different people in different locations, all make the same mistake. It’s more likely that only some results will be biased, and to different degrees. And that’s something that would pop up as heterogeneity.

Heterogeneity is a clear warning signal that something may have gone wrong.

Of course, undetected mistakes are not the only possible cause for heterogeneity. There are always variations between different experiments which might also explain differences. A medical example is when patients in some study are sicker, or if some receive higher dose of a drug and so on.
If a robust explanation for the heterogeneity can be found then one need not downgrade. Preferably one would perform a subgroup analysis, which means that one splits the studies into different groups that are themselves not heterogeneous.

Parapsychologists speculate that something analogous is responsible for the heterogeneity in the ganzfeld. Some experiments used different target types while others recruited ‘creative’ participants which are supposed to be especially good at the task.
That does not rise to the level of a robust explanation, though. In medicine, one could invoke genetic differences, differences in food and climate, and any number of such things but that’s just hand-waving.
Future research may or may not uncover the causes of the heterogeneity but right now we simply lack any sort of robust explanation.

The way things are, downgrading for heterogeneity is a must.

Reasons for upgrading

One might upgrade the quality of evidence from an observational trial if the effect is especially large, if there is a dose-response relationship, or if all plausible confounding variables would work against the effect.

A large effect is defined as a relative risk (RR) of over 2, which for us means a hit rate of over 50%: Clearly not the case.

A dose-response relationship is clearly not present. There is nothing equivalent either because that implies knowledge of some underlying mechanism, which we do not have by definition.

The last one regarding confounders may be a bit confusing, so I’ll quote an example.

In another example of this phenomenon, an unpublished systematic review addressed the effect of condom use on HIV infection among men who have sex with men. The pooled effect estimate of RR from the five eligible observational studies was 0.34 [0.21, 0.54] in favor of condom use compared with no condom use. Two of these studies that examined the number of partners in those using condoms and not using condoms found that condom users were more likely to have more partners (but did not adjust for this confounding factor in their analyses). Considering the number of partners would, if anything, strengthen the effect estimate in favor of condom use.

-GRADE guidelines: 9. Rating up the quality of evidence by Guyatt et al. in JCE Vol. 64
heterogeneity

That’s it for now. Still a lot of ground to cover…

The Ganzfeld Experiments: Quality -Part 1

Previously I have discussed what might be concluded from the ganzfeld experiments. In this post I will address a notoriously contentious issue: The quality of the evidence.
Practitioners and proponents of parapsychology often voice concern that their experiments are held to a different standard than experiments in ordinary science. To alleviate these concerns I will be scrupulous in applying the standards of medicine.
Medical experiments are in many ways similar to parapsychological ones. They are conducted on humans with all the messiness and limitations that comes with that. There are also some important differences, of course. I will discuss that as I go along.
Medicine faces more than any other science, the task of drawing conclusions from multiple studies, at least if the goal is to give patients truly evidence-based advice. Guidelines have been developed for that. Of course, these guidelines are themselves based on research on what methods are best suited to drawing correct conclusions and are still evolving.
I will mainly be relying on the Cochrane Handbook (v 5.1.0). Much of the handbook is not relevant to our purpose since we are not preparing a Cochrane review. The relevant material for this post is to be found in Part 2 of the handbook.

What is Quality of evidence?

It should be obvious what is meant by quality of evidence. It means how reliable it is. If it is of high quality then we probably won’t have to revise our judgement later on. If it is of low quality then we can’t be certain of that. The conclusion we draw from it might still be true but we would probably not be willing to bet the farm on it.

Let’s look at a practical example of how a medical trial might go wrong:
Experimenter: We have just given you a very expensive treatment. Do you feel any better?
Patient: I… don’t know?
Experimenter: It was really very expensive. I’m sure it must have worked. You do feel better, right?
Patient: Err…. Sure.

Clearly, if that were to happen in a medical trial, the results would be completely worthless for anyone outside of the marketing department. You probably already know the remedy to the problems here: Blinding and a control group.
The control group will typically receive a placebo. That is something that looks almost exactly like the real treatment but with the essential part removed, for example a pill without the drug. Often the control group will receive the old treatment instead because one cannot ethically let people go without treatment. Besides, one is usually more interested in whether a new treatment is better and should replace the current treatment rather than if it is better than nothing at all.
Blinding means concealing who belongs to the treatment and who to the control group. A double-blind trial is one where neither subjects nor experimenters know who is in what group. Which in turn means that they cannot be influenced by their personal preferences.

But hold on a moment. Just because someone is not blinded, does not mean that he or she will go to such lengths to influence the patient. Maybe the patient will not be influenced at all. If we know that a trial was unblinded we know that there was a risk of getting a wrong result. We don’t know if that actually happened, though.

Risk of bias

The Cochrane Collaboration talks about “risk of bias”. A bias in science is something that causes a result to be false. It has nothing to do with the personal motivations of the people involved.A completely impartial experimenter may still conduct a biased study while a biased experimenter may conduct an unbiased experiment.

The reason that the collaboration talks about risk of bias rather than quality is because in some cases it may be impossible to conduct a study with all the necessary safe-guards. In such cases it would be unfair to talk about a low quality study, but unfair or not, such a study would be still be at risk to yield a false results, ie to be biased. Another reason is that study quality is often regarded to be about more than just the reliability of the results. Other factors, such as properly following ethical guidelines also come into play.

According to the Collaboration’s suggested scheme, studies are assigned either a ‘Low risk’ of bias, ‘High risk’ of bias, or ‘Unclear risk’ of bias. Note that there is no category for no risk. This will probably come as a surprise to many, particularly fans of parapsychology. When mainstream scientists regard a parapsychological study as potentially flawed, that is not because of some prejudice toward parapsychology, it is because they apply the same standard as is applied to all studies.

The basic reason for that is that we can never rule out human error. We can never presume to know everything. Particularly, the risk of bias is assessed only by reading the report from the experimenters. Such reports are always idealized. If you ever read a scientific paper reporting an experimenter, you will almost never find mention of equipment malfunction or such mishaps. Of course, they do happen but every student learns not to report them. They are not supposed to whine about how hard the experiment was to get right or what personal drama they experienced. No one wants to hear that. Only events that might compromise the results are to be reported.
That leaves a lot to the judgement of the experimenters who have every reason not to mention things which might reflect badly on them or imply that they wasted their time.

By the way, the FDA does conduct on-site inspections to ensure that proper protocol is followed but that is only for some trials connected to drug licensing and the like. Regulatory bodies around the world will not accept anything but high quality evidence. Be aware, though, that alternative medicine is not held to these standards.

Bias vs. Conflict of interest

Personal motivations, the biases of the people involved, are not assessed.

Scientific journals, particularly in medicine, require researchers to declare financial conflicts of interest. That means mainly financial conflicts, such as when the researcher owns a considerable amount of stock in a company that will benefit from a certain result, or is employed by one, or funded by one. Financial benefits can be reaped even if the result is unmasked as false within a short time.
Other conflicts of interest such as personal friendships or family associations generally receive less attention. religious or ideological commitments are generally not considered at all.

Studies show indeed that, for example, trials financed by industry, are a little more likely to find a beneficial effect for the drug. Insofar, extra care is advised when conflicts of interest exist. If the majority of studies being looked at are industry-funded then one may even take this as evidence that publication bias is likely.

I am going to spare a few words on conflict of interest related to the ganzfeld and let that be enough.

Declaring conflicts of interest is not done in parapsychology. I don’t know why but I think it is not widely practiced in psychology either (do not mix up psychology with psychiatry or psychotherapy, though).
With regards to the ganzfeld experiment, there may be financial conflicts of interest related to funding. Parapsychological research is financed mostly by private grants and donations. I suspect that receiving such moneys is closely related to being able to obtain positive results. However, if or how much such factors correlate to positive results is, of course, unknown.
More serious are perhaps the religious or spiritual convictions of parapsychologists. I have previously analyzed what one can conclude from the ganzfeld experiments and found that it does not allow conclusions about the correctness of any world-view. Nevertheless parapsychologists argue this and insofar may be strongly committed to obtaining certain results.

Fans of parapsychology and parapsychologists often take it for granted that people will go to any length, even falsifying their research to uphold their world-view. Of course, that is an accusation only made at people who are not able to reproduce parapsychological results.
I, personally, do not know that anyone actually does that and rather suspect it to be rare outside of parapsychology. However, those accusations make it clear that some parapsychologists regard this as a completely obvious and natural thing to do. Which in turn makes me suspect that they themselves may be prone to such behavior.

Such considerations are not evidence-based, though, and as such I will not take them into account when assessing the evidence. One could point to past failures of parapsychology as a sort of indirect evidence for such problems and, of course, many people do that in practice when they consider parapsychology to be ridiculous. Eventually that is not different in principle from displaying knee-jerk suspicion of industry-funded studies.

To be continued…

The Ganzfeld Experiments: Telepathy?

Discussions of parapsychological experiments usually focus on the quality of the alleged evidence. I will break with this tradition and begin by analyzing what conclusions are warranted if the results are as parapsychologists claim. Let’s begin by looking at the typical Ganzfeld experiment.

The typical Ganzfeld experiment

We have two experimental subjects or participants. One is called “sender” and the other “receiver”.
The sender is shown a video clip or a still image (in one case music) which was randomly selected from a collection. He or she is supposed to watch or listen to this for around 20 minutes while the receive” is “in the Ganzfeld”, that is, he or she experiences a mild state of sensory deprivation. During this time the receiver just says whatever goes through their mind and this is recorded.

Afterwards comes the judging procedure. The target is presented along with 3 decoys (though occasionally a different number was used) from the same collection of possible targets. Sometimes the judging is done by the receivers themselves, possibly with the help of an experimenter, and sometimes by an experimenter. Naturally, we are assured that the experimenters involved in the judging do not know the correct answer themselves.

Who is being tested?

That the receiver does not give the answer alone provides an immediate problem in interpreting the result. We cannot know if any correct answer was truly given by the test subject, or by someone else.

One might correctly point out that the only thing that matters is if something paranormal happens. So what if the telepathy is between the sender and the experimenter rather than the designated receiver?

Unfortunately, there are practical implications to this. It is not enough to ensure that the receiver had no way of knowing the correct answer, one must ensure this for the judges, and/or helper as well, which multiplies the problem.

Consider a situation where something like distance-dependence is to be investigated. Not only must sender and receiver be separated by the prescribed distance but also sender and judge and any judging helpers.

Evidence for telepathy?

The conclusion, or rather explanation, offered by parapsychologists is implied in the terminology. The sender sends and the receiver receives in some unknown way simply called telepathy. The catch is that there is absolutely no evidence that the sender actually does any sending.

Yes, if there was some unidentified form of human communication not blocked by whatever separates the two, then that could explain the results. However, the results could also be explained by, just for example, remote viewing.
Some will now say that it doesn’t matter if we are dealing with remote viewing or telepathy, as long as we are dealing with something inexplicable. In the sense that both would be extremely interesting, that is certainly true.

However, it does not change the fact that drawing unwarranted conclusions is simply bad science. Surely, if you have something interesting on your hands you want to do good science on it. Does it really make sense to say: “It doesn’t matter that we are misrepresenting the results, only that they are interesting?”

There are also practical considerations. If it turns out that a sender is not necessary, then you need only half as many test subjects and, on top of that, can schedule the sessions easier.

Finally, I need to be blunt. If someone is unable or unwilling to correctly interpret evidence, then one can’t help but wonder if they are able to correctly implement the experimental protocol. If they tell me that this “proves” telepathy and I see it does not, can I trust them when they tell me that they ruled out all conventional explanations?

A few experiments have been conducted that modified that standard experiment to properly test for the influence of the sender. Discussion of that will need to wait for later.

Consciousness Research?

Many parapsychologists think of themselves as conducting consciousness research. Somehow, in some unknown way, consciousness is thought to be responsible for the Ganzfeld results. Whether we are dealing with telepathy or remote viewing, it must surely be a psychic ability, or so some say.
But again, this does not follow from the data. We have a number of known sensory organs which are clearly responsible for gathering most, if not all, of the information we have about the world. If there was an additional extra-sensory means of perception, there is no reason to assume from the get-go that it would be fundamentally different from the known means. We can only say that it must be much less effective than the known means or we would all be aware of it.

Not only is there no good reason to think that the psyche or consciousness should have anything to do with the results, there is a strong argument against that. The receiver is typically not aware of receiving any information. He or she simply babbles out whatever goes through their mind. Afterwards they will be provided with some record of their babblings and possibly even with help in matching them to the target. In other words, the process is completely unconscious.

Evidence for psi?

Here it gets more tricky, as there is no single definition of psi accepted by all parapsychologists. The Parapsychological Association (PA) says on its web-page:

A general blanket term, proposed by B. P. Wiesner and seconded by R. H. Thouless (1942), and used either as a noun or adjective to identify paranormal processes and paranormal causation; the two main categories of psi are psi-gamma (paranormal cognition; extrasensory perception) and psi-kappa (paranormal action; psychokinesis), although the purpose of the term “psi” is to suggest that they might simply be different aspects of a single process, rather than distinct and essentially different processes.

Obviously, the idea that all alleged psi phenomena are connected cannot, in principle, receive any support from the standard Ganzfeld experiment. And even more obviously, it cannot provide evidence for any other psi phenomenon.
Nothing about the experiment allows any conclusion about the mechanism.
One might say, that, if parapsychologists stumbled across something with one experiment, it would be worthwhile to look at what else they have. That is a reasonable argument, but it is entirely based on social considerations, namely an assessment of how credible parapsychologists are as a group. It does not follow from the experiment.

Another definition, given in some Ganzfeld papers (those by Daryl Bem) goes like so:

The term psi denotes anomalous processes of information or energy transfer, processes such as telepathy or other forms of extrasensory perception that are currently unexplained in terms of known physical or biological mechanisms. The term is purely descriptive: It neither implies that such anomalous phenomena are paranormal nor connotes anything about their underlying mechanisms.

Clearly, this second definition is in direct contradiction to the one by the PA. Not only does the PA explicitly say that psi denotes something paranormal (whatever that means), it also says that it does imply something about the mechanism: Namely that there is a single process for all alleged psi phenomena.

An amusing consequence of the second definition is that psi definitely exists. Whenever someone knows something and you don’t know how, that’s psi.
By that definition, the Ganzfeld experiment can provide evidence for psi but it is not clear why you would even bother. By that definition, the existence of psi is a necessary consequence of us not being all-knowing.

Many psi believers feel that psi is this awesome spiritual thing. If people could be convinced of its existence it would revolutionize all of science, from physics to psychology, and even all of society.  That’s not something a single experiment can ever provide.

What it is not evidence for

There is any number of people who claim psychic powers. Some of them do so on a professional basis, that is they charge money for psychic readings. Some people might say that if something is happening in these Ganzfeld experiments then that supports that there might be something to that. Unfortunately, it’s the other way around.

It’s like saying that, because anyone can leap over a low fence, therefore some people might be able to leap over a tall building. That doesn’t follow, of course. No matter how many people you watch jumping a low fence, it will not change the fact that no one can jump over the average building. Even worse, the longer you go on, the clearer it will become that it is simply not possible.

It is much easier to demonstrate an amazing ability, with big effects on the world, than a tiny one. If someone claims the Ganzfeld experiments as “the best evidence for psi” that is basically a tacit admission that psychic powers as portrayed on TV, or the New Age literature, do no exist.

Evidence for what?

Eventually, the standard Ganzfeld experiment can only provide us with evidence that something unexpected happens in such Ganzfeld experiments and nothing else. By design the experiment does not provide us with any clues as to what is going on.

This is the fundamental problem in the standard Ganzfeld design.

We think that the hit rate in a typical Ganzfeld experiment should be 25% but there is any number of reasons why this might be wrong.

Early on, a few possibilities were raised. For example, when a sender is given a photograph, they might leave fingerprints, kinks, scratches or other such clues on the paper. These so-called handling cues might enable the receiver to tell that photograph from other, fresh photographs supplied as decoys.
Such things are usually referred to as flaws in the protocol. I, personally, don’t think it is constructive to label some explanations as somehow inherently reflecting badly on the experimenters. In my opinion, the single flaw is that the design does not allow any conclusions about the explanation, not that it allows for boring explanations.

Parapsychologists frequently point out a catch with such proposed explanations: There is no evidence that they are true. The experiment, after all, only tells us that there is something to be explained, not how.
Unfortunately, they rarely realize that the same is doubly true for their suggestions. As vague as the term is, there is no evidence that something like telepathy exists, so invoking it to explain the Ganzfeld results is simply building on air. Arguing that the Ganzfeld results themselves are evidence for telepathy is simply circular reasoning. One might equally claim the results as evidence for fraud, error, or gremlins.

The Ganzfeld reviewed

With this post, I can begin a new series. The topic will be the infamous Ganzfeld experiments.
The story is long and there are many angles to it. First I will start with a quick overview of what the Ganzfeld is anyways and what the history is. That’s well trodden ground but the series would be incomplete without these basics.
Next, I will review the experimental design and what conclusions can be drawn from positive results. That is an often neglected topic.
After that, I will review the debate so far in some depth. That will be several parts, I don’t know how many exactly yet.
At some point I will do some analyzing of the data myself, to break out of the tedious he said/she said mode.

Now let’s begin with the quick overview.

What is the Ganzfeld?

The Ganzfeld is a way of inducing mild and benign hallucinations. The word is German and means “whole field”. This refers to the whole field of vision being taken up by a uniform stimulus.
In practice one halves a table tennis ball and puts one half over each eye. Then a (usually red) light is shone on the participants face. He or she now sees only a uniform red glow and after a while will begin seeing things.
The effect can also be elicited in different ways, for example, by staring at a white wall. Additionally, parapsychologists usually have their test subjects wear headphones playing white or pink noise.

It is thought that the hallucinations result from the way the nervous system tries to deal with the lack of sense data. It “turns up the gain” in an attempt to get some usable signal but ends up with only noise. The hallucinations are the equivalent of the hissing that results when cheap speakers have the volume turned to max.

What is a Ganzfeld Experiment in parapsychology?

Parapsychologists in the early 1970ies had the idea that we are constantly, somehow receiving ESP impressions but that these are usually crowded out by the perceptions coming from the known senses. They reasoned that if the known senses were shut down and everything else amplified, then this should allow the ever present ESP to come to the fore.
However, in the almost 40 years since, there has been very little effort to actually test this hypothesis. In fact, it seems to have taken over a quarter century before anyone even asked the question. I would have expected that one would first test if the Ganzfeld technique can boost ESP perception. If it can I would expect it to be used as an add-on in most parapsychological experiments. The more ESP perception you can produce, the more data you have to study, right?

It has not happened like that, though. Instead a specific type of experiment was created. The typical Ganzfeld Experiment purports to be a test of telepathy. It involves two subjects, one called “sender” the other “receiver”. The “sender” is presented with a “target”, typically a photo or video clip, while the “receiver”, in another room, is in a Ganzfeld state and simply speaks his impressions.
The task then is to identify the “target”, using these impressions. This may be done by the subject alone, by a third person or by the subject with the aid of a third person.
There are usually 3 “decoys” along with which the “target” is presented. This means that, theoretically, there should be a 1 in 4 chance of guessing right just by chance.

The claim is that subjects in Ganzfeld Experiments guess correct about 1 in 3 times rather than the expected 1 in 4 times. This then is supposed to be evidence for telepathy. Whether it is, I will examine in detail in another post. For now I will just point out the obvious. One cannot conclude from such experiments if the Ganzfeld actually does anything, consideration of ESP aside.

A brief history

The parapsychological Ganzfeld experiment was pioneered by Charles Honorton who, together with Sharon Harper, published the first account of one in 1974. In the following years more experiments were conducted and published. The supposed success of this method prompted Ray Hyman, a mainstream psychologist with a long-standing interest in parapsychology, to analyze the results. He argued that they were best explained by a number of flaws in the set-up. Honorton disputed this and a lengthy debate ensued.
Eventually, in 1986, during a lunch to which they both had been invited, they discovered that besides their disagreements they also agreed on many things. Instead of continuing the debate they co-authored an article agreeing that only further experiments could bring clarity and also outlining certain procedural musts to avoid any possibility of error. Honorton then went on to realize experiments following these guidelines. The quality of his work convinced mainstream psychologist Daryl Bem. With his help the results of the new experiments were published in a respectable psychology journal in 1994.

If this was a Hollywood movie it would end here, with the triumph of the underdog.

Science does not end, however. These high quality experiments took place all in one lab, independent confirmation was still required. Besides, as Hyman pointed out, parapsychology had been at this point before only to see seemingly solid evidence implode.
In 1998 then another analysis came out, this one by Milton and Wiseman. It looked at all the smaller experiments that had come out following the publication of the guidelines in order to see if they had done as well as Honorton’s lab.
This turned out not to be the case. Independent confirmation had failed.

This would be another place for a movie to end. The good guys come riding into town and set people straight on the nonsense the snake oil salesman is peddling.

But again life does not offer neat endings. Parapsychologists were quick to explain away this failure, although not with arguments that could convince mainstream thinkers.
Meanwhile Ganzfeld work continued. In 2010 another analysis came out purporting to show that there is something there after all.

And that concludes the brief history. It has a lot of holes and leaves much unsaid. The details will be found in the posts to come.

The Entity case reviewed

‘The Entity’ is a 1982 movie (trailer), “based on a true story”. It’s about a woman, Carlotta “Carla” Moran, who is harassed and raped by a ghost and who seeks the help of parapsychologists.

Just a few days ago, someone posted a youtube link to a video in which parapsychologist Barry Taff, one of the real life investigators on the case, talks about it. I decided to investigate. One result is that I think the youtube clip is originally an extra on the DVD release of the movie (see Cinema of the Psychic Realm: A Critical Survey by Paul Meehan).
Between the case and the movie there is a novel by Frank DeFelitta, who is mostly a writer, director and producer.

A thorough investigation would require interviewing witnesses and, most importantly, reviewing what documentation is available. It is a fact that human memory is malleable, leading to things like False Memory Syndrome. This problem is especially acute in this case because the novel and the movie offer rich sources of entirely fictitious events and imagery. So called source amnesia is very common. That means that you can remember a fact but not where or how you learned it.
For example you may know that the the US constitution asserts the existence of “certain inalienable Rights, [and] that among these are Life, Liberty and the pursuit of Happiness.”
Where did you learn that? Did you read it? Were you told? Or did you hear on TV?

If you’re saying that you never learned this because it is not actually in the US constitution then grab yourself a cookie, otherwise you have some more food for thought. Either way, I’m sure the point is clear. Memory is not too reliable.

So, basically, a proper investigation must rely on recordings made at the time, photos, video and sound, as well as contemporary written accounts or notes. I will try to do that but I’m only a poor blogger and have only so much time and energy. This post is restricted to what one can find on the net. That said, let’s go.

Sources

The case was investigated by Kerry Gaynor and Barry Taff, of UCLA in 1974. Their findings were apparently published in IEEE publications for some reason. The IEEE is an electrical/electronics engineering association.
A deleted wikipedia entry gives these 2 references:

Taff, B.E. & Gaynor, K., “Another Wild Ghost Chase? No, One Hell of a Haunt”, in Wescon Special Session: “Psychotronics, 1975”, 1975 Wescon Professional Program, Western Electronic Show & Convention, San Francisco, September 16-19th, 1975 (Proceedings of the IEEE) [Institute of Electrical and Electronics Engineers, Inc.]

Taff, B.E. & Gaynor, K., “Another Wild Ghost Chase? No, One Hell of a Haunt”, in ELECTRO/76, Special Session: Psychotronics III”, Electro 76/Professional Program, Boston, May 11-14th, 1976, (Proceedings of the IEEE) [Institute of Electrical and Electronics Engineers, INC.]

Apparently these do not exist in electronic form at present so I am unable to confirm or deny the accuracy of these refs. The TV show Sightings seemed to, at least, confirm that there was something in ELECTRO/76.

Parapsychological articles generally cite a different source.

Taff, B.E & Gaynor, K., “A New Poltergeist Effect”, in Theta: A Journal For Research On the Question of Survival After Death, Journal of the Psychical Research Foundation, Durham, N.C., Vol. 4, No.2, Spring 1976, pp. 1-7

Apparently there are plans to make that archive electronically available but not yet.

What I did find was a sort of reprint in PSI Journal of Investigative Psychical Research, volume 4(1) from 2008. The introduction mentions changes made.

Desiring to set matters straight on what really occurred thirty-four years ago in a Los Angeles suburb, I am reprinting the original article with only minor upgrades and adjustments to compensate for three decades of time and acquired knowledge regarding this type of research.

I hope this only refers to background information. I am basing most of this post on that article.
I also listened to a few interviews with Barry Taff and watched a few segments from TV shows. Notably, a segment in the show Sightings (I think Season 1, Episode 3) featured Kerry Gaynor talking a lot and showing the photos. Unfortunately, that show was first aired in 1992, almost 20 years after the original occurrences and a decade after the movie. Gaynor seems to back up Taff’s article.
California’s Most Haunted is not very informative but has sound bites by Mort Zarkoff.

Problems with the Witnesses

I’ve already pointed out that long term memory is not too reliable. Now I must point out that even accounts written directly after an event are problematic. This realization dates back over 100 years. At the time parapsychology was very interested in séances. These took places in dark, even pitch-dark rooms and were lead by a medium.
Supposedly all sorts of supernatural events, objects flying, ghosts materializing and so on, took place. Skeptics explained this as the medium using magic tricks while believers pointed out that what witnesses recounted could not possibly be trickery.
In 1887, Richard Hodgsons, no less than the president of the American Society for Psychical Research, put the skill of witnesses to the test. With the help of magician John Davey, he staged a fake séance for, not exactly unsuspecting people ,but rather people who did not know that anything out of the ordinary would happen.
Their accounts were so incomplete and jumbled that it was impossible to reconstruct the tricks used. In fact, they could be ruled out, in defiance of the facts. (Read more)

Another consideration is that some of those involved have a financial motive to propagate the mystery of the case. They draw revenue from the movie and the novel. That is particularly true for DeFelitta, of course. The woman on whom all this was based also received money until her death or disappearance. I don’t know if she received any immediate, proper compensation for letting dozens of spectators into her home. It would have been fair.

According to my main source, DeFelitta was only present, with camera man Mort Zarkoff, on one evening where nothing much happened. That’s not the impression one would get from TV shows on the case, where he invariably appears. He happily relates seeing things that apparently happened when he was not there…
There were a few sound bites from Zarkoff on a California’s Most Haunted. They make it appear as if he is, like DeFelitta, talking about something he didn’t see, however that could be due to misleading editing.

Gaynor and Taff were technical advisors on the movie. I do not know how substantial the royalties are.

Barry Taff appears in a number of interviews, for example on Coast to Coast. He comes across as a very colorful character. He claims amazing psychic powers for himself, such as the ability to psychically diagnose medical conditions. He also relates how he once was beat up by a ghost, except that he also says that other witnesses saw him get beat up by a young man. As a witness he certainly has a credibility problem.

Kerry Gaynor now seems to be a hypnotist specializing in smoking cessation. At least I think that the guy in the photo of this article shows the same person. I feel he’s the most credible of the bunch.

Another person who appears in TV shows is Dick Thompson, professional photographer. The implication is that he was also a witness but he is not mentioned by name in my main source.

Allegedly, there were over 50 witnesses in total to one paranormal event or another. The problem here is that there are not that many witnesses testifying to the phenomena. We only have these five who testify to the phenomena and to the presence of other witnesses. What those other witnesses would say is simply unknown.

If anyone feels like digging up the original reports in a library, by all means check the following issues of the journals as well. Someone may have written a letter to the editor in which they confirm or deny the accuracy of the account.
However, even someone who saw absolutely nothing might still assume that all the good bits happened on another occasion or that the matter was simply not important enough to follow up on.

In any case, all those people that were present cannot give us much additional trust in the accuracy of the account. With so many people present, the likelihood is higher that one of them would be a “bad witness”. Maybe one of them is not quite sane or not quite honest and will see or say anything with the right prodding.
Also, so many people present does increase the possibility that one of them may have been motivated to play a little hoax.

At the end of the day, that so many people were present is a problem for the paranormal interpretation, rather than a plus.

Recent developments

As late as 2011, Taff said on Coast-to-Coast (23.01.2011) that he did not know what had happened to the female victim. He had not maintained contact with her but he knew from DeFelitta that she had stopped cashing the checks sent to her and that DeFelitta was unable to contact her.
More recently someone surfaced who claims to be the woman’s second child. I don’t know how or if the identity of that person was established. He claims that she died in 1996 from pulmonary disease but otherwise I see no new information. Read more about this here.

I am going to go through what allegedly happened according to Taff’s article in PSI. The article is contradicted in a few details by other sources.

What happened?

The meeting

In real life, the victim/experiencer was called Doris Bither at the time. She was married multiple times and apparently changed her last name accordingly. She is described as having been intoxicated during most of her dealings with the parapsychologists. She had four children, 3 boys aged 10, 13 and 16, which were interviewed, and one daughter, 6, that was never seen.
She lived in Culver city, California in a shabby house that was “twice condemned” by the city.

Her first meeting with the parapsychologists was a chance encounter. She overheard Kerry Gaynor talking with a friend about hauntings in a bookstore and approached them.

The first Visit

On August 22, 1974, Taff and Gaynor visited her small home in Culver city and interviewed the family.

Their accounts were fairly uniform in reference to a particular apparition whom they called “Mr. Whose-it.” The alleged apparition would appear in semi-solid form and was well over six feet in height, according to their testimony. Both Doris and her eldest son claimed to have seen two dark, solid figures with Asian features appear from out of nowhere within their mother’s bedroom, who at times appeared to be struggling with each other.
This particular event occurred several times, with one episode where Doris claimed to have physically bumped into the apparition in the hallway. Neither Doris nor her eldest son would accept the possibility that the apparitions might have been imagined or simply prowlers or intruders who forcibly entered the house.

Doris Bither also reported that she was sexually assaulted on several occasions, suffering large bruises. However, there were no medical records to substantiate this claim and the investigators could not see any bruising. The last attack had allegedly happened weeks prior so that any injuries had healed in the meantime.
The ‘spectral rape’ was pretty much what the movie centered around but there is no evidence, no matter how tenuous, that could corroborate Bither’s claims. By the by, California’s Most Haunted asserts that the investigators saw the bruising complete with a dramatic reenactment in which an actress pulls her top down.

Another incident featured in the movie is simply hearsay as well.

Even more dramatic was Doris’ claim that during one particular attack, her eldest son overheard the scuffle and entered the bedroom. According to Doris, he witnessed her being tossed around like a rag doll by the entities. She alleges that when her son came to her aid, an invisible force picked him up and threw him backwards into the wall. The son corroborated his mother’s story, speaking of the sheer terror he experienced during that struggle.

Apparently these stories are only the tip of the iceberg, for Taff goes on:

I will refrain from going into all the bizarre stories that were related to us for we cannot substantiate them.

The bleak picture that emerges is one of a dysfunctional family, with an alcoholic mother, living in a run down house. If you’re thinking based on this that these interviews promised an intriguing case then Taff and Gaynor would have, reportedly, disagreed with you.

Our initial impression was to totally discount Doris’ claims and simply refer her to one of the psychiatrists at the NPI.

The changed their opinion only because:

However, a few days hence, Doris called to inform us that five individuals outside her family had now seen the alleged apparitions.

There is no mention what exactly was seen, or if the claim was corroborated by asking the witnesses. Presumably it was not.

The second Visit

The parapsychologists return to the house with cameras. They noticed cold spots, a rotting smell and the feeling of pressure on the inner ear. Neither of these reports is particularly odd. Such sensations can be easily induced by suggestion.
There are no mentions of using measurement devices to confirm that any objective temperature or air pressure differences existed. However, I can’t help speculating what it might have been if it was real.
The stench is obviously consistent with the destitute circumstances of the family. On a more speculative note, the stench may also have been connected with the feeling of pressure in the ears. The feeling might have simply resulted from breathing differently. Or there might have been noxious chemicals or mildew spores in the air, coming with the stench from whatever was rotting there. Which might cause swelling of mucous membranes in the nose or ears and so lead to that feeling of pressure. But that’s really a lot of speculation without knowing if there was anything more than suggestion going on.
About the feeling of cold:

An intriguing factor, which in our opinion is highly significant, was that from the very first occasion we entered Doris’ bedroom, we both immediately noticed that the temperature was unusually low in comparison to the rest of the house, even though it was a hot August night and all the bedroom’s windows were closed.

A room can be cold because of closed windows, rather than despite. It will prevent warm air from going flowing in. If the room is also shadowed by the rest of the house, then it will be cooler than the outside or the rest of the house, because it neither receives warm air nor hot sun.

Moving on…

The first of many to come, seemingly inexplicable happenings, occurred while Gaynor was talking to the elder son in the kitchen.
Gaynor was standing approximately one foot away from the lower cabinets when suddenly the cabinet door swung open. A frying pan flew out of the cabinet, following a curved path to the floor over 2.5 feet [ca. 75 cm] away, hitting with quite a thud. Now, of course, the immediate thing to surmise is that the pan was leaning against the cabinet door and finally pushed it open as it fell out. But we cannot accept this explanation for the trajectory of the pan as it came out of the cabinet was elliptical. It was seemingly propelled out of the cabinet by a substantial force.

2.5 feet does not suggest much force to me. Moreover it suggests a very short flight. It seems doubtful that either Gaynor or the son were paying much attention to the cabinet. Which raises the question of how much of the flight path was really seen rather than just subconsciously surmised. This event is only as inexplicable as eyewitness testimony is reliable. That can be summed up in three words: Not at all.
By the by, over the years the pan has begun flying further. In California’s Most Haunted, Taff says that it flew across the kitchen, many feet. Retroactive PK at work, perhaps?
As are as I am concerned the falling hypothesis is entirely viable. Of course, it’s also possible that one of the kids played a prank. Come to think of it, the report doesn’t even say that Gaynor and the elder son were alone in the kitchen.
By the way, the term “elliptical trajectory” is nonsense in this context. That shows a lack of formal physical knowledge but whether that is relevant here is a different matter.

During this visit, an alleged psychic and friend of Bither called “Candy” was also present. At some point she called that she felt a presence in the bedroom. Taff rushed in from the kitchen and immediately took a Polaroid photo. At the time, there were no digital cameras. Normal cameras took a chemical film which had to be developed in a lab but Polaroid cameras produced pictures that developed on their own, on the spot.
So they could immediately examine the picture. It was completely “bleached” white. Taff uses the word “bleached”, to me these photos simply look overexposed. Psychic Candy felt a presence several more times and each time either Gaynor or Taff took another photo.
It’s not particularly remarkable that they would have produced a number of bad photos. Electronics in th 1970ies was not as advanced as now and so cameras required a lot of manual adjustments. The question is rather why the photos for which Candy indicated a “presence” were overexposed.
One answer may simply be that this is selective memory. Maybe they just didn’t bother mentioning pictures that came out normal when Candy shouted.
Another point may be that when Candy shouted they simply did not have the time to adjust the settings properly. They may also have held the camera differently at those time, covering the a built-in brightness sensor. It may also be possible that the sensor was not fast enough to adjust to different conditions when they rushed from one room to another.
Whatever the case may be, a few overexposed photos hardly deserve to be called evidence.

In one close-up of Candy, which was taken when she said the presence was right in front of her, her face is “bleached” but the surroundings not as much. That can happen when using a flash. The bright flash is reflected by the face but quickly dissipates over longer distances.

One more picture showed “a small ball of light” but no one of those present had seen it. White spots may result from damage to the photographic material. The specific camera used, the SX-70, was popular among artists, not just because it was the first instant camera but also because the pictures could be manipulated by applying the right kind of pressure afterwards.
I haven’t seen the specific photo, though.

Standing there in amazement for several minutes discussing this phenomenal picture, I happened to glance over toward the bedroom’s eastern window and suddenly observed several rapidly moving, electric-blue balls of light.

The most obvious explanation here is that Taff simply saw a car’s headlights, or whatever, reflected in the window glass. No one else saw anything.

What is interesting is that these are the first strange lights mentioned. There is no mention that the family had ever seen any strange lights.
Now something appears on a picture that they interpret as a ball of light and immediately Taff sees a strange light.

The last paranormal phenomenon of the day:

[...]the Polaroid suddenly and inexplicably took the last photograph by itself.

duN-dun-DUNNNN! Boo!
I have to ridicule this because I have absolutely no explanation for this. File under total proof.

They also had an infrared camera but due to mishandling the film was exposed. They do not consider the possibility that it was malicious manipulation by the entity.

The third visit

On this occasion they take a professional photographer with them. Unfortunately, just on that day they did not obtain any photographic evidence. What a strange coincidence. Or maybe the work of a malicious spirit?

Our third to Doris’ house was most notable in that it was the first occasion where we both collectively witnessed identical visual phenomena. On more than twenty separate occasions, all of us present in the bedroom, including Doris and the female photographer, simultaneously observed what appeared to be small, pulsing flashes of light. It was at this point that we decided to further darken the candle-lit room and hung several heavy quilts and bedspreads over the windows and curtains. Our attempt partially succeeded in that we significantly attenuated the outside light.
The change in light intensity within the bedroom did not affect our most unusual luminous “friend” that now appeared even more brilliant against the darkened surround. It should be noted that we both alternately watched the various window areas in the hope of determining if the source of light was originating from outside the house, perhaps from a passing vehicle or neighbor’s flashlight.
After several such attempts, we were satisfied that whatever these moving, pulsing lights were, they were not originating from outside the house as the thick quilts draped over the window curtains would have easily told us of such an photonic intrusion. The sudden and rapid appearance and disappearance of the lights on this night made it virtually impossible to obtain any photographs, regardless of the fact that it appeared over ten times on the front of the bar area alone.

I don’t know what I should say to that. I don’t understand how they became satisfied that the lights did not come from outside.
If I had to guess than my guess would be that it was light from outside after all.
Whatever, with a bit of creativity I can think of other causes for such elusive lights. One is reflected, flickering, candle light, or perhaps light from equipment. Also, it just might be that someone played a trick on them. The three teen-aged sons seem ideal suspects.
Another possibility is misperceptions in the low light environment, aided by suggestion.
The quirks of the human vision will also make it more difficult to track down a dim light source in a dark room. The human retina is not equally light sensitive everywhere. The edges of our vision are much more light sensitive than the center.
So we might see a light, that is really there, out of the corner of the eye and then lose it when looking straight at it. An occasional glance into a bright candle may leave afterimages.

We learned the next day that our attractive female photographer, after being dropped off by us at her apartment, became so overpoweringly ill from the effects of the bedroom’s malodorous environment that she regurgitated heavily before retiring.

They don’t say why they consider this particular detail noteworthy, nor why they feel it necessary to point out that the photographer was attractive. I suppose the implication is that she suffered because of the rapist ghost.
One might also conjecture that she, or maybe all of them, was (accidentally or not) drugged or poisoned which caused them to see lights and her to throw up. That is not be the most likely possibility, though, in my opinion.

The fourth visit

On their fourth visit they brought a number of other people with them and conducted a séance. The séance thing is usually glossed over in documentaries, for some reason. They only mentioned dozens, or whatever witnesses. Perhaps the producers fear that saying that they were there for a séance detracts from their credibility.

The séance circle consisted of someeight individuals, including more than ten spectators, many of who came with cameras loaded with high-speed infrared film with IR flashes, and high-speed black and white film with deep-red filtered strobes.

There is no mention of who these people were or how they were recruited. There is neither mention that any written testimonials were collected.

On this night we all observed what appeared to be extremely intense lights, which were not stable either in size or luminosity. The lights were at times three- dimensional in nature, reaching out between various individuals within the circle. Judging from the rapidly changing size, dimensional characteristics and intensity of the lights observed, it is our opinion that these manifestations were not fraudulently created, nor the result of collective hallucinations.

Apparently the lights were different from those on the previous night. That suggests a different cause.
Taff asserts that these lights could only have been “faked” using lasers and spends several paragraphs arguing that a concealed laser apparatus is an impossibility. It does not at all argue for the assertion that it would have required a laser.

In Sightings Gaynor mentions the possibility that it might have been a flashlight but dismisses the idea based on a photo. I wonder why that is not mentioned by Taff. The photo he bases his dismissal on was apparently taken during the next visit, according to my primary source so it is discussed below.

As far as photos on this day go:

With three 35 mm. cameras continuously firing at these oscillating greenish white, three-dimensional lights, only one photograph depicted anything significant. The camera loaded with Kodak Tri-X black and white film with a deep-red filtered strobe captured what appears to be a small ball of light flying across the corner of the room. The sixth obtained photograph displayed an object bearing strong resemblance to a comet with a tail behind it.

And later:

However, several other pictures showed what appeared to be faces or figures outlined in light against a sliding closet door. But, as these images are highly subjective in nature, much like a Rorschach, we did not subject them to further analysis. Another photograph depicted an intense light against the south-facing wall in several separate frames.
The professional photographer who took these pictures was convinced that this exposure could not be explained away as irregularities in paint or a “hot spot” of reflection. The photographer was similarly convinced that the flying ball of light, discussed earlier, which he also caught, was not an artifact of overdeveloping or scratch marks on the negative.
The criticisms raised against the facial and figure outlines on the walls were, in most respects valid, in that the lack of uniformity of paint on the bedroom walls in conjunction with the slight penetrating power of the pushed Kodak Tri-X film could have conceivably accounted for these unidentifiable figures, which unfortunately were not recognizable by everyone examining the photographs.

The phenomenon of seeing shapes in random patterns is known as pareidolia. There is no mention of who examined the photographs.
The assertions of the professional photographer can be believed or not. That he reportedly denies pareidolia harms his credibility but whether that matters is a different matter.

So what happened that night? No one knows, remember the limitations of eye witness reports. Nevertheless. I will indulge some speculations.

Green afterimages are seen after a bright red light. This could be a candle or the aforementioned red strobe. Particularly in a dark room such an afterimage can appear as a three dimensional glowing mass. It would require an iron will to believe to not see it for what it is though.
Whether this will is present in the witnesses from who we heard, you decide.

Lights from outside may also have played a role.

Then, there is still the possibility that someone, one of the family’s kids or a spectator, played a prank with a flashlight. It seems to me, though, that there should be more photos with a bright light then. Still, an iron will to disbelieve might lead one to dismiss such photos, especially if the light was perhaps usually turned less bright. So, who knows.

The fifth visit

Our fifth visit to Doris’ house resulted in a large-scale magnification of all phenomena. We began by duct taping large black poster boards up on the walls and ceiling of the bedroom, all of which were numbered and identified with a magnetic orientation. White duct tape was placed between the dark panels that formed a grid network, like graph paper, therein providing us with a reference for further attempts at photographing the lights. Black poster boards were also used to seal off all the light entrances into the bedroom that rendered the environment almost pitch black.
With over 30 individuals, some of whom were volunteers from our UCLA Parapsychology laboratory, the lights returned and were even more brilliant than before, as well as demonstrating a direct responsiveness to our verbal suggestions. The three-dimensional lights seemingly reacted and responded to our jokes and various provoking remarks, especially those of Doris.

They asked the lights to blink on a certain panel which it did.

[...]two blinks in panel three for “yes” and four flashes in panel six for “no.”

There is no full account of the questions asked, nor the answers given. In short:

The answers we received could not be confirmed, and never really made any sense.

These answers are interesting because if everyone saw the same answers then that would indicate that there really was a light, regardless of origin.
Unfortunately, I don’t think that can be taken as a given. They were sitting around in the dark waiting for something. Now if someone just shouts out that they see something then, quite possibly others will go along. That someone would buck the trend and say that he or she does not see something seems less likely. They might figure that their night sight is not good enough.
And then there’s the question to what a degree Taff and Gaynor would have been willing or able to credit skeptical remarks.

This night produced the best piece of evidence in the form of this photo.

On the photo you see two arcs of light. Which is not what was seen, according to the report.
The smaller one on the left looks to me like a “kink mark” (don’t google that at work). That’s what happens when a chemical film is roughly handled and bent which causes the light sensitive chemicals to be displaced.
Often, the smaller, left arc is cut off, leaving only the large arc above Doris Bither. Much is made of that.
The arcs are interpreted as traces of the balls of light, left because the shutter speed was not high enough. On its face that seems plausible but notice how the arcs fade out at the ends. That points rather to a kink mark in my opinion.
In either case, I have to agree with the often made argument that these arcs cannot be light projected on the background since otherwise there should be bends and discontinuities. This is particularly true for the left arc that covers someone’s head. Though for some reason Taff always points to the larger arc not being bent by the corner.

I don’t see any poster boards on the walls or windows. That could mean that Taff is mixing up dates and this photo was actually taken the previous day.
Or it may be that it simply was not taken in the bedroom. That would indicate that it was not taken during the séance when the lights were seen.
Either way, it is a worrying discrepancy.

When Adrian Vance, the West Coast Editor of Popular Photography examined the negatives of these photos, he was as perplexed as we were. According to Vance, the very nature of optical glass in a 35 mm. SLR camera prohibits such inverted arcs from occurring. Yet here they are. Vance could not conceive of any known artifact or anomaly to account for such images.

I find it hard to believe that the smaller arc is not a kink mark as it matches the examples I have seen. I don’t know about the bigger one.
There is reason to believe that Vance judges the likelihood of errors in pictures differently than many of his colleagues. That is that he is or was also an occasional UFO researcher who validated some UFO photos. Whether his judgment is more or less accurate than that of his colleagues, I cannot say.
The photo was published in the magazine. It would be very interesting to know what letters-to-the-editor that produced. (“UCLA Group Uses Camera to Hunt Ghosts”, Popular Photography, May 1976, pp. 102 & 115.)

They also had a Geiger counter which gave no reading at one point during the evening. This is interpreted as radiation being actually shielded rather than the device simply malfunctioning. No comment.

In the same night, the poster boards were pulled from the walls. It was suggested that maybe the heat could have weakened the adhesive tape but Taff asserts that some paint and plaster had been pulled from the wall as well and was still sticking to the tapes.
The possibility that one of the family was responsible was not discussed. In some interviews the idea that Doris Bither may have done it herself is repudiated by saying that she was petite. I think ladder technology had already been developed by 1974 but whatever. When I hear of petty vandalism, I’m thinking teen-age male.

The sixth visit

Our sixth session at the house took place five days later and in most respects was a repeat performance of our fifth visit with the exception that the lights repeatedly began to
take shape; forming the lime green, partially three-dimensional, apparitional image of a very large, muscular man whose shoulders, head and arms were readily discernible by the more than twenty individuals’ present. However, no salient facial characteristics of this apparition were discernible.

Nothing is made of it but this is the first time an apparition is seen by outsiders, similar to what the family had claimed.
Allegedly two persons, called Jeff and Craig, fainted when they saw the apparition. Perhaps fainting is not so unusual in the California heat in a crowded, ill-smelling room.

The display of lights this evening was so intense that they easily illuminated the numbered poster boards covering the walls of the bedroom’s corner. Even the clothes of the individuals observing the lights from outside the séance circle were brightly lit by the luminous activity. In fact, so piercing were the lights that they were seen to reflect off the camera’s aluminum frame and lenses, all of which were aimed directly at the corner where the optical display was concentrated.

Not surprising to us, considering the past two attempts at photographing the lights, all the negatives were perfectly clear, as if no light whatsoever was present to expose the film.

Undoubtedly, that was the most amazing night. If there was real light then someone must have very skillfully manipulated seven cameras unnoticed which seems just implausible.
The apparition is attested to by both Gaynor and Taff. Could they have convinced each other of having seen something that was not there at all? The answer is yes, of course. And while this  as an explanation is not implausible it is deeply unsatisfying to me. What really happened there?

The seventh visit

Again, the poster boards had been torn down. Also Doris Bither and two of her sons reported other psychokinetic phenomena. She showed a large bruise on her arm which supposedly resulted from having been hit by a candelabra thrown by an invisible force.

Accompanying us on this evening was Dr. Thelma Moss, head of our laboratory at UCLA’s Neuropsychiatric Institute, various assistants from the lab, several psychiatrists from the institute who professed an interest in such phenomena, and Frank De Felitta, a renowned writer,
producer, and director of The Stately Ghosts of England (NBC, 1965).

This appears to be the only occasion on which DeFelitta and Mort Zarkoff accompanied them, despite how it sounds in interviews.

There were some faint glimmerings of light, but they were in no way intense enough to
cause any real excitement.
[..]
Sadly, those attending only on this evening did not witness the “magnificent” display of swirling, three-dimensional lights or the apparition that had occurred within the house.

The obvious question is why no show? An obvious but speculative answer is the presence of some more skeptical authority figures in the form of Thelma Moss and several psychiatrists. If the lights were really just afterimages and hysteria then a few more sober persons, not going with the flow and not afraid to voice doubt, would have been quite a damper.
Of course, it’s also possible that a hoaxer simply lost interest, was not invited or any number of other things.
Taff remarks that Bither was much calmer during that visit and for the first time not intoxicated.

However, at one point during the séance, Gaynor suggested to the “presence” in the house, whatever it was, that it should demonstrate its strength by again tearing the poster boards off the walls, but this time, in our presence. As if in immediate reply, within five seconds following Gaynor’s request, several of the poster boards directly over Doris’ head were suddenly torn loose from their position and sharply struck her in the face.
Both Gaynor and I, as well as others in the room, could easily observe the bizarre sight of the duct tape being pulled, again as if by unseen hands, from the boards on the wall.

That sounds quite amazing and even more so when you hear Gaynor relating that in an interview.

Needless to say, the opinions of some of those attending only for the seventh visit to Doris’ house were anything but positive, as our claims were only marginally supported. As far as the activity surrounding the poster boards was concerned, many of those present felt that their sudden removal from the walls and ceiling was explainable under the heating and humidity hypothesis discussed earlier. Yeah, right. And pigs can fly too.

This is a remarkable passage. It reveals that there are two completely different views of the happenings. And it is the only passage that reveals that. I also wonder why it should be needless to say that.
Apparently what Gaynor and Taff witnessed was a solidly fixed board being torn from the wall and flying straight at Bither, right on cue. What the others witnessed was boards just falling down. I wonder if these less impressed witnesses would even agree that it happened on cue. Witness reports are often unreliable with regards to timing.

Needless to say, Frank and Mort were absolutely amazed by even this, less than expected, occurrence. Unfortunately, when they had their special films processed, it did not reveal anything significant as related to what we all had observed that night in Doris’s bedroom.

That nothing remarkable was caught on camera means either that it was not pointed at Bither or the right poster boards at the right time, or that the falling boards did not look impressive.

Note that Taff refers to events that all witnessed. That makes one wonder about the other occasions where supposedly many people saw something. Did really everyone see something amazing or just something?

The last visit

The 8th and last visit took place on Halloween 1974. It is described as even less remarkable than the previous one. No more details are given.

Radin for a Rerun

This is the 3rd and currently last part in my series on parapsychological double-slit experiments. It discusses the pilot study by Dean Radin and the six following experiments by him and others.

The Less Said…

In 2008, Dean Radin published an article in Explore, The Journal of Science & Healing, a journal dedicated to alternative medicine. That’s certainly a good way to hide it from anyone who knows or cares about physics. Or ethics.
Dean Radin is currently coeditor-in-chief.

Now hold tight because this paper is bad.

Obviously, the original design and justification for the experiment is taken from Jeffers, though Radin fails to give him the appropriate credit. However, the implementation is different.

Radin used a so-called Michelson interferometer for the experiment. This is physically equivalent to Jeffers’ set-up in all relevant aspects. However, due to being extremely sensitive to environmental factors, like temperature or vibrations, it seems a less than ideal choice.
Another thing that is different is the outcome measure. That’s where things really go south. He uses a CCD chip to capture the interference pattern. Unfortunately he then completely disregards it.
Instead he computes the total intensity of the light reaching the chip.
With that, the experiment becomes a test if the subjects can throw a shadow by concentrating on a spot.
Radin seems completely oblivious of that simple fact.

In all likelihood, Radin simply did not get a positive result and then, instead of accepting it, went fishing. How that works (except not really) is simply explained here.

The title of the paper, Testing Nonlocal Observation as a Source of Intuitive Knowledge, tells us that Radin sees this as relevant to intuition. Intuition, as he points out, is regarded as an important source of artistic inspiration or for scientific insights. It’s a bit hard to see what the connection between that and casting shadows should be. But remember that Radin doesn’t know what he is doing. He believes this is like Jeffers’ original design.
His first error is in thinking that the original design could that someone gains knowledge about the photons’ paths. Learn why not in this post. This is a simple misunderstanding and easy to follow.
His second error is more weird and more typically parapsychological. He thinks that, if people can gain knowledge about the paths of a few photons in some unknown way, then it’s reasonable to assume that they can gain any knowledge via the same unknown mechanism.
We detect photons all the time with our eyes, and much more effectively. If that doesn’t tell us anything about intuition, then why should this?

The bottom line is that evidently the experiment was botched. And even if it hadn’t been, the conclusions he tries to draw just wouldn’t follow. The connections he sees are just not there, at least as far as there is evidence.

Six More Experiments

Finally, we get to the current paper by Dean Radin, Leena Michel, Karla Galdamez, Paul Wendland, Robert Rickenbach, and Arnaud Delorme. I think Radin wrote most of the paper. It repeats so many errors of the previous one and the statistics are his style.

The paper is miles better. It follows Jeffers original set-up quite closely. For one, they use a standard double-slit set-up. They don’t measure the contrast in the way Jeffers did but something that should work just as well. At least, the justification seems solid to me though I don’t know enough to actually vouch for it.
What’s more, they say that the ups and downs of the measure were given as feed-back to the subjects. This again follows Jeffers’ lead and, importantly, gives me some confidence that it was not chosen after the fact. But again Jeffers is not properly credited with coming up with the original design of the experiment.

Unfortunately this paper again does not report the actual physical outcome. They don’t report how much the subjects were able to influence the pattern. What seems clear is that none of them was able to alter the pattern in a way that stood out above the noise.
It would have been interesting to know if the measurements were any more precise than those conducted by Jeffers and Ibison. If their apparatus is better, and the effect still couldn’t be clearly measured, then that would suggest confirmation that Ibison’s result was just chance.

They calculate that, on average, the pattern in periods where subjects paid attention was slightly different from the pattern when they did not pay attention. That is valid in principle because you can get a good measurement with a bad apparatus by simply repeating the process a lot of times.
However, any physicist or engineer would try their utmost to improve that apparatus rather than rely on repetitions.

On the whole, this is not handled like a physics experiment but more like one in social science.
Most importantly, the effect size reported is not about any physically measurable change to the interference pattern. It is simply a measure for how much the result deviated from “chance expectation”. That’s not exactly what should be of top interest.

The paper reports six experiments. To give credit where credit should not be due, some would have pooled the data and pretended to have run fewer but larger experiments to make the results seem more impressive. That’s not acceptable, of course, but still done (EG by Daryl Bem).

Results and Methods

The first four experiments presented all failed to reach a significant result, even by the loose standards common in social science. However, they all pointed somewhat in the right direction which might be considered encouraging enough to continue.

Among these experiments there were two ill-conceived attempts to identify other factors that might influence the result.
The first idea is that somewhat regular ups and downs in the outcome measure could have coincided with periods of attention and no attention. I can’t stress enough that this would have been better addressed by trying to get the apparatus to behave.
Instead, Radin performs a plainly bizarre statistical analysis. I’m sure this was thought up by him rather than a co-author because it is just his style.
Basically, he takes out all the big ups and downs. So far so fine. This should indeed remove any spurious ups and downs coming from within the apparatus. But wait, it should also remove any real effect!
Radin, however, is satisfied with still getting a positive result even when there is nothing left of that could cause a true positive result. The “positive” result Radin gets is obviously a meaningless coincidence that almost certainly would not repeat in any of the other experiments. And indeed, he reports the analysis only for this one experiment.
Make no mistake here, once a method for such an analysis has been implemented on the computer for one set of data, it takes only seconds to perform it on any other set of data.

The second attempt concerns the possibility that warmth might have affected the result. A good way to test this is probably to introduce heat sources into the room and see how that affects the apparatus.
What is done is quite different. Four thermometers are placed in the room while an experiment is conducted. The idea seems to have been that if the room gets warmer this indicates that warmth may have been responsible. Unfortunately, since you don’t know if you are measuring at the right places, you can’t conclude that warmth is not responsible if you don’t find any. Besides it might not be a steady increase you are looking for. In short, you don’t know if your four thermometers could pick up anything relevant or how to recognize it, if they did.
Conversely, even if the room got warmer with someone in it, this would not necessarily affect the measurement adversely.
In any case, temperature indeed seemed to increase slightly. Why the same temperature measurements were not conducted in the other experiments, or why the possible temperature influence was not investigated further, is unclear to me. They believe this should work, so why don’t they continue with it?

The last two experiments were somewhat more elaborate. They were larger, comprising about 50 subjects rather than about 30, and took an EEG of subjects. The fifth experiment is the one success in the lot insofar that it reports a significant result.

Conclusion

If you have read the first part of this series then you have encountered a mainstream physics articles that studied how the thermal emission of photons affects the interference pattern. What that paper shares with this one is that both are interested in how a certain process affects the interference pattern.

And yet the papers could hardly be more different. The mainstream paper contains extensive theoretical calculations that place the results in the context of known physics. The fringe paper has no such calculations and relies mainly on pop science accounts of quantum physics.

The mainstream paper presents a clear and unambiguous change in the interference pattern. Let’s look at it again.

The dots are the particles and the lines mark the theoretically expected interference patterns fitted to the actual results. As you can see the dots don’t exactly follow the lines. That’s just unavoidable random variation due to any number of reasons. And yet the change in the pattern can be clearly seen.

From what is reported in Radin’s paper we can deduce that the change associate with attention was not even remotely as clean. In fact, the patterns should be virtually identical the whole time.
That means, that if there is a real effect in Radin’s paper, it is tiny. So tiny that it can’t be properly seen with the equipment they used.

That is hardly a surprising result. If paying attention on something was able to change its quantum behavior in a noticeable way, then this should have been noticed long ago. Careful experiments should be plagued by inexplicable noise, depending on what the experimenters are thinking about.

The “positive” result that he reports suffer from the same problem as virtually all positive results in parapsychology and also many in certain recognized scientific disciplines. It may simply be due to kinks in the social science methodology employed.
Some of the weirdness in the paper, not all of which I mentioned, leaves me with no confidence that there is more than “flexible methods” going on here.

Poor Quantum Physics

Radin believes that a positive result supports “consciousness causes collapse”.  He bemoans a lack of experimental tests of that idea and attributes it, quite without justification, to a “taboo” against including consciousness in physics.
Thousands upon thousands of physicists and many times more students have out of some desire to conform simply refused to do some simple and obvious experiment. I think it says a lot about Radin and the company he keeps that he has no problem believing that.
I don’t know about you, my dear readers, but when I am in such a situation would have thought differently. Either all those people who should know more about the subject than me have their heads up their behinds. Or maybe it is just me. And I would have wondered if there was maybe something I am missing. And I would have found out what it was and avoided making an ass of myself. Then again, I would have (and have) also avoided book deals and the adoration of many fans and the like, all of which Radin secured for himself.
So who’s to say that reasonable thinking is actually the same as sensible thinking.

But back to the physics. As is obvious when one manages to find the relevant literature, conscious awareness of any information is not necessary to affect an interference pattern. Moreover, wave function collapse is not necessary to explain this. Both of this should be plain from the mainstream paper mentioned here.

Outlook

My advice to anyone who thinks that there’s something about this is to try to build a more sensitive apparatus and/or to calibrate it better. If the effect still doesn’t rise over the noise, it probably still wasn’t there in the first place. If it does, however, future research becomes much easier.
For example, if tiny magnetic fields influence this, as Radin suggests, that could be learned in a few days.

Unfortunately, it does not appear that this is the way Dean Radin and his colleagues are going about but I’ll refrain from comment until I have more solid information.
But at least, the are continuing the line of investigation. They deserve some praise for that. It is all too often the case that parapsychologists present a supposedly awesome, earth-shattering result and then move on to do something completely different.

 

Update

I omitted to comment on a lot of details in the second paper to keep things halfway brief. In doing so I overlooked one curiosity that really should be mentioned.

The fourth experiment is “retrocausal”. That means, in this case, that the double-slit part of the experiment was run and recorded three months before the humans viewed this record, and tried to influence it. The retrocausality in itself is not really such an issue. Time is a curious thing in modern physics and not at all like we intuit.

The curious thing is that it implies that the entire recording was in a state of quantum superposition for a whole three months. Getting macroscopic objects into and keeping them in such states is enormously difficult. It certainly does not just happen on its own. What they claim to have done there is simply impossible as far as mainstream quantum physics is concerned. Not just in theory, but it can’t be done in practice despite physicists trying really hard.

A Physicist Investigates

This is the second part on my series on the parapsychological double-slit experiment.

The original double-slit experiment was performed in 1803 by Thomas Young.
The parapsychological version was thought up by one of his intellectual heirs, almost 200 years later.

In the Beginning

Stanley Jeffers, a professor of physics at York University in Toronto, Canada, came across the results from PEAR lab during a sabbatical in 1992. The PEAR project claimed to have shown that people could influence Random Number Generators (RNGs) by their intention alone. There were also earlier experiments which claimed that people could influence radioactive decay.
Jeffers became interested enough to follow up on that. Jeffers came up with a more direct way to test this idea by having people try to influence the diffraction of photons (unsuccessfully). (Jeffers and Sloan 1992)

Afterwards, he refined that experiment by using a double-slit set-up. This makes for an exquisite test of whether humans have some unknown means of affecting the world.
If there is some way of making information available, about through which slit the particles went, then this should affect the interference pattern. Moreover the change in the pattern would allow one to estimate just how much information was made available.
It’s not necessary for anyone to become consciously aware of this information, it’s enough that it exists. For example, if people can somehow influence what happens at one slit but not the other, then that could “tag” the particle in such a way that it becomes possible to distinguish through which slit it went. Such a thing would be sufficient to affect the interference pattern.
More directly, if people can clairvoyantly observe one slit (aka remote view it), then this too should make which slit information available.
There’s a catch, though. That’s only true as far as we understand the laws of nature. However, the paranormal, by definition, is not bound by that.

There’s another thing that could lead to a positive result. The experimental apparatus needs to be carefully calibrated. If the right parts are bent out of shape by just a few micrometers or if the laser is somehow affected then this could also lead to a change in the pattern. It would be quite tricky to mimic the change expected from available information but it could still lead to a positive result under the right (or wrong) circumstances. Either if the subconscious psychokinetic ability is also ingenious or if the pattern is not scrutinized properly.
Jeffers seems not to have considered this possibility but, of course, outside of parapsychology such ideas, IE psychokinesis, tend to be regarded as rather philosophically than actually possible. Nevertheless, parapsychologists have little hesitation to jump to any conclusion and it would be quite in line with what is often claimed.

All in all, Jeffers’ double-slit set-up promises to be a very sensitive means of detecting even the smallest influence, whatever it’s nature.

A Double-Slit Diffraction Experiment to Investigate Claims of Consciousness-Related Anomalies

Stanley Jeffers ran 74 sessions, each consisting of many attempts to influence the pattern. Then he gave up and presented his findings at a conference. After that, the people of the PEAR lab borrowed the device and used it to run 20 sessions.
I do not have the original conference report by Jeffers available to me but was able to find first hand information by Jeffers in the book Psi Wars: Getting to Grips with the Paranormal.
The two experiments were reported together in The Journal of Scientific Exploration, which is dedicated to publishing stuff that is, put bluntly, too stupid or crazy for normal peer-reviewed journals. (direct link to paper)
The article is authored by Ibison and Jeffers but appears written by Ibison.

While Jeffers completely failed to find anything in his 74 sessions, Ibison of the PEAR lab reported a “significant” result with only 20. Some will say that the latter experiment had a positive result.

I fear, I have to go into a little detail on that. This is an issue that will be relevant for all the following experiments. The interesting thing here is the contrast of the interference pattern. But you can’t measure it with arbitrary precision.
The pattern itself will be influenced by tiny disturbances of the apparatus caused by changes in ambient heat or vibrations. The sensor which picks up the brightness, basically a predecessor to the CCD chip in current digital camera, is not perfect either. And then there’s more esoteric things, too.
The upshot is that every measurement will have a slightly different result.
According to Jeffers in Psi Wars, repeated calibrations yield typical values for the contrast of 0.991 with a standard deviation of 0.001, while Ibison talks of “around 5% of the peak value”. I don’t quite know what Ibison means by that since it’s not how measurement uncertainty is usually reported. However, it does seem to imply more uncertainty than Jeffers’ figure. Maybe there’s a typo in one of the sources or maybe Ibison at the PEAR lab was not able to calibrate the device as well as Jeffers who was, after all, a pro at this.

What is clear is that any influence by the subjects must have been so small that it never rose above the noise, that is, the fluctuations due to imperfect measurement. Unfortunately, none of the sources I have at hand provides information on what kind of an upper bound this puts on the subjects’ maximal influence.

It is possible to measure something even if it hides in the noise. To do so, one must simply repeat the measurement over and over again and form an average. In the long run the errors in either direction will balance out. The more often one repeats the measurement, the more reliable the average. Using statistics, one can compute how reliable a result is, which is then expressed with error bars.
Basically, Ibison found that the contrast was slightly altered and computed that there was only a 1 in 20 chance for that to happen, merely from noise. That means, if you repeated the experiment many times, you’d get such a result or a better one 1 in 20 times, if you’re only gazing at noise. That’s, of course, not very impressive, even not taking into account that such figures tend to be exaggerated. There’s a couple of things that can make a result appear more unlikely than it really is. Physicists tend to ask for much more clear-cut results.

Conclusion

Fundamentally, this should be a discouraging result for parapsychology. Even though a device was constructed able to detect influences far smaller than that what most parapsychological experiments require, the supposed effect was still hidden in the noise.

Someone who believes that there really was an effect there has two basic options. The preferred one should be to build a better apparatus, that has less noise and so will let the true effect stand out.
The other one is, as mentioned, to run this experiment a lot of times but that will not be nearly as useful, nor as convincing.

Further Thoughts

I mentioned that one way of achieving a positive result in this test is by exerting some physical force on the apparatus. A physicist or engineer would test this directly by having people try to exert that force on a force sensor of some kind.
Obviously, parapsychologists don’t do it that way. It fails to show what they know to be true.
One answer to that is abandoning careful experiments and simply argue that some feat can’t possibly have been a trick because they surely would have noticed. Parapsychologists apparently do not suffer from the same limitations to perceptions as the rest of us. Or perhaps of a few more inabilities to see something.
The other answer is to conduct experiments involving randomness, like the RNGs of PEAR or radioactive decay or something as simple as die. These experiments seem much less silly than any argument for the genuineness of something that looks like a magic trick.
But are they really? Why not measure the effect directly?
Why is there always randomness involved?
An obvious answer is that the “effect” is simply the result of data mining. My take on that here.

But, since I aim to deliver the whole truth, I must also tell you that parapsychologists also have ideas on the issue. One idea that Ibison mentions in the mentioned article is that maybe experimenters can see into the future and somehow know, subconsciously, when to start an experimental run so that the purely random processes will deliver a favorable result. Ibison is quite open with the fact that the apparatus was basically just generating random numbers and not necessarily measuring anything.

Footnote

Stanley Jeffers has published his conclusion on the results on the PEAR lab results Skeptical Inquirer in 2006 and 2007, and also written an essay which was published in the book Psi Wars: Getting to Grips with the Paranormal. (Review by Caroline Watts)

Attention! Double-slit!

Recently Dean Radin and others published an article that purports to study the effects of attention on a double slit experiment.

Originally I wanted to do just a rebuttal to that but then found it necessary to also review the entire background. The simple rebuttal spiraled out of control into a 3-part series. My old math teacher was right. Once you add the imaginary things get complex, for reals. And not only for them.

A Word of Caution

People often ask for evidence when they are faced with something they find unlikely. The more skeptical will also ask for evidence for something they consider credible, at least sometimes. For the academically educated evidence means articles published in peer-reviewed, reputable, scientific journals.
For example, all the articles I cite as evidence in the first part, where I look at mainstream quantum physics, are from such journals.
So here comes the warning. Not all journals that call themselves peer-reviewed are reputable. For example, there is a peer-reviewed journal dedicated to creationistic ideas. And I probably don’t need to tell you what scientists on the whole think of creationism.

The journals that published the articles discussed in this series are not reputable. Mainstream science does not take note of them. Physics Essays, where the most recent article appeared may very well be the closest to the mainstream and still it is mostly ignored.
It is largely an outlet for people who believe that Einstein was wrong. We’re not talking about scientists looking for the next big thing, we’re talking about people who are to Einstein’s theory what creationists are to evolution.
This is not meant as an argument against these ideas, I just don’t want to mislead anyone into believing that there is a legitimate scientific debate going on here.

That’s not to say that science ignores fringe ideas. For example, Stanley Jeffers who appears in the second part of this series is a mainstream physicist who decided to follow up on some of those.
He just didn’t find that there was anything there. It was a dead end.
James Alcock has a few words on that in his editorial Give the Null Hypothesis a Chance.

There are many cranks out there. These are people who hold onto some theory in the face of contrary evidence. They will not go away but they will, almost invariably, accuse the mainstream of science to be dogmatic. Eventually, there is nothing to be done but ignore them.

On to the Review

The first part gives a brief overview over the quantum physics background to the experiment. Dean Radin gets this completely wrong. And I fear the misunderstandings he propagates will pop up in many places.

Part 1: A Quantum Understanding

In the next part we will look at the experiment in question. Let’s call it the parapsychological double-slit experiment. We will learn who came up with the idea and what he found out and also what a positive result should look like and what it might mean.

Part 2: A Physicist Investigates

The 3rd and last part, for now, looks at the two articles authored by Dean Radin, presenting seven replications of the original design.

Part 3: Radin for a Rerun

Further studies are being conducted so more parts are likely to follow at some point.

« Older entries

Follow

Get every new post delivered to your Inbox.