Why doesn’t experiment 9 replicate?

I have written about Daryl Bem’s paper “Feeling the Future” before and laid out a few of the serious issues that invalidate it.

Recently it’s been in the news again because one of the nine experiments presented in it, experiment 9, was repeated and failed to yield a positive result. Of course, no one was particularly surprised by this, except perhaps the usual die hard believers. Still, some may wonder where the positive result came from in the first place. Just chance or something more?

Before we can look at the actual research we need to look at the dangers of pattern seeking…

Patterns are for kilts

Image of group of 9 people

Let’s do a little game. We pick a few people in this image and then we try to find some way to split those nine people into two groups in such a way that most of our picks end up in one group.
For example let’s take the 1st from the left in the first row and the 2nd in the bottom row.
Answer: Males vs. Females.

Again: We take the 2nd in the top row and the 2nd and 4th in the bottom row.
Possible answer: People with and without sunglasses.
It doesn’t work perfectly but mostly.

If you’re creative you can find a more or less good solution for any possible combination of picks. That’s the first take-away point.

Now let’s add a bit of back story and extend our game. The group went to a casino and some of them won big and those are the people we point out.
The goal of the game is now not only to find a good grouping but also to make up some story for why the one group had most of the winners.

For example: The sunglasses are a lucky charm and that’s why the group with glasses did better.
That’s alright, but lucky charm is kind of lame.
How about: Hiding the eyes helps bluffing in poker. Much better…
But wait, correlation does not equal causation as statisticians never tire of telling us. Pro-players like to wear sunglasses, as everyone knows, and that’s why that group did better.

So if you’re creative you can even find some semi-plausible explanation for why a group did better than another.
And when the explanation need not even be semi-plausible then you can always find one without any creativity. Lucky charm, magic or divine favor fits any case. That’s the second take home point.

You can always find some sort of pattern in any set of random data. For example, shapes in clouds. Random means that you rarely find the same pattern again.

For one final encore, let’s make up for each person in that picture how much money they won or lost in the casino. Say top left: Lost $145; top 2nd from left: Won $78; And so on…
An answer might be skin bared in square cm, or height in inches and so on.

Experiment 9

Experiment 9 is derived from a simple psychological experiment that could run something like this:
Step 1
Ask a subject to remember a list of words. The words are flashed one at a time on a computer screen for 3 seconds each.
Step 2
Then randomly select some words for the subject to practice. The selected words appear on the screen again and the subject types them. Of course, the subject can’t make notes.
Step 3
The subject is asked to recall the words.

The result is, unsurprisingly, that more of the practiced words are recalled.
Bem switched steps 2 and 3. That is the words are practiced after they were recalled. You wouldn’t expect that what one does after the fact makes a difference but Bem claimed that the experiment was a success.

If you are new to parapsychology you would probably assume that this means that more practice words were recalled. In fact, Bem does not tell us that. We don’t know if that was the case but the omission is telling.
Bem constructs what he calls a “differential recall index” for each subject. You compute this by first subtracting the number of control words (words that did get practiced) recalled from the number of practice words. Then you multiply this by the number of words recalled in total. This is then turned into a percentage but I’ll omit that in the examples.

So if subject 1 recalls 39 words in total and 20 of these are practiced later then the index is 1*39.
And if subject 2 recalls only 18 words and 8 are practiced then the index is (-2)*18= -36.

You can already guess where this is going. The justification that Bem gives for this manipulation is:
Unlike in a traditional experiment in which all participants contribute the same fixed
number of trials, in the recall test each word the participant recalls constitutes a trial and is
scored as either a practice word or a control word.

This is just massive nonsense. As we have seen, not every recalled word is equal. Those words that come from participants who recalled many count heavier. The function of the index runs counter to the stated purpose.
Let’s combine the examples above. Subject 1 recalled one more practice word but subject 2 recalled two fewer. This indicates that practicing after the fact does not work, although in an actual experiment 2 subjects would be too little to state anything with confidence.
But now look at the combined index: 39 – 36 = +3. This indicates success. Obviously the index misleads here.

That the reviewers let this through is certainly a screw-up on their part. There’s no sugar-coating it.

Charitably one might assume that Bem also made a mistake and just through luck got a significant result. However, that is unlikely.
The evidence, namely the advice he gives on writing articles as well as his handling of the other experiments, indicates that the index was created to force a positive result.
Still, that not necessarily implies ill intent. He may have played around with a statistics program until he got results he liked without ever realizing that this is scientifically worthless. In fact, objectively this is scientific misconduct.
Unfortunately, Bem displays an awareness of the inappropriateness of such methods.

The fact that the actual result of the experiment is not reported by Bem but only the flawed and potentially misleading Differential Recall index makes me conclude that the experiment was probably a failure. There was simply a random association between high recall and favorable outcome on which the DR index capitalizes.
By random chance such a pattern may arise again but only rarely, hence the failure to replicate.

Conceptual vs. Close Replication

Believers often insist that Bem has only replicated previous work. The implication being that these experiments are replicable. But when they say replication they mean a so-called “conceptual replication”. By that they mean experiments in general that purport to show retroactive effects, that is the present affecting the past. Of course, when one makes up a whole new experiment one can simply use the now familiar tricks to force a positive result.
A close replication actually repeats the experiment and is therefore bound to the same method of analysis. Only a close replication is a real replication.


Back from hiatus

As you can see I took a time out from this blog for half a year now and never delivered the promised Ganzfeld series. It’s tedious and unrewarding work and I simply had better things to do. Hopefully I’ll bring things home in the next couple of months. Even though I have no idea who really cares I feel a sense of duty to finish what I started.

Randi’s Prize
What I won’t finish is the chapter by chapter review of Randi’s Prize by Robert McLuhan. It simply doesn’t work. He cites a lot of research and it is really this research that should be addressed rather than McLuhan’s take on it. The basic errors he himself makes are already pointed out in the reviews of the first few chapters.

Next up will be my take on the current hoopla about the failed replication of one of Bem’s experiments. Stay tuned.