Getting Wagenmakers wrong

EJ Wagenmakers et al published the first reply to the horribly flawed Feeling the Future paper by Daryl Bem. I’ve blogged about it more times than I care to count right now.

Their most important point was regarding the abuse of statistics. Or, as they put it, that Bem’s study was exploratory rather than confirmatory.
They also suggested a different statistical method as a remedy. I’ve expressed doubts about that because I don’t think that there is a non-abusable method.

Unfortunately ,what they proposed has been completely and thoroughly misunderstood. The latest misrepresentation appeared in an article by 3 skeptics in The Psychologist. I blogged.

How to get Wagenmakers right

The traditional method of evaluating a scientific claim or idea is Null-Hypothesis Significance Testing (NHST). This involves coming up with a mathematical prediction of what happens if the new idea is wrong. It’s not enough to say that people can’t see into the future, you must say what the results should look like if they can’t.
After the experiment is done, this prediction is used to work out how likely it was to get results such as those that one actually got. If it is unlikely then one concludes that the prediction was wrong. The null hypothesis is refuted. Something happens and this then is taken as evidence for the original idea.
There’s a number of things that can go wrong with this method. One is choosing that null prediction after the fact, based on whatever results you got.

The method that Wagenmakers argued for is different. It involves not only making a prediction about what  happens when the original idea is wrong. It also requires making a prediction about what happens if it is right.
Then, with the results of the experiment, one works out how likely the result was under either prediction. Finally, calculate how much more likely the result is under one hypothesis rather than the other. This last number is called the Bayes Factor.

For an example, imagine an ordinary 6-sided die but instead of being ordinarily labeled it has only the letters “A” and “B”. The die comes in 2 variations, one has 5 “A”s and 1 “B”, the other 1 “A” and 5 “B”s.
You roll a die once and get an “A”. This result is 5 times likely in the first variant.
You could use this result to speculate about what kind of die you rolled. But what if there is a third variant of die? One that has, say, 3 “A”s and 3 “B”s. Then your Bayes Factor would be different.

The Bayes Factor depends crucially on the two hypothesis being compared. Depending on which 2 hypothesis are being compared one can seem more likely or the other.

In the case of Feeling the Future, the question is basically what we should assume about what happens when something happens. How much feel for the future should we assume?
Wagenmakers et al said if one cannot assume anything for lack of information then one should use this default assumption as suggested by several statisticians. This assumption implied that people might be a little good at feeling the future or maybe very good.
Bem, along with two statisticians, countered that we already know that people are not good at feeling the future. Parapsychological abilities are always weak and therefor one should use a different assumption under which the strength of the evidence was very much confirmed.

Let’s make this intuitively clear. Think again of of die with 5 As or 5 Bs. You are told that one die was rolled 100 times and showed 30 As and 70 Bs. Clearly that is more likely to be a 5 B die than a 5 A die. But wait, what if, instead of comparing those 2 with each other we compare either of these with a die with 2 As and 4 Bs. That die would win.

I have simplified a lot here. If something doesn’t seem to make sense it’s probably because of that and not because of a problem in the original literature.

Bem’s argument makes a lot of sense but overlooks that belief in strong precognition is wide-spread, even among parapsychologists. Tiny effects are what they get but not what they hope for or believe in. Both parties have valid arguments for their assumptions but neither makes a compelling case. On the whole, however, it does show a problem with the default Bayesian t-test.

Let me emphasize again that Wagenmakers made two points. The first that Bem made mistakes in applying the statistics. And secondly that it would be better to use the default Bayesian t-test rather than the traditional NHST. These are separate issues.
In my opinion, the abuse of statistical methods is the crucial issue that cannot be solved by using a different method.

How to get Wagenmakers wrong

Bayesian statistics is often thought of as involving a prior probability. In fact, the defining characteristic of Bayesian statistics is that it includes prior knowledge.

Again let’s go with the example. You’re only concerned with the 2 die variants with the 5 “A”s and the 5 “B”s. Someone is throwing always the same die and telling you the result. You can’t see the die, of course, but are supposed to guess which die was thrown solely based on the result.
Intuitively, you’ll probably be tending more toward the first kind with every “A” and more toward the second type with every “B”.
But what if I told you that I randomly picked the die out of a box with 100 die of the 5 “A” variant and only 1 of the 5 “B” variant. You’ll start out assuming it should be the 5 “A” variant and will require a lot of “B”s before switching.
Formally, we’d compute the Bayes Factor from the data and then use that factor to update the prior probability to get the posterior probability. The clearer the data is, and the more data one has, the greater the shift in what we should hold to be the case.

In reality one will hardly ever know which of several competing hypotheses is more likely to be true. Different people will make their own guess. Some, maybe most, people will regard precognition as a virtual impossibility, a few as a virtual certainty.
Wagenmakers et al showed that even if one assumes a very low prior probability to the idea that people can feel the future (or rather the mathematical prediction based on that idea), 2,000 test subjects would yield enough data to shift opinion firmly towards precognition being true.

Unfortunately, some people completely misunderstood that. They thought that Wagenmakers et al were saying that we should not regard Bem’s data as convincing because they assigned a low prior probability. In truth the only assumption that went into the Bayes factor calculation was regarding the effect size. That point was strongly emphasized but still people miss it.

Advertisements

[sound of sighing]

The May issue of The Psychologist carries an article by Stuart Ritchie, Richard Wiseman and Chris French titled Replication, Replication, Replication plus some reactions to it. The Psychologist is the official monthly publication of The British Psychological Society. And the article is, of course, about the problems the 3 skeptics had in getting their failed replications published.

Yes, replication is important

That the importance of replications receives attention is good, of course. Depositories for failed experiments are important and have the potential to aid the scientific enterprise.

What is sad, however, is that the importance of proper methodology is largely overlooked. Even the 3 skeptics who should know all about the dangers of data-dredging cavalierly dismiss the issue with these words:

While many of these methodological problems are worrying, we don’t think any of them completely undermine what appears to be an impressive dataset.

But replication is still not the answer

I have written about how replication cannot be the whole answer before. In a nutshell, by cunning abuse of statistical methods it is possible to give any mundane and boring result the impression of showing some amazing, unheard of effect. That takes hardly any extra work but experimentally debunking the supposed effect is a huge effort. It takes more searching to be sure that something is not there than to simply find it. For statistical reasons, an experiment needs more subjects to “prove” the absence of an effect with the same confidence as finding it.
But there’s also that there might be some difference between the original experiment and the replication that explain the lack of effect. In this case it was claimed that maybe the 3 failed because they did not believe in the effect. It takes just seconds to make such a claim. Disproving it requires finding a “believer” who will again run an experiment with more subjects that the original.

Quoth the 3 skeptics:

Most obviously, we have only attempted to replicate one of Bem’s nine experiments; much work is yet to be done.

It should be blindingly obvious that science just can’t work like that.

There are a few voices that take a more sensible approach. Daniel Bor writes a little of how neuroimaging which has, or had, extreme problems with useless statistics might improve by foster greater expertise among the practitioners. Neuroimaging seems to have made methodological improvements. What social psychology needs is a drink of the same cup.

The difficulty of publishing and the crying of rivers

On the whole, I find the article by the 3 skeptics to be little more than a whine about how difficult it is to get published, hardly an unusual experience. The first journal refused because they don’t publish replications.
Top journals are supposed to make sure that the results they publish are worthwhile. Showing that people can see into the future is amazing, not being able to show that is not. Back in the day it was simply so that there was only a limited number of pages that could be stuffed into an issue, these days, with online publishing, there’s still the limited attention of readers.
The second journal refused to publish because one of the peer-reviewers, who happened to be Daryl Bem, requested further experiments to be done. That’s a perfectly normal thing and it’s also normal that researchers should be annoyed by what they see as a frivolous request.
In this case, one more experiment should have made sure that the failure to replicate wasn’t due to the beliefs of the experimenters. The original results published by Bem were almost certainly not due to chance. Looking for a reason for the different results is good science.

I’ve given a simple explanation for the obvious reason here. If the 3 skeptics are unwilling or unable to actually give such an explanation they are hardly in a position to complain.

Beware the literature

As a general rule, failed experiments have a harder time to get published than successful ones. That’s something of a problem because it means that information about what doesn’t work is lost to the larger community. When there is an interesting result in the older literature that seems not to have been followed up on then it probably is the case that it didn’t work after all. The original report was a fluke and the “debunking” was never much published. Of course, one can’t be sure if it was not maybe overlooked, which is a problem.
One must be aware that the scientific literature is not a complete record of all available scientific information. Failures will mostly live on in the memory of professors and will still be available to their ‘apprentices’ but it would be much more desirable if the information could be made available to all. With the internet, this possibility now exists and that discussion about such means is probably the most valuable result of the Bem affair so far.

Is Replication the Answer?

One question that is forced on us by the publication of papers like Daryl Bem’s Feeling the Future is what went wrong and how it can be fixed.

One demand that often arises is for replication. It is one of the standard demands made by interested skeptics in forums and such places. I can understand why calling for replication is seductive.
It is shrewd and skeptical. It says: Not so fast, let’s be sure first while at the same time offering a highly technical criticism. Replication is technical jargon, don’t you know?. On the other hand it’s also nice and open-minded. It says: This is totally serious science and some people who aren’t me should spend a lot of time on it.
And perhaps most important of all, it requires not a moments thought.

Cynicism aside, replication really is important. As long as a result is not replicated it is very likely wrong. If you don’t replicate you’re not really generating knowledge. Not only can you not rely on the results, you also lose the ability to determine if you are using good methods or are applying them correctly. Which I’d speculate will decrease reliability still further over time.

Replication is essential but is replication really all that is needed?

Put yourself in the shoes of a scientist. You have just run an experiment and found absolutely no evidence that people can see the future.  That’s going to be tough to publish.
Journals are sometimes criticized for being biased against negative results but the simple fact is that they are biased against uninteresting results. Attention is a limited quantity; there’s only so much time in a day that can be spent reading. Most ideas don’t work out and so it is hardly news when an idea fails in experiment. Think for an example of all the chemicals that are not drugs of any kind.

Before computers and the information age it probably wouldn’t even have been possible to handle all the information about failed ideas. Things have changed now but the scientific community is still struggling to incorporate these new possibilities. However, one still can’t expect real life humans to pay attention to evidence of the completely expected.

Now you could try a new idea and hope that you have more luck with that.
Or you could do what Bem did and work some statistical magic on the data. And by magic I mean sleight of hand. The additional work required is much less and it is almost certain to work.
The question is simply if you want to advance science and humanity or your career and yourself.

If you go the 2nd route, the Bem route, your result will almost certainly fail to replicate.

So you might say that replication, if it is attempted solves the problem. Until then you have a confused public by premature press reports, perhaps bad policy decisions, and certainly a lot of time wasted trying to replicate the effect. Establishing that an effect is not there always takes more effort than just demonstrating it.

To this one might say that the nature of science is just so, tentative and self-correcting. Meanwhile the original data magician, our Bem-alike, has produced a publication in a respectable journal, which indicates quality work, and received numerous citations (in the form of failed replications), which indicates that the paper was fruitful and stimulated further research. These factors, number of publications, reputation of journal and number of citations are usually used to judge the quality of work by a scientist in some objective way.

Eventually, if replication is all the answer needed, one should expect science to devolve into producing seemingly, amazing results that are then slowly disproven by subsequent failed replications. Any of that progress we have come to expect would be merely an accidental byproduct.

The problem might be said to lie rather in judging scientists in such a way. Maybe we should include the replicability of results in such judgments. But now we’re no longer talking about replication as the sole answer. We’re now talking about penalizing bad research.

And that’s the point. Science only works if people play by the rules. Those who won’t or can’t must be dealt with somehow. In the extreme case that means labeling them crackpots and ostracizing them.
But there’s less extreme examples.

The case of the faster than light neutrinos

You probably have heard that some scientists recently announced that they had measured neutrinos to go faster than light. This turned out to be due to a faulty cable.

This story is currently a favorite of skeptics who pointed out that few physicists took the result seriously, despite the fact that it was originally claimed that all technical issues had been ruled.. It makes a good cautionary tale about how implausible results should be handled and why. Human error is just always possible and plausible.

There’s another chapter to this story, one that I fear will not get much attention.

The leaders of the experiment were forced to resign as a consequence of the affair.

There were very many scientists involved in the experiment due to the sheer size of the experimental apparatus. Among them, there was much discontentment about how the results were handled. Some said that they should have run more tests, including the test that found the fault, before publishing. Which means, of course, that they shouldn’t have published at all.

It is easy to see how a publish-or-perish environment that puts a premium on exciting results encourages not looking too closely for faults. But what’s the alternative? No incentive to publish equals no incentive to work. No incentive for exciting results just cements the status quo and hinders progress.

A Pigasus for Daryl Bem

Every year on April Fools day, James Randi hands out the Pigasus Award. Here is the announcement for the 2011 awards, delivered on April 1 2012.

One award went to Daryl Bem for “his shoddy research that has been discredited on many accounts by prominent critics, such as Drs. Richard Weisman, Steven Novella, and Chris French.”

I’ve called this well deserved but there’s certainly much that can be quibbled about. For example, these critics are hardly those who delivered the hardest hitting critiques. Far more deserving of honorable mention are Wagenmakers, Francis and Simmons (and their respective co-authors) for their contribution of peer reviewed papers that tackle the problem.

A point actually concerning the award is whether it is fair to single out Bem for a type of misconduct that may be very wide-spread in psychological research. Let’s be clear on this, his methods are not just “strange” or “shoddy” as Randi kindly puts it, they border on the fraudulent. Someone else, in a different field, might have found themselves in serious trouble with a paper like this. Though I think it very hard to get such a paper past peer review in a more math savvy discipline.
But even if you think it is just a highly visible example of normal bad practice, surely it is appropriate to use the high visibility to bring attention to it. Numerous people have done exactly that. Either using it to argue for different statistical techniques or to draw attention for the lack of necessary replication in psychology.

I doubt that Randi calling this out will do much good since I doubt that many psychologists will even notice. And even if, I doubt that it will cause them to rethink their current (mal)practice. There’s a good chance that Bem will be awarded an IgNobel prize later this year. That probably gets more attention but even so…

 

The reactions from believers have been completely predictable. They have so far ignored the criticisms of the methods and so they ignore that Randi explicitly justifies the award with the “strange methods”. They simply pretend that any doubt or criticism is the result of utter dogmatism.

Sadly, some skeptical individuals have also voiced disappointment, for example Stuart Ritchie on his Twitter feed. Should I ever come across a justification for such reactions I will report and comment.

Why doesn’t experiment 9 replicate?

I have written about Daryl Bem’s paper “Feeling the Future” before and laid out a few of the serious issues that invalidate it.

Recently it’s been in the news again because one of the nine experiments presented in it, experiment 9, was repeated and failed to yield a positive result. Of course, no one was particularly surprised by this, except perhaps the usual die hard believers. Still, some may wonder where the positive result came from in the first place. Just chance or something more?

Before we can look at the actual research we need to look at the dangers of pattern seeking…

Patterns are for kilts

Image of group of 9 people

Let’s do a little game. We pick a few people in this image and then we try to find some way to split those nine people into two groups in such a way that most of our picks end up in one group.
For example let’s take the 1st from the left in the first row and the 2nd in the bottom row.
Answer: Males vs. Females.

Again: We take the 2nd in the top row and the 2nd and 4th in the bottom row.
Possible answer: People with and without sunglasses.
It doesn’t work perfectly but mostly.

If you’re creative you can find a more or less good solution for any possible combination of picks. That’s the first take-away point.

Now let’s add a bit of back story and extend our game. The group went to a casino and some of them won big and those are the people we point out.
The goal of the game is now not only to find a good grouping but also to make up some story for why the one group had most of the winners.

For example: The sunglasses are a lucky charm and that’s why the group with glasses did better.
That’s alright, but lucky charm is kind of lame.
How about: Hiding the eyes helps bluffing in poker. Much better…
But wait, correlation does not equal causation as statisticians never tire of telling us. Pro-players like to wear sunglasses, as everyone knows, and that’s why that group did better.

So if you’re creative you can even find some semi-plausible explanation for why a group did better than another.
And when the explanation need not even be semi-plausible then you can always find one without any creativity. Lucky charm, magic or divine favor fits any case. That’s the second take home point.

You can always find some sort of pattern in any set of random data. For example, shapes in clouds. Random means that you rarely find the same pattern again.

For one final encore, let’s make up for each person in that picture how much money they won or lost in the casino. Say top left: Lost $145; top 2nd from left: Won $78; And so on…
An answer might be skin bared in square cm, or height in inches and so on.

Experiment 9

Experiment 9 is derived from a simple psychological experiment that could run something like this:
Step 1
Ask a subject to remember a list of words. The words are flashed one at a time on a computer screen for 3 seconds each.
Step 2
Then randomly select some words for the subject to practice. The selected words appear on the screen again and the subject types them. Of course, the subject can’t make notes.
Step 3
The subject is asked to recall the words.

The result is, unsurprisingly, that more of the practiced words are recalled.
Bem switched steps 2 and 3. That is the words are practiced after they were recalled. You wouldn’t expect that what one does after the fact makes a difference but Bem claimed that the experiment was a success.

If you are new to parapsychology you would probably assume that this means that more practice words were recalled. In fact, Bem does not tell us that. We don’t know if that was the case but the omission is telling.
Bem constructs what he calls a “differential recall index” for each subject. You compute this by first subtracting the number of control words (words that did get practiced) recalled from the number of practice words. Then you multiply this by the number of words recalled in total. This is then turned into a percentage but I’ll omit that in the examples.

So if subject 1 recalls 39 words in total and 20 of these are practiced later then the index is 1*39.
And if subject 2 recalls only 18 words and 8 are practiced then the index is (-2)*18= -36.

You can already guess where this is going. The justification that Bem gives for this manipulation is:
Unlike in a traditional experiment in which all participants contribute the same fixed
number of trials, in the recall test each word the participant recalls constitutes a trial and is
scored as either a practice word or a control word.

This is just massive nonsense. As we have seen, not every recalled word is equal. Those words that come from participants who recalled many count heavier. The function of the index runs counter to the stated purpose.
Let’s combine the examples above. Subject 1 recalled one more practice word but subject 2 recalled two fewer. This indicates that practicing after the fact does not work, although in an actual experiment 2 subjects would be too little to state anything with confidence.
But now look at the combined index: 39 – 36 = +3. This indicates success. Obviously the index misleads here.

That the reviewers let this through is certainly a screw-up on their part. There’s no sugar-coating it.

Charitably one might assume that Bem also made a mistake and just through luck got a significant result. However, that is unlikely.
The evidence, namely the advice he gives on writing articles as well as his handling of the other experiments, indicates that the index was created to force a positive result.
Still, that not necessarily implies ill intent. He may have played around with a statistics program until he got results he liked without ever realizing that this is scientifically worthless. In fact, objectively this is scientific misconduct.
Unfortunately, Bem displays an awareness of the inappropriateness of such methods.

The fact that the actual result of the experiment is not reported by Bem but only the flawed and potentially misleading Differential Recall index makes me conclude that the experiment was probably a failure. There was simply a random association between high recall and favorable outcome on which the DR index capitalizes.
By random chance such a pattern may arise again but only rarely, hence the failure to replicate.

Conceptual vs. Close Replication

Believers often insist that Bem has only replicated previous work. The implication being that these experiments are replicable. But when they say replication they mean a so-called “conceptual replication”. By that they mean experiments in general that purport to show retroactive effects, that is the present affecting the past. Of course, when one makes up a whole new experiment one can simply use the now familiar tricks to force a positive result.
A close replication actually repeats the experiment and is therefore bound to the same method of analysis. Only a close replication is a real replication.

Feeling the Future: Part 2

In my first post on Feeling the Future, I discussed mainly how it’s misuse of statistics related to science in general. I said little about how exactly the statistics were misused. My thinking was that a detailed examination would be too boring for the average reader.
I still think that but nevertheless I will spread out exactly how we know that Bem misused statistics.

The Problem Explained

The good news is that you don’t need to know statistics to understand this problem. You surely know games that use dice. Something like monopoly for example. In that game you throw 2 dice that tell you how far to move.
What if someone isn’t happy with the outcome and decides to roll again? That’s cheating!

Even small kids intuitively understand that this is an advantage, something that skews the outcome in a direction of one’s choosing. There’s no knowledge of statistics or probability theory necessary. While the outcome of each roll is still random,there’s a non-random choice involved.

If you roll 3 dice and pick 2. That’s the same thing, right?

How about we roll 4 and then pick 2 but with the stipulation that the 2 remaining dice must be used on the next? Again there’s a choice involved. Within limitations the player can choose
how to move which allows an advantage. The player’s moves are not longer random.
This is despite the fact that the dice rolls are random and none are discarded.

Now we’re ready to get to Feeling the Future.
The results presented were very unlikely to have arisen by chance. Therefore, the argument goes, they probably didn’t arise by chance. Which means there must have been some unknown factor influencing the outcome.

You may realize that this is a shaky argument. Just because something is unlikely does not mean it doesn’t happen. The impossible doesn’t happen but the unlikely, by definition, must and does. The unlikely is set apart from the likely merely by happening less often.
Then again, the impossible is only impossible as far as we know. And that we’re wrong on something is at best unlikely, if that. In reality, as opposed to in mathematics, we’re always dealing with probability judgements, never with absolutes.
In other words, that argument is all we have. It is used in the same way in almost every scientific experiment.

So the argument is solid enough. In fact, I believe that there is something other than chance involved. Of course, dear reader, if you didn’t know that already you must have skipped the beginning of the post.

Bem’s experiments each had, according to Feeling the Future, 100-200 participants. In reality, at least some of them were assembled from smaller blocks of maybe around 50. This is a problem for exactly the same reason as the dice examples. Even if the outcomes in every block were completely random, once hand-picked blocks are assembled to a larger whole, this whole no longer is.

Proof that it happened

How do we know that this happened? This doesn’t require knowledge of statistics either, just a bit of sleuthing.
First we note what it says in footnote about experiment 5

This experiment was our first psi study and served as a pilot for the basic procedures adopted in all the other studies reported in this article. When it was conducted, we had not yet introduced the hardware based random number generator or the stimulus seeking scale. Preliminary results were reported at the 2003 convention of the Parapsychological Convention in Vancouver, Canada (Bem, 2003); subsequent results and analyses have revised some of the conclusions presented there.

Fortunately, this presentation is also available in written form. Unfortunately it is immediately obvious that it doesn’t present anything corresponding to experiment 5.
The presentation from 2003 reported not 1 but 8 experiments, each with at most 60 participants. The experimental design, however, matches that reported in 2011.
The 8 experiments are grouped into 3 experimental series, so perhaps he pooled these together for the later paper? But no, that doesn’t work either.

I could write several more paragraphs of this kind, trying to write up a logic puzzle full of numbers as if it were a car chase. But my sense of compassion wins out. I know I would merely bore you half blind, my dear readers, and I won’t have that on my conscience.

Therefore I shall only give my answers as one does with puzzles. Check them with the links at the bottom if you like. I could easily have overlooked something or made a typo.

Experimental series 300 of the presentation is the “small retroactive habituation experiment that used supraliminal rather than subliminal exposures” that is mentioned in the File-Drawer section of “Feeling the Future”.
Experiment 102 with 60 participants must have been excluded because it has 60 rather than 48 trials per session.
Experiments 103, 201, 202, 203 combined form experiment 6. They have the same number of participants (n=150). Moreover, the method matches precisely. 100 of these 150 were tested for “erotic reactivity“. This is true for experiment 6 as well as the combination.
Experiment 101 could be part of experiment 5 but there aren’t enough participants. Additional data must have been collected later.
Note that the footnote points to “subsequent results“.

Warning signs

Even without following up the footnotes and references there are some warnings signs in Feeling the Future that hint that something is amiss. For example.

The number of exposures varied, assuming the values of 4, 6, 8, or 10 across this experiment and its replication.

The only reason one would change something in an experiment is to determine if this one factor has any influence on the results. Here we learn that a factor was varied but there is neither reason nor justification given. Much less results.

These two items were administered to 100 of the 150 participants in this replication prior to the relaxation period and experimental trials.

The same thing applies here. A good experiment is completely preplanned and rigidly carried through. There’s no problem with doing less formal, exploratory work to find good candidate ideas that merit the effort necessary for a rigid test. But such exploratory experiments have almost no evidential weight.

Such warning signs are also present in the other experiments described in Feeling the Future. That could indicate that the same thing was done there as well. But don’t make the mistake of assuming that this issue is the only one that invalidates Bem’s conclusions. There’s also the issue of data dredging which is like deciding which card game to play depending on what hand you were dealt. Small wonder then, if you find your cards to be unusually good, according to the rules of the game you chose.

In terms of an experiment that means analyzing the results in various ways and then reporting those results that favor the desired conclusion. That Bem did this is also evident from a comparison of the 2003 and 2011 description of what is apparently and purportedly the same data.

Particularly worrying is that Bem has explicitly and repeatedly denied using such misleading methods. I shall restrain myself from speculating about what made him deny such an obvious, documented fact. It does not have to be dishonesty but none of the other possibilities is flattering, either.

There’s a common conceit among believers that skeptics don’t look at the data. Whenever someone claims this, ask them if there is anything wrong with Feeling the Future and you will know the truth of that.

Sources:
Bem, D. J. (2003, August). Precognitive habituation: Replicable evidence for a process of anomalous cognition. Paper presented at the 46th Annual Convention of the Parapsychological Association, Vancouver, Canada.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology.

Feeling The Future, Smelling The Rot

Daryl Bem is (or was?) a well-respected social scientist who used to lecture at Cornell University. The Journal of Personality and Social Psychology is a peer-reviewed, scientific journal, also well-respected in its field. So it should be no surprise that when Bem published an article that claimed to demonstrate precognition in that journal it made quite a splash.

It was even mentioned, at length, in more serious newspapers like the New York Times. Though at least with the skepticism a subject with such a lousy track record deserves. In fact, if the precognition effect that Bem claims was real, casinos were impossible, as a reply by dutch scientists around EJ Wagenmakers points out.

By now, several people have attempted to replicate some of Bem’s experiments without finding the claimed effect. That’s hardly surprising but it does not explain how Bem got his results.

What’s wrong with the article?

It becomes obvious pretty quickly that the statistics were badly mishandled and a deeper look only makes things look worse. The article should never have passed review but that mistake didn’t bother me at first. Bem is experienced, with many papers under his belt. He knows how to game the system.

The mishandled statistics were not just obvious to me, of course. They were pointed out almost immediately by a number of different people.

These issues should be obvious to anyone doing science. If you don’t understand statistics you can’t do social science. What does statistics have to do with understanding people? About the same thing that literacy has to do with writing novels. At its core nothing, it’s just a necessary tool.

Mishandled statistics are not all that uncommon. Statistics is difficult and fields such as psychology are not populated by people with an affinity for math. Nevertheless, omitting key information and presenting the rest in a misleading manner really stretched my tolerance. That he simply lied about his method when responding to criticism, went too far. But that’s just in my opinion.

Such an accusation demands evidence, of course. The article is full of tell-tale hints which you can read about here or in Wagenmakers’ manuscript (link at the bottom).
But there is clear proof, too. As Bem mentions in the article, some results were already published in 2003. Comparing that article to the current article reveals that he originally performed several experiments with around 50 subjects each. He thoroughly analyzed these batches and then assembled then to packets of 100-200 subjects which he presents as experiments in his new paper.

[Update: There is now a more extensive post on this available.]

That he did that is the omitted key information. The tell-tale hints suggest that he did that and more in all experiments. Yet he has stated that exploratory analysis did not take place. Something that is clearly shown to be false by the historical record.

Scientists aren’t supposed to do that sort of thing. Honesty and integrity are considered to be pretty important and going by the American Psychological Association’s ethics code that is even true for psychologists. But hey, it’s just parapsychology.

And here’s where my faith in science takes a hit…

The Bem Exploration Method

Bem Exploration Method (BEM) is what Wagenmakers and company, with unusual sarcasm for a scientific paper, called the way by which Bem manufactured his results. They quote from an essay Bem wrote that gives advice for “writing the empirical journal article”. In this essay, Bem outlines the very methods he used in “Feeling the Future”.

Bem’s essay is widely used to teach budding social psychologists how to do science. In other words, they are trained in misconduct.

Let me give some examples.

There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b).
The conventional view of the research process is that we first derive a set of hypotheses from a theory, design and conduct a study to test these hypotheses, analyze the data to see if they were confirmed or disconfirmed, and then chronicle this sequence of events in the journal article.

I just threw a dice 3 times (via random.org) and got the sequence 6,3,3. If you, dear reader, want to duplicate this feat you will have to try an average of 216 times. Now, if I had said I am going to get 6,3,3 in advance this would have been impressive but, of course, I didn’t. I could have said the same thing about any other combination, so you’re probably just rolling your eyes.
Scientific testing works a lot like that. You work out how likely it is that something happens by chance and if that chance is low, you conclude that something else was going on. But as you can see, this only works if the outcome is called in advance.
This is why the “conventional view” is as it is. Calling the shot after making the shot just doesn’t work.

In real life, it can be tricky finding some half-way convincing idea that you can pretend to have tested. Bem gives some advice on that:

[T]he data. Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find additional evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something —anything —interesting.

There is nothing, as such wrong, with exploring data, to come up with new hypothesis to test in further experiments. In my dice example, I might notice that I rolled two 3s and proceed to test if maybe the dice is biased towards 3s.
Well-meaning people, or those so well-educated in scientific methodology that they can’t believe anyone would argue such misbehavior, will understand this passage to mean exactly that. Unfortunately, that’s not what Bem did in Feeling The Future.

And again, he was only following his own advice, which is given to psychology students around the world.

When you are through exploring, you may conclude that the data are not strong enough to justify your new insights formally, but at least you are now ready to design the “right” study. If you still plan to report the current data, you may wish to mention the new insights tentatively, stating honestly that they remain to be tested adequately. Alternatively, the data may be strong enough to justify re-entering your article around the new findings and subordinating or even ignoring your original hypotheses.

The truth is that once you go fishing, the data is never strong (or more precisely the result).

Bem claimed that his results were not exploratory. Maybe he truly believes that “strong data” turns an exploratory study into something else?
In practice, this advice means that it is okay to lie (at least by omission) if you’re certain that you’re right. I am reminded of a quote by a rather more accomplished scientist. He said about science:

The first principle is that you must not fool yourself–and you are
the easiest person to fool. So you have to be very careful about
that. After you’ve not fooled yourself, it’s easy not to fool other
scientists. You just have to be honest in a conventional way after
that.

That quote is from Richard Feynman. He had won a Nobel prize in physics and advocated scrupulous honesty in science. I imagine he would have used Bem’s advice as a prime example of what he called cargo cult science.

Bayesians to the rescue?

Bem has inadvertently brought this wide-spread malpractice in psychology into the lime-light.
Naturally, these techniques of misleading others also work in other fields and are also employed there. But it is my personal opinion that other fields have a greater awareness of the problem. Other fields are more likely to recognize them as being scientifically worthless and, when done intentionally, fraud.
If anyone knows of similar advice given to students in other fields, please inform me.

The first “official” response had the promising title: Why psychologists must change the way they analyze their data by Wagenmakers and colleagues. It is from this paper that I took the term Bem Exploration Method.
The solution they suggest, the new way to analyze data, is to calculate Bayes factors instead of p-values.
They aren’t the first to suggest this. Statisticians have long been arguing the relative merits of these methods.
This isn’t the place to rehash this discussion or even to explain it. I will simply say that I don’t think it will work. The Bayesian methods are just as easily manipulated as the more common ones.

Wagenmakers & co show that the specific method they use fails to find much evidence for precognition in Bem’s data. But this is only because that method is less easy to “impress” with small effects, not because it is tamper-proof. Bayesian methods, like traditional methods can be more or less sensitive.

The problem can’t be solved by teaching different methods. Not as long as students are simultaneously taught to misapply these methods. It must be made clear that the Bem Exploration Method is simply a form of cheating.

Sources:
Bem, D. J. (2003). Writing the empirical journal article. In J. M. Darley, M. P. Zanna, & H. L. Roediger III (Eds.), The compleat academic: A career guide (pp. 171–201). Washington, DC: American Psychological Association.
Bem, D. J. (2003, August). Precognitive habituation: Replicable evidence for a process of anomalous cognition. Paper presented at the 46th Annual Convention of the Parapsychological Association, Vancouver, Canada.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology.
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? A response to wagenmakers, wetzels, borsboom, & van der Maas (2011). Manuscript submitted for publication.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (in press). Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology.

See here for an extensive list of links on the topic. If I missed anything it will be there.