Feeling The Future, Smelling The Rot

Daryl Bem is (or was?) a well-respected social scientist who used to lecture at Cornell University. The Journal of Personality and Social Psychology is a peer-reviewed, scientific journal, also well-respected in its field. So it should be no surprise that when Bem published an article that claimed to demonstrate precognition in that journal it made quite a splash.

It was even mentioned, at length, in more serious newspapers like the New York Times. Though at least with the skepticism a subject with such a lousy track record deserves. In fact, if the precognition effect that Bem claims was real, casinos were impossible, as a reply by dutch scientists around EJ Wagenmakers points out.

By now, several people have attempted to replicate some of Bem’s experiments without finding the claimed effect. That’s hardly surprising but it does not explain how Bem got his results.

What’s wrong with the article?

It becomes obvious pretty quickly that the statistics were badly mishandled and a deeper look only makes things look worse. The article should never have passed review but that mistake didn’t bother me at first. Bem is experienced, with many papers under his belt. He knows how to game the system.

The mishandled statistics were not just obvious to me, of course. They were pointed out almost immediately by a number of different people.

These issues should be obvious to anyone doing science. If you don’t understand statistics you can’t do social science. What does statistics have to do with understanding people? About the same thing that literacy has to do with writing novels. At its core nothing, it’s just a necessary tool.

Mishandled statistics are not all that uncommon. Statistics is difficult and fields such as psychology are not populated by people with an affinity for math. Nevertheless, omitting key information and presenting the rest in a misleading manner really stretched my tolerance. That he simply lied about his method when responding to criticism, went too far. But that’s just in my opinion.

Such an accusation demands evidence, of course. The article is full of tell-tale hints which you can read about here or in Wagenmakers’ manuscript (link at the bottom).
But there is clear proof, too. As Bem mentions in the article, some results were already published in 2003. Comparing that article to the current article reveals that he originally performed several experiments with around 50 subjects each. He thoroughly analyzed these batches and then assembled then to packets of 100-200 subjects which he presents as experiments in his new paper.

[Update: There is now a more extensive post on this available.]

That he did that is the omitted key information. The tell-tale hints suggest that he did that and more in all experiments. Yet he has stated that exploratory analysis did not take place. Something that is clearly shown to be false by the historical record.

Scientists aren’t supposed to do that sort of thing. Honesty and integrity are considered to be pretty important and going by the American Psychological Association’s ethics code that is even true for psychologists. But hey, it’s just parapsychology.

And here’s where my faith in science takes a hit…

The Bem Exploration Method

Bem Exploration Method (BEM) is what Wagenmakers and company, with unusual sarcasm for a scientific paper, called the way by which Bem manufactured his results. They quote from an essay Bem wrote that gives advice for “writing the empirical journal article”. In this essay, Bem outlines the very methods he used in “Feeling the Future”.

Bem’s essay is widely used to teach budding social psychologists how to do science. In other words, they are trained in misconduct.

Let me give some examples.

There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b).
The conventional view of the research process is that we first derive a set of hypotheses from a theory, design and conduct a study to test these hypotheses, analyze the data to see if they were confirmed or disconfirmed, and then chronicle this sequence of events in the journal article.

I just threw a dice 3 times (via random.org) and got the sequence 6,3,3. If you, dear reader, want to duplicate this feat you will have to try an average of 216 times. Now, if I had said I am going to get 6,3,3 in advance this would have been impressive but, of course, I didn’t. I could have said the same thing about any other combination, so you’re probably just rolling your eyes.
Scientific testing works a lot like that. You work out how likely it is that something happens by chance and if that chance is low, you conclude that something else was going on. But as you can see, this only works if the outcome is called in advance.
This is why the “conventional view” is as it is. Calling the shot after making the shot just doesn’t work.

In real life, it can be tricky finding some half-way convincing idea that you can pretend to have tested. Bem gives some advice on that:

[T]he data. Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find additional evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something —anything —interesting.

There is nothing, as such wrong, with exploring data, to come up with new hypothesis to test in further experiments. In my dice example, I might notice that I rolled two 3s and proceed to test if maybe the dice is biased towards 3s.
Well-meaning people, or those so well-educated in scientific methodology that they can’t believe anyone would argue such misbehavior, will understand this passage to mean exactly that. Unfortunately, that’s not what Bem did in Feeling The Future.

And again, he was only following his own advice, which is given to psychology students around the world.

When you are through exploring, you may conclude that the data are not strong enough to justify your new insights formally, but at least you are now ready to design the “right” study. If you still plan to report the current data, you may wish to mention the new insights tentatively, stating honestly that they remain to be tested adequately. Alternatively, the data may be strong enough to justify re-entering your article around the new findings and subordinating or even ignoring your original hypotheses.

The truth is that once you go fishing, the data is never strong (or more precisely the result).

Bem claimed that his results were not exploratory. Maybe he truly believes that “strong data” turns an exploratory study into something else?
In practice, this advice means that it is okay to lie (at least by omission) if you’re certain that you’re right. I am reminded of a quote by a rather more accomplished scientist. He said about science:

The first principle is that you must not fool yourself–and you are
the easiest person to fool. So you have to be very careful about
that. After you’ve not fooled yourself, it’s easy not to fool other
scientists. You just have to be honest in a conventional way after
that.

That quote is from Richard Feynman. He had won a Nobel prize in physics and advocated scrupulous honesty in science. I imagine he would have used Bem’s advice as a prime example of what he called cargo cult science.

Bayesians to the rescue?

Bem has inadvertently brought this wide-spread malpractice in psychology into the lime-light.
Naturally, these techniques of misleading others also work in other fields and are also employed there. But it is my personal opinion that other fields have a greater awareness of the problem. Other fields are more likely to recognize them as being scientifically worthless and, when done intentionally, fraud.
If anyone knows of similar advice given to students in other fields, please inform me.

The first “official” response had the promising title: Why psychologists must change the way they analyze their data by Wagenmakers and colleagues. It is from this paper that I took the term Bem Exploration Method.
The solution they suggest, the new way to analyze data, is to calculate Bayes factors instead of p-values.
They aren’t the first to suggest this. Statisticians have long been arguing the relative merits of these methods.
This isn’t the place to rehash this discussion or even to explain it. I will simply say that I don’t think it will work. The Bayesian methods are just as easily manipulated as the more common ones.

Wagenmakers & co show that the specific method they use fails to find much evidence for precognition in Bem’s data. But this is only because that method is less easy to “impress” with small effects, not because it is tamper-proof. Bayesian methods, like traditional methods can be more or less sensitive.

The problem can’t be solved by teaching different methods. Not as long as students are simultaneously taught to misapply these methods. It must be made clear that the Bem Exploration Method is simply a form of cheating.

Sources:
Bem, D. J. (2003). Writing the empirical journal article. In J. M. Darley, M. P. Zanna, & H. L. Roediger III (Eds.), The compleat academic: A career guide (pp. 171–201). Washington, DC: American Psychological Association.
Bem, D. J. (2003, August). Precognitive habituation: Replicable evidence for a process of anomalous cognition. Paper presented at the 46th Annual Convention of the Parapsychological Association, Vancouver, Canada.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology.
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? A response to wagenmakers, wetzels, borsboom, & van der Maas (2011). Manuscript submitted for publication.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (in press). Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology.

See here for an extensive list of links on the topic. If I missed anything it will be there.

Advertisements

9 Comments

  1. May 1, 2011 at 10:15 am

    […] This is where it gets a bit ranty, so feel free to stop reading here. Maybe you’ve read my previous article and wondered why this seemed so serious to me. Perhaps this article gave an answer. Part of the BEM […]

  2. Mike said,

    May 1, 2011 at 6:45 pm

    Don’t you think that it’s a tad bit hypocritical of both yourself and Wagenmakers et al, for calling Dr. Bem out for supposedly using statistics to generate hypotheses after the fact rather than beforehand, then citing Wagenmakers et al as a better way of doing statistics, even though they used a different statistical approach and came to different conclusions, after the fact?

    >>Calling the shot after making the shot just doesn’t work.

    It’s scientific Monday-morning quarterbacking, plain and simple. Hindsight is 20/20. And just because your team is winning, doesn’t make it right, either. Are skeptics somehow magically immune to confirmation bias? I don’t think so.

    [Thank you for your question.
    First of all, Bem’s methods have long been recognized as problematic. This is not something that was only realized after he published his paper. See for example Wikipedia on data dredging.
    That is one reason for my shock at the apparent widespread acceptance of these methods.
    Bem’s advice has also been criticized before, for example by Kerr in 1998. (Something I was very relieved to learn!)
    There’s nothing Monday-morning about this.
    Secondly, I don’t cite Wagenmakers as a better way of doing statistics. I state plainly that I don’t think it will work.
    However, there is nothing hypocritical in what they do. As I explain, there is nothing wrong with exploring the data (or in Wagenmakers’ case using it for demonstrating a method). The wrong part is in misleading people about this.]

    • Mike said,

      May 2, 2011 at 5:00 pm

      I do agree Bem did a fair bit of data dredging. However, many of the hypotheses tested – such as the correlations between stimulus seeking and sex differences – are relationships that have been hypothesized for decades. I don’t have the papers at my fingertips, but I’ve read articles showing various differences between sexes with regards to reactions to violent and/or erotic stimuli, in standard psychological experiments. Belief in ESP as a factor of ESP performance has been around for probably a century. Yeah, people forget (or refuse to believe) that parapsychology has been conducting legitimate science for a decent time and have collected quite a large amount of information, data, theory and hypotheses. The fact that the questions about stimuli seeking and ESP belief are written into the software of the test programs indicate that these are hypotheses he would have wanted to test to begin with.

      Heh, despite the nature of the paper (jokingly), you can’t go back in time to collect data that you didn’t the first time around. BUT, I completely agree that there should have been more separation between what could be construed as speculative or exploratory work, and the hypothesis confirmation.

      All that said, the primary data being tested from the get-go is the retro effects. This is what you test to see if you got psi effects or you did not, and since this is your first line of analysis, T-testing is completely valid. All the other effects are variables that affect performance, and that is where it is questionable whether post-hoc correction should be used and what kind.

      The thing about Wagenmakers’ paper is that it is definitely *not* being cited as exploring the data or demonstrating a method. It is completely misleading in that it makes the positive claim Bem’s data analysis is wrong, and if people analyzed the data *their* way, they’d find the null hypothesis is confirmed, and this is what skeptics, scientists, and journalists are all claiming based on that paper. It is every bit HARKing. But they treat the Wagenmakers paper as that it totally discredits Dr Bem’s. In reality, both are, in their own ways, extremely misleading.

      What it boils down to, really, is there is an extremely fine line between Type I and Type II error. You can’t decrease one without increasing the other. You don’t want to confirm too much – that dilutes science – and you don’t want to be too conservative – that causes scientific gridlock and slows discovery. Bem errs on the side of confirmation and Wagenmakers et al err on the side of conservative. And that’s also why Bayes factors really don’t fix anything. They still, at some point in the process, require a subjective judgement about how much Type I error you’re willing to allow for. Frequentist testing is nice because 0.05 has been the convention for a very long time and everyone is familiar with it, and most people agree it is a decent convention for setting the bar (regardless of if it truly is or not).

      The only way to fix that, and one that I’ve been advocating for a while, is to have a consortium *before* the experiments are conducted, in order to agree upon some specific Bayes prior or factor to which the data will be subjected. The experiment is conducted, the data is tallied, and the posterior is calculated, and it either makes it over the bar, or it doesn’t, and no one can argue about how high the bar was set, because it was set and agreed upon at the beginning.

      [Going back to the example in my post. Let’s say I’ve been saying for years that I can predict dice throws. Does that make it any better?
      Of course not. I need to call every throw in advance.
      In the same way, it doesn’t matter that “stimulus seekers” have long been claimed to be better at psi. The point is that you need to spell out in advance who is a “stimulus seeker” and what result is better at psi. Picking in retrospect doesn’t work.
      But that’s not even the problem. That’s just why the paper isn’t evidence for anything.
      The problem is that Bem simply is not being honest.
      In contrast, what Wagenmakers et al do is completely open. Disagreeing with their conclusions doesn’t make them misleading.]

  3. Mike said,

    May 4, 2011 at 9:00 pm

    I did not say the Wagenmakers paper was misleading because I disagree with their conclusions, and I am glad they are open about their choices. What is misleading is the conclusion they come to with their analysis of the data (I’m talking just about their numbers and null hypothesis rejection, or lack of), and thus what people report about the conclusions of the paper, without actually having any understanding of what Wagenmakers did and why.

    Did you know what a Default Cauchy Prior Distribution is, before reading this comment and googling it? And furthermore, do you know why Wagenmakers picked it? You may or may not, but I reckon most people who do read about papers don’t have a firm handle on the manipulations used (or mis-used, as you clearly point out).

    It’s a completely kosher way of analyzing data (using d-Cauchy to set priors distro), but doing so after the fact, strapping on what is an extremely conservative distribution of effect size priors, already knowing it’ll ensure the null hypothesis isn’t rejected, is disingenuous. Bem could have designed his experiments around using it, and thus would have chosen much larger group sizes, and this would have probably been a good thing. But he didn’t, he did his power analysis around frequentist Gaussian distributions, like 90% of research psychologists, and thus chose sample sizes to match.

    Saying he should have used an ultraconservative prior distro after he finished all of his experiments doesn’t discredit his data. That’s what most skeptics reading it don’t realize, and that’s my biggest problem. All they did for their data analysis (multiple comparisons aside, which is a whole nother bag of worms) is say, metaphorically, “You set the bar at 10 feet, just like psychologists have been for decades, and cleared it. When we set the bar at 50 feet, you didn’t clear it by a long shot.” Whether research psychologists need to raise the bar is a good discussion brought up by the paper, and a debate that has been raging for some time. But it doesn’t dissolve the data and its effect size anymore than saying, “You’re doing it wrong. Your argument is invalid.”

    Bem DID spell out ahead of time that subjects who scored higher on a Stimulus-Seeking parameter (SS), determined by certain questions on a questionnaire, “Do you get bored easily?” (SS1, higher score is more SS) and “Do you like to watch movies you have seen before” (SS2, lower score is more SS) should score higher on these precognition tasks. You get a single number, SS12, which is a rough measure of how stimulus-seeking the subject is. The subject pool is divided into two groups, high, SS >3, and low, SS < 3, and the retro data compared for each. He must have supposed it ahead of time, and put the questions in the questionnaire, or else he wouldn't have the data to analyze it otherwise, because he would not have asked those questions. It'd be like deciding after the experiment is done, "I want to test the effects of 30mg aspirin vs 60mg aspirin daily." You can't. You didn't test for it.

    [What you’re saying is that you find fault with Wagenmakers argument with Bayesian statistics and therefore reject the conclusion. I can see why you would find that misleading.
    A faulty argument leads to the wrong conclusion.
    But that is normal. People are wrong all the time, even in science. When you say ‘misleading’ that implies that Wagenmakers et al did arrive at the wrong conclusion on purpose. I see no indication for that.
    Be aware that the specific method they advocate has been published previously. (By the way, I wouldn’t have critized them if I didn’t understand what they did, Default Cauchy and all.)
    I largely agree with your arguments against the method they advocate. As I say in the post, it is not tamper-proof. The effect still is there.
    However, I think you’re missing the main point. Namely that the effect got there because the BEM put it there.

    As you are obviously aware, Bem implies in his paper that he chose the sample sizes based on a power analysis. Yet, in reality he performed experiments that were much smaller and then assembled these to the larger units he presents.
    In truth, at least some of Bem’s experiments never happened as described. Now that’s what I call misleading.

    Regarding the stimulus seeking. He could have changed what high and low stimulus seeking means by changing the point on the scale dividing the two. Or by changing how the questions are evaluated. He apparently didn’t do that, that’s at least something. (I misrecalled on that).
    He doesn’t report what other personality traits he tested but that’s all right, too, because he, at least, tells us that there were others.
    But that’s not good enough as long as it isn’t also spelled out what good at psi means. And that really is the main point. The stimulus seeking analysis is correctly presented as post hoc, so there is no legitimate complaint there anyways.]

  4. Mike said,

    May 6, 2011 at 7:12 am

    Actually, he does report that he tested different personality traits during pilot experiments, explicitly to address the potential issues of exploratory-confirmation overlap and the file drawer effect.

    “First, several individual-difference variables that had been reported
    in the psi literature to predict psi performance were pilot
    tested in these two experiments, including openness to experience;
    belief in psi; belief that one has had some psi experiences in
    everyday life; and practicing a mental discipline such as meditation,
    yoga, self-hypnosis, or biofeedback. None of them reliably
    predicted psi performance, even before application of a Bonferroni
    correction for multiple tests. Second, an individual-difference variable
    (negative reactivity) that I reported as a correlate of psi in my
    convention presentation of these experiments (Bem, 2003) failed
    to emerge as significant in the final overall database.” – Bem 2011p.421

    Naturally, Dr. Bem wanted to avoid exactly the sort of accusations being leveled at him presently.

    It still stands to reason exactly which data was used for what experiments (the assembled units so to speak). What exactly are these hints that suggest the experiments in Bem 2003 were repackaged and used in Bem 2011? Is there any actual evidence of this claim?

    It may come to at some point Dr. Bem divulging all of the original data for examination. He takes great pains to ensure the data is not tampered with in the experimenting process, so I would imagine he would have little trouble with providing these records (to qualified experimenters obviously) to ensure integrity overall.

    Also, the criterion cutoff for stimulus seeking hi/lo is 2.5 and is coded into the source code for RPrime v4.0, and appears to be that way throughout all versions of the program.

    [1.Yes, he does report having tested different traits. I said as much.
    2.Regarding evidence for the repackaged experiments. The 2003 article is linked at the bottom of the post
    3.Cut-off at 2.5? The paper says hi-SS is ‘above the midpoint’ which I took to mean 3.5 and up.]

  5. June 4, 2011 at 6:51 pm

    […] my first post on Feeling the Future, I discussed mainly how it’s misuse of statistics related to science in general. I said […]

  6. Jon Mannsaker said,

    January 6, 2012 at 9:36 pm

    Who are you behind this blog? Anonymous persons have no say in a scientific debate. Specially not when you are accusing a scientist of cheating and lying. Come out of the closet! Give your name and background.

    • March 17, 2012 at 11:07 am

      Ideally science is all about the arguments and not at all about the person. For example, peer review is usually anonymous to prevent the author’s reputation biasing the reviewers.

      I guard my anonymity simply because paranormal topics seem to attract a greater than usual share of deranged individuals.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: