Getting Wagenmakers wrong

EJ Wagenmakers et al published the first reply to the horribly flawed Feeling the Future paper by Daryl Bem. I’ve blogged about it more times than I care to count right now.

Their most important point was regarding the abuse of statistics. Or, as they put it, that Bem’s study was exploratory rather than confirmatory.
They also suggested a different statistical method as a remedy. I’ve expressed doubts about that because I don’t think that there is a non-abusable method.

Unfortunately ,what they proposed has been completely and thoroughly misunderstood. The latest misrepresentation appeared in an article by 3 skeptics in The Psychologist. I blogged.

How to get Wagenmakers right

The traditional method of evaluating a scientific claim or idea is Null-Hypothesis Significance Testing (NHST). This involves coming up with a mathematical prediction of what happens if the new idea is wrong. It’s not enough to say that people can’t see into the future, you must say what the results should look like if they can’t.
After the experiment is done, this prediction is used to work out how likely it was to get results such as those that one actually got. If it is unlikely then one concludes that the prediction was wrong. The null hypothesis is refuted. Something happens and this then is taken as evidence for the original idea.
There’s a number of things that can go wrong with this method. One is choosing that null prediction after the fact, based on whatever results you got.

The method that Wagenmakers argued for is different. It involves not only making a prediction about what  happens when the original idea is wrong. It also requires making a prediction about what happens if it is right.
Then, with the results of the experiment, one works out how likely the result was under either prediction. Finally, calculate how much more likely the result is under one hypothesis rather than the other. This last number is called the Bayes Factor.

For an example, imagine an ordinary 6-sided die but instead of being ordinarily labeled it has only the letters “A” and “B”. The die comes in 2 variations, one has 5 “A”s and 1 “B”, the other 1 “A” and 5 “B”s.
You roll a die once and get an “A”. This result is 5 times likely in the first variant.
You could use this result to speculate about what kind of die you rolled. But what if there is a third variant of die? One that has, say, 3 “A”s and 3 “B”s. Then your Bayes Factor would be different.

The Bayes Factor depends crucially on the two hypothesis being compared. Depending on which 2 hypothesis are being compared one can seem more likely or the other.

In the case of Feeling the Future, the question is basically what we should assume about what happens when something happens. How much feel for the future should we assume?
Wagenmakers et al said if one cannot assume anything for lack of information then one should use this default assumption as suggested by several statisticians. This assumption implied that people might be a little good at feeling the future or maybe very good.
Bem, along with two statisticians, countered that we already know that people are not good at feeling the future. Parapsychological abilities are always weak and therefor one should use a different assumption under which the strength of the evidence was very much confirmed.

Let’s make this intuitively clear. Think again of of die with 5 As or 5 Bs. You are told that one die was rolled 100 times and showed 30 As and 70 Bs. Clearly that is more likely to be a 5 B die than a 5 A die. But wait, what if, instead of comparing those 2 with each other we compare either of these with a die with 2 As and 4 Bs. That die would win.

I have simplified a lot here. If something doesn’t seem to make sense it’s probably because of that and not because of a problem in the original literature.

Bem’s argument makes a lot of sense but overlooks that belief in strong precognition is wide-spread, even among parapsychologists. Tiny effects are what they get but not what they hope for or believe in. Both parties have valid arguments for their assumptions but neither makes a compelling case. On the whole, however, it does show a problem with the default Bayesian t-test.

Let me emphasize again that Wagenmakers made two points. The first that Bem made mistakes in applying the statistics. And secondly that it would be better to use the default Bayesian t-test rather than the traditional NHST. These are separate issues.
In my opinion, the abuse of statistical methods is the crucial issue that cannot be solved by using a different method.

How to get Wagenmakers wrong

Bayesian statistics is often thought of as involving a prior probability. In fact, the defining characteristic of Bayesian statistics is that it includes prior knowledge.

Again let’s go with the example. You’re only concerned with the 2 die variants with the 5 “A”s and the 5 “B”s. Someone is throwing always the same die and telling you the result. You can’t see the die, of course, but are supposed to guess which die was thrown solely based on the result.
Intuitively, you’ll probably be tending more toward the first kind with every “A” and more toward the second type with every “B”.
But what if I told you that I randomly picked the die out of a box with 100 die of the 5 “A” variant and only 1 of the 5 “B” variant. You’ll start out assuming it should be the 5 “A” variant and will require a lot of “B”s before switching.
Formally, we’d compute the Bayes Factor from the data and then use that factor to update the prior probability to get the posterior probability. The clearer the data is, and the more data one has, the greater the shift in what we should hold to be the case.

In reality one will hardly ever know which of several competing hypotheses is more likely to be true. Different people will make their own guess. Some, maybe most, people will regard precognition as a virtual impossibility, a few as a virtual certainty.
Wagenmakers et al showed that even if one assumes a very low prior probability to the idea that people can feel the future (or rather the mathematical prediction based on that idea), 2,000 test subjects would yield enough data to shift opinion firmly towards precognition being true.

Unfortunately, some people completely misunderstood that. They thought that Wagenmakers et al were saying that we should not regard Bem’s data as convincing because they assigned a low prior probability. In truth the only assumption that went into the Bayes factor calculation was regarding the effect size. That point was strongly emphasized but still people miss it.


[sound of sighing]

The May issue of The Psychologist carries an article by Stuart Ritchie, Richard Wiseman and Chris French titled Replication, Replication, Replication plus some reactions to it. The Psychologist is the official monthly publication of The British Psychological Society. And the article is, of course, about the problems the 3 skeptics had in getting their failed replications published.

Yes, replication is important

That the importance of replications receives attention is good, of course. Depositories for failed experiments are important and have the potential to aid the scientific enterprise.

What is sad, however, is that the importance of proper methodology is largely overlooked. Even the 3 skeptics who should know all about the dangers of data-dredging cavalierly dismiss the issue with these words:

While many of these methodological problems are worrying, we don’t think any of them completely undermine what appears to be an impressive dataset.

But replication is still not the answer

I have written about how replication cannot be the whole answer before. In a nutshell, by cunning abuse of statistical methods it is possible to give any mundane and boring result the impression of showing some amazing, unheard of effect. That takes hardly any extra work but experimentally debunking the supposed effect is a huge effort. It takes more searching to be sure that something is not there than to simply find it. For statistical reasons, an experiment needs more subjects to “prove” the absence of an effect with the same confidence as finding it.
But there’s also that there might be some difference between the original experiment and the replication that explain the lack of effect. In this case it was claimed that maybe the 3 failed because they did not believe in the effect. It takes just seconds to make such a claim. Disproving it requires finding a “believer” who will again run an experiment with more subjects that the original.

Quoth the 3 skeptics:

Most obviously, we have only attempted to replicate one of Bem’s nine experiments; much work is yet to be done.

It should be blindingly obvious that science just can’t work like that.

There are a few voices that take a more sensible approach. Daniel Bor writes a little of how neuroimaging which has, or had, extreme problems with useless statistics might improve by foster greater expertise among the practitioners. Neuroimaging seems to have made methodological improvements. What social psychology needs is a drink of the same cup.

The difficulty of publishing and the crying of rivers

On the whole, I find the article by the 3 skeptics to be little more than a whine about how difficult it is to get published, hardly an unusual experience. The first journal refused because they don’t publish replications.
Top journals are supposed to make sure that the results they publish are worthwhile. Showing that people can see into the future is amazing, not being able to show that is not. Back in the day it was simply so that there was only a limited number of pages that could be stuffed into an issue, these days, with online publishing, there’s still the limited attention of readers.
The second journal refused to publish because one of the peer-reviewers, who happened to be Daryl Bem, requested further experiments to be done. That’s a perfectly normal thing and it’s also normal that researchers should be annoyed by what they see as a frivolous request.
In this case, one more experiment should have made sure that the failure to replicate wasn’t due to the beliefs of the experimenters. The original results published by Bem were almost certainly not due to chance. Looking for a reason for the different results is good science.

I’ve given a simple explanation for the obvious reason here. If the 3 skeptics are unwilling or unable to actually give such an explanation they are hardly in a position to complain.

Beware the literature

As a general rule, failed experiments have a harder time to get published than successful ones. That’s something of a problem because it means that information about what doesn’t work is lost to the larger community. When there is an interesting result in the older literature that seems not to have been followed up on then it probably is the case that it didn’t work after all. The original report was a fluke and the “debunking” was never much published. Of course, one can’t be sure if it was not maybe overlooked, which is a problem.
One must be aware that the scientific literature is not a complete record of all available scientific information. Failures will mostly live on in the memory of professors and will still be available to their ‘apprentices’ but it would be much more desirable if the information could be made available to all. With the internet, this possibility now exists and that discussion about such means is probably the most valuable result of the Bem affair so far.

Is Replication the Answer?

One question that is forced on us by the publication of papers like Daryl Bem’s Feeling the Future is what went wrong and how it can be fixed.

One demand that often arises is for replication. It is one of the standard demands made by interested skeptics in forums and such places. I can understand why calling for replication is seductive.
It is shrewd and skeptical. It says: Not so fast, let’s be sure first while at the same time offering a highly technical criticism. Replication is technical jargon, don’t you know?. On the other hand it’s also nice and open-minded. It says: This is totally serious science and some people who aren’t me should spend a lot of time on it.
And perhaps most important of all, it requires not a moments thought.

Cynicism aside, replication really is important. As long as a result is not replicated it is very likely wrong. If you don’t replicate you’re not really generating knowledge. Not only can you not rely on the results, you also lose the ability to determine if you are using good methods or are applying them correctly. Which I’d speculate will decrease reliability still further over time.

Replication is essential but is replication really all that is needed?

Put yourself in the shoes of a scientist. You have just run an experiment and found absolutely no evidence that people can see the future.  That’s going to be tough to publish.
Journals are sometimes criticized for being biased against negative results but the simple fact is that they are biased against uninteresting results. Attention is a limited quantity; there’s only so much time in a day that can be spent reading. Most ideas don’t work out and so it is hardly news when an idea fails in experiment. Think for an example of all the chemicals that are not drugs of any kind.

Before computers and the information age it probably wouldn’t even have been possible to handle all the information about failed ideas. Things have changed now but the scientific community is still struggling to incorporate these new possibilities. However, one still can’t expect real life humans to pay attention to evidence of the completely expected.

Now you could try a new idea and hope that you have more luck with that.
Or you could do what Bem did and work some statistical magic on the data. And by magic I mean sleight of hand. The additional work required is much less and it is almost certain to work.
The question is simply if you want to advance science and humanity or your career and yourself.

If you go the 2nd route, the Bem route, your result will almost certainly fail to replicate.

So you might say that replication, if it is attempted solves the problem. Until then you have a confused public by premature press reports, perhaps bad policy decisions, and certainly a lot of time wasted trying to replicate the effect. Establishing that an effect is not there always takes more effort than just demonstrating it.

To this one might say that the nature of science is just so, tentative and self-correcting. Meanwhile the original data magician, our Bem-alike, has produced a publication in a respectable journal, which indicates quality work, and received numerous citations (in the form of failed replications), which indicates that the paper was fruitful and stimulated further research. These factors, number of publications, reputation of journal and number of citations are usually used to judge the quality of work by a scientist in some objective way.

Eventually, if replication is all the answer needed, one should expect science to devolve into producing seemingly, amazing results that are then slowly disproven by subsequent failed replications. Any of that progress we have come to expect would be merely an accidental byproduct.

The problem might be said to lie rather in judging scientists in such a way. Maybe we should include the replicability of results in such judgments. But now we’re no longer talking about replication as the sole answer. We’re now talking about penalizing bad research.

And that’s the point. Science only works if people play by the rules. Those who won’t or can’t must be dealt with somehow. In the extreme case that means labeling them crackpots and ostracizing them.
But there’s less extreme examples.

The case of the faster than light neutrinos

You probably have heard that some scientists recently announced that they had measured neutrinos to go faster than light. This turned out to be due to a faulty cable.

This story is currently a favorite of skeptics who pointed out that few physicists took the result seriously, despite the fact that it was originally claimed that all technical issues had been ruled.. It makes a good cautionary tale about how implausible results should be handled and why. Human error is just always possible and plausible.

There’s another chapter to this story, one that I fear will not get much attention.

The leaders of the experiment were forced to resign as a consequence of the affair.

There were very many scientists involved in the experiment due to the sheer size of the experimental apparatus. Among them, there was much discontentment about how the results were handled. Some said that they should have run more tests, including the test that found the fault, before publishing. Which means, of course, that they shouldn’t have published at all.

It is easy to see how a publish-or-perish environment that puts a premium on exciting results encourages not looking too closely for faults. But what’s the alternative? No incentive to publish equals no incentive to work. No incentive for exciting results just cements the status quo and hinders progress.

Scientific Skepticism

You’ve probably had some physics or chemistry at some point. If so, you probably remember the being shown experiments.If not, or if you can’t remember, click here. In my memory, these were deadly boring but at least you needn’t pay so much attention.
Have you ever wondered why that has such an important part in the science curriculum?

Doing experiments yourself teaches you some skills and, hopefully, is more engaging than just reading in a book. But what’s the point of having the teacher perform experiments? Surely it would be much cheaper and easier to just have the pupils read about them.

In my experience, the answer is never spelled out in class. That answer is that you’re not supposed to believe things just because it says so in a book. You’re supposed to know, for yourself, that what you’re taught is right, that it works.

The Battle Cry of the Revolution

Nullius in verba is latin for ‘take nobody’s word for it’. In the 17th century it became the motto of the Royal Society and a battle cry of the scientific revolution. The men of the Royal Society resolved not to believe things merely because they were written in some ancient book or said by some respected personage. They would experiment.
Think about how radical this was in a time of absolute monarchy when the bible was regarded as the word of god.
If these man had realized that the same attitude would eventually challenge even monarchy and church, most of them would probably have been horrified and abandoned their quest.

Of course, especially in our modern times, one cannot personally repeat every single experiment. And when one gets a different answer on one experiment, well, maybe it is not everyone else that has made a mistake. Skepticism is one thing, solipsism another.

Science is Social

Science is a social enterprise. Just like society as a whole, science relies on people doing their jobs and follow certain rules. Both have fail-safes, to deal with instances where individuals either make mistakes or cheat. In science this fail-safe is largely the skepticism of the peers, who are relied on to find and weed out erroneous results. Unfortunately, science has little ability to deal with actual frauds. Only occasionally, when someone has gone completely over-board in making stuff up, frauds are identified as malicious rather than simple mistakes.
Scientific skepticism ensures that fraudulent results are weeded out just like innocent errors. Scientific fraud can still harm society by causing bad decisions to be made. It also harms science by causing people to waste time and money trying to confirm the unconfirmable, or trying to build on a rotten foundation.
The only way a false result can escape correction is if it so unimportant as to be completely ignored. That’s, ironically enough, the best case.

I hope I made clear why skepticism (in this particular version) is so fundamental to science. It should also be clear why fabricating or altering data is such a heinous crime against science. What if everyone did it?
Science would become a meaningless sham.

There’s two more things I want to say. Not so much about scientific skepticism but it fits here.

In science, you’re supposed to judge a claim on its merits, not on the merits of the person making that claim. Still, reputation plays an important part when judging a claim.
The reason is because whenever someone publishes scientific results they are not just offering the fruits of their labor. They are also asking others to do work. The result must be vetted, it must be replicated before it can be used.
This request doesn’t go out to some faceless, abstract entity called science. It goes out to individuals. And each of these individuals will ask themselves: If I take this seriously, am I wasting my time?
That’s where reputation (and a whole number of other soft factors) comes into play. It’s not logically valid to judge the claim by the reputation of the person making. But judging the likelihood of wasting your time on these soft factors certainly works.

The Bem Exploration Method
Yes, this again. This is where it gets a bit ranty, so feel free to stop reading here.
Maybe you’ve read my previous article and wondered why this seemed so serious to me. Perhaps this article gave an answer.
Part of the BEM is data dredging and similar abuses of statistics produce false results. The more widespread this technique becomes, the more waste there is in science. It takes away resources that could be used gainfully.
Even worse is the omission of negative results. This is something that turns replication itself into a sham. It is much more work intensive and easier to detect than fabricating data but the effect on the scientific enterprise is the same. It turns it into a sham.
Seeing how Bem’s advice is used to teach students makes me seriously wonder about the integrity of (social) psychological science.

This blog post was the most shocking on Feeling the Future that I’ve read. And I’ve read a lot.
Lots of stupid and credulous things were written about it but he had the good sense to see the signs of questionable research practices in Bem’s article. His post is very good in that department.
And still he calls it a good paper. He also calls the publishing standards too lax, so the only thing I can really blame him for is not being sufficiently outraged.

Feeling The Future, Smelling The Rot

Daryl Bem is (or was?) a well-respected social scientist who used to lecture at Cornell University. The Journal of Personality and Social Psychology is a peer-reviewed, scientific journal, also well-respected in its field. So it should be no surprise that when Bem published an article that claimed to demonstrate precognition in that journal it made quite a splash.

It was even mentioned, at length, in more serious newspapers like the New York Times. Though at least with the skepticism a subject with such a lousy track record deserves. In fact, if the precognition effect that Bem claims was real, casinos were impossible, as a reply by dutch scientists around EJ Wagenmakers points out.

By now, several people have attempted to replicate some of Bem’s experiments without finding the claimed effect. That’s hardly surprising but it does not explain how Bem got his results.

What’s wrong with the article?

It becomes obvious pretty quickly that the statistics were badly mishandled and a deeper look only makes things look worse. The article should never have passed review but that mistake didn’t bother me at first. Bem is experienced, with many papers under his belt. He knows how to game the system.

The mishandled statistics were not just obvious to me, of course. They were pointed out almost immediately by a number of different people.

These issues should be obvious to anyone doing science. If you don’t understand statistics you can’t do social science. What does statistics have to do with understanding people? About the same thing that literacy has to do with writing novels. At its core nothing, it’s just a necessary tool.

Mishandled statistics are not all that uncommon. Statistics is difficult and fields such as psychology are not populated by people with an affinity for math. Nevertheless, omitting key information and presenting the rest in a misleading manner really stretched my tolerance. That he simply lied about his method when responding to criticism, went too far. But that’s just in my opinion.

Such an accusation demands evidence, of course. The article is full of tell-tale hints which you can read about here or in Wagenmakers’ manuscript (link at the bottom).
But there is clear proof, too. As Bem mentions in the article, some results were already published in 2003. Comparing that article to the current article reveals that he originally performed several experiments with around 50 subjects each. He thoroughly analyzed these batches and then assembled then to packets of 100-200 subjects which he presents as experiments in his new paper.

[Update: There is now a more extensive post on this available.]

That he did that is the omitted key information. The tell-tale hints suggest that he did that and more in all experiments. Yet he has stated that exploratory analysis did not take place. Something that is clearly shown to be false by the historical record.

Scientists aren’t supposed to do that sort of thing. Honesty and integrity are considered to be pretty important and going by the American Psychological Association’s ethics code that is even true for psychologists. But hey, it’s just parapsychology.

And here’s where my faith in science takes a hit…

The Bem Exploration Method

Bem Exploration Method (BEM) is what Wagenmakers and company, with unusual sarcasm for a scientific paper, called the way by which Bem manufactured his results. They quote from an essay Bem wrote that gives advice for “writing the empirical journal article”. In this essay, Bem outlines the very methods he used in “Feeling the Future”.

Bem’s essay is widely used to teach budding social psychologists how to do science. In other words, they are trained in misconduct.

Let me give some examples.

There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b).
The conventional view of the research process is that we first derive a set of hypotheses from a theory, design and conduct a study to test these hypotheses, analyze the data to see if they were confirmed or disconfirmed, and then chronicle this sequence of events in the journal article.

I just threw a dice 3 times (via and got the sequence 6,3,3. If you, dear reader, want to duplicate this feat you will have to try an average of 216 times. Now, if I had said I am going to get 6,3,3 in advance this would have been impressive but, of course, I didn’t. I could have said the same thing about any other combination, so you’re probably just rolling your eyes.
Scientific testing works a lot like that. You work out how likely it is that something happens by chance and if that chance is low, you conclude that something else was going on. But as you can see, this only works if the outcome is called in advance.
This is why the “conventional view” is as it is. Calling the shot after making the shot just doesn’t work.

In real life, it can be tricky finding some half-way convincing idea that you can pretend to have tested. Bem gives some advice on that:

[T]he data. Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find additional evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something —anything —interesting.

There is nothing, as such wrong, with exploring data, to come up with new hypothesis to test in further experiments. In my dice example, I might notice that I rolled two 3s and proceed to test if maybe the dice is biased towards 3s.
Well-meaning people, or those so well-educated in scientific methodology that they can’t believe anyone would argue such misbehavior, will understand this passage to mean exactly that. Unfortunately, that’s not what Bem did in Feeling The Future.

And again, he was only following his own advice, which is given to psychology students around the world.

When you are through exploring, you may conclude that the data are not strong enough to justify your new insights formally, but at least you are now ready to design the “right” study. If you still plan to report the current data, you may wish to mention the new insights tentatively, stating honestly that they remain to be tested adequately. Alternatively, the data may be strong enough to justify re-entering your article around the new findings and subordinating or even ignoring your original hypotheses.

The truth is that once you go fishing, the data is never strong (or more precisely the result).

Bem claimed that his results were not exploratory. Maybe he truly believes that “strong data” turns an exploratory study into something else?
In practice, this advice means that it is okay to lie (at least by omission) if you’re certain that you’re right. I am reminded of a quote by a rather more accomplished scientist. He said about science:

The first principle is that you must not fool yourself–and you are
the easiest person to fool. So you have to be very careful about
that. After you’ve not fooled yourself, it’s easy not to fool other
scientists. You just have to be honest in a conventional way after

That quote is from Richard Feynman. He had won a Nobel prize in physics and advocated scrupulous honesty in science. I imagine he would have used Bem’s advice as a prime example of what he called cargo cult science.

Bayesians to the rescue?

Bem has inadvertently brought this wide-spread malpractice in psychology into the lime-light.
Naturally, these techniques of misleading others also work in other fields and are also employed there. But it is my personal opinion that other fields have a greater awareness of the problem. Other fields are more likely to recognize them as being scientifically worthless and, when done intentionally, fraud.
If anyone knows of similar advice given to students in other fields, please inform me.

The first “official” response had the promising title: Why psychologists must change the way they analyze their data by Wagenmakers and colleagues. It is from this paper that I took the term Bem Exploration Method.
The solution they suggest, the new way to analyze data, is to calculate Bayes factors instead of p-values.
They aren’t the first to suggest this. Statisticians have long been arguing the relative merits of these methods.
This isn’t the place to rehash this discussion or even to explain it. I will simply say that I don’t think it will work. The Bayesian methods are just as easily manipulated as the more common ones.

Wagenmakers & co show that the specific method they use fails to find much evidence for precognition in Bem’s data. But this is only because that method is less easy to “impress” with small effects, not because it is tamper-proof. Bayesian methods, like traditional methods can be more or less sensitive.

The problem can’t be solved by teaching different methods. Not as long as students are simultaneously taught to misapply these methods. It must be made clear that the Bem Exploration Method is simply a form of cheating.

Bem, D. J. (2003). Writing the empirical journal article. In J. M. Darley, M. P. Zanna, & H. L. Roediger III (Eds.), The compleat academic: A career guide (pp. 171–201). Washington, DC: American Psychological Association.
Bem, D. J. (2003, August). Precognitive habituation: Replicable evidence for a process of anomalous cognition. Paper presented at the 46th Annual Convention of the Parapsychological Association, Vancouver, Canada.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology.
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? A response to wagenmakers, wetzels, borsboom, & van der Maas (2011). Manuscript submitted for publication.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (in press). Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology.

See here for an extensive list of links on the topic. If I missed anything it will be there.

Extraordinary Claims and Extraordinary Evidence

“Extraordinary claims require extraordinary evidence” is a common skeptical quote and there has already been a lot written about it.

So rather than reinvent the wheel and talk about the history of the statement or give some abstract justification I am going to give an example of how it is applied.

A Medical Example

Think of a medical test like a HIV test or a pregnancy test. Such tests can go wrong. A pregnancy test could say you or your partner is pregnant when she is not. Or it may fail to say so when she actually is. There can be a false positive (aka Type I error) or false negative (aka Type II error). Such medical tests are extensively tested themselves before being marketed. It is therefore well-known how often they are wrong on average.

For our example we will imagine a test that has a 5% chance of a false positive and to keep things simple we will ignore false negatives. What happens when we apply that test to 10.000 people, 20 of whom have the disease and 9.980 who don’t?

We get 20 true positives and 9.980*5%=  489 false positives for a total of 519 positives. That means that although the test proudly proclaims to show a false positive in only 5% of cases, we found that of all the positives less than 4% are true.

Of course, in reality one will only know the test results and not what the actual truth is but one may have a good idea from previous experience. And, of course, one will rarely get a result that is exactly average but let’s not get into probability theory.

Now let’s say we test 10.000 people of whom 1.000 have the disease. We find 1.450 positives, 450 of which are false (9.000*5%). This time over 2/3rds of the positives are true.

In both cases we have the same evidence, namely a positive test result, but in the first case it is very, very likely false but in the second case probably true.

Such situations are encountered frequently in medicine. It is the reason that only risk populations are screened for diseases. Screening everyone would produce an overwhelming number of false positives that would cause needless distress.

Think back to the first example where we had 519 positives. If we apply another test, even a better one with only a 1% rate of false positives we would still get 5 false positives besides the 20 true ones. So even though we applied 2 tests, one with a 5% rate of false positives and another with a 1% rate we still only have 80% true positives.

There’s an implication here. Have you ever heard of a case where a tumor was suddenly gone? Well, maybe it wasn’t there in the first place. Even though medicine is well aware of this logic and compensates for it by using tests that are very reliable we cannot, as a matter, of principle achieve certainty. Especially since real life also throws us mixed up paperwork and human error besides any technical faults.

From The Medical To The General

The same logic can be applied to real life in a very straight-forward way. Say someone claims to have developed a perpetual motion machine. Many people have claimed that in the past but it never panned out. This is so extreme that patent offices these days refuse to review patents on such machines.

So in the very least we must assume that there are 1.000s if not 10.000s of perpetual motion claims that are untrue for every one that is true even if such a thing is possible.

But what does a test look like? Typically there will be a demonstration where the machine is shown in action. This will at least prove that there is some sort of machine. Any claims that exist merely in the form of an april fools press release or suchlike will fall to the wayside. So we can say that this is a true test in that it may be either negative or positive.

On the other hand, a demonstration only proves that some machine exists, it does not prove that this machine is perpetual, so the rate of false positives must be very high indeed.

The next test might be allowing an engineer or physicist to inspect the machine. How weighty is that evidence? Hard to say. If it is a scam, that person may simply be in on it. And even if not, that person may be fooled. How likely is that? That depends very much on how much leeway he or she is allowed. Your chances of seeing through a magic trick while sitting in an audience are pretty much nil but very high if you have free reign to roam backstage and to set up cameras etc. Don’t mistake knowing how a trick is done with actually seeing through it.

At what point would it be reasonable to believe in perpetual motion? Generally, when a radical change to something considered a law of nature is proposed one would like to have many independent replications in different labs and by different groups. In the case of a perpetual motion machine this should be quite easy.

An entirely different example of an extraordinary claim is Bigfoot. There are many, many species around the world. There’s even a fair number of large mammals that are different enough to be easily distinguishable by an amateur. There also are many fictitious beasts. I couldn’t put a number on it but the ratio of fictitious to real can’t be all that bad.

Two or three hundred years ago one would surely have considered the report of some reasonable credible person as sufficient to establish the existence of Bigfoot. So what has changed?

Nowadays, there are people everywhere. Zoologists have scoured every corner of the globe for new species. Discovering a large mammal means fame eternal. But there simply can’t be many left, especially not in North America. One needn’t even find a live specimen, with modern DNA techniques, simply finding a few hairs, feces or bones is enough.

Every expedition that fails to find Bigfoot, or even every Hiker, equals a negative test result. Of course, there can be false negatives. Just because you failed to find something doesn’t mean it is not there but it does mean that it is less likely.

You probably have a usual spot for your keys. But when you don’t see them there you look somewhere else. Before you looked, you thought it most likely that you’d find them there but when you didn’t see them you adjusted the probabilities. You’ve probably made the experience of finding something in a spot where you had searched before, even exhaustively. There you were a victim of a false negative. Every test can throw a false result!

Back to Bigfoot, if we think of every person who might have found evidence of Bigfoot as a test then we have a massive lot of negative results. Yet we don’t know if that means anything because we should expect a lot of false negatives even if Bigfoot exists. At the same time we also have a few positive results. People who reported seeing Bigfoot or who found Bigfoot tracks, there’s even Bigfoot films. These might be false positives, though. Misidentifications, hoaxes and the like.

On balance what carries more weight, positives or negatives? The answer is, as you might have guessed, the negatives. There are no DNA sample, no carcass, evidence which would be all but conclusive. And we should have such evidence if even a fraction of the positives, IE the tracks, the sightings, especially the film, were true.

You’re probably thinking that the medical analogy, this talk about tests gets quite strained here. Well, it’s what I’m thinking anyways, so let’s just let it be. There’s just one more thing I want to mention.

Think about what would happen if I tested 100 healthy people with a test that has a 5% rate of false positives. You get 5  positives, of course (on average). Test 100 more and you have 10. Test 1.000 more and you have 110! Why! You have an epidemic on your hand!

At least that’s how it might seem to a casual observer. The point is, it only takes dedication to have a growing pile of evidence for anything.  So whenever you hear of a “statistically significant result”, and newspapers are fully of them, remember that.

Science as a whole has ways of dealing with that and will probably not be fooled for long, at least, if reasonably unbiased people take an interest. The lasting facts tend to emerge more slowly, over the course of several years, or even decades. They rarely make newspaper headlines because there is no one event that could be reported on.


I hope this was enough to bring home the principle behind “extraordinary claims require extraordinary evidence”.

The important thing to take away is that claims are not different by their nature but by what we know. Ordinary claims can become extraordinary once counter-evidence piles up and vice versa.

One might equally well say: “Look at all the evidence!”

Don’t just look at the demonstration for that perpetual motion machine, look at all the identical claims in the past that have failed. Don’t just look at those Bigfoot tracks, look at what’s not been found.

We already know a thing or two. There’s nothing open-minded about ignoring that for the sake of evidence that may or may not materialize.