© SolStock/Getty
Crisis? What crisis?
Research findings often crumble under the microscope. Rows over the best way to fix this must end so we can stop trust in science crumbling too

In these times of fake news, it's good to know that there's still one source we can rely on: the scientific community. Wielding rigorous standards of evidence, researchers can be counted on to give us trustworthy insights amid a sea of nonsense.

Yet this, too, is fake news. For decades, scientists have been using flawed methods for turning raw data into insight about, say, the effectiveness of a new medical therapy or method of teaching. As a result, the research literature is awash with findings that are nothing more than meaningless flukes. No less shocking is the fact that researchers have been repeatedly warned about the problem, to no effect.

This week, the American Statistical Association (ASA) hopes to change that. It is hosting a conference intended to get the scientific community to mend its ways. But what has this scientific crisis got to do with statistics?

Last year, the ASA stated its concern about the misuse of the standard data analysis methods that researchers use to tell if they've found something worth reporting. Known as significance testing, these are suspected of playing a key role in the replication crisis in science, in which startling claims collapse when other researchers try to confirm them.

Taking the p

The focus of the ASA's concern lies in the calculated probability, or p-value, of experimental results. After crunching data on, say, the fitness of people before and after some new form of exercise, researchers calculate the p-value of the difference in performance.

If the outcome leads to a p-value below 5 per cent, the findings are conventionally said to be "statistically significant" and worth taking seriously. That's because the p-value is widely thought to be the chances of the findings being a fluke. As such, those with a p-value of 5 per cent seem 95 per cent likely to be the result of a genuine effect.

But that's not what a p-value means. It's actually the chances of getting at least as impressive a finding as that seen, assuming fluke was the real cause. Which might sound the same, but is subtly - and dangerously - different.

Because it is based on the assumption that fluke is the real cause of the effect, a p-value can't just be flipped around to give the chances of that assumption being true.

Statistical headache

Confused? Then try this medical analogy. You're a doctor with a patient complaining of recurrent headaches. You know there's a 60 per cent chance of getting these headaches - if there is a brain tumor.

So does that mean the chances your patient has a tumor are also 60 per cent? Hardly: it's clearly a mistake to simply flip the problem around and assume the chances of having headaches if you have a tumour are the same as those of having a tumour if you're having headaches.

Yet that's exactly the kind of blunder researchers make with p-values, leading them to see "significance" in meaningless flukes, like careless doctors confusing symptoms with causes.

The ASA was right to go public with its concerns last year, and to call on researchers to adopt more reliable techniques for extracting insights from data. Now it is asking statisticians to come up with alternatives.

Civil war

But there's a problem. There's no consensus on what these alternatives should be. For decades there has been internecine civil war between statisticians who think the p-value approach just needs tweaking, and those seeking radical reform.

In the run-up to this week's meeting, one group of leading statisticians called for a simple tightening of the p-value threshold to 0.5 per cent. It's already been attacked as simplistic and beset with unintended consequences - such as discouraging follow-up studies, which will need to be much larger and more expensive if they're to meet the tougher standard.

A host of alternatives, including outright bans on p-values and claims of "significance", will be discussed at the conference, but if history is any guide, they too will be torn apart by one faction or another.

It is time for statisticians to bury their hatchets and reach a pragmatic consensus. Unless they do, researchers will simply stick with the flawed methods they've always used - and leave the rest of us struggling to tell the fake breakthroughs from the real ones.