Recently, there wasÂ an interesting articleÂ in BusinessWeek about the flip-flop of studies on the efficacy of echinacea to cure the common cold. The article focused on the possibility of incorrectly performed studies.Â But, there may have been nothing wrong with any of the studies, even though they differed in their results. The statistical nature of clinical studies means there is always a smallÂ possibility that false effects will be seen. However, biases inherent to statistical research may result in a surprisingly large percentage of published studies being wrong. In fact, it has been suggested that the majority of such studies are.
First, I’ll have to briefly explain something about how statistically-based studies are done. When people do such trials, they consider as “significant” any result thatÂ would only happen by chanceÂ 1 in 20 times. In the language of statistics,Â they design the study so that theÂ “null” hypothesis (e.g. that echinacea has no effect on a cold) would only be rejectedÂ falsely at most 5% of the time based on the normal random variabilityÂ expected in their study. In other words, they accept thatÂ 5% of the time (at most) they willÂ erroneously see an effect where there truly isn’t any. This 5% chance of a mistake arises from unavoidable randomness, such as the normalÂ variation in disease duration and severity; in the case of the echinacea studies you might just happen to test your drug on a group of people whoÂ happened to getÂ lucky and got colds which were abnormally weak.
In summary,Â to say a drugÂ study is conducted at the 5% significance level, you are saying that you designed the study so that you wouldÂ falsely conclude a positive effect when there were none only 5% of the time. In practice,Â scientists usually publishÂ the p-value, which is the lowest significance (which you can only compute after the fact)Â that would have still allowed you to conclude an effect. The main point, however,Â is that any study that is at leastÂ significant at the 5% level is generallyÂ considered significant enough to publish.
So, being wrong at most 1 in 20 timesÂ is pretty good, right? The cost of putting aÂ study out there thatÂ isÂ wrongÂ pales in comparison to the good of the 19 that actually help, right? Does it really matter if, of the 1000s of studies telling us what we shouldÂ and shouldn’t do,Â dozens of them are wrong? In theory, there will always be anÂ order of magnitude more studiesÂ that are truly helpful.
The problem is, this conclusionÂ assumes a lot.Â Just because the average study may have a p-value of, say 2%, it doesn’t mean only 2% of the studies out there are wrong. We have no idea how many studies are performed and not published. Another way of looking at the significance level of an experiment is “How many times does this experiment have to be repeated before IÂ have a high probability of being ableÂ to publish the result I want?” This may sound cynical, but I’m not suggesting any dishonesty.Â This kind ofÂ specious statisticsÂ occurs innocentlyÂ all the timeÂ due to unknown repeated efforts in the community, an effect called publication bias. Scientists rarely publish null findings, and even if they do, such results are unlikely to get much attention.
Taking 5% as the accepted norm for statistical significance, this means onlyÂ 14 groups need to have independentlyÂ looked at the same question, in the entire history of medicine, before it’s probable that one of them will find a falsely significant result. Perhaps more problematically,Â consider that many studies actually look at multitudes of variables, and it becomes clear that if you just ask enoughÂ questions on a survey, you’re virtually guaranteed to have plentyÂ of statistically significant “effects” to publish.Â Perhaps this is why companies find funding statistical studiesÂ so much more gratifying than funding the physical sciences.
None of what I have said so far is likely to be considered novel to anybody involved in clinical research.Â However, I think there is potentiallyÂ another, more insidious source of bias that I don’t believe has been mentioned before.Â The medical research community is basically a big hypothesis generating machine, and the weirder, the better. There is fame to be found in overturning existing belief and finding counterintuitive effects, so people are biased towards attemptingÂ studies where the null hypothesis represents existing belief. However, assuming that there is some correlation between our current state of knowledge and the truth, this implies a bias towards studies where the null hypothesis is actually correct.Â InÂ classical statistics, the null hypothesis can only beÂ refuted, not confirmed.Â Thus, by focusing on studies that seek to overturn existing belief, thereÂ may beÂ an inherentÂ biasÂ in the medicalÂ profession to findÂ false results. If so, it’s possible that a significant percentage of published studies are wrong, far in excess of that suggested by the published significance level of the studies.
Statistical studies are certainly appropriate when attempting to confirm a scientific theory grounded in logic and understanding of the underlying mechanism. A randomÂ question, however, is not a theory, and using statistics to blindly fish for novel correlations will always produce false resultsÂ at a rateÂ proportional to the effort applied. Furthermore, as mentioned above, this may be further exacerbated byÂ the bias towards disproving existing knowledge as opposed to confirming it.Â The quality expert W. Edwards Deming (1975) once suggested that the reason students have problems understanding hypothesis tests is that they “may be trying to think.” Using statistics asÂ a primary scientificÂ investigative tool, as opposed to merely a confirmative one, is a recipe for the production of junk science.