Recently, there was an interesting article in BusinessWeek about the flip-flop of studies on the efficacy of echinacea to cure the common cold. The article focused on the possibility of incorrectly performed studies. But, there may have been nothing wrong with any of the studies, even though they differed in their results. The statistical nature of clinical studies means there is always a small possibility that false effects will be seen. However, biases inherent to statistical research may result in a surprisingly large percentage of published studies being wrong. In fact, it has been suggested that the majority of such studies are.
First, I’ll have to briefly explain something about how statistically-based studies are done. When people do such trials, they consider as “significant” any result that would only happen by chance 1 in 20 times. In the language of statistics, they design the study so that the “null” hypothesis (e.g. that echinacea has no effect on a cold) would only be rejected falsely at most 5% of the time based on the normal random variability expected in their study. In other words, they accept that 5% of the time (at most) they will erroneously see an effect where there truly isn’t any. This 5% chance of a mistake arises from unavoidable randomness, such as the normal variation in disease duration and severity; in the case of the echinacea studies you might just happen to test your drug on a group of people who happened to get lucky and got colds which were abnormally weak.
In summary, to say a drug study is conducted at the 5% significance level, you are saying that you designed the study so that you would falsely conclude a positive effect when there were none only 5% of the time. In practice, scientists usually publish the p-value, which is the lowest significance (which you can only compute after the fact) that would have still allowed you to conclude an effect. The main point, however, is that any study that is at least significant at the 5% level is generally considered significant enough to publish.
So, being wrong at most 1 in 20 times is pretty good, right? The cost of putting a study out there that is wrong pales in comparison to the good of the 19 that actually help, right? Does it really matter if, of the 1000s of studies telling us what we should and shouldn’t do, dozens of them are wrong? In theory, there will always be an order of magnitude more studies that are truly helpful.
The problem is, this conclusion assumes a lot. Just because the average study may have a p-value of, say 2%, it doesn’t mean only 2% of the studies out there are wrong. We have no idea how many studies are performed and not published. Another way of looking at the significance level of an experiment is “How many times does this experiment have to be repeated before I have a high probability of being able to publish the result I want?” This may sound cynical, but I’m not suggesting any dishonesty. This kind of specious statistics occurs innocently all the time due to unknown repeated efforts in the community, an effect called publication bias. Scientists rarely publish null findings, and even if they do, such results are unlikely to get much attention.
Taking 5% as the accepted norm for statistical significance, this means only 14 groups need to have independently looked at the same question, in the entire history of medicine, before it’s probable that one of them will find a falsely significant result. Perhaps more problematically, consider that many studies actually look at multitudes of variables, and it becomes clear that if you just ask enough questions on a survey, you’re virtually guaranteed to have plenty of statistically significant “effects” to publish. Perhaps this is why companies find funding statistical studies so much more gratifying than funding the physical sciences.
None of what I have said so far is likely to be considered novel to anybody involved in clinical research. However, I think there is potentially another, more insidious source of bias that I don’t believe has been mentioned before. The medical research community is basically a big hypothesis generating machine, and the weirder, the better. There is fame to be found in overturning existing belief and finding counterintuitive effects, so people are biased towards attempting studies where the null hypothesis represents existing belief. However, assuming that there is some correlation between our current state of knowledge and the truth, this implies a bias towards studies where the null hypothesis is actually correct. In classical statistics, the null hypothesis can only be refuted, not confirmed. Thus, by focusing on studies that seek to overturn existing belief, there may be an inherent bias in the medical profession to find false results. If so, it’s possible that a significant percentage of published studies are wrong, far in excess of that suggested by the published significance level of the studies. One might call this predisposition toward finding counterintuitive results “fame bias”. It may be how we get such ludicrous results as “eating McDonald’s french fries decreases the risk of breast cancer,” an actual published result from Harvard.
Statistical studies are certainly appropriate when attempting to confirm a scientific theory grounded in logic and understanding of the underlying mechanism. A random question, however, is not a theory, and using statistics to blindly fish for novel correlations will always produce false results at a rate proportional to the effort applied. Furthermore, as mentioned above, this may be further exacerbated by the bias towards disproving existing knowledge as opposed to confirming it. The quality expert W. Edwards Deming (1975) once suggested that the reason students have problems understanding hypothesis tests is that they “may be trying to think.” Using statistics as a primary scientific investigative tool, as opposed to merely a confirmative one, is a recipe for the production of junk science.