I may have seemed a bit hard on meta-analysis last week, but I should say that there’s really no way to mount a good scientific argument without some form of it. You have to consider results across multiple studies, and come up with some sort of synthesis about what they might mean when taken together. Without this kind of reasoning, each new study would have to be treated as sui generis, and science couldn’t function as a cumulative discipline. Mostly this is done informally, by looking over the relevant papers, and thinking about them, so making meta-analysis quantitative and algorithmic has the advantage of taking much of the subjectivity out of the process.
The biggest weakness of meta-analysis, in principle, is that it accurately reflects the biases and assumptions of the field at large. Under some circumstances, quantitative meta-analysis can even reveal shortcomings in the literature, as I tried to demonstrate in last week’s discussion of the neurosynth results for “speech perception” vs. “words” and “sentences.” But there are certain kinds of findings that would violate assumptions that motivate the design of most, if not all studies in the field, and these are things that will never be found by even the most sophisticated consideration of the existing literature. You could not discover the law of conservation of matter by conducting a meta-analysis of all of the data accumulated by alchemists. They weren’t asking the right questions to provide that kind of answer. It would be crazy to assert that, in just a few decades of functional neuroimaging research, we’ve figured out exactly the kinds of questions and techniques to carve nature at its joints. We’ve made a lot of progress, but there are still a lot of unknown unknowns. So, that’s one reason meta-analysis has to remain just one bow in our quiver, and why we have to be prepared to chuck the results of even the most sophisticated meta-analyses if new studies demonstrate that some of our assumptions are not valid.
Another cultural issue that impacts the validity of meta-analysis is the way in which results are reported. We depend to a staggering degree on significant inferential statistics in deciding whether a paper should be published. Inferential statistics are supposed to give us some idea of how likely a particular result was to have occurred by chance, or, positively, how likely it would be “replicate.” So, when I report that a result is “statistically significant” or “reliable,” that’s supposed to mean that it is very unlikely to have happened by chance, and that, if you re-did my experiment, you’d be very likely to find the same thing.
The problem is that, because journals tend overwhelmingly to publish only those results that are significant, we don’t actually know how reliable many of the effects in the literature are. In the worst case, an unscrupulous scientist could just run the same experiment over and over again, hoping to produce the “correct” result by chance (since, after all, the chance of producing a significant result is never exactly zero, even when the effect is not real). Say this was done a thousand times, and in 999 of those cases, the desired effect was not found, or the effect went in the opposite direction of what was predicted. If the literature only contains the one instance in which the desired outcome was observed, it obviously presents a misleading view.
But you don’t need willful misrepresentation like p-hacking or HARKing to create this kind of problem. More often, in fact, the “file drawer” problem arises when an interesting, counterintuitive result appears in the literature, and then lots of labs try to reproduce it as the first step in designing new experiments based on the finding. Often, the result cannot be reproduced, and the failure to replicate ends up getting hidden away in the file drawer. This happens because you can’t easily publish failures to replicate, even if most people in the field believe that the study you are trying to replicate is obviously not right.
There are many good posts elsewhere about the file-drawer problem, and even a virtual file drawer where people can register and post failed replications of psychology experiments. This discussion has spilled out of the blogs and into the literature, perhaps most notably in a special section of Perspectives on Psychological Science, including an by meta-analysis aficionado John Ioannides on, among other things, how the file drawer problem impacts meta-analysis. (See also the special topic in Frontiers in Computational Neuroscience.) It is very exciting that we are having a serious discussion about how to resolve this issue by changing the criteria for publication, creating specific venues for the publication of replication studies (successful and failed), models for post-publication commentary and annotation. In short, a cultural shift in how scientists publish and report data is under way. One can optimistically look forward to a time when too-good-to-be-true findings are joined in the literature by failed replications, and these failed replications find their way into an open annotation system, instead of hiding in the “lore” of a subdiscipline.
One strategy for getting null results into the literature is the Registered Report model that Cortex has begun to experiment with. On this model, scientists submit an introduction motivating their experiment, and a full experimental design with planned analyses before collecting any data. If the study is deemed sound via a peer review process, the article is accepted in principle, meaning that however the results turn out, it will be accepted as long as the authors follow their proposed methods and analyses exactly.
This is a fantastic idea, because it reduces the degrees of freedom for reporting. As Chambers puts it: “Whether an experiment supports the stated hypothesis is the one aspect of science that scientists (should) have no control over – yet the traditional publishing model encourages a host of dodgy practices to exert such control.” Registered reports are judged on the things that the experimenter rightly has control over — the design, the analysis, the hypotheses tested, etc.
I don’t worry that a proliferation of failed replications, null results and inconclusive experiments will negatively impact the signal-to-noise ratio in the literature. I think it will give us a better idea of what the signal actually is by providing a clear view of the noise. I do think it would be unhealthy for the field if this model completely took over. There are many cases in which researchers start out asking a specific question, find inconclusive effects, and then play around with some exploratory analyses to see what else they can find and come up with something interesting. The only problem with this is when, because of the way journal articles are structured, the scientists then need to pretend that their experiment was designed to make the serendipitous finding. The Cortex model anticipates this by encouraging researchers to report such exploratory analyses, but segregating them from the main results and conclusions. This is as it should be. Exploratory results typically involve multiple tests, and are thus less likely to be truly reliable than planned comparisons even if they are very compelling and produce “reliable” effects when tested. It’s a bit like calling your shots in pool: if a ball drops, but you didn’t call it, it is probably unlikely that, given the same situation, you would be able to sink that same ball again.
The more serious concern is that review of initial submissions will exaggerate the impact of reviewers’ assumptions and conservatism on what experiments are even attempted. This certainly seems to be the case in the NIH review process (where, essentially, reviewers are asked to decide whether experiments should be funded based on a description similar to what is envisioned for registered reports). Despite the fact that “Innovation” is a key review criterion, there is strong feeling among researchers that reviewers heavily favor projects that are consistent with their biases and assumptions. This is hard to prove, but certainly gibes with my experience.
There is an approach to doing science that would be impossible under this model. Let’s call it “Bee Science,” based on this post from Language Log, and the Blackawton Bees project. If you’re von Frisch and colleagues, or a group of clever elementary school students, you don’t start with a clear set of hypotheses, planned analyses, or even, necessarily, a fixed methodology. You watch the bees. You wonder about what they’re doing. You take painstakingly accurate notes. You improvise. Then you think about what you’ve found, and start trying some things out. Starting from a position of naïveté — either because there is not much known about your area, or because you are eight years old, or because you are willfully practicing counter-induction — creates incredible freedom, and is an important path toward innovation.
Maybe it will be important in the scientific literature of the future to recognize a genre distinction between the Registered Report, and something I’ll call a Narrative Account. Narrative Accounts would be more free-form. They would contain reports of failed experiments and null effects, and provide enough detail about methods to permit replications, but they could be freely structured post-hoc to emphasize what the authors thought to be the most interesting results, and the results most likely to open up new directions for future research. It’s clear that in the current literature, authors calling their shots after seeing which balls they’ve sunk is a huge problem. This doesn’t mean we shouldn’t let people take a wild crack at the table once in a while, just to shake things up. Sometimes that’s the only way to find new angles. The Narrative Account would provide a means to say, essentially: “We tried a lot of stuff, some of it was interesting, maybe some of it would replicate. Check it out!” This is what much scientific work is actually like, and I think most scientists would be happy to be freed from the requirement to pretend it is otherwise.
Image credit: aussigal on Flickr.