I was on a rooftop at an OHBM party, chatting with Niko Kriegeskorte about the ineluctable cruelty and arbitrariness of nature when my partner texted from back in NYC to say she had just seen Werner Herzog on the subway. As an enthusiast of Hebbian learning, I can’t help but notice how coincidences, even when they have no intrinsic meaning, can really sear something into your memory.
The recent eruption of discussion about what should be done about shoddy fMRI studies reminded me of some of what we said. In particular I don’t think the conversation so far has acknowledged the degree to which publications of over-cooked science is motivated by a fear of failure that has been amplified out of control by the tight margins with which neuroimaging work is carried out, and the unforgiving standards of productivity against which we are measured at every stage of our careers.
Up on the roof, Niko and I were discussing the role of failure in research, especially in training students. The first independent project one does in graduate school is, for many, a first direct encounter with an impartial, implacable universe that doesn’t owe you an interesting result just because you are extremely bright and worked very hard for years on a project. Failure is hard to take after a career of schooling in which rewards accrue for meeting course requirements — or for bargaining with professors who have more incentive to get good teaching evaluations than to give fair grades. But failure is part of the scientist’s job: if we are asking interesting questions, and pushing at the boundaries of what is known, our inductions and intuitions are bound to be off, often by a wide margin, and the data should tell us this.
Some projects are like the Amazon basin, swallowing up the research team like a band of malarial would-be conquistadors, floating downstream on a monkey-infested raft. It’s not just naivete, though, or a hyperactive imagination fueled by the early Kinski/Herzog collaborations that makes failure so crushing for people at this stage. The fact is that building a career depends on generating some sort of publishable result. Because no one is going to hire you for a postdoc, much less a permanent faculty position, if you have no publications.
This pressure is only amplified later in one’s career, when advising multiple students, and postdocs all of whom “need” to publish something sooner rather than later. But by this point the constant anxiety that your projects will fail to produce anything interesting or publishable is so thoroughly eclipsed by anxiety that your grant applications will fail to be funded, so you hardly notice it. One day you become capable of saying things like “well, there was always a chance it wouldn’t work; that’s why it’s called an experiment,” to brokenhearted trainees.
As in any industry, academic science has productivity metrics, and decisions about hiring, funding, promotion, tenure, etc. are largely based on these. Further, contrary to the “education crisis myth,” there is actually a huge glut of science PhDs in the US at the moment, due in large part to the doubling of the NIH budget between 1998 and 2003, and the abrupt halt in increases thereafter. Here’s a description of how bad things were five years ago, before the financial crisis. They haven’t improved since.
So, you need to be productive just to stay in the game, and productivity means publishing a lot of papers. And yet publishing depends on many things that are largely beyond your control. Even armed with interesting results, there’s always the possibility that your manuscript will come up against a reviewer (usually Reviewer #3) whose theoretical framework is threatened by your findings, or who just doesn’t like the way you’ve framed your argument, or, for whatever reason, is going to insist that you do more experiments.
All of that falls under the category of human cruelty and arbitrariness, and my topic for the moment is nature’s cruelty and arbitrariness. As in war, where no plan survives contact with the enemy, in science, no hypothesis survives contact with actual data.
There is an easy way around this: making shit up. By which I don’t necessarily mean Stapeling or Hauserizing your data. There are actually a lot different ways researchers put our fingers on the scales, many of which, though quite common and innocuous-seeming, can greatly inflate the likelihood of finding a false positive. A recent paper by Simmons et al. (2011) (see also discussion here) demonstrates just how badly these practices can distort our estimate of just how likely our results were to have been caused by chance. They conducted simulations to explore the effects of selectively reporting which variables they collected, making ad hoc decisions about when to stop collecting data (i.e., stopping when they had a significant effect) and applying multiple different statistical controls on common statistical test. As an example, they conducted analyses that showed, with great statistical reliability, that listening to a song about being old — “When I’m Sixty-Four” — makes people chronologically older than listening to some random Microsoft tune. That is, the actual ages of the people in the experimental group were found to be greater than the ages of the people in the control group.
While Simmons et al. were deliberately (and hilariously) cooking the data to make a point, it is easy to see how less extreme versions of this can happen in practically any lab. It is very common to collect an arbitrary amount of data, and stop collecting only when the experiment is producing publishable results. This is probably the most benign-seeming of the tactics they explored, and it is, as far as I know, standard operating procedure in many labs. Their analyses suggest that it can have a pretty dramatic impact on false positives, in the worst case creating about a 20% chance of finding a “significant” result from pure noise.
I have certainly done this in the past, without having the slightest idea of how wildly the data can be distorted by failing to set an a priori sample size. Mostly it happens when the first, planned run of an experiment produces a promising but non-publishable result. If there were a fixed sample size in place, the experiment would stop there; maybe we would do some exploratory analyses to see if there was anything weird about the study, but then we would have to start over, collecting an entirely new (and probably somewhat larger) data set. That is a hard thing to do, with the clock ticking on whatever deadline is looming ahead of you (and there is always something, a dissertation date? grant submission? tenure review?). In practice it always seems more efficient to just keeping adding subjects until the result you “know is true” comes out in publishable form.
Further, running experiments is often expensive, and collecting a new data set can sometimes be impractical, e.g., if the semester is ending and the “participant pool” of first year psychology students conscripted into participating in research will run out. If you’re running an fMRI study, the last thing you want to do is throw out tens of thousands of dollars worth of data and then re-run the same experiment. Finally, people don’t often do this because it would be a bit depressing to complete an experiment, get an interesting result, and then go back and do more or less exactly the same experiment over from scratch because it failed to reach statistical significance. After all, either we will replicate the experiment — in which case we can only publish it once anyway — or we can fail to replicate, in which case we really shouldn’t publish the results from the time it “worked.”
It’s a commonplace that failure builds character. It’s also true that scientists as a group have unusually high standards for integrity and honesty, not to mention persistence and dogged hard work. But the pressure to maintain a high level of productivity — measured in terms of number of publications — combined with the wide variety of levers available to make data look more interesting than they are, creates perverse incentives for intellectual dishonesty. I don’t see this changing unless we give people more permission to fail.
Image credit: A normal distribution, by Leah Beeferman.