brain puzzle
© Ratoca/Shutterstock
Here's a short quiz concerning several popular findings from different subfields of psychology. True or False?

1. Brain training games strengthen cognitive skills in ways that generalize to everyday life tasks.

2. Selective serotonin reuptake inhibitors or SSRIs, the class of anti-depressants that includes Prozac, Zoloft, and Paxil (among others), are more effective than older anti-depressants, and are significantly better than placebo for most people with mild or moderate depression

3. Standing in a "power pose" prior to a job interview, hands on your hips or interlocked behind your head, increases testosterone production as well as the odds of being selected for the job.

4. The reason most types of psychotherapy are helpful is because they share a set of common factors, such as empathy, not because of specific methods unique to each approach.

5. Girls and women perform better on math tests when they are told the test doesn't really measure anything about their true math ability. This is one variant of the stereotype threat effect.

6. If you are asked to resist eating a freshly baked chocolate cookie on a nearby plate for 15 minutes, your performance on a cognitive test is likely to diminish — a phenomenon called ego-depletion.

7. The Stanford Prison Study, in which participants were randomly assigned to be guards or prisoners in a mock prison, showed definitively that specific contexts can lead people to act sadistically.

And now the answers:

1. False. Brain training games appear to simply make people better at brain training games, without strengthening cognitive skills in other areas of life. That hasn't stopped a multimillion-dollar industry from claiming otherwise, based on weak and unreliable evidence. To learn more, read this excellent essay by Atlantic science writer Ed Yong.

2. False. The SSRIs are generally no more effective than older anti-depressants, nor are they more effective than placebo for most people with mild or moderate depression. That's not anything you're likely to read in the literature from the pharmaceutical companies, however. In fact, until some investigative reporting turned up a trove of unpublished studies some years ago, there was no way to know that studies demonstrating the superiority of the SSRIs were not representative of the larger body of unpublished research. The selective publishing of findings is called publication bias, and it's a big problem in all of science, not just psychology.

3. False. Excitement about power poses skyrocketed after a TED talk by former Harvard Business School psychologist Amy Cuddy went viral. It's been seen 53 million times as of this writing. There's just one problem: the data behind power poses simply doesn't hold up. The evidence is so weak, in fact, that Dr. Cuddy's collaborator disavowed the research and concluded that the power pose effect simply isn't real. You can read an excellent and balanced article about the case in the New York Times by Susan Dominus.

4. We still don't know. We know that psychotherapy works well for the majority of people who make use of it; however, despite decades of research costing millions of dollars, and hundreds of articles and books claiming the greater importance of "common factors" or model-specific techniques, we still can't really say with confidence why therapy is helpful. In a 2019 systematic review, psychologist Pim Cuijpers and his colleagues concluded that methodological problems such as small samples and inadequate study designs precluded any clear resolution of the debate between advocates of the so-called common factors view and those who prioritize the techniques specific to different therapies.

5. It's unclear but increasingly appears to be false. Stereotype threat was once widely believed to explain the performance gap on standardized tests of black versus white students, and on math tests of girls versus boys. It was a groundbreaking line of research, and generated a lot of excitement. However, more recent studies using rigorous designs and larger samples have failed to find evidence of the effect. There's a great podcast on the rise and fall of stereotype threat research from the good folks at Radiolab that's worth a listen. It's called Stereothreat.

6. False. Go ahead and eat the cookie. It won't affect your performance. Although earlier studies on ego-depletion made a big splash, larger replication studies have failed to find evidence for the phenomenon. In that same Radiolab episode, Stereothreat, the decline of the ego-depletion effect is wonderfully brought to life.

7. False. The Stanford Prison Study has long been considered one of the most influential studies of Social Psychology. However, it has emerged that Philip Zimbardo, the head of the study, actually coached the guards to be cruel, thus helping to bring about the very effect he claimed was caused by the mock prison context. To read more, here's an excellent essay by Brian Resnick.

What do these research topics have in common? They are part of a growing list of widely accepted "truths" in psychology that have failed the test of time — or to be more accurate, they have failed the test of replication with large samples and rigorous designs, or alternatively, they have simply not withstood close scrutiny of their methods. They are a part of what has come to be known more generally as the replication crisis in psychology.

The replication crisis got its name from a massive undertaking by Brian Nosek and his many colleagues in the Reproducibility Project, in which they sought to replicate 100 studies from three high ranking psychology journals. Their results, published in the highly esteemed journal Science, sent shockwaves through the field. They were only able to replicate the original findings in about half of the studies. A few years later, a group of researchers at Many Labs 2 responded to criticisms of the Reproducibility Project by conducting another set of replications with even greater methodological rigor. The result? Of the 28 studies they replicated, only 50% or 14 showed an effect comparable to the original studies.

Why does the replication crisis matter? It's certainly stirring up something of a tempest. Prominent psychologists have written impassioned op-eds in the New York Times denouncing the very idea of a "crisis", while others have responded that in fact, psychology has a real and pervasive problem determining what is and isn't true. The failure to replicate is so common that the science writer Ed Yong concluded, "It seems that one of the most reliable findings in psychology is that only half of psychological studies can be successfully repeated."

The replication crisis, which might more broadly be called a crisis of trustworthiness, matters because every semester, we require thousands of psychology students to memorize hundreds of research findings and regurgitate them on multiple-choice exams, without carefully examining the methodological strengths and limitations of the studies that produced those findings. It matters because policymakers, clinicians, and ordinary consumers of psychological research make important decisions based on what they read, or what they see in TED Talks and other social media. It matters because researchers spend tens of millions of dollars of taxpayer money on research using designs that are often inadequate to answer the questions they are asking.

The Reproducibility Project and Many Labs 2 only looked at a total of 128 studies. That's a tiny percentage of studies conducted every month in psychology departments around the world. But the problem of replicability, and more broadly of low trustworthiness, extends well beyond that relatively small number of studies. As the examples in the first part of this post illustrate, some of the biggest findings in psychology have fallen in recent years in the face of increased methodological rigor and an increasingly critical look at the original methods used to achieve those big findings.

What contributes to the failure of studies to replicate? Why do findings that seem true so often turn out to be unreliable? Two of the biggest culprits are small sample sizes, and the preference of researchers and journal editors to publish "significant" findings — the so-called publication bias.

Findings based on small samples are notoriously unreliable. Small samples have two main hazards: either they are too small to detect meaningful effects that do exist (so-called false negatives), or they detect effects that are not actually real (false positives). Unfortunately, larger studies require more resources (time, money, participants), so the lure of small samples is powerful, especially for researchers working under the pressure of a tenure clock and the urgency to publish. The widespread publication of small-sample studies leads to the proliferation of unreliable findings, including a lot of false positives that eventually disappear when studies are replicated with sufficiently large samples.

There's also a lot of pressure on researchers to publish so-called "significant" findings — findings that have a less than 5% probability of being due to chance. Why 5%? There's actually no compelling reason; it's the arbitrary number that the field has come to view as the dividing line between what is true and what is not. With a small sample, just a few high or low scoring participants can easily tip a finding below or above that critical 5% line. And with a bit of creativity (also known as "p-hacking"), outliers (extreme scores) can be dropped, analyses can be rerun just a bit differently, and a non-significant finding suddenly crosses the magical threshold of p<.05. The finding may have no real-world significance whatsoever, but it now has that much-heralded status: statistical significance. The odds of the study being submitted for publication, and of it being accepted, have just increased substantially. With large samples, the problem of p-hacking still exists, but it's harder to achieve because the weight of any single participant is greatly diminished by the size of the sample.

Publication bias, another major culprit, is a serious threat to the integrity of psychological science. Imagine if five researchers study the impact of eating an apple on math test performance. Four of the studies find no effect at all, while one finds that apple eating is modestly related to better test performance. This means that 80% of the studies find no effect at all and one finds only a modest effect. Now assume that the four researchers who didn't find the "apple effect" decide not to publish their findings, because they perceive, accurately, that publishing "nonsignificant" findings will be difficult, and may also be viewed negatively by their colleagues. So now the only study that gets published is the one that did find the "apple effect". To the public, it appears as though only one study has examined the impact of apples on math performance, and that study found it to be real. Soon the media are advocating eating apples before math exams, and the apple effect is out in the world. Eventually, someone will probably do a large replication study of the apple effect and find that apples really don't influence performance on math tests, and the phenomenon will fade away--but not before a lot of time, money, and effort have been expended on related studies all premised on this one misleading finding (e.g., How about pears? Bananas? What about citrus fruit? Do apples also help with other types of exams?).

This is publication bias, and it's a serious problem. We can only make evidence-based decisions based on published findings; if those findings are just a partial, non-representative sample of what's actually been found, we are likely to reach inaccurate conclusions. Systematic reviews, which summarize the state of our knowledge on a given topic, are routinely used by practitioners and policymakers in their decision making. But systematic reviews generally only include the findings of published research. Publication bias threatens the integrity of such reviews by causing all those unpublished "non-significant" findings to be excluded. Using methods recently developed to estimate and adjust for the effect of publication bias in systematic reviews, researchers have found that seemingly powerful effects are often reduced dramatically, in some cases to non-significance.

So where does that leave us? How do we know what findings are trustworthy in psychology? Unfortunately, there's no easy answer. Many studies have been replicated successfully and can be considered robust. The Many Labs 2 replication project and Reproducibility Project both list all of the studies they were able to successfully replicate (as well as those they were not). There are countless other studies in all areas of psychology that have held up well over time and across replications. On the other hand, some popular studies may seem reliable, but have not yet been subjected to rigorous replication.

As a general rule, look for replications: has the study been repeated by anyone else and gotten similar results? Look to see whether the researchers provide a rationale for the number of participants in their study and whether they raise any concerns about their sample size. Good researchers will be extremely cautious about any conclusions they reach based on a small number of participants. There's no hard and fast rule about how many participants are needed, since the required number for a trustworthy finding varies according to what's being studied and the methods being used. If only one research team has found an effect, and has used a small sample, I'd suggest a bit of caution, perhaps best viewing the findings as provisional.

Of course, replication and large samples still don't guarantee that a study's findings are trustworthy. A host of other biases, from sampling, to research design, to the type of statistical analyses run, all influence the validity of a study's findings. A great example is the so-called bystander effect, the well-known finding that the greater the number of people who witness a person in distress, the lower the likelihood that anyone will intervene. It's been well-studied in the lab; unfortunately, it lacks ecological or "real world" validity. In a recent study of real-world situations, researchers found just the opposite of the bystander effect: the greater the number of witnesses to people in distress, the more likely someone was to take action. And, parenthetically, in the original incident that gave rise to bystander theory research, the rape and killing of Kitty Genovese in front of 38 witnesses, a new report reveals that, contrary to what's written in most textbooks, several people did call the police and come to her aid.

In a future post, I'll consider several ways in which psychology can strengthen the trustworthiness of its findings. I'll also highlight studies that have withstood scrutiny, successfully been replicated, and are rightfully seen as exemplars of trustworthy psychological science.