© Bryan Satalino
"Non-coding RNA" does create some proteins
Bits of the transcriptome once believed to function as RNA molecules are in fact translated into small proteins.

In 2002, a group of plant researchers studying legumes at the Max Planck Institute for Plant Breeding Research in Cologne, Germany, discovered that a 679-nucleotide RNA believed to function in a noncoding capacity was in fact a protein-coding messenger RNA (mRNA).1 It had been classified as a long (or large) noncoding RNA (lncRNA) by virtue of being more than 200 nucleotides in length. The RNA, transcribed from a gene called early nodulin 40 (ENOD40), contained short open reading frames (ORFs)—putative protein-coding sequences bookended by start and stop codons—but the ORFs were so short that they had previously been overlooked. When the Cologne collaborators examined the RNA more closely, however, they found that two of the ORFs did indeed encode tiny peptides: one of 12 and one of 24 amino acids. Sampling the legumes confirmed that these micropeptides were made in the plant, where they interacted with a sucrose-synthesizing enzyme.

Five years later, another ORF-containing mRNA that had been posing as a lncRNA was discovered in Drosophila.2,3 After performing a screen of fly embryos to find lncRNAs, Yuji Kageyama, then of the National Institute for Basic Biology in Okazaki, Japan, suppressed each transcript's expression. "Only one showed a clear phenotype," says Kageyama, now at Kobe University. Because embryos missing this particular RNA lacked certain cuticle features, giving them the appearance of smooth rice grains, the researchers named the RNA "polished rice" (pri).

Turning his attention to how the RNA functioned, Kageyama thought he should first rule out the possibility that it encoded proteins. But he couldn't. "We actually found it was a protein-coding gene," he says. "It was an accident—we are RNA people!" The pri gene turned out to encode four tiny peptides—three of 11 amino acids and one of 32—that Kageyama and colleagues showed are important for activating a key developmental transcription factor.4

Since then, a handful of other lncRNAs have switched to the mRNA ranks after being found to harbor micropeptide-encoding short ORFs (sORFs)—those less than 300 nucleotides in length. And given the vast number of documented lncRNAs—most of which have no known function—the chance of finding others that contain micropeptide codes seems high.

The hunt for these tiny treasures is now on, but it's a challenging quest. After all, there are good reasons why these itty-bitty peptides and their codes went unnoticed for so long.

Overlooked ORFs

From the late 1990s into the 21st century, as species after species had their genomes sequenced and deposited in databases, the search for novel genes and their associated mRNAs duly followed. With millions or even billions of nucleotides to sift through, researchers devised computational shortcuts to hunt for canonical gene and mRNA features, such as promoter regions, exon/intron splice sites, and, of course, ORFs.

ORFs can exist in practically any stretch of RNA sequence by chance, but many do not encode actual proteins. Because the chance that an ORF encodes a protein increases with its length, most ORF-finding algorithms had a size cut-off of 300 nucleotides—translating to 100 amino acids. This allowed researchers to "filter out garbage—that is, meaningless ORFs that exist randomly in RNAs," says Eric Olson of the University of Texas Southwestern Medical Center in Dallas.

Of course, by excluding all ORFs less than 300 nucleotides in length, such algorithms inevitably missed those encoding genuine small peptides. "I'm sure that the people who came up with [the cut-off] understood that this rule would have to miss anything that was shorter than 100 amino acids," says Nicholas Ingolia of the University of California, Berkeley. "As people applied this rule more and more, they sort of lost track of that caveat." Essentially, sORFs were thrown out with the computational trash and forgotten.

Aside from statistical practicality and human oversight, there were also technical reasons that contributed to sORFs and their encoded micropeptides being missed. Because of their small size, sORFs in model organisms such as mice, flies, and fish are less likely to be hit in random mutagenesis screens than larger ORFs, meaning their functions are less likely to be revealed. Also, many important proteins are identified based on their conservation across species, says Andrea Pauli of the Research Institute of Molecular Pathology in Vienna, but "the shorter [the ORF], the harder it gets to find and align this region to other genomes and to know that this is actually conserved."

As for the proteins themselves, the standard practice of using electrophoresis to separate peptides by size often meant micropeptides would be lost, notes Doug Anderson, a postdoc in Olson's lab. "A lot of times we run the smaller things off the bottom of our gels," he says. Standard protein mass spectrometry was also problematic for identifying small peptides, says Gerben Menschaert of Ghent University in Belgium, because "there is a washout step in the protocol so that only larger proteins are retained."

But as researchers take a deeper dive into the function of the thousands of lncRNAs believed to exist in genomes, they continue to uncover surprise micropeptides. In February 2014, for example, Pauli, then a postdoc in Alex Schier's lab at Harvard University, discovered a hidden code in a zebrafish lncRNA. She had been hunting for lncRNAs involved in zebrafish development because "we hadn't really anticipated that there would be any coding regions out there that had not been discovered—at least not something that is essential," she says. But one lncRNA she identified actually encoded a 58-amino-acid micropeptide, which she called Toddler, that functioned as a signaling protein necessary for cell movements that shape the early embryo.5

Then, last year, Anderson and his colleagues reported another. Since joining Olson's lab in 2010, Anderson had been searching for lncRNAs expressed in the heart and skeletal muscles of mouse embryos. He discovered a number of candidates, but one stood out for its high level of sequence conservation—suggesting to Anderson that it might have an important function. He was right, the RNA was important, but for a reason that neither Anderson nor Olson had considered: it was in fact an mRNA encoding a 46-amino-acid-long micropeptide.6

"When we zeroed in on the conserved region [of the gene], Doug found that it began with an ATG [start] codon and it terminated with a stop codon," Olson says. "That's when he looked at whether it might encode a peptide and found that indeed it did." The researchers dubbed the peptide myoregulin, and found that it functioned as a critical calcium pump regulator for muscle relaxation.

With more and more overlooked peptides now being revealed, the big question is how many are left to be discovered. "Were there going to be dozens of [micropeptides]? Were there going to be hundreds, like there are hundreds of microRNAs?" says Ingolia. "We just didn't know."

Olson suspects the number is quite large. The fact that "myoregulin went below the radar screen for all these years . . . really told us that there's likely to be a gold mine of undiscovered micropeptides out there," he says. "So we are aggressively mining that right now."

Hunting for hidden peptides

In the mid-2000s, Menschaert was working on mass spectrometry protocols to enrich small peptides, which at that time were believed to be cleaved from larger proteins, when he read the papers about the polished rice sORFs. If there is one example of sORF-encoded micropeptides, he thought, there are bound to be others.
© Bryan Satalino
FOLLOWING THE CODE: With the advent of genome sequencing technologies, researchers began combing genomes for open reading frames (ORFs). To enrich for genuine protein-coding ORFs and to eliminate those random sequences that by chance were bookended by start and stop codons, most ORF-finding algorithms ignored any stretches shorter than 300 nucleotides. Unfortunately, this also meant that many short ORFs encoding micropeptides were missed. Now, new techniques are helping scientists identify tiny ORFs within what were presumed to be long noncoding RNAs.
See full infographic: WEB | PDF

To find out if his hunch was correct, Menschaert performed a lot of RNA sequencing to identify sORFs, and a lot of mass spectrometry to find the putative peptides. But it was a slow and painstaking endeavor, as he could only survey a small number of sORFs at a time. Then, in 2009, researchers developed a new, rapid, genome-wide approach called ribosome profiling, which enabled the translation of all ORFs, large and small, to be assessed en masse using next-generation sequencing of ribosome-associated RNA.

The technique was an update of another method called ribosome footprinting, in which researchers would isolate ribosome-associated RNAs, digest them with a nuclease, and then recover and sequence the short fragments of RNA protected from digestion by the bound ribosomes. Mass spec was still required to confirm that the proteins generated from these RNAs actually existed in the cell; even truly noncoding RNAs can sometimes associate with ribosomes by chance. But ribosome footprinting was a straightforward way to identify RNAs that, at the very least, associated with the translation machinery.

Until the past decade of advances in sequencing technology, however, this too was a time-consuming process, says Ingolia. "People had used ribosome footprinting on single, specific messages, but you couldn't apply it to everything that was going on in a cell." Then next-gen sequencing was developed, giving researchers the power to "read hundreds of millions of these footprints at once," says Jonathan Weissman of the University of California, San Francisco.

So he, Ingolia—then a postdoc in his lab—and their colleagues turned ribosome footprinting into ribosome profiling to obtain a global snapshot of translation events across the entire transcriptome. In 2011, the researchers reported that in mouse embryonic stem cells, the majority of lncRNAs transcribed from apparently noncoding regions of the genome were in fact associated with ribosomes.7 "Very early on . . . we could see that we were getting signals outside of the canonical open reading frames," says Weissman.

"That paper was really a milestone in terms of showing that there is a lot of translation outside of [known] coding regions," says Pauli.

Read the rest of the article here.