proteins
© Argonne National Laboratory
Starting with a database of all known structural features in proteins, researchers have performed a network analysis based on their similarities. The surprising result is that almost every protein we know of is related to the rest by seven degrees of separation.

Life requires lots of chemicals, from the DNA and RNA that carry genetic information to the lipids that keep the contents of cells separated from their environment. But it's fair to say a lot of the action involves proteins, which do everything from catalyzing chemical reactions to providing structural scaffolding for various parts of the cell. All these different functions are dependent upon how the protein is organized in three dimensions, which occurs through a process called protein folding.

All the dizzying variety of known proteins are generated by linking together a chain composed of 20 common amino acids (and a few rare variations on those). When you consider how quickly the number of possible combinations of these amino acids increases as the length of the protein does, however, it should be clear that the proteins that exist occupy only a small portion of the potential protein space. So, in this view, evolution has generated the rare, useful solutions within a sea of possibilities.

But that's not the only way to look at things. The alternative view is that the backbone that links together the amino acids in a limited number of well defined structures, such as spirals called ฮฑ-helixes and flat ribbons called ฮฒ-sheets, and there are a relatively limited number of ways to link up these features into a structural feature that biologists call a fold. A new analysis of protein folds has now suggested that almost every existing fold fits into a network where it's possible to link any protein fold with any other through a series of seven or fewer steps, each of which goes through another, existing protein fold.

The basic idea behind the paper is to use databases that contain the three-dimensional coordinates of all the atoms in every protein for which we know the structure. Using this data, researchers have identified a large collection of folds, which the authors define as "a particular spatial arrangement of ฮฑ-helical and/or ฮฒ-sheet secondary structures."

That's where the new paper (and a hefty dose of computing power) comes in. The authors went through and compared every known fold with every other one in a pairwise fashion, calculating the degree to which the folds are related, using a measure called the "TM-score." (This paper seems to have more details on the alignment.) Because TM-scores have a known measure of statistical significance - the best alignment of two random proteins is 0.3, with a standard deviation of 0.01 - the authors required a score of 0.4 before they'd consider two folds to be related.

They then performed network analysis to create clusters of related folds. The surprise was that they got what's in essence a single, densely-packed cluster. Over 80 percent of the total fold-space was within four hops (where each hop brings you to another related fold) from the rest of the cluster. If you extend out to eight hops, you can incorporate over 98 percent of the known protein folds. This densely packed graph remained even after the authors had eliminated all the proteins that were known to be related via evolution.

The authors argue that there's simply a limited number of ways to pack together a hydrogen bonded structure, and nature has explored more or less all of them. They support this argument by showing that they can see a similar network when they feed their system a set of structures generated from random, hydrogen-bonded peptides.

So, what's it all mean? Well, in practical terms, if the authors are right, then the protein solving problem - the one that some of you may have been donating your spare CPU and GPU cycles to - may not be as difficult as some might have thought. If any potential arrangement is possible, then it's tough to see where an unfolded protein might wind up. But if we've really exhausted the biologically relevant fold space already, then the solution to the protein-folding problem would be severely constrained.

In evolutionary terms, the authors suggest the results may support the "Big Bang theory of protein evolution," which suggests that early life was quickly able to explore most of the useful fold space, and has just been tinkering with variations on them since.

I'm not entirely convinced it does. In many cases, only a few key amino acids form the hydrogen bonds and charge interactions that hold a fold together. It's easy to imagine that you only have to tweak a few of these to switch from a given fold to something that's closely related. But it's easy to imagine a lot of things; the authors haven't gone through and determined whether that's actually true in the cases that their approach has identified. It's not even clear that we currently have enough information to do this sort of comparison, given that we'd need to have some measure of all the amino acid combinations compatible with a given fold.

Still, it's an intriguing idea, and one that fits in nicely with the growing recognition that emergent properties - situations where ensembles of something behave differently from individual instances - may play a significant role in dictating the behavior of natural systems. If this idea turns out to be right, it's possible that the diversity of protein structures we see arises simply from the properties of the amino acids they're comprised of.